The series is based on v6.7-rc1 and can be cloned with:
$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
-b arm-mte-dynamic-carveout-rfc-v2
Introduction
============
Memory Tagging Extension (MTE) is implemented currently to have a static
carve-out of the DRAM to store the allocation tags (a.k.a. memory colour).
This is what we call the tag storage. Each 16 bytes have 4 bits of tags, so
this means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup) and
normally hidden from the OS. So a checked memory access to location X
generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X to Y
is linear (subject to a minimum block size to deal with some address
interleaving). The software doesn't need to know about this correspondence
as we have specific instructions like STG/LDG to location X that lead to a
tag store/load to Y.
Now, not all memory used by applications is tagged (mmap(PROT_MTE)). For
example, some large allocations may not use PROT_MTE at all or only for the
first and last page since initialising the tags takes time. The side-effect
is that of that 3% of DRAM, only part of it, say 1%, is effectively used.
The series aims to take that unused tag storage and release it to the page
allocator for normal data usage.
The first complication is that a PROT_MTE page allocation at address X will
need to reserve the tag storage page at location Y (and migrate any data in
that page if it is in use).
To make things worse, pages in the tag storage/carve-out range cannot use
PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.
Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.
All of this will be entirely transparent to userspace. As with the current
kernel (without this dynamic tag storage), a user only needs to ask for
PROT_MTE mappings to get tagged pages.
Implementation
==============
MTE tag storage reuse is accomplished with the following changes to the
Linux kernel:
1. The tag storage memory is exposed to the memory allocator as
MIGRATE_CMA. The arm64 code manages this memory directly instead of using
cma_declare_contiguous/cma_alloc for performance reasons.
There is a limitation to this approach: MIGRATE_CMA cannot be used for
tagged allocations, even if not all MIGRATE_CMA memory is tag storage.
2. mprotect(PROT_MTE) is implemented by adding a fault-on-access mechanism
for existing pages. When a page is next accessed, a fault is taken and the
corresponding tag storage is reserved.
3. When the code tries to copy tags to a page (when swapping in a newly
allocated page, or during migration/THP collapse) which doesn't have the
tag storage reserved, the tags are copied to an xarray and restored when
tag storage is reserved for the destination page.
KVM support has not been implemented yet, that because a non-MTE enabled
VMA can back the memory of an MTE-enabled VM. It will be added after there
is a consensus on the right approach on the memory management support.
Overview of the patches
=======================
For people not interested in the arm64 details, you probably want to start
with patches 1-10, which mostly deal with adding the necessary hooks to the
memory management code, and patches 19 and 20 which add the page
fault-on-access mechanism for regular pages, respectively huge pages. Patch
21 is rather invasive, it moves the definition of struct
migration_target_control out of mm/internal.h to migrate.h, and the arm64
code also uses isolate_lru_page() and putback_movable_pages() when
migrating a tag storage page out of a PROT_MTE VMA. And finally patch 26 is
an optimization for a performance regression that has been reported with
Chrome and it introduces CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY to allow arm64 to
use take_page_off_buddy() to fast track reserving tag storage when the page
is free.
The rest of the patches are mostly arm64 specific.
Patches 11-18 support for detecting the tag storage region and reserving
tag storage when a tagged page is allocated.
Patches 19-21 add the page fault-on-access on mechanism and use it to
reserve tag storage when needed.
Patches 22 and 23 handle saving tags temporarily to an xarray if the page
doesn't have tag storage, and copying the tags over to the tagged page when
tag storage is reserved.
Changelog
=========
Changes since RFC v1 [1]:
* The entire series has been reworked to remove MIGRATE_METADATA and put tag
storage pages on the MIGRATE_CMA freelists.
* Changed how tags are saved and restored when copying them from one page to
another if the destination page doesn't have tag storage - now the tags are
restored when tag storage is reserved for the destination page instead of
restoring them in set_pte_at() -> mte_sync_tags().
[1] https://lore.kernel.org/lkml/[email protected]/
Testing
=======
To enable MTE dynamic tag storage:
- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
storage memory")
Check dmesg for the message "MTE tag storage region management enabled".
I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting the upper 2GB memory region into 3: one region
for normal RAM, followed by the tag storage for the lower 2GB of memory, then
the tag storage for the normal RAM region. Like below:
diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..8c719825a9b3 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,12 +165,30 @@ C1_L2: l2-cache1 {
};
};
- memory@80000000 {
+ memory0: memory@80000000 {
device_type = "memory";
- reg = <0x00000000 0x80000000 0 0x80000000>,
- <0x00000008 0x80000000 0 0x80000000>;
+ reg = <0x00000000 0x80000000 0 0x80000000>;
};
+ memory1: memory@880000000 {
+ device_type = "memory";
+ reg = <0x00000008 0x80000000 0 0x78000000>;
+ };
+
+ tags0: tag-storage@8f8000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x00000008 0xf8000000 0 0x4000000>;
+ block-size = <0x1000>;
+ memory = <&memory0>;
+ };
+
+ tags1: tag-storage@8fc000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x00000008 0xfc000000 0 0x3c00000>;
+ block-size = <0x1000>;
+ memory = <&memory1>;
+ };
+
reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
Alexandru Elisei (27):
arm64: mte: Rework naming for tag manipulation functions
arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
mm: migrate/mempolicy: Add hook to modify migration target gfp
mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations
mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA
guard code"
mm: Allow an arch to hook into folio allocation when VMA is known
mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
arm64: mte: Reserve tag storage memory
arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
arm64: mte: Disable dynamic tag storage management if HW KASAN is
enabled
arm64: mte: Check that tag storage blocks are in the same zone
arm64: mte: Manage tag storage on page allocation
arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
arm64: mte: Reserve tag block for the zero page
mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
mm: hugepage: Handle huge page fault on access
mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE)
arm64: mte: swap: Handle tag restoring when missing tag storage
arm64: mte: copypage: Handle tag restoring when missing tag storage
arm64: mte: Handle fatal signal in reserve_tag_storage()
KVM: arm64: Disable MTE if tag storage is enabled
arm64: mte: Fast track reserving tag storage when the block is free
arm64: mte: Enable dynamic tag storage reuse
arch/arm64/Kconfig | 16 +
arch/arm64/include/asm/assembler.h | 10 +
arch/arm64/include/asm/mte-def.h | 16 +-
arch/arm64/include/asm/mte.h | 43 +-
arch/arm64/include/asm/mte_tag_storage.h | 75 +++
arch/arm64/include/asm/page.h | 5 +-
arch/arm64/include/asm/pgtable-prot.h | 2 +
arch/arm64/include/asm/pgtable.h | 96 +++-
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/elfcore.c | 14 +-
arch/arm64/kernel/hibernate.c | 46 +-
arch/arm64/kernel/mte.c | 12 +-
arch/arm64/kernel/mte_tag_storage.c | 686 +++++++++++++++++++++++
arch/arm64/kernel/setup.c | 7 +
arch/arm64/kvm/arm.c | 6 +-
arch/arm64/lib/mte.S | 34 +-
arch/arm64/mm/copypage.c | 59 ++
arch/arm64/mm/fault.c | 261 ++++++++-
arch/arm64/mm/mteswap.c | 162 +++++-
fs/proc/page.c | 1 +
include/linux/gfp_types.h | 14 +-
include/linux/huge_mm.h | 2 +
include/linux/kernel-page-flags.h | 1 +
include/linux/migrate.h | 12 +-
include/linux/migrate_mode.h | 1 +
include/linux/mmzone.h | 5 +
include/linux/page-flags.h | 16 +-
include/linux/pgtable.h | 54 ++
include/trace/events/mmflags.h | 5 +-
mm/Kconfig | 7 +
mm/cma.c | 4 +-
mm/huge_memory.c | 5 +-
mm/internal.h | 9 -
mm/memory-failure.c | 8 +-
mm/memory.c | 10 +
mm/mempolicy.c | 3 +
mm/migrate.c | 3 +
mm/page_alloc.c | 118 +++-
mm/shmem.c | 14 +-
mm/swapfile.c | 7 +
40 files changed, 1668 insertions(+), 182 deletions(-)
create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
create mode 100644 arch/arm64/kernel/mte_tag_storage.c
base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.42.1
The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
after each cma_alloc() function call. This is done even though cma_alloc()
can allocate an arbitrary number of CMA pages. When looking at
/proc/vmstat, the number of successful (or failed) cma_alloc() calls
doesn't tell much with regards to how many CMA pages were allocated via
cma_alloc() versus via the page allocator (regular allocation request or
PCP lists refill).
This can also be rather confusing to a user who isn't familiar with the
code, since the unit of measurement for nr_free_cma is the number of pages,
but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
function calls.
Let's make this consistent, and arguably more useful, by having
CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
CMA_ALLOC_FAIL count the number of pages cma_alloc() failed to allocate.
For users that wish to track the number of cma_alloc() calls, there are
tracepoints for that already implemented.
Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/cma.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index 2b2494fd6b59..2b74db5116d5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
pr_debug("%s(): returned %p\n", __func__, page);
out:
if (page) {
- count_vm_event(CMA_ALLOC_SUCCESS);
+ count_vm_events(CMA_ALLOC_SUCCESS, count);
cma_sysfs_account_success_pages(cma, count);
} else {
- count_vm_event(CMA_ALLOC_FAIL);
+ count_vm_events(CMA_ALLOC_FAIL, count);
if (cma)
cma_sysfs_account_fail_pages(cma, count);
}
--
2.42.1
It might be desirable for an architecture to modify the gfp flags used to
allocate the destination page for migration based on the page that it is
being replaced. For example, if an architectures has metadata associated
with a page (like arm64, when the memory tagging extension is implemented),
it can request that the destination page similarly has storage for tags
already allocated.
No functional change.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/migrate.h | 4 ++++
mm/mempolicy.c | 2 ++
mm/migrate.c | 3 +++
3 files changed, 9 insertions(+)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2ce13e8a309b..0acef592043c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -60,6 +60,10 @@ struct movable_operations {
/* Defined in mm/debug.c: */
extern const char *migrate_reason_names[MR_TYPES];
+#ifndef arch_migration_target_gfp
+#define arch_migration_target_gfp(src, gfp) 0
+#endif
+
#ifdef CONFIG_MIGRATION
void putback_movable_pages(struct list_head *l);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..50bc43ab50d6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
h = folio_hstate(src);
gfp = htlb_alloc_mask(h);
+ gfp |= arch_migration_target_gfp(src, gfp);
nodemask = policy_nodemask(gfp, pol, ilx, &nid);
return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
}
@@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
gfp = GFP_TRANSHUGE;
else
gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
+ gfp |= arch_migration_target_gfp(src, gfp);
page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
return page_rmappable_folio(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 35a88334bb3c..dd25ab69e3de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2016,6 +2016,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
struct hstate *h = folio_hstate(src);
gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
+ gfp_mask |= arch_migration_target_gfp(src, gfp);
return alloc_hugetlb_folio_nodemask(h, nid,
mtc->nmask, gfp_mask);
}
@@ -2032,6 +2033,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
zidx = zone_idx(folio_zone(src));
if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
gfp_mask |= __GFP_HIGHMEM;
+ gfp_mask |= arch_migration_target_gfp(src, gfp);
return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
}
@@ -2500,6 +2502,7 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
__GFP_NOWARN;
gfp &= ~__GFP_RECLAIM;
}
+ gfp |= arch_migration_target_gfp(src, gfp);
return __folio_alloc_node(gfp, order, nid);
}
--
2.42.1
Add arch_free_pages_prepare() hook that is called before that page flags
are cleared. This will be used by arm64 when explicit management of tag
storage pages is enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 4 ++++
mm/page_alloc.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b31f53e9ab1d..3f34f00ced62 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -880,6 +880,10 @@ static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
}
#endif
+#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order) { }
+#endif
+
#ifndef __HAVE_ARCH_UNMAP_ONE
/*
* Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2782b778e78..86e4b1dac538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1086,6 +1086,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
trace_mm_page_free(page, order);
kmsan_free_page(page, order);
+ arch_free_pages_prepare(page, order);
+
if (unlikely(PageHWPoison(page)) && !order) {
/*
* Do not let hwpoison pages hit pcplists/buddy
@@ -3171,7 +3173,7 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
return alloc_flags;
}
-#ifdef HAVE_ARCH_ALLOC_PAGE
+#ifdef HAVE_ARCH_PREP_NEW_PAGE
static void return_page_to_buddy(struct page *page, int order)
{
int migratetype = get_pfnblock_migratetype(page, pfn);
--
2.42.1
As an architecture might have specific requirements around the allocation
of CMA pages, add an arch hook that can disable allocations from
MIGRATE_CMA, if the allocation was otherwise allowed.
This will be used by arm64, which will put tag storage pages on the
MIGRATE_CMA list, pages which have specific limitations.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 7 +++++++
mm/page_alloc.c | 3 ++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3f34f00ced62..b7a9ab818f6d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -884,6 +884,13 @@ static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
static inline void arch_free_pages_prepare(struct page *page, int order) { }
#endif
+#ifndef __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp)
+{
+ return true;
+}
+#endif
+
#ifndef __HAVE_ARCH_UNMAP_ONE
/*
* Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86e4b1dac538..0f508070c404 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3167,7 +3167,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
- if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
+ arch_alloc_cma(gfp_mask))
alloc_flags |= ALLOC_CMA;
#endif
return alloc_flags;
--
2.42.1
__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).
Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/mm/fault.c | 2 +-
include/linux/gfp_types.h | 14 +++++++-------
include/trace/events/mmflags.h | 2 +-
mm/page_alloc.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 460d799e1296..daa91608d917 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -948,7 +948,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
* separate DC ZVA and STGM.
*/
if (vma->vm_flags & VM_MTE)
- flags |= __GFP_ZEROTAGS;
+ flags |= __GFP_TAGGED;
return vma_alloc_folio(flags, 0, vma, vaddr, false);
}
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..37b9e265d77e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_HARDWALL 0x100000u
#define ___GFP_THISNODE 0x200000u
#define ___GFP_ACCOUNT 0x400000u
-#define ___GFP_ZEROTAGS 0x800000u
+#define ___GFP_TAGGED 0x800000u
#ifdef CONFIG_KASAN_HW_TAGS
#define ___GFP_SKIP_ZERO 0x1000000u
#define ___GFP_SKIP_KASAN 0x2000000u
@@ -226,11 +226,11 @@ typedef unsigned int __bitwise gfp_t;
*
* %__GFP_ZERO returns a zeroed page on success.
*
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
- * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
- * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
- * memory tags at the same time as zeroing memory has minimal additional
- * performace impact.
+ * %__GFP_TAGGED marks the allocation as having tags, which will be zeroed it
+ * allocation time if the memory itself is being zeroed (either via __GFP_ZERO
+ * or via init_on_alloc, provided that __GFP_SKIP_ZERO is not set). This flag is
+ * intended for optimization: setting memory tags at the same time as zeroing
+ * memory has minimal additional performace impact.
*
* %__GFP_SKIP_KASAN makes KASAN skip unpoisoning on page allocation.
* Used for userspace and vmalloc pages; the latter are unpoisoned by
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED ((__force gfp_t)___GFP_TAGGED)
#define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
#define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index d801409b33cf..6ca0d5ed46c0 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
gfpflag_string(__GFP_RECLAIM), \
gfpflag_string(__GFP_DIRECT_RECLAIM), \
gfpflag_string(__GFP_KSWAPD_RECLAIM), \
- gfpflag_string(__GFP_ZEROTAGS)
+ gfpflag_string(__GFP_TAGGED)
#ifdef CONFIG_KASAN_HW_TAGS
#define __def_gfpflag_names_kasan , \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..770e585b77c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1483,7 +1483,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
{
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
- bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+ bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
int i;
set_page_private(page, 0);
--
2.42.1
arm64 uses arch_swap_restore() to restore saved tags before the page is
swapped in and it's called in atomic context (with the ptl lock held).
Introduce arch_swap_prepare_to_restore() that will allow an architecture to
perform extra work during swap in and outside of a critical section.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 7 +++++++
mm/memory.c | 4 ++++
mm/shmem.c | 9 +++++++++
mm/swapfile.c | 7 +++++++
4 files changed, 27 insertions(+)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b1001ce361ac..ffdb9b6bed6c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -938,6 +938,13 @@ static inline void arch_swap_invalidate_area(int type)
}
#endif
+#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..e137f7673749 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3957,6 +3957,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_throttle_swaprate(folio, GFP_KERNEL);
+ ret = arch_swap_prepare_to_restore(entry, folio);
+ if (ret)
+ goto out_page;
+
/*
* Back out if somebody else already faulted in this pte.
*/
diff --git a/mm/shmem.c b/mm/shmem.c
index 71ce5fe5c779..0449c03dbdfd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1840,6 +1840,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct swap_info_struct *si;
struct folio *folio = NULL;
swp_entry_t swap;
+ vm_fault_t ret;
int error;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1888,6 +1889,14 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
folio_wait_writeback(folio);
+ ret = arch_swap_prepare_to_restore(swap, folio);
+ if (ret) {
+ if (fault_type)
+ *fault_type = ret;
+ error = -EINVAL;
+ goto unlock;
+ }
+
/*
* Some architectures may have to restore extra metadata to the
* folio after reading from swap.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4bc70f459164..9983dffce47b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1746,6 +1746,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
pte_t *pte, new_pte, old_pte;
bool hwpoisoned = PageHWPoison(page);
+ vm_fault_t err;
int ret = 1;
swapcache = page;
@@ -1779,6 +1780,12 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
goto setpte;
}
+ err = arch_swap_prepare_to_restore(entry, page_folio(page));
+ if (err) {
+ ret = -EINVAL;
+ goto out;
+ }
+
/*
* Some architectures may have to restore extra metadata to the page
* when reading from swap. This metadata may be indexed by swap entry
--
2.42.1
Allow the kernel to get the size and location of the MTE tag storage
regions from the DTB. This memory is marked as reserved for now.
The DTB node for the tag storage region is defined as:
tags0: tag-storage@8f8000000 {
compatible = "arm,mte-tag-storage";
reg = <0x08 0xf8000000 0x00 0x4000000>;
block-size = <0x1000>;
memory = <&memory0>; // Associated tagged memory node
};
The tag storage region represents the largest contiguous memory region that
holds all the tags for the associated contiguous memory region which can be
tagged. For example, for a 32GB contiguous tagged memory the corresponding
tag storage region is 1GB of contiguous memory, not two adjacent 512M of
tag storage memory.
"block-size" represents the minimum multiple of 4K of tag storage where all
the tags stored in the block correspond to a contiguous memory region. This
is needed for platforms where the memory controller interleaves tag writes
to memory. For example, if the memory controller interleaves tag writes for
256KB of contiguous memory across 8K of tag storage (2-way interleave),
then the correct value for "block-size" is 0x2000. This value is a hardware
property, independent of the selected kernel page size.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 12 ++
arch/arm64/include/asm/mte_tag_storage.h | 15 ++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/mte_tag_storage.c | 256 +++++++++++++++++++++++
arch/arm64/kernel/setup.c | 7 +
5 files changed, 291 insertions(+)
create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
create mode 100644 arch/arm64/kernel/mte_tag_storage.c
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..fe8276fdc7a8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2062,6 +2062,18 @@ config ARM64_MTE
Documentation/arch/arm64/memory-tagging-extension.rst.
+if ARM64_MTE
+config ARM64_MTE_TAG_STORAGE
+ bool "Dynamic MTE tag storage management"
+ help
+ Adds support for dynamic management of the memory used by the hardware
+ for storing MTE tags. This memory, unlike normal memory, cannot be
+ tagged. When it is used to store tags for another memory location it
+ cannot be used for any type of allocation.
+
+ If unsure, say N
+endif # ARM64_MTE
+
endmenu # "ARMv8.5 architectural features"
menu "ARMv8.7 architectural features"
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
new file mode 100644
index 000000000000..8f86c4f9a7c3
--- /dev/null
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MTE_TAG_STORAGE_H
+#define __ASM_MTE_TAG_STORAGE_H
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void mte_tag_storage_init(void);
+#else
+static inline void mte_tag_storage_init(void)
+{
+}
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+#endif /* __ASM_MTE_TAG_STORAGE_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index d95b3d6b471a..5f031bf9f8f1 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
obj-$(CONFIG_ARM64_MTE) += mte.o
+obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
obj-y += vdso-wrap.o
obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
new file mode 100644
index 000000000000..fa6267ef8392
--- /dev/null
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for dynamic tag storage.
+ *
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/of_device.h>
+#include <linux/of_fdt.h>
+#include <linux/range.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+
+#include <asm/mte_tag_storage.h>
+
+struct tag_region {
+ struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
+ struct range tag_range; /* Tag storage memory, in PFNs. */
+ u32 block_size; /* Tag block size, in pages. */
+};
+
+#define MAX_TAG_REGIONS 32
+
+static struct tag_region tag_regions[MAX_TAG_REGIONS];
+static int num_tag_regions;
+
+static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
+ int reg_len, struct range *range)
+{
+ int addr_cells = dt_root_addr_cells;
+ int size_cells = dt_root_size_cells;
+ u64 size;
+
+ if (reg_len / 4 > addr_cells + size_cells)
+ return -EINVAL;
+
+ range->start = PHYS_PFN(of_read_number(reg, addr_cells));
+ size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
+ if (size == 0) {
+ pr_err("Invalid node");
+ return -EINVAL;
+ }
+ range->end = range->start + size - 1;
+
+ return 0;
+}
+
+static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
+ struct range *tag_range)
+{
+ const __be32 *reg;
+ int reg_len;
+
+ reg = of_get_flat_dt_prop(node, "reg", ®_len);
+ if (reg == NULL) {
+ pr_err("Invalid metadata node");
+ return -EINVAL;
+ }
+
+ return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
+}
+
+static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
+{
+ const __be32 *reg;
+ int reg_len;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
+ if (reg == NULL)
+ reg = of_get_flat_dt_prop(node, "reg", ®_len);
+
+ if (reg == NULL) {
+ pr_err("Invalid memory node");
+ return -EINVAL;
+ }
+
+ return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
+}
+
+struct find_memory_node_arg {
+ unsigned long node;
+ u32 phandle;
+};
+
+static int __init fdt_find_memory_node(unsigned long node, const char *uname,
+ int depth, void *data)
+{
+ const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+ struct find_memory_node_arg *arg = data;
+
+ if (depth != 1 || !type || strcmp(type, "memory") != 0)
+ return 0;
+
+ if (of_get_flat_dt_phandle(node) == arg->phandle) {
+ arg->node = node;
+ return 1;
+ }
+
+ return 0;
+}
+
+static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
+{
+ struct find_memory_node_arg arg = { 0 };
+ const __be32 *memory_prop;
+ u32 mem_phandle;
+ int ret, reg_len;
+
+ memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
+ if (!memory_prop) {
+ pr_err("Missing 'memory' property in the tag storage node");
+ return -EINVAL;
+ }
+
+ mem_phandle = be32_to_cpup(memory_prop);
+ arg.phandle = mem_phandle;
+
+ ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
+ if (ret != 1) {
+ pr_err("Associated memory node not found");
+ return -EINVAL;
+ }
+
+ *mem_node = arg.node;
+
+ return 0;
+}
+
+static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
+ u32 *retval)
+{
+ const __be32 *reg;
+
+ reg = of_get_flat_dt_prop(node, propname, NULL);
+ if (!reg)
+ return -EINVAL;
+
+ *retval = be32_to_cpup(reg);
+ return 0;
+}
+
+static u32 __init get_block_size_pages(u32 block_size_bytes)
+{
+ u32 a = PAGE_SIZE;
+ u32 b = block_size_bytes;
+ u32 r;
+
+ /* Find greatest common divisor using the Euclidian algorithm. */
+ do {
+ r = a % b;
+ a = b;
+ b = r;
+ } while (b != 0);
+
+ return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
+}
+
+static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
+ int depth, void *data)
+{
+ struct tag_region *region;
+ unsigned long mem_node;
+ struct range *mem_range;
+ struct range *tag_range;
+ u32 block_size_bytes;
+ u32 nid = 0;
+ int ret;
+
+ if (depth != 1 || !strstr(uname, "tag-storage"))
+ return 0;
+
+ if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
+ return 0;
+
+ if (num_tag_regions == MAX_TAG_REGIONS) {
+ pr_err("Maximum number of tag storage regions exceeded");
+ return -EINVAL;
+ }
+
+ region = &tag_regions[num_tag_regions];
+ mem_range = ®ion->mem_range;
+ tag_range = ®ion->tag_range;
+
+ ret = tag_storage_of_flat_get_tag_range(node, tag_range);
+ if (ret) {
+ pr_err("Invalid tag storage node");
+ return ret;
+ }
+
+ ret = tag_storage_get_memory_node(node, &mem_node);
+ if (ret)
+ return ret;
+
+ ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
+ if (ret) {
+ pr_err("Invalid address for associated data memory node");
+ return ret;
+ }
+
+ /* The tag region must exactly match the corresponding memory. */
+ if (range_len(tag_range) * 32 != range_len(mem_range)) {
+ pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+ PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
+ return -EINVAL;
+ }
+
+ ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
+ if (ret || block_size_bytes == 0) {
+ pr_err("Invalid or missing 'block-size' property");
+ return -EINVAL;
+ }
+ region->block_size = get_block_size_pages(block_size_bytes);
+ if (range_len(tag_range) % region->block_size != 0) {
+ pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
+ PFN_PHYS(range_len(tag_range)), region->block_size);
+ return -EINVAL;
+ }
+
+ ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
+ if (ret)
+ nid = numa_node_id();
+
+ ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
+ nid, MEMBLOCK_NONE);
+ if (ret) {
+ pr_err("Error adding tag memblock (%d)", ret);
+ return ret;
+ }
+ memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+
+ pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
+
+ num_tag_regions++;
+
+ return 0;
+}
+
+void __init mte_tag_storage_init(void)
+{
+ struct range *tag_range;
+ int i, ret;
+
+ ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
+ if (ret) {
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+ }
+ num_tag_regions = 0;
+ pr_info("MTE tag storage region management disabled");
+ }
+}
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..1b77138c1aa5 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -42,6 +42,7 @@
#include <asm/cpufeature.h>
#include <asm/cpu_ops.h>
#include <asm/kasan.h>
+#include <asm/mte_tag_storage.h>
#include <asm/numa.h>
#include <asm/scs.h>
#include <asm/sections.h>
@@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
FW_BUG "Booted with MMU enabled!");
}
+ /*
+ * Must be called before memory limits are enforced by
+ * arm64_memblock_init().
+ */
+ mte_tag_storage_init();
+
arm64_memblock_init();
paging_init();
--
2.42.1
Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
storage for an allocated page. Reserving tag storage can fail, for example,
if the tag storage page has a short pin on it, so allow prep_new_page() ->
arch_prep_new_page() to similarly fail.
arch_alloc_page(), called from post_alloc_hook(), has been considered as an
alternative to adding yet another arch hook, but post_alloc_hook() cannot
fail, as it's also called when free pages are isolated.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 7 ++++
mm/page_alloc.c | 75 ++++++++++++++++++++++++++++++++---------
2 files changed, 66 insertions(+), 16 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..b31f53e9ab1d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -873,6 +873,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
}
#endif
+#ifndef __HAVE_ARCH_PREP_NEW_PAGE
+static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_UNMAP_ONE
/*
* Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 770e585b77c8..b2782b778e78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1538,9 +1538,15 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
page_table_check_alloc(page, order);
}
-static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
unsigned int alloc_flags)
{
+ int ret;
+
+ ret = arch_prep_new_page(page, order, gfp_flags);
+ if (unlikely(ret))
+ return ret;
+
post_alloc_hook(page, order, gfp_flags);
if (order && (gfp_flags & __GFP_COMP))
@@ -1556,6 +1562,8 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
set_page_pfmemalloc(page);
else
clear_page_pfmemalloc(page);
+
+ return 0;
}
/*
@@ -3163,6 +3171,24 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
return alloc_flags;
}
+#ifdef HAVE_ARCH_ALLOC_PAGE
+static void return_page_to_buddy(struct page *page, int order)
+{
+ int migratetype = get_pfnblock_migratetype(page, pfn);
+ unsigned long pfn = page_to_pfn(page);
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ __free_one_page(page, pfn, zone, order, migratetype, FPI_TO_TAIL);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+#else
+static void return_page_to_buddy(struct page *page, int order)
+{
+}
+#endif
+
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
@@ -3309,7 +3335,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
page = rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
- prep_new_page(page, order, gfp_mask, alloc_flags);
+ if (prep_new_page(page, order, gfp_mask, alloc_flags)) {
+ return_page_to_buddy(page, order);
+ goto no_page;
+ }
/*
* If this is a high-order atomic allocation then check
@@ -3319,20 +3348,20 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
reserve_highatomic_pageblock(page, zone);
return page;
- } else {
- if (has_unaccepted_memory()) {
- if (try_to_accept_memory(zone, order))
- goto try_this_zone;
- }
+ }
+no_page:
+ if (has_unaccepted_memory()) {
+ if (try_to_accept_memory(zone, order))
+ goto try_this_zone;
+ }
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- /* Try again if zone has deferred pages */
- if (deferred_pages_enabled()) {
- if (_deferred_grow_zone(zone, order))
- goto try_this_zone;
- }
-#endif
+ /* Try again if zone has deferred pages */
+ if (deferred_pages_enabled()) {
+ if (_deferred_grow_zone(zone, order))
+ goto try_this_zone;
}
+#endif
}
/*
@@ -3538,8 +3567,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
count_vm_event(COMPACTSTALL);
/* Prep a captured page if available */
- if (page)
- prep_new_page(page, order, gfp_mask, alloc_flags);
+ if (page) {
+ if (prep_new_page(page, order, gfp_mask, alloc_flags)) {
+ return_page_to_buddy(page, order);
+ page = NULL;
+ }
+ }
/* Try get a page from the freelist if available */
if (!page)
@@ -4490,9 +4523,18 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
}
break;
}
+
+ if (prep_new_page(page, 0, gfp, 0)) {
+ pcp_spin_unlock(pcp);
+ pcp_trylock_finish(UP_flags);
+ return_page_to_buddy(page, 0);
+ if (!nr_account)
+ goto failed;
+ else
+ goto out_statistics;
+ }
nr_account++;
- prep_new_page(page, 0, gfp, 0);
if (page_list)
list_add(&page->lru, page_list);
else
@@ -4503,6 +4545,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_spin_unlock(pcp);
pcp_trylock_finish(UP_flags);
+out_statistics:
__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
--
2.42.1
The tag save/restore/copy functions could be more explicit about from where
the tags are coming from and where they are being copied to. Renaming the
functions to make it easier to understand what they are doing:
- Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to
match the other functions that take a page address as parameter.
- Rename mte_save/restore_tags() to
mte_save/restore_page_tags_by_swp_entry() to make it clear that they are
saved in a collection indexed by swp_entry (this will become important
when they will be also saved in a collection indexed by page pfn). Same
applies to mte_invalidate_tags{,_area}_by_swp_entry().
- Rename mte_save/restore_page_tags() to make it clear where the tags are
going to be saved, respectively from where they are restored - in a
previously allocated memory buffer, not in an xarray, like when the tags
are saved when swapping. Rename the action to 'copy' instead of
'save'/'restore' to match the copy from user functions, which also copy
tags to memory.
- Rename mte_allocate/free_tag_storage() to mte_allocate/free_tag_buf() to
make it clear the functions have nothing to do with the memory where the
corresponding tags for a page live. Change the parameter type for
mte_free_tag_buf()) to be void *, to match the return value of
mte_allocate_tag_buf(). Also do that because that memory is opaque and it
is not meant to be directly deferenced.
In the name of consistency rename local variables from tag_storage to tags.
Give a similar treatment to the hibernation code that saves and restores
the tags for all tagged pages.
In the same spirit, rename MTE_PAGE_TAG_STORAGE to
MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of
the memory needed to save the tags for a page. Oportunistically rename
MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits,
not bytes, like the rest of the size variable from the same header file.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte-def.h | 16 +++++-----
arch/arm64/include/asm/mte.h | 23 +++++++++------
arch/arm64/include/asm/pgtable.h | 8 ++---
arch/arm64/kernel/elfcore.c | 14 ++++-----
arch/arm64/kernel/hibernate.c | 46 ++++++++++++++---------------
arch/arm64/lib/mte.S | 18 ++++++------
arch/arm64/mm/mteswap.c | 50 ++++++++++++++++----------------
7 files changed, 90 insertions(+), 85 deletions(-)
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index 14ee86b019c2..eb0d76a6bdcf 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -5,14 +5,14 @@
#ifndef __ASM_MTE_DEF_H
#define __ASM_MTE_DEF_H
-#define MTE_GRANULE_SIZE UL(16)
-#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1))
-#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE)
-#define MTE_TAG_SHIFT 56
-#define MTE_TAG_SIZE 4
-#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
-#define MTE_PAGE_TAG_STORAGE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8)
+#define MTE_GRANULE_SIZE UL(16)
+#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE)
+#define MTE_TAG_SHIFT 56
+#define MTE_TAG_SIZE_BITS 4
+#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT)
+#define MTE_PAGE_TAG_STORAGE_SIZE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8)
-#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
+#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
#endif /* __ASM_MTE_DEF_H */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 91fbd5c8a391..8034695b3dd7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -18,19 +18,24 @@
#include <asm/pgtable-types.h>
-void mte_clear_page_tags(void *addr);
+void mte_clear_page_tags(void *page_addr);
+
unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
unsigned long n);
unsigned long mte_copy_tags_to_user(void __user *to, void *from,
unsigned long n);
-int mte_save_tags(struct page *page);
-void mte_save_page_tags(const void *page_addr, void *tag_storage);
-void mte_restore_tags(swp_entry_t entry, struct page *page);
-void mte_restore_page_tags(void *page_addr, const void *tag_storage);
-void mte_invalidate_tags(int type, pgoff_t offset);
-void mte_invalidate_tags_area(int type);
-void *mte_allocate_tag_storage(void);
-void mte_free_tag_storage(char *storage);
+
+int mte_save_page_tags_by_swp_entry(struct page *page);
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page);
+
+void mte_copy_page_tags_to_buf(const void *page_addr, void *to);
+void mte_copy_page_tags_from_buf(void *page_addr, const void *from);
+
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset);
+void mte_invalidate_tags_area_by_swp_entry(int type);
+
+void *mte_allocate_tag_buf(void);
+void mte_free_tag_buf(void *buf);
#ifdef CONFIG_ARM64_MTE
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..9b32c74b4a1b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1039,7 +1039,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
static inline int arch_prepare_to_swap(struct page *page)
{
if (system_supports_mte())
- return mte_save_tags(page);
+ return mte_save_page_tags_by_swp_entry(page);
return 0;
}
@@ -1047,20 +1047,20 @@ static inline int arch_prepare_to_swap(struct page *page)
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
{
if (system_supports_mte())
- mte_invalidate_tags(type, offset);
+ mte_invalidate_tags_by_swp_entry(type, offset);
}
static inline void arch_swap_invalidate_area(int type)
{
if (system_supports_mte())
- mte_invalidate_tags_area(type);
+ mte_invalidate_tags_area_by_swp_entry(type);
}
#define __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
if (system_supports_mte())
- mte_restore_tags(entry, &folio->page);
+ mte_restore_page_tags_by_swp_entry(entry, &folio->page);
}
#endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..e9ae00dacad8 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -17,7 +17,7 @@
static unsigned long mte_vma_tag_dump_size(struct core_vma_metadata *m)
{
- return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE;
+ return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE_SIZE;
}
/* Derived from dump_user_range(); start/end must be page-aligned */
@@ -38,7 +38,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
* have been all zeros.
*/
if (!page) {
- dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+ dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
continue;
}
@@ -48,12 +48,12 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
*/
if (!page_mte_tagged(page)) {
put_page(page);
- dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+ dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
continue;
}
if (!tags) {
- tags = mte_allocate_tag_storage();
+ tags = mte_allocate_tag_buf();
if (!tags) {
put_page(page);
ret = 0;
@@ -61,16 +61,16 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
}
}
- mte_save_page_tags(page_address(page), tags);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
put_page(page);
- if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE)) {
+ if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE_SIZE)) {
ret = 0;
break;
}
}
if (tags)
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
return ret;
}
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 02870beb271e..a3b0e7b32457 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -215,41 +215,41 @@ static int create_safe_exec_page(void *src_start, size_t length,
#ifdef CONFIG_ARM64_MTE
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_pfn);
-static int save_tags(struct page *page, unsigned long pfn)
+static int save_page_tags_by_pfn(struct page *page, unsigned long pfn)
{
- void *tag_storage, *ret;
+ void *tags, *ret;
- tag_storage = mte_allocate_tag_storage();
- if (!tag_storage)
+ tags = mte_allocate_tag_buf();
+ if (!tags)
return -ENOMEM;
- mte_save_page_tags(page_address(page), tag_storage);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
- ret = xa_store(&mte_pages, pfn, tag_storage, GFP_KERNEL);
+ ret = xa_store(&tags_by_pfn, pfn, tags, GFP_KERNEL);
if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
- mte_free_tag_storage(tag_storage);
+ mte_free_tag_buf(tags);
return xa_err(ret);
} else if (WARN(ret, "swsusp: %s: Duplicate entry", __func__)) {
- mte_free_tag_storage(ret);
+ mte_free_tag_buf(ret);
}
return 0;
}
-static void swsusp_mte_free_storage(void)
+static void swsusp_mte_free_tags(void)
{
- XA_STATE(xa_state, &mte_pages, 0);
+ XA_STATE(xa_state, &tags_by_pfn, 0);
void *tags;
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_pfn);
xas_for_each(&xa_state, tags, ULONG_MAX) {
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_pfn);
- xa_destroy(&mte_pages);
+ xa_destroy(&tags_by_pfn);
}
static int swsusp_mte_save_tags(void)
@@ -273,9 +273,9 @@ static int swsusp_mte_save_tags(void)
if (!page_mte_tagged(page))
continue;
- ret = save_tags(page, pfn);
+ ret = save_page_tags_by_pfn(page, pfn);
if (ret) {
- swsusp_mte_free_storage();
+ swsusp_mte_free_tags();
goto out;
}
@@ -290,25 +290,25 @@ static int swsusp_mte_save_tags(void)
static void swsusp_mte_restore_tags(void)
{
- XA_STATE(xa_state, &mte_pages, 0);
+ XA_STATE(xa_state, &tags_by_pfn, 0);
int n = 0;
void *tags;
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_pfn);
xas_for_each(&xa_state, tags, ULONG_MAX) {
unsigned long pfn = xa_state.xa_index;
struct page *page = pfn_to_online_page(pfn);
- mte_restore_page_tags(page_address(page), tags);
+ mte_copy_page_tags_from_buf(page_address(page), tags);
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
n++;
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_pfn);
pr_info("Restored %d MTE pages\n", n);
- xa_destroy(&mte_pages);
+ xa_destroy(&tags_by_pfn);
}
#else /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..9f623e9da09f 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -119,7 +119,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
cbz x2, 2f
1:
ldg x4, [x1]
- ubfx x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
+ ubfx x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE_BITS
USER(2f, sttrb w4, [x0])
add x0, x0, #1
add x1, x1, #MTE_GRANULE_SIZE
@@ -132,11 +132,11 @@ USER(2f, sttrb w4, [x0])
SYM_FUNC_END(mte_copy_tags_to_user)
/*
- * Save the tags in a page
+ * Copy the tags in a page to a buffer
* x0 - page address
- * x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ * x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
*/
-SYM_FUNC_START(mte_save_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_to_buf)
multitag_transfer_size x7, x5
1:
mov x2, #0
@@ -153,14 +153,14 @@ SYM_FUNC_START(mte_save_page_tags)
b.ne 1b
ret
-SYM_FUNC_END(mte_save_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_to_buf)
/*
- * Restore the tags in a page
+ * Restore the tags in a page from a buffer
* x0 - page address
- * x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ * x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
*/
-SYM_FUNC_START(mte_restore_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_from_buf)
multitag_transfer_size x7, x5
1:
ldr x2, [x1], #8
@@ -174,4 +174,4 @@ SYM_FUNC_START(mte_restore_page_tags)
b.ne 1b
ret
-SYM_FUNC_END(mte_restore_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_from_buf)
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..2a43746b803f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -7,79 +7,79 @@
#include <linux/swapops.h>
#include <asm/mte.h>
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_swp_entry);
-void *mte_allocate_tag_storage(void)
+void *mte_allocate_tag_buf(void)
{
/* tags granule is 16 bytes, 2 tags stored per byte */
- return kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+ return kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL);
}
-void mte_free_tag_storage(char *storage)
+void mte_free_tag_buf(void *buf)
{
- kfree(storage);
+ kfree(buf);
}
-int mte_save_tags(struct page *page)
+int mte_save_page_tags_by_swp_entry(struct page *page)
{
- void *tag_storage, *ret;
+ void *tags, *ret;
if (!page_mte_tagged(page))
return 0;
- tag_storage = mte_allocate_tag_storage();
- if (!tag_storage)
+ tags = mte_allocate_tag_buf();
+ if (!tags)
return -ENOMEM;
- mte_save_page_tags(page_address(page), tag_storage);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
/* lookup the swap entry.val from the page */
- ret = xa_store(&mte_pages, page_swap_entry(page).val, tag_storage,
+ ret = xa_store(&tags_by_swp_entry, page_swap_entry(page).val, tags,
GFP_KERNEL);
if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
- mte_free_tag_storage(tag_storage);
+ mte_free_tag_buf(tags);
return xa_err(ret);
} else if (ret) {
/* Entry is being replaced, free the old entry */
- mte_free_tag_storage(ret);
+ mte_free_tag_buf(ret);
}
return 0;
}
-void mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
{
- void *tags = xa_load(&mte_pages, entry.val);
+ void *tags = xa_load(&tags_by_swp_entry, entry.val);
if (!tags)
return;
if (try_page_mte_tagging(page)) {
- mte_restore_page_tags(page_address(page), tags);
+ mte_copy_page_tags_from_buf(page_address(page), tags);
set_page_mte_tagged(page);
}
}
-void mte_invalidate_tags(int type, pgoff_t offset)
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset)
{
swp_entry_t entry = swp_entry(type, offset);
- void *tags = xa_erase(&mte_pages, entry.val);
+ void *tags = xa_erase(&tags_by_swp_entry, entry.val);
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
}
-void mte_invalidate_tags_area(int type)
+void mte_invalidate_tags_area_by_swp_entry(int type)
{
swp_entry_t entry = swp_entry(type, 0);
swp_entry_t last_entry = swp_entry(type + 1, 0);
void *tags;
- XA_STATE(xa_state, &mte_pages, entry.val);
+ XA_STATE(xa_state, &tags_by_swp_entry, entry.val);
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_swp_entry);
xas_for_each(&xa_state, tags, last_entry.val - 1) {
- __xa_erase(&mte_pages, xa_state.xa_index);
- mte_free_tag_storage(tags);
+ __xa_erase(&tags_by_swp_entry, xa_state.xa_index);
+ mte_free_tag_buf(tags);
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_swp_entry);
}
--
2.42.1
Tag storage memory requires that the tag storage pages used for data are
always migratable when they need to be repurposed to store tags.
If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
memblocks to find a suitable location for copying the kernel image. The
kernel image, once loaded, cannot be moved to another location in physical
memory. The initialization code for the tag storage reserves the memblocks
for the tag storage pages, which means kexec will not use them, and the tag
storage pages can be migrated at any time, which is the desired behaviour.
However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
flag, which isn't currently set by the tag storage initialization code.
Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
that that the Kconfig option required for it to work correctly.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 047487046e8f..efa5b7958169 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+ depends on ARCH_KEEP_MEMBLOCK
select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
--
2.42.1
Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
the page allocator to manage them like regular pages.
Ths migratype lends the pages some very desirable properties:
* They cannot be longterm pinned, meaning they will always be migratable.
* The pages can be allocated explicitely by using their PFN (with
alloc_contig_range()) when they are needed to store tags.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
include/linux/mmzone.h | 5 +++
mm/internal.h | 3 --
4 files changed, 74 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe8276fdc7a8..047487046e8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+ select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index fa6267ef8392..427f4f1909f3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
* Copyright (C) 2023 ARM Ltd.
*/
+#include <linux/cma.h>
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/of_device.h>
#include <linux/of_fdt.h>
+#include <linux/pageblock-flags.h>
#include <linux/range.h>
#include <linux/string.h>
#include <linux/xarray.h>
@@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
return ret;
}
+ /* Pages are managed in pageblock_nr_pages chunks */
+ if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
+ pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+ PFN_PHYS(pageblock_nr_pages));
+ return -EINVAL;
+ }
+
ret = tag_storage_get_memory_node(node, &mem_node);
if (ret)
return ret;
@@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
pr_info("MTE tag storage region management disabled");
}
}
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+ phys_addr_t dram_start, dram_end;
+ struct range *tag_range;
+ unsigned long pfn;
+ int i, ret;
+
+ if (num_tag_regions == 0)
+ return 0;
+
+ dram_start = memblock_start_of_DRAM();
+ dram_end = memblock_end_of_DRAM();
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ /*
+ * Tag storage region was clipped by arm64_bootmem_init()
+ * enforcing addressing limits.
+ */
+ if (PFN_PHYS(tag_range->start) < dram_start ||
+ PFN_PHYS(tag_range->end) >= dram_end) {
+ pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+ ret = -EINVAL;
+ goto out_disabled;
+ }
+ }
+
+ /*
+ * MTE disabled, tag storage pages can be used like any other pages. The
+ * only restriction is that the pages cannot be used by kexec because
+ * the memory remains marked as reserved in the memblock allocator.
+ */
+ if (!system_supports_mte()) {
+ for (i = 0; i< num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
+ free_reserved_page(pfn_to_page(pfn));
+ }
+ ret = 0;
+ goto out_disabled;
+ }
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
+ init_cma_reserved_pageblock(pfn_to_page(pfn));
+ totalcma_pages += range_len(tag_range);
+ }
+
+ return 0;
+
+out_disabled:
+ pr_info("MTE tag storage region management disabled");
+ return ret;
+}
+arch_initcall(mte_tag_storage_activate_regions);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3c25226beeed..15f81429e145 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,6 +83,11 @@ static inline bool is_migrate_movable(int mt)
return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
}
+#ifdef CONFIG_CMA
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void init_cma_reserved_pageblock(struct page *page);
+#endif
+
/*
* Check whether a migratetype can be merged with another migratetype.
*
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..ddf6bb6c6308 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -539,9 +539,6 @@ isolate_migratepages_range(struct compact_control *cc,
int __alloc_contig_migrate_range(struct compact_control *cc,
unsigned long start, unsigned long end);
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void init_cma_reserved_pageblock(struct page *page);
-
#endif /* CONFIG_COMPACTION || CONFIG_CMA */
int find_suitable_fallback(struct free_area *area, unsigned int order,
--
2.42.1
An architecture might want to fixup the gfp flags based on the type of VMA
where the page will be mapped.
On arm64, this is currently used if the VMA is MTE enabled. When
__GFP_TAGGED is set, for performance reasons, tag zeroing is performed at
the same time as the data is zeroed, instead of being performed separately,
in set_pte_at() -> mte_sync_tags().
Its usage will be expanded when the storage for the tags will have to be
explicitely managed by the kernel.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/page.h | 5 ++---
arch/arm64/include/asm/pgtable.h | 3 +++
arch/arm64/mm/fault.c | 19 ++++++-------------
include/linux/pgtable.h | 7 +++++++
mm/mempolicy.c | 1 +
mm/shmem.c | 5 ++++-
6 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..c8125a28eaa2 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
void copy_highpage(struct page *to, struct page *from);
#define __HAVE_ARCH_COPY_HIGHPAGE
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
- unsigned long vaddr);
-#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
+ vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false);
void tag_clear_highpage(struct page *to);
#define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9b32c74b4a1b..cd5dacd1be3a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1065,6 +1065,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#endif /* CONFIG_ARM64_MTE */
+#define __HAVE_ARCH_CALC_VMA_GFP
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
+
/*
* On AArch64, the cache coherency is handled via the set_pte_at() function.
*/
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index daa91608d917..acbc7530d2b2 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -935,22 +935,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
NOKPROBE_SYMBOL(do_debug_exception);
/*
- * Used during anonymous page fault handling.
+ * If this is called during anonymous page fault handling, and the page is
+ * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this
+ * is usually faster than separate DC ZVA and STGM.
*/
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
- unsigned long vaddr)
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
{
- gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
-
- /*
- * If the page is mapped with PROT_MTE, initialise the tags at the
- * point of allocation and page zeroing as this is usually faster than
- * separate DC ZVA and STGM.
- */
if (vma->vm_flags & VM_MTE)
- flags |= __GFP_TAGGED;
-
- return vma_alloc_folio(flags, 0, vma, vaddr, false);
+ return __GFP_TAGGED;
+ return 0;
}
void tag_clear_highpage(struct page *page)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b7a9ab818f6d..b1001ce361ac 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -873,6 +873,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
}
#endif
+#ifndef __HAVE_ARCH_CALC_VMA_GFP
+static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_PREP_NEW_PAGE
static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 50bc43ab50d6..cb170abae1fd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2170,6 +2170,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
pgoff_t ilx;
struct page *page;
+ gfp |= arch_calc_vma_gfp(vma, gfp);
pol = get_vma_policy(vma, addr, order, &ilx);
page = alloc_pages_mpol(gfp | __GFP_COMP, order,
pol, ilx, numa_node_id());
diff --git a/mm/shmem.c b/mm/shmem.c
index 91e2620148b2..71ce5fe5c779 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1570,7 +1570,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
*/
static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
{
- gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+ gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -2023,6 +2023,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
gfp_t huge_gfp;
huge_gfp = vma_thp_gfp_mask(vma);
+ huge_gfp |= arch_calc_vma_gfp(vma, huge_gfp);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
folio = shmem_alloc_and_add_folio(huge_gfp,
inode, index, fault_mm, true);
@@ -2199,6 +2200,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
vm_fault_t ret = 0;
int err;
+ gfp |= arch_calc_vma_gfp(vmf->vma, gfp);
+
/*
* Trinity finds that probing a hole which tmpfs is punching can
* prevent the hole-punch from ever completing: noted in i_private.
--
2.42.1
The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
because CMA is always allowed when __GFP_MOVABLE is set.
With the introduction of the arch_alloc_cma() function, the above is not
true anymore, so bring back the filter.
This is a partially revert because the stale comment remains removed.
Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f508070c404..135f9283a863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2907,10 +2907,17 @@ struct page *rmqueue(struct zone *preferred_zone,
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
if (likely(pcp_allowed_order(order))) {
- page = rmqueue_pcplist(preferred_zone, zone, order,
- migratetype, alloc_flags);
- if (likely(page))
- goto out;
+ /*
+ * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
+ * we need to skip it when CMA area isn't allowed.
+ */
+ if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
+ migratetype != MIGRATE_MOVABLE) {
+ page = rmqueue_pcplist(preferred_zone, zone, order,
+ migratetype, alloc_flags);
+ if (likely(page))
+ goto out;
+ }
}
page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
--
2.42.1
To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.
This has several benefits over reserving the tag storage as part of the
mprotect() call handling:
- Tag storage is reserved only for those pages in the VMA that are
accessed, instead of for all the pages already mapped in the VMA.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.
But all of this is at the expense of an extra page fault per page until the
pages being accessed all have their corresponding tag storage reserved.
For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
of the PBHA bits in entries from the last level of the translation table
and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
59, is already being used as a software bit for PMD_PRESENT_INVALID.
This is only implemented for PTE mappings; PMD mappings will follow.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/mte.h | 4 +-
arch/arm64/include/asm/mte_tag_storage.h | 2 +
arch/arm64/include/asm/pgtable-prot.h | 2 +
arch/arm64/include/asm/pgtable.h | 40 ++++++---
arch/arm64/kernel/mte.c | 12 ++-
arch/arm64/mm/fault.c | 101 +++++++++++++++++++++++
include/linux/pgtable.h | 17 ++++
mm/Kconfig | 3 +
mm/memory.c | 3 +
10 files changed, 170 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index efa5b7958169..3b9c435eaafb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2066,6 +2066,7 @@ if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
depends on ARCH_KEEP_MEMBLOCK
+ select ARCH_HAS_FAULT_ON_ACCESS
select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 6457b7899207..70dc2e409070 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
}
void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte, unsigned int nr_pages);
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
@@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
static inline void mte_zero_clear_page_tags(void *addr)
{
}
-static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
{
}
static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6e5d28e607bb..c70ced60a0cd 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
+
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
#else
static inline bool tag_storage_enabled(void)
{
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index e9624f6326dd..85ebb3e352ad 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_TAG_STORAGE_NONE (_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
/*
* This bit indicates that the entry is present i.e. pmd_page()
@@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
})
#define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_FAULT_ON_ACCESS __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
/* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
#define PAGE_SHARED __pgprot(_PAGE_SHARED)
#define PAGE_SHARED_EXEC __pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 20e8de853f5d..8cc135f1c112 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
__func__, pte_val(old_pte), pte_val(pte));
}
-static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
+static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
{
- if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
- __sync_icache_dcache(pte);
+ if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
+ __sync_icache_dcache(*pteval);
/*
* If the PTE would provide user space access to the tags associated
@@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
* pte_access_permitted() returns false for exec only mappings, they
* don't expose tags (instruction fetches don't check tags).
*/
- if (system_supports_mte() && pte_access_permitted(pte, false) &&
- !pte_special(pte) && pte_tagged(pte))
- mte_sync_tags(pte, nr_pages);
+ if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
+ !pte_special(*pteval) && pte_tagged(*pteval))
+ mte_sync_tags(pteval, nr_pages);
}
static inline void set_ptes(struct mm_struct *mm,
@@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
pte_t *ptep, pte_t pte, unsigned int nr)
{
page_table_check_ptes_set(mm, ptep, pte, nr);
- __sync_cache_and_tags(pte, nr);
+ __sync_cache_and_tags(&pte, nr);
for (;;) {
__check_safe_pte_update(mm, ptep, pte);
@@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
}
#endif
+#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+ return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
+ (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+ return fault_on_access_pte(pmd_pte(pmd));
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+ if (tag_storage_enabled())
+ return handle_page_missing_tag_storage(vmf);
+ return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
+
#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
static inline int pmd_present(pmd_t pmd)
@@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
unsigned long __always_unused addr,
pte_t *ptep, pte_t pte, unsigned int nr)
{
- __sync_cache_and_tags(pte, nr);
+ __sync_cache_and_tags(&pte, nr);
__check_safe_pte_update(mm, ptep, pte);
set_pte(ptep, pte);
}
@@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
* in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
*/
const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
- PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
- PTE_ATTRINDX_MASK;
+ PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
+ PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
/* preserve the hardware dirty information */
if (pte_hw_dirty(pte))
pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..5962bab1d549 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -21,6 +21,7 @@
#include <asm/barrier.h>
#include <asm/cpufeature.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
#include <asm/ptrace.h>
#include <asm/sysreg.h>
@@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
#endif
-void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
{
- struct page *page = pte_page(pte);
+ struct page *page = pte_page(*pteval);
unsigned int i;
- /* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
+ if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
+ *pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
+ continue;
+ }
+
+ /* if PG_mte_tagged is set, tags have already been initialised */
if (try_page_mte_tagging(page)) {
mte_clear_page_tags(page_address(page));
set_page_mte_tagged(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index acbc7530d2b2..f5fa583acf18 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -19,6 +19,7 @@
#include <linux/kprobes.h>
#include <linux/uaccess.h>
#include <linux/page-flags.h>
+#include <linux/page-isolation.h>
#include <linux/sched/signal.h>
#include <linux/sched/debug.h>
#include <linux/highmem.h>
@@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
mte_zero_clear_page_tags(page_address(page));
set_page_mte_tagged(page);
}
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct page *page = NULL;
+ pte_t new_pte, old_pte;
+ bool writable = false;
+ vm_fault_t err;
+ int ret;
+
+ spin_lock(vmf->ptl);
+ if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+
+ /* Get the normal PTE */
+ old_pte = ptep_get(vmf->pte);
+ new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+ /*
+ * Detect now whether the PTE could be writable; this information
+ * is only valid while holding the PT lock.
+ */
+ writable = pte_write(new_pte);
+ if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+ can_change_pte_writable(vma, vmf->address, new_pte))
+ writable = true;
+
+ page = vm_normal_page(vma, vmf->address, new_pte);
+ if (!page || is_zone_device_page(page))
+ goto out_map;
+
+ /*
+ * This should never happen, once a VMA has been marked as tagged, that
+ * cannot be changed.
+ */
+ if (!(vma->vm_flags & VM_MTE))
+ goto out_map;
+
+ /* Prevent the page from being unmapped from under us. */
+ get_page(page);
+ vma_set_access_pid_bit(vma);
+
+ /*
+ * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
+ * and spin_lock() above which takes the ptl lock. Both locks should be
+ * balanced after this point.
+ */
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+ /*
+ * Probably the page is being isolated for migration, replay the fault
+ * to give time for the entry to be replaced by a migration pte.
+ */
+ if (unlikely(is_migrate_isolate_page(page)))
+ goto out_retry;
+
+ ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+ if (ret)
+ goto out_retry;
+
+ put_page(page);
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+ if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+
+out_map:
+ /*
+ * Make it present again, depending on how arch implements
+ * non-accessible ptes, some can allow access by kernel mode.
+ */
+ old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+ new_pte = pte_modify(old_pte, vma->vm_page_prot);
+ new_pte = pte_mkyoung(new_pte);
+ if (writable)
+ new_pte = pte_mkwrite(new_pte, vma);
+ ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+ update_mmu_cache(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+ return 0;
+
+out_retry:
+ put_page(page);
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ if (fault_flag_allow_retry_first(vmf->flags)) {
+ err = VM_FAULT_RETRY;
+ } else {
+ /* Replay the fault. */
+ err = 0;
+ }
+ return err;
+}
+#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ffdb9b6bed6c..e2c761dd6c41 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
}
#endif /* CONFIG_NUMA_BALANCING */
+#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+ return false;
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+ return false;
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+#endif
+
#endif /* CONFIG_MMU */
#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..a90eefc3ee80 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
config ARCH_HAS_CACHE_LINE_SIZE
bool
+config ARCH_HAS_FAULT_ON_ACCESS
+ bool
+
config ARCH_HAS_CURRENT_STACK_POINTER
bool
help
diff --git a/mm/memory.c b/mm/memory.c
index e137f7673749..a04a971200b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
+ if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ return arch_do_page_fault_on_access(vmf);
+
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
--
2.42.1
Reserve tag storage for a tagged page by migrating the contents of the tag
storage (if in use for data) and removing the tag storage pages from the
page allocator by calling alloc_contig_range().
When all the associated tagged pages have been freed, return the tag
storage pages back to the page allocator, where they can be used again for
data allocations.
Tag storage pages cannot be tagged, so disallow allocations from
MIGRATE_CMA when the allocation is tagged.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte.h | 16 +-
arch/arm64/include/asm/mte_tag_storage.h | 45 +++++
arch/arm64/include/asm/pgtable.h | 27 +++
arch/arm64/kernel/mte_tag_storage.c | 241 +++++++++++++++++++++++
fs/proc/page.c | 1 +
include/linux/kernel-page-flags.h | 1 +
include/linux/page-flags.h | 1 +
include/trace/events/mmflags.h | 3 +-
mm/huge_memory.c | 1 +
9 files changed, 333 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 8034695b3dd7..6457b7899207 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
#ifdef CONFIG_ARM64_MTE
/* track which pages have valid allocation tags */
-#define PG_mte_tagged PG_arch_2
+#define PG_mte_tagged PG_arch_2
/* simple lock to avoid multiple threads tagging the same page */
-#define PG_mte_lock PG_arch_3
+#define PG_mte_lock PG_arch_3
+/* Track if a tagged page has tag storage reserved */
+#define PG_tag_storage_reserved PG_arch_4
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+extern bool page_tag_storage_reserved(struct page *page);
+#endif
static inline void set_page_mte_tagged(struct page *page)
{
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+ /* Open code mte_tag_storage_enabled() */
+ WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
+ !page_tag_storage_reserved(page));
+#endif
/*
* Ensure that the tags written prior to this function are visible
* before the page flags update.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 8f86c4f9a7c3..cab033b184ab 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -5,11 +5,56 @@
#ifndef __ASM_MTE_TAG_STORAGE_H
#define __ASM_MTE_TAG_STORAGE_H
+#ifndef __ASSEMBLY__
+
+#include <linux/mm_types.h>
+
+#include <asm/mte.h>
+
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
+static inline bool tag_storage_enabled(void)
+{
+ return static_branch_likely(&tag_storage_enabled_key);
+}
+
+static inline bool alloc_requires_tag_storage(gfp_t gfp)
+{
+ return gfp & __GFP_TAGGED;
+}
+
void mte_tag_storage_init(void);
+
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
+void free_tag_storage(struct page *page, int order);
+
+bool page_tag_storage_reserved(struct page *page);
#else
+static inline bool tag_storage_enabled(void)
+{
+ return false;
+}
+static inline bool alloc_requires_tag_storage(struct page *page)
+{
+ return false;
+}
static inline void mte_tag_storage_init(void)
{
}
+static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+ return 0;
+}
+static inline void free_tag_storage(struct page *page, int order)
+{
+}
+static inline bool page_tag_storage_reserved(struct page *page)
+{
+ return true;
+}
#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
+#endif /* !__ASSEMBLY__ */
#endif /* __ASM_MTE_TAG_STORAGE_H */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cd5dacd1be3a..20e8de853f5d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -10,6 +10,7 @@
#include <asm/memory.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
#include <asm/pgtable-hwdef.h>
#include <asm/pgtable-prot.h>
#include <asm/tlbflush.h>
@@ -1063,6 +1064,32 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
mte_restore_page_tags_by_swp_entry(entry, &folio->page);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define __HAVE_ARCH_PREP_NEW_PAGE
+static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
+{
+ if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
+ return reserve_tag_storage(page, order, gfp);
+ return 0;
+}
+
+#define __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order)
+{
+ if (tag_storage_enabled() && page_mte_tagged(page))
+ free_tag_storage(page, order);
+}
+
+#define __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp_mask)
+{
+ if (tag_storage_enabled() && alloc_requires_tag_storage(gfp_mask))
+ return false;
+ return true;
+}
+
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
#endif /* CONFIG_ARM64_MTE */
#define __HAVE_ARCH_CALC_VMA_GFP
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index fd63430d4dc0..9f8ef3116fc3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -11,12 +11,18 @@
#include <linux/of_device.h>
#include <linux/of_fdt.h>
#include <linux/pageblock-flags.h>
+#include <linux/page-flags.h>
+#include <linux/page_owner.h>
#include <linux/range.h>
+#include <linux/sched/mm.h>
#include <linux/string.h>
+#include <linux/vm_event_item.h>
#include <linux/xarray.h>
#include <asm/mte_tag_storage.h>
+__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
struct tag_region {
struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
struct range tag_range; /* Tag storage memory, in PFNs. */
@@ -28,6 +34,31 @@ struct tag_region {
static struct tag_region tag_regions[MAX_TAG_REGIONS];
static int num_tag_regions;
+/*
+ * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
+ * because alloc_contig_range() might sleep.
+ *
+ * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
+ * because pages can be freed from non-preemptible contexts, including from an
+ * interrupt handler.
+ *
+ * Because tag storage can be freed from interrupt contexts, the xarray is
+ * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
+ * xa_store(). This is done to prevent a deadlock with free_tag_storage() being
+ * called from an interrupt raised before xa_store() releases the xa_lock.
+ *
+ * All of the above means that reserve_tag_storage() cannot run concurrently
+ * with itself (no concurrent insertions), but it can run at the same time as
+ * free_tag_storage(). The first thing that reserve_tag_storage() does after
+ * taking the mutex is increase the refcount on all present tag storage blocks
+ * with the xa_lock held, to serialize against freeing the blocks. This is an
+ * optimization to avoid taking and releasing the xa_lock after each iteration
+ * if the refcount operation was moved inside the loop, where it would have had
+ * to be executed for each block.
+ */
+static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
+static DEFINE_MUTEX(tag_blocks_lock);
+
static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
int reg_len, struct range *range)
{
@@ -368,3 +399,213 @@ static int __init mte_tag_storage_activate_regions(void)
return ret;
}
arch_initcall(mte_tag_storage_activate_regions);
+
+static void page_set_tag_storage_reserved(struct page *page, int order)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++)
+ set_bit(PG_tag_storage_reserved, &(page + i)->flags);
+}
+
+static void block_ref_add(unsigned long block, struct tag_region *region, int order)
+{
+ int count;
+
+ count = min(1u << order, 32 * region->block_size);
+ page_ref_add(pfn_to_page(block), count);
+}
+
+static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
+{
+ int count;
+
+ count = min(1u << order, 32 * region->block_size);
+ return page_ref_sub_return(pfn_to_page(block), count);
+}
+
+static bool tag_storage_block_is_reserved(unsigned long block)
+{
+ return xa_load(&tag_blocks_reserved, block) != NULL;
+}
+
+static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
+{
+ int ret;
+
+ ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
+ if (!ret)
+ block_ref_add(block, region, order);
+
+ return ret;
+}
+
+static int order_to_num_blocks(int order)
+{
+ return max((1 << order) / 32, 1);
+}
+
+static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
+ struct tag_region *region)
+{
+ struct range *tag_range = ®ion->tag_range;
+ struct range *mem_range = ®ion->mem_range;
+ u64 page_pfn = page_to_pfn(page);
+ u64 block, block_offset;
+
+ if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
+ return -ERANGE;
+
+ block_offset = (page_pfn - mem_range->start) / 32;
+ block = tag_range->start + rounddown(block_offset, region->block_size);
+
+ if (block + region->block_size - 1 > tag_range->end) {
+ pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
+ PFN_PHYS(block), PFN_PHYS(block + region->block_size),
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+ return -ERANGE;
+ }
+ *blockp = block;
+
+ return 0;
+
+}
+
+static int tag_storage_find_block(struct page *page, unsigned long *block,
+ struct tag_region **region)
+{
+ int i, ret;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
+ if (ret == 0) {
+ *region = &tag_regions[i];
+ return 0;
+ }
+ }
+
+ return -EINVAL;
+}
+
+bool page_tag_storage_reserved(struct page *page)
+{
+ return test_bit(PG_tag_storage_reserved, &page->flags);
+}
+
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+ unsigned long start_block, end_block;
+ struct tag_region *region;
+ unsigned long block;
+ unsigned long flags;
+ unsigned int tries;
+ int ret = 0;
+
+ VM_WARN_ON_ONCE(!preemptible());
+
+ if (page_tag_storage_reserved(page))
+ return 0;
+
+ /*
+ * __alloc_contig_migrate_range() ignores gfp when allocating the
+ * destination page for migration. Regardless, massage gfp flags and
+ * remove __GFP_TAGGED to avoid recursion in case gfp stops being
+ * ignored.
+ */
+ gfp &= ~__GFP_TAGGED;
+ if (!(gfp & __GFP_NORETRY))
+ gfp |= __GFP_RETRY_MAYFAIL;
+
+ ret = tag_storage_find_block(page, &start_block, ®ion);
+ if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+ return 0;
+ end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+ mutex_lock(&tag_blocks_lock);
+
+ /* Check again, this time with the lock held. */
+ if (page_tag_storage_reserved(page))
+ goto out_unlock;
+
+ /* Make sure existing entries are not freed from out under out feet. */
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size) {
+ if (tag_storage_block_is_reserved(block))
+ block_ref_add(block, region, order);
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+ for (block = start_block; block < end_block; block += region->block_size) {
+ /* Refcount incremented above. */
+ if (tag_storage_block_is_reserved(block))
+ continue;
+
+ tries = 3;
+ while (tries--) {
+ ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
+ if (ret == 0 || ret != -EBUSY)
+ break;
+ }
+
+ if (ret)
+ goto out_error;
+
+ ret = tag_storage_reserve_block(block, region, order);
+ if (ret) {
+ free_contig_range(block, region->block_size);
+ goto out_error;
+ }
+
+ count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
+ }
+
+ page_set_tag_storage_reserved(page, order);
+out_unlock:
+ mutex_unlock(&tag_blocks_lock);
+
+ return 0;
+
+out_error:
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size) {
+ if (tag_storage_block_is_reserved(block) &&
+ block_ref_sub_return(block, region, order) == 1) {
+ __xa_erase(&tag_blocks_reserved, block);
+ free_contig_range(block, region->block_size);
+ }
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+ mutex_unlock(&tag_blocks_lock);
+
+ count_vm_events(CMA_ALLOC_FAIL, region->block_size);
+
+ return ret;
+}
+
+void free_tag_storage(struct page *page, int order)
+{
+ unsigned long block, start_block, end_block;
+ struct tag_region *region;
+ unsigned long flags;
+ int ret;
+
+ ret = tag_storage_find_block(page, &start_block, ®ion);
+ if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+ return;
+
+ end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size) {
+ if (WARN_ONCE(!tag_storage_block_is_reserved(block),
+ "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
+ continue;
+
+ if (block_ref_sub_return(block, region, order) == 1) {
+ __xa_erase(&tag_blocks_reserved, block);
+ free_contig_range(block, region->block_size);
+ }
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+}
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..e7eb584a9234 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
u |= kpf_copy_bit(k, KPF_ARCH_2, PG_arch_2);
u |= kpf_copy_bit(k, KPF_ARCH_3, PG_arch_3);
+ u |= kpf_copy_bit(k, KPF_ARCH_4, PG_arch_4);
#endif
return u;
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index 859f4b0c1b2b..4a0d719ffdd4 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -19,5 +19,6 @@
#define KPF_SOFTDIRTY 40
#define KPF_ARCH_2 41
#define KPF_ARCH_3 42
+#define KPF_ARCH_4 43
#endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a88e64acebfe..7915165a51bd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,7 @@ enum pageflags {
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
PG_arch_3,
+ PG_arch_4,
#endif
__NR_PAGEFLAGS,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 6ca0d5ed46c0..ba962fd10a2c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
IF_HAVE_PG_IDLE(young) \
IF_HAVE_PG_ARCH_X(arch_2) \
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3) \
+IF_HAVE_PG_ARCH_X(arch_4)
#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..9beead961a65 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2474,6 +2474,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
(1L << PG_arch_2) |
(1L << PG_arch_3) |
+ (1L << PG_arch_4) |
#endif
(1L << PG_dirty) |
LRU_GEN_MASK | LRU_REFS_MASK));
--
2.42.1
To be able to reserve the tag storage associated with a page requires that
the tag storage page can be migrated.
When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
in non-preemptible contexts, which can make reserving the associate tag
storage impossible.
Keep the tag storage pages reserved if HW KASAN is enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 427f4f1909f3..8b9bedf7575d 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -308,6 +308,19 @@ static int __init mte_tag_storage_activate_regions(void)
goto out_disabled;
}
+ /*
+ * The kernel allocates memory in non-preemptible contexts, which makes
+ * migration impossible when reserving the associated tag storage.
+ *
+ * The check is safe to make because KASAN HW tags are enabled before
+ * the rest of the init functions are called, in smp_prepare_boot_cpu().
+ */
+ if (kasan_hw_tags_enabled()) {
+ pr_info("KASAN HW tags incompatible with MTE tag storage management");
+ ret = 0;
+ goto out_disabled;
+ }
+
for (i = 0; i < num_tag_regions; i++) {
tag_range = &tag_regions[i].tag_range;
for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
--
2.42.1
Make sure the contents of the tag storage block is not corrupted by
performing:
1. A tag dcache inval when the associated tagged pages are freed, to avoid
dirty tag cache lines being evicted and corrupting the tag storage
block when it's being used to store data.
2. A data cache inval when the tag storage block is being reserved, to
ensure that no dirty data cache lines are present, which would
trigger a writeback that could corrupt the tags stored in the block.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/assembler.h | 10 ++++++++++
arch/arm64/include/asm/mte_tag_storage.h | 2 ++
arch/arm64/kernel/mte_tag_storage.c | 11 +++++++++++
arch/arm64/lib/mte.S | 16 ++++++++++++++++
4 files changed, 39 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 376a980f2bad..8d41c8cfdc69 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -310,6 +310,16 @@ alternative_cb_end
lsl \reg, \reg, \tmp // actual cache line size
.endm
+/*
+ * tcache_line_size - get the safe tag cache line size across all CPUs
+ */
+ .macro tcache_line_size, reg, tmp
+ read_ctr \tmp
+ ubfm \tmp, \tmp, #32, #37 // tag cache line size encoding
+ mov \reg, #4 // bytes per word
+ lsl \reg, \reg, \tmp // actual tag cache line size
+ .endm
+
/*
* raw_icache_line_size - get the minimum I-cache line size on this CPU
* from the CTR register.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index cab033b184ab..6e5d28e607bb 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -11,6 +11,8 @@
#include <asm/mte.h>
+extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
+
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 9f8ef3116fc3..833480048170 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -19,6 +19,7 @@
#include <linux/vm_event_item.h>
#include <linux/xarray.h>
+#include <asm/cacheflush.h>
#include <asm/mte_tag_storage.h>
__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
@@ -431,8 +432,13 @@ static bool tag_storage_block_is_reserved(unsigned long block)
static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
{
+ unsigned long block_va;
int ret;
+ block_va = (unsigned long)page_to_virt(pfn_to_page(block));
+ /* Avoid writeback of dirty data cache lines corrupting tags. */
+ dcache_inval_poc(block_va, block_va + region->block_size * PAGE_SIZE);
+
ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
if (!ret)
block_ref_add(block, region, order);
@@ -587,6 +593,7 @@ void free_tag_storage(struct page *page, int order)
{
unsigned long block, start_block, end_block;
struct tag_region *region;
+ unsigned long page_va;
unsigned long flags;
int ret;
@@ -594,6 +601,10 @@ void free_tag_storage(struct page *page, int order)
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
return;
+ page_va = (unsigned long)page_to_virt(page);
+ /* Avoid writeback of dirty tag cache lines corrupting data. */
+ dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+
end_block = start_block + order_to_num_blocks(order) * region->block_size;
xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 9f623e9da09f..bc02b4e95062 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -175,3 +175,19 @@ SYM_FUNC_START(mte_copy_page_tags_from_buf)
ret
SYM_FUNC_END(mte_copy_page_tags_from_buf)
+
+/*
+ * dcache_inval_tags_poc(start, end)
+ *
+ * Ensure that any tags in the D-cache for the interval [start, end)
+ * are invalidated to PoC.
+ *
+ * - start - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_tags_poc)
+ tcache_line_size x2, x3
+ dcache_by_myline_op igvac, sy, x0, x1, x2, x3
+ ret
+SYM_FUNC_END(__pi_dcache_inval_tags_poc)
+SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc)
--
2.42.1
alloc_contig_range() requires that the requested pages are in the same
zone. Check that this is indeed the case before initializing the tag
storage blocks.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 8b9bedf7575d..fd63430d4dc0 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
}
}
+/* alloc_contig_range() requires all pages to be in the same zone. */
+static int __init mte_tag_storage_check_zone(void)
+{
+ struct range *tag_range;
+ struct zone *zone;
+ unsigned long pfn;
+ u32 block_size;
+ int i, j;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ block_size = tag_regions[i].block_size;
+ if (block_size == 1)
+ continue;
+
+ tag_range = &tag_regions[i].tag_range;
+ for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
+ zone = page_zone(pfn_to_page(pfn));
+ for (j = 1; j < block_size; j++) {
+ if (page_zone(pfn_to_page(pfn + j)) != zone) {
+ pr_err("Tag storage block pages in different zones");
+ return -EINVAL;
+ }
+ }
+ }
+ }
+
+ return 0;
+}
+
static int __init mte_tag_storage_activate_regions(void)
{
phys_addr_t dram_start, dram_end;
@@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
goto out_disabled;
}
+ ret = mte_tag_storage_check_zone();
+ if (ret)
+ goto out_disabled;
+
for (i = 0; i < num_tag_regions; i++) {
tag_range = &tag_regions[i].tag_range;
for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
--
2.42.1
On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage
management is being activated.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 833480048170..a1cc239f7211 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -393,6 +393,8 @@ static int __init mte_tag_storage_activate_regions(void)
totalcma_pages += range_len(tag_range);
}
+ reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+
return 0;
out_disabled:
--
2.42.1
Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
regular pages.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte_tag_storage.h | 1 +
arch/arm64/include/asm/pgtable.h | 7 ++
arch/arm64/mm/fault.c | 81 ++++++++++++++++++++++++
include/linux/huge_mm.h | 2 +
include/linux/pgtable.h | 5 ++
mm/huge_memory.c | 4 +-
mm/memory.c | 3 +
7 files changed, 101 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index c70ced60a0cd..b97406d369ce 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
+vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
#else
static inline bool tag_storage_enabled(void)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8cc135f1c112..1704411c096d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
return handle_page_missing_tag_storage(vmf);
return VM_FAULT_SIGBUS;
}
+
+static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
+{
+ if (tag_storage_enabled())
+ return handle_huge_page_missing_tag_storage(vmf);
+ return VM_FAULT_SIGBUS;
+}
#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f5fa583acf18..6730a0812a24 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
return 0;
+out_retry:
+ put_page(page);
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ if (fault_flag_allow_retry_first(vmf->flags)) {
+ err = VM_FAULT_RETRY;
+ } else {
+ /* Replay the fault. */
+ err = 0;
+ }
+ return err;
+}
+
+vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
+{
+ unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ struct vm_area_struct *vma = vmf->vma;
+ pmd_t old_pmd, new_pmd;
+ bool writable = false;
+ struct page *page;
+ vm_fault_t err;
+ int ret;
+
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
+ spin_unlock(vmf->ptl);
+ return 0;
+ }
+
+ old_pmd = vmf->orig_pmd;
+ new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+
+ /*
+ * Detect now whether the PMD could be writable; this information
+ * is only valid while holding the PT lock.
+ */
+ writable = pmd_write(new_pmd);
+ if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+ can_change_pmd_writable(vma, vmf->address, new_pmd))
+ writable = true;
+
+ page = vm_normal_page_pmd(vma, haddr, new_pmd);
+ if (!page)
+ goto out_map;
+
+ if (!(vma->vm_flags & VM_MTE))
+ goto out_map;
+
+ get_page(page);
+ vma_set_access_pid_bit(vma);
+
+ spin_unlock(vmf->ptl);
+ writable = false;
+
+ if (unlikely(is_migrate_isolate_page(page)))
+ goto out_retry;
+
+ ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
+ if (ret)
+ goto out_retry;
+
+ put_page(page);
+
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
+ spin_unlock(vmf->ptl);
+ return 0;
+ }
+
+out_map:
+ /* Restore the PMD */
+ new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+ new_pmd = pmd_mkyoung(new_pmd);
+ if (writable)
+ new_pmd = pmd_mkwrite(new_pmd, vma);
+ set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
+ update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+ spin_unlock(vmf->ptl);
+
+ return 0;
+
out_retry:
put_page(page);
if (vmf->flags & FAULT_FLAG_VMA_LOCK)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..bb84291f9231 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
unsigned long cp_flags);
+bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t pmd);
vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2c761dd6c41..de45f475bf8d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
{
return VM_FAULT_SIGBUS;
}
+
+static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
#endif
#endif /* CONFIG_MMU */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9beead961a65..d1402b43ea39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
return VM_FAULT_FALLBACK;
}
-static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd)
+inline bool can_change_pmd_writable(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd)
{
struct page *page;
diff --git a/mm/memory.c b/mm/memory.c
index a04a971200b9..46b926625503 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+ if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
+ return arch_do_huge_page_fault_on_access(&vmf);
+
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);
--
2.42.1
Linux restores tags when a page is swapped in and there are tags associated
with the swap entry which the new page will replace. The saved tags are
restored even if the page will not be mapped as tagged, to protect against
cases where the page is shared between different VMAs, and is tagged in
some, but untagged in others. By using this approach, the process can still
access the correct tags following an mprotect(PROT_MTE) on the non-MTE
enabled VMA.
But this poses a challenge for managing tag storage: in the scenario above,
when a new page is allocated to be swapped in for the process where it will
be mapped as untagged, the corresponding tag storage block is not reserved.
mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
overwrite data in the tag storage block associated with the new page,
leading to data corruption if the block is in use by a process.
Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them when tag storage is reserved for
the page.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte_tag_storage.h | 9 ++
arch/arm64/include/asm/pgtable.h | 11 +++
arch/arm64/kernel/mte_tag_storage.c | 20 +++-
arch/arm64/mm/mteswap.c | 112 +++++++++++++++++++++++
4 files changed, 148 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6a8b19a6a758..a3c38099fe1a 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -37,6 +37,15 @@ bool page_is_tag_storage(struct page *page);
vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
+
+void tags_by_pfn_lock(void);
+void tags_by_pfn_unlock(void);
+
+void *mte_erase_tags_for_pfn(unsigned long pfn);
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
+
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
#else
static inline bool tag_storage_enabled(void)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1704411c096d..1a25b7d601c2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1084,6 +1084,17 @@ static inline void arch_swap_invalidate_area(int type)
mte_invalidate_tags_area_by_swp_entry(type);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
+ struct folio *folio)
+{
+ if (tag_storage_enabled())
+ return mte_try_transfer_swap_tags(entry, &folio->page);
+ return 0;
+}
+#endif
+
#define __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 5096ce859136..6b11bb408b51 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -547,8 +547,10 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
mutex_lock(&tag_blocks_lock);
/* Check again, this time with the lock held. */
- if (page_tag_storage_reserved(page))
- goto out_unlock;
+ if (page_tag_storage_reserved(page)) {
+ mutex_unlock(&tag_blocks_lock);
+ return 0;
+ }
/* Make sure existing entries are not freed from out under out feet. */
xa_lock_irqsave(&tag_blocks_reserved, flags);
@@ -583,9 +585,10 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
}
page_set_tag_storage_reserved(page, order);
-out_unlock:
mutex_unlock(&tag_blocks_lock);
+ mte_restore_tags_for_pfn(page_to_pfn(page), order);
+
return 0;
out_error:
@@ -612,7 +615,8 @@ void free_tag_storage(struct page *page, int order)
struct tag_region *region;
unsigned long page_va;
unsigned long flags;
- int ret;
+ void *tags;
+ int i, ret;
ret = tag_storage_find_block(page, &start_block, ®ion);
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
@@ -622,6 +626,14 @@ void free_tag_storage(struct page *page, int order)
/* Avoid writeback of dirty tag cache lines corrupting data. */
dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+ tags_by_pfn_lock();
+ for (i = 0; i < (1 << order); i++) {
+ tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
+ if (unlikely(tags))
+ mte_free_tag_buf(tags);
+ }
+ tags_by_pfn_unlock();
+
end_block = start_block + order_to_num_blocks(order) * region->block_size;
xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 2a43746b803f..20d718a514af 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -20,6 +20,114 @@ void mte_free_tag_buf(void *buf)
kfree(buf);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static DEFINE_XARRAY(tags_by_pfn);
+
+void tags_by_pfn_lock(void)
+{
+ xa_lock(&tags_by_pfn);
+}
+
+void tags_by_pfn_unlock(void)
+{
+ xa_unlock(&tags_by_pfn);
+}
+
+void *mte_erase_tags_for_pfn(unsigned long pfn)
+{
+ return __xa_erase(&tags_by_pfn, pfn);
+}
+
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
+{
+ void *entry;
+ int ret;
+
+ ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
+ if (ret)
+ return true;
+
+ tags_by_pfn_lock();
+
+ if (page_tag_storage_reserved(pfn_to_page(pfn))) {
+ tags_by_pfn_unlock();
+ return false;
+ }
+
+ entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
+ if (xa_is_err(entry)) {
+ xa_release(&tags_by_pfn, pfn);
+ goto out_unlock;
+ } else if (entry) {
+ mte_free_tag_buf(entry);
+ }
+
+out_unlock:
+ tags_by_pfn_unlock();
+ return true;
+}
+
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
+{
+ struct page *page = pfn_to_page(start_pfn);
+ unsigned long pfn;
+ void *tags;
+
+ tags_by_pfn_lock();
+
+ for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
+ if (WARN_ON_ONCE(!page_tag_storage_reserved(page)))
+ continue;
+
+ tags = mte_erase_tags_for_pfn(pfn);
+ if (unlikely(tags)) {
+ /*
+ * Mark the page as tagged so mte_sync_tags() doesn't
+ * clear the tags.
+ */
+ WARN_ON_ONCE(!try_page_mte_tagging(page));
+ mte_copy_page_tags_from_buf(page_address(page), tags);
+ set_page_mte_tagged(page);
+ mte_free_tag_buf(tags);
+ }
+ }
+
+ tags_by_pfn_unlock();
+}
+
+/*
+ * Note on locking: swap in/out is done with the folio locked, which eliminates
+ * races with mte_save/restore_page_tags_by_swp_entry.
+ */
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
+{
+ void *swap_tags, *pfn_tags;
+ bool saved;
+
+ /*
+ * mte_restore_page_tags_by_swp_entry() will take care of copying the
+ * tags over.
+ */
+ if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
+ return 0;
+
+ swap_tags = xa_load(&tags_by_swp_entry, entry.val);
+ if (!swap_tags)
+ return 0;
+
+ pfn_tags = mte_allocate_tag_buf();
+ if (!pfn_tags)
+ return VM_FAULT_OOM;
+
+ memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
+ saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
+ if (!saved)
+ mte_free_tag_buf(pfn_tags);
+
+ return 0;
+}
+#endif
+
int mte_save_page_tags_by_swp_entry(struct page *page)
{
void *tags, *ret;
@@ -54,6 +162,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
if (!tags)
return;
+ /* Tags will be restored when tag storage is reserved. */
+ if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
+ return;
+
if (try_page_mte_tagging(page)) {
mte_copy_page_tags_from_buf(page_address(page), tags);
set_page_mte_tagged(page);
--
2.42.1
As long as a fatal signal is pending, alloc_contig_range() will fail with
-EINTR. This makes it impossible for tag storage allocation to succeed, and
the page allocator will print an OOM splat.
The process is going to be killed, so return 0 (success) from
reserve_tag_storage() to allow the page allocator to make progress.
set_pte_at() will map it with PAGE_FAULT_ON_ACCESS and subsequent accesses
from different threads will cause a fault until the signal is delivered.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 17 +++++++++++++++++
arch/arm64/mm/fault.c | 5 +++++
2 files changed, 22 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 6b11bb408b51..602fdc70db1c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -572,6 +572,23 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
break;
}
+ /*
+ * alloc_contig_range() returns -EINTR from
+ * __alloc_contig_migrate_range() if a fatal signal is pending.
+ * As long as the signal hasn't been handled, it is impossible
+ * to reserve tag storage for any page. Stop trying to reserve
+ * tag storage, but return 0 so the page allocator can make
+ * forward progress, instead of printing an OOM splat.
+ *
+ * The tagged page with missing tag storage will be mapped with
+ * PAGE_FAULT_ON_ACCESS in set_pte_at(), which means accesses
+ * until the signal is delivered will cause a fault.
+ */
+ if (ret == -EINTR) {
+ ret = 0;
+ goto out_error;
+ }
+
if (ret)
goto out_error;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 964c5ae161a3..fdc98c5828bf 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -950,6 +950,11 @@ gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
void tag_clear_highpage(struct page *page)
{
+ if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
+ clear_page(page_address(page));
+ return;
+ }
+
/* Newly allocated page, shouldn't have been tagged yet */
WARN_ON_ONCE(!try_page_mte_tagging(page));
mte_zero_clear_page_tags(page_address(page));
--
2.42.1
KVM allows MTE enabled VMs to be created when the backing VMA does not have
MTE enabled. Without changes to how KVM allocates memory for a VM, it is
impossible at the moment to discern when the corresponding tag storage
needs to be reserved.
For now, disable MTE in KVM if tag storage is enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kvm/arm.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e5f75f1f1085..5b33c532c62a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -29,6 +29,7 @@
#include <linux/uaccess.h>
#include <asm/ptrace.h>
#include <asm/mman.h>
+#include <asm/mte_tag_storage.h>
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>
#include <asm/cpufeature.h>
@@ -86,7 +87,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
break;
case KVM_CAP_ARM_MTE:
mutex_lock(&kvm->lock);
- if (!system_supports_mte() || kvm->created_vcpus) {
+ if (!system_supports_mte() || tag_storage_enabled() ||
+ kvm->created_vcpus) {
r = -EINVAL;
} else {
r = 0;
@@ -279,7 +281,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
case KVM_CAP_ARM_MTE:
- r = system_supports_mte();
+ r = system_supports_mte() && !tag_storage_enabled();
break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
--
2.42.1
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte_tag_storage.h | 1 +
arch/arm64/kernel/mte_tag_storage.c | 15 +++++++
arch/arm64/mm/fault.c | 55 ++++++++++++++++++++++++
include/linux/migrate.h | 8 +++-
include/linux/migrate_mode.h | 1 +
mm/internal.h | 6 ---
6 files changed, 78 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index b97406d369ce..6a8b19a6a758 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
+bool page_is_tag_storage(struct page *page);
vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index a1cc239f7211..5096ce859136 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
return test_bit(PG_tag_storage_reserved, &page->flags);
}
+bool page_is_tag_storage(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct range *tag_range;
+ int i;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ if (tag_range->start <= pfn && pfn <= tag_range->end)
+ return true;
+ }
+
+ return false;
+}
+
int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
{
unsigned long start_block, end_block;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6730a0812a24..964c5ae161a3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -12,6 +12,7 @@
#include <linux/extable.h>
#include <linux/kfence.h>
#include <linux/signal.h>
+#include <linux/migrate.h>
#include <linux/mm.h>
#include <linux/hardirq.h>
#include <linux/init.h>
@@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
}
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define MR_TAGGED_TAG_STORAGE MR_ARCH_1
+
+extern bool isolate_lru_page(struct page *page);
+extern void putback_movable_pages(struct list_head *l);
+
+/* Returns with the page reference dropped. */
+static void migrate_tag_storage_page(struct page *page)
+{
+ struct migration_target_control mtc = {
+ .nid = NUMA_NO_NODE,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+ };
+ unsigned long i, nr_pages = compound_nr(page);
+ LIST_HEAD(pagelist);
+ int ret, tries;
+
+ lru_cache_disable();
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!isolate_lru_page(page + i)) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ /* Isolate just grabbed another reference, drop ours. */
+ put_page(page + i);
+ list_add_tail(&(page + i)->lru, &pagelist);
+ }
+
+ tries = 5;
+ while (tries--) {
+ ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
+ MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
+ if (ret == 0 || ret != -EBUSY)
+ break;
+ }
+
+out:
+ if (ret != 0)
+ putback_movable_pages(&pagelist);
+
+ lru_cache_enable();
+}
+
vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -1013,6 +1058,11 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
if (unlikely(is_migrate_isolate_page(page)))
goto out_retry;
+ if (unlikely(page_is_tag_storage(page))) {
+ migrate_tag_storage_page(page);
+ return 0;
+ }
+
ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
if (ret)
goto out_retry;
@@ -1098,6 +1148,11 @@ vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
if (unlikely(is_migrate_isolate_page(page)))
goto out_retry;
+ if (unlikely(page_is_tag_storage(page))) {
+ migrate_tag_storage_page(page);
+ return 0;
+ }
+
ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
if (ret)
goto out_retry;
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0acef592043c..afca42ace735 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,8 +10,6 @@
typedef struct folio *new_folio_t(struct folio *folio, unsigned long private);
typedef void free_folio_t(struct folio *folio, unsigned long private);
-struct migration_target_control;
-
/*
* Return values from addresss_space_operations.migratepage():
* - negative errno on page migration failure;
@@ -57,6 +55,12 @@ struct movable_operations {
void (*putback_page)(struct page *);
};
+struct migration_target_control {
+ int nid; /* preferred node id */
+ nodemask_t *nmask;
+ gfp_t gfp_mask;
+};
+
/* Defined in mm/debug.c: */
extern const char *migrate_reason_names[MR_TYPES];
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..c6c5c7726d26 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+ MR_ARCH_1,
MR_TYPES
};
diff --git a/mm/internal.h b/mm/internal.h
index ddf6bb6c6308..96fff5dfc041 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -949,12 +949,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
void setup_zone_pageset(struct zone *zone);
-struct migration_target_control {
- int nid; /* preferred node id */
- nodemask_t *nmask;
- gfp_t gfp_mask;
-};
-
/*
* mm/filemap.c
*/
--
2.42.1
A double digit performance decrease for Chrome startup time has been
reported with the dynamic tag storage management enabled. A large part of
the regression is due to lru_cache_disable(), called from
__alloc_contig_migrate_range(), which IPIs all CPUs in the system.
Improve the performance by taking the storage block directly from the
freelist if it's free, thus sidestepping the costly function call.
Note that at the moment this is implemented only when the block size is
1 (the block is one page); larger block sizes could be added later if
necessary.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/mte_tag_storage.c | 15 +++++++++++++++
include/linux/page-flags.h | 15 +++++++++++++--
mm/Kconfig | 4 ++++
mm/memory-failure.c | 8 ++++----
mm/page_alloc.c | 21 ++++++++++++---------
6 files changed, 49 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3b9c435eaafb..93a4bbca3800 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2067,6 +2067,7 @@ config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
depends on ARCH_KEEP_MEMBLOCK
select ARCH_HAS_FAULT_ON_ACCESS
+ select WANTS_TAKE_PAGE_OFF_BUDDY
select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 602fdc70db1c..11961587382d 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -522,6 +522,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
unsigned long block;
unsigned long flags;
unsigned int tries;
+ bool success;
int ret = 0;
VM_WARN_ON_ONCE(!preemptible());
@@ -565,6 +566,19 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
if (tag_storage_block_is_reserved(block))
continue;
+ if (region->block_size == 1 && is_free_buddy_page(pfn_to_page(block))) {
+ success = take_page_off_buddy(pfn_to_page(block), false);
+ if (success) {
+ ret = tag_storage_reserve_block(block, region, order);
+ if (ret) {
+ put_page_back_buddy(pfn_to_page(block), false);
+ goto out_error;
+ }
+ page_ref_inc(pfn_to_page(block));
+ goto success_next;
+ }
+ }
+
tries = 3;
while (tries--) {
ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
@@ -598,6 +612,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
goto out_error;
}
+success_next:
count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7915165a51bd..0d0380141f5d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -576,11 +576,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
#define MAGIC_HWPOISON 0x48575053U /* HWPS */
extern void SetPageHWPoisonTakenOff(struct page *page);
extern void ClearPageHWPoisonTakenOff(struct page *page);
-extern bool take_page_off_buddy(struct page *page);
-extern bool put_page_back_buddy(struct page *page);
+extern bool PageHWPoisonTakenOff(struct page *page);
#else
PAGEFLAG_FALSE(HWPoison, hwpoison)
+TESTSCFLAG_FALSE(HWPoison, hwpoison)
#define __PG_HWPOISON 0
+static inline void SetPageHWPoisonTakenOff(struct page *page) { }
+static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
+static inline bool PageHWPoisonTakenOff(struct page *page)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
+extern bool take_page_off_buddy(struct page *page, bool poison);
+extern bool put_page_back_buddy(struct page *page, bool unpoison);
#endif
#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
diff --git a/mm/Kconfig b/mm/Kconfig
index a90eefc3ee80..0766cdc3de4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -773,6 +773,7 @@ config MEMORY_FAILURE
depends on MMU
depends on ARCH_SUPPORTS_MEMORY_FAILURE
bool "Enable recovery from hardware memory errors"
+ select WANTS_TAKE_PAGE_OFF_BUDDY
select MEMORY_ISOLATION
select RAS
help
@@ -1022,6 +1023,9 @@ config ARCH_HAS_CACHE_LINE_SIZE
config ARCH_HAS_FAULT_ON_ACCESS
bool
+config WANTS_TAKE_PAGE_OFF_BUDDY
+ bool
+
config ARCH_HAS_CURRENT_STACK_POINTER
bool
help
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..8b44afd6a558 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -157,7 +157,7 @@ static int __page_handle_poison(struct page *page)
zone_pcp_disable(page_zone(page));
ret = dissolve_free_huge_page(page);
if (!ret)
- ret = take_page_off_buddy(page);
+ ret = take_page_off_buddy(page, true);
zone_pcp_enable(page_zone(page));
return ret;
@@ -1348,7 +1348,7 @@ static int page_action(struct page_state *ps, struct page *p,
return action_result(pfn, ps->type, result);
}
-static inline bool PageHWPoisonTakenOff(struct page *page)
+bool PageHWPoisonTakenOff(struct page *page)
{
return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
}
@@ -2236,7 +2236,7 @@ int memory_failure(unsigned long pfn, int flags)
res = get_hwpoison_page(p, flags);
if (!res) {
if (is_free_buddy_page(p)) {
- if (take_page_off_buddy(p)) {
+ if (take_page_off_buddy(p, true)) {
page_ref_inc(p);
res = MF_RECOVERED;
} else {
@@ -2567,7 +2567,7 @@ int unpoison_memory(unsigned long pfn)
ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
} else if (ghp < 0) {
if (ghp == -EHWPOISON) {
- ret = put_page_back_buddy(p) ? 0 : -EBUSY;
+ ret = put_page_back_buddy(p, true) ? 0 : -EBUSY;
} else {
ret = ghp;
unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 135f9283a863..4b74acfc41a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6700,7 +6700,7 @@ bool is_free_buddy_page(struct page *page)
}
EXPORT_SYMBOL(is_free_buddy_page);
-#ifdef CONFIG_MEMORY_FAILURE
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
/*
* Break down a higher-order page in sub-pages, and keep our target out of
* buddy allocator.
@@ -6730,11 +6730,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
set_buddy_order(current_buddy, high);
}
}
-
/*
- * Take a page that will be marked as poisoned off the buddy allocator.
+ * Take a page off the buddy allocator, and optionally mark it as poisoned.
*/
-bool take_page_off_buddy(struct page *page)
+bool take_page_off_buddy(struct page *page, bool poison)
{
struct zone *zone = page_zone(page);
unsigned long pfn = page_to_pfn(page);
@@ -6755,7 +6754,8 @@ bool take_page_off_buddy(struct page *page)
del_page_from_free_list(page_head, zone, page_order);
break_down_buddy_pages(zone, page_head, page, 0,
page_order, migratetype);
- SetPageHWPoisonTakenOff(page);
+ if (poison)
+ SetPageHWPoisonTakenOff(page);
if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, -1, migratetype);
ret = true;
@@ -6769,9 +6769,10 @@ bool take_page_off_buddy(struct page *page)
}
/*
- * Cancel takeoff done by take_page_off_buddy().
+ * Cancel takeoff done by take_page_off_buddy(), and optionally unpoison the
+ * page.
*/
-bool put_page_back_buddy(struct page *page)
+bool put_page_back_buddy(struct page *page, bool unpoison)
{
struct zone *zone = page_zone(page);
unsigned long pfn = page_to_pfn(page);
@@ -6781,9 +6782,11 @@ bool put_page_back_buddy(struct page *page)
spin_lock_irqsave(&zone->lock, flags);
if (put_page_testzero(page)) {
- ClearPageHWPoisonTakenOff(page);
+ VM_WARN_ON_ONCE(PageHWPoisonTakenOff(page) && !unpoison);
+ if (unpoison)
+ ClearPageHWPoisonTakenOff(page);
__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
- if (TestClearPageHWPoison(page)) {
+ if (!unpoison || (unpoison && TestClearPageHWPoison(page))) {
ret = true;
}
}
--
2.42.1
There are several situations where copy_highpage() can end up copying
tags to a page which doesn't have its tag storage reserved.
One situation involves migration racing with mprotect(PROT_MTE): VMA is
initially untagged, migration starts and destination page is allocated
as untagged, mprotect(PROT_MTE) changes the VMA to tagged and userspace
accesses the source page, thus making it tagged. The migration code
then calls copy_highpage(), which will copy the tags from the source
page (now tagged) to the destination page (allocated as untagged).
Yes another situation can happen during THP collapse. The huge page that
will replace the HPAGE_PMD_NR contiguous mapped pages is allocated with
__GFP_TAGGED not set. copy_highpage() will copy the tags from the pages
being replaced to the huge page which doesn't have tag storage reserved.
The situation gets even more complicated when the replacement huge page
is a tag storage page. The tag storage huge page will be migrated after
a fault on access, but the tags from the original pages must be copied
over to the huge page that will be replacing the tag storage huge page.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/mm/copypage.c | 59 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index a7bb20055ce0..7899f38773b9 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -13,6 +13,62 @@
#include <asm/cacheflush.h>
#include <asm/cpufeature.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+ void *tags;
+ bool saved;
+
+ VM_WARN_ON_ONCE(!preemptible());
+
+ if (page_mte_tagged(from)) {
+ if (likely(page_tag_storage_reserved(to)))
+ return false;
+
+ tags = mte_allocate_tag_buf();
+ if (WARN_ON(!tags))
+ return true;
+
+ mte_copy_page_tags_to_buf(page_address(from), tags);
+ saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+ if (!saved)
+ mte_free_tag_buf(tags);
+
+ return saved;
+ }
+
+ if (likely(!page_is_tag_storage(from)))
+ return false;
+
+ tags_by_pfn_lock();
+ tags = mte_erase_tags_for_pfn(page_to_pfn(from));
+ tags_by_pfn_unlock();
+
+ if (likely(!tags))
+ return false;
+
+ if (page_tag_storage_reserved(to)) {
+ WARN_ON_ONCE(!try_page_mte_tagging(to));
+ mte_copy_page_tags_from_buf(page_address(to), tags);
+ set_page_mte_tagged(to);
+ mte_free_tag_buf(tags);
+ return true;
+ }
+
+ saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+ if (!saved)
+ mte_free_tag_buf(tags);
+
+ return saved;
+}
+#else
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+ return false;
+}
+#endif
void copy_highpage(struct page *to, struct page *from)
{
@@ -24,6 +80,9 @@ void copy_highpage(struct page *to, struct page *from)
if (kasan_hw_tags_enabled())
page_kasan_tag_reset(to);
+ if (tag_storage_enabled() && try_transfer_saved_tags(from, to))
+ return;
+
if (system_supports_mte() && page_mte_tagged(from)) {
/* It's a new page, shouldn't have been tagged yet */
WARN_ON_ONCE(!try_page_mte_tagging(to));
--
2.42.1
Everything is in place, enable tag storage management.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 11961587382d..9f60e952a814 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -395,6 +395,9 @@ static int __init mte_tag_storage_activate_regions(void)
reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+ static_branch_enable(&tag_storage_enabled_key);
+ pr_info("MTE tag storage region management enabled");
+
return 0;
out_disabled:
--
2.42.1
On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
<[email protected]> wrote:
>
> Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> regular pages.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/include/asm/mte_tag_storage.h | 1 +
> arch/arm64/include/asm/pgtable.h | 7 ++
> arch/arm64/mm/fault.c | 81 ++++++++++++++++++++++++
> include/linux/huge_mm.h | 2 +
> include/linux/pgtable.h | 5 ++
> mm/huge_memory.c | 4 +-
> mm/memory.c | 3 +
> 7 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index c70ced60a0cd..b97406d369ce 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
> bool page_tag_storage_reserved(struct page *page);
>
> vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> #else
> static inline bool tag_storage_enabled(void)
> {
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 8cc135f1c112..1704411c096d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> return handle_page_missing_tag_storage(vmf);
> return VM_FAULT_SIGBUS;
> }
> +
> +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> +{
> + if (tag_storage_enabled())
> + return handle_huge_page_missing_tag_storage(vmf);
> + return VM_FAULT_SIGBUS;
> +}
> #endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
>
> #define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index f5fa583acf18..6730a0812a24 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
>
> return 0;
>
> +out_retry:
> + put_page(page);
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vma);
> + if (fault_flag_allow_retry_first(vmf->flags)) {
> + err = VM_FAULT_RETRY;
> + } else {
> + /* Replay the fault. */
> + err = 0;
> + }
> + return err;
> +}
> +
> +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
> +{
> + unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + struct vm_area_struct *vma = vmf->vma;
> + pmd_t old_pmd, new_pmd;
> + bool writable = false;
> + struct page *page;
> + vm_fault_t err;
> + int ret;
> +
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
> + spin_unlock(vmf->ptl);
> + return 0;
> + }
> +
> + old_pmd = vmf->orig_pmd;
> + new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> +
> + /*
> + * Detect now whether the PMD could be writable; this information
> + * is only valid while holding the PT lock.
> + */
> + writable = pmd_write(new_pmd);
> + if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> + can_change_pmd_writable(vma, vmf->address, new_pmd))
> + writable = true;
> +
> + page = vm_normal_page_pmd(vma, haddr, new_pmd);
> + if (!page)
> + goto out_map;
> +
> + if (!(vma->vm_flags & VM_MTE))
> + goto out_map;
> +
> + get_page(page);
> + vma_set_access_pid_bit(vma);
> +
> + spin_unlock(vmf->ptl);
> + writable = false;
> +
> + if (unlikely(is_migrate_isolate_page(page)))
> + goto out_retry;
> +
> + ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
> + if (ret)
> + goto out_retry;
> +
> + put_page(page);
> +
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
> + spin_unlock(vmf->ptl);
> + return 0;
> + }
> +
> +out_map:
> + /* Restore the PMD */
> + new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> + new_pmd = pmd_mkyoung(new_pmd);
> + if (writable)
> + new_pmd = pmd_mkwrite(new_pmd, vma);
> + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
> + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> + spin_unlock(vmf->ptl);
> +
> + return 0;
> +
> out_retry:
> put_page(page);
> if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..bb84291f9231 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> unsigned long cp_flags);
> +bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t pmd);
>
> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e2c761dd6c41..de45f475bf8d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> {
> return VM_FAULT_SIGBUS;
> }
> +
> +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> +{
> + return VM_FAULT_SIGBUS;
> +}
> #endif
>
> #endif /* CONFIG_MMU */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9beead961a65..d1402b43ea39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
> return VM_FAULT_FALLBACK;
> }
>
> -static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
> - unsigned long addr, pmd_t pmd)
> +inline bool can_change_pmd_writable(struct vm_area_struct *vma,
Remove inline keyword here.
Peter
> + unsigned long addr, pmd_t pmd)
> {
> struct page *page;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a04a971200b9..46b926625503 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> return 0;
> }
> if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> + if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
> + return arch_do_huge_page_fault_on_access(&vmf);
> +
> if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf);
>
> --
> 2.42.1
>
Hi Peter,
On Tue, Nov 21, 2023 at 05:28:49PM -0800, Peter Collingbourne wrote:
> On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> > regular pages.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/include/asm/mte_tag_storage.h | 1 +
> > arch/arm64/include/asm/pgtable.h | 7 ++
> > arch/arm64/mm/fault.c | 81 ++++++++++++++++++++++++
> > include/linux/huge_mm.h | 2 +
> > include/linux/pgtable.h | 5 ++
> > mm/huge_memory.c | 4 +-
> > mm/memory.c | 3 +
> > 7 files changed, 101 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index c70ced60a0cd..b97406d369ce 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
> > bool page_tag_storage_reserved(struct page *page);
> >
> > vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> > +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> > #else
> > static inline bool tag_storage_enabled(void)
> > {
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 8cc135f1c112..1704411c096d 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > return handle_page_missing_tag_storage(vmf);
> > return VM_FAULT_SIGBUS;
> > }
> > +
> > +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > + if (tag_storage_enabled())
> > + return handle_huge_page_missing_tag_storage(vmf);
> > + return VM_FAULT_SIGBUS;
> > +}
> > #endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> >
> > #define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index f5fa583acf18..6730a0812a24 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> >
> > return 0;
> >
> > +out_retry:
> > + put_page(page);
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vma);
> > + if (fault_flag_allow_retry_first(vmf->flags)) {
> > + err = VM_FAULT_RETRY;
> > + } else {
> > + /* Replay the fault. */
> > + err = 0;
> > + }
> > + return err;
> > +}
> > +
> > +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
> > +{
> > + unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > + struct vm_area_struct *vma = vmf->vma;
> > + pmd_t old_pmd, new_pmd;
> > + bool writable = false;
> > + struct page *page;
> > + vm_fault_t err;
> > + int ret;
> > +
> > + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > + if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
> > + spin_unlock(vmf->ptl);
> > + return 0;
> > + }
> > +
> > + old_pmd = vmf->orig_pmd;
> > + new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> > +
> > + /*
> > + * Detect now whether the PMD could be writable; this information
> > + * is only valid while holding the PT lock.
> > + */
> > + writable = pmd_write(new_pmd);
> > + if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> > + can_change_pmd_writable(vma, vmf->address, new_pmd))
> > + writable = true;
> > +
> > + page = vm_normal_page_pmd(vma, haddr, new_pmd);
> > + if (!page)
> > + goto out_map;
> > +
> > + if (!(vma->vm_flags & VM_MTE))
> > + goto out_map;
> > +
> > + get_page(page);
> > + vma_set_access_pid_bit(vma);
> > +
> > + spin_unlock(vmf->ptl);
> > + writable = false;
> > +
> > + if (unlikely(is_migrate_isolate_page(page)))
> > + goto out_retry;
> > +
> > + ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
> > + if (ret)
> > + goto out_retry;
> > +
> > + put_page(page);
> > +
> > + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > + if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
> > + spin_unlock(vmf->ptl);
> > + return 0;
> > + }
> > +
> > +out_map:
> > + /* Restore the PMD */
> > + new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> > + new_pmd = pmd_mkyoung(new_pmd);
> > + if (writable)
> > + new_pmd = pmd_mkwrite(new_pmd, vma);
> > + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
> > + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> > + spin_unlock(vmf->ptl);
> > +
> > + return 0;
> > +
> > out_retry:
> > put_page(page);
> > if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..bb84291f9231 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> > unsigned long cp_flags);
> > +bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
> > + pmd_t pmd);
> >
> > vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> > vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index e2c761dd6c41..de45f475bf8d 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > {
> > return VM_FAULT_SIGBUS;
> > }
> > +
> > +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > + return VM_FAULT_SIGBUS;
> > +}
> > #endif
> >
> > #endif /* CONFIG_MMU */
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 9beead961a65..d1402b43ea39 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
> > return VM_FAULT_FALLBACK;
> > }
> >
> > -static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
> > - unsigned long addr, pmd_t pmd)
> > +inline bool can_change_pmd_writable(struct vm_area_struct *vma,
>
> Remove inline keyword here.
Indeed, as it does nothing now that the function is not static.
Thanks,
Alex
>
> Peter
>
> > + unsigned long addr, pmd_t pmd)
> > {
> > struct page *page;
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a04a971200b9..46b926625503 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > return 0;
> > }
> > if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> > + if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
> > + return arch_do_huge_page_fault_on_access(&vmf);
> > +
> > if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> > return do_huge_pmd_numa_page(&vmf);
> >
> > --
> > 2.42.1
> >
On 19.11.23 17:56, Alexandru Elisei wrote:
> Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> storage for an allocated page. Reserving tag storage can fail, for example,
> if the tag storage page has a short pin on it, so allow prep_new_page() ->
> arch_prep_new_page() to similarly fail.
But what are the side-effects of this? How does the calling code recover?
E.g., what if we need to populate a page into user space, but that
particular page we allocated fails to be prepared? So we inject a signal
into that poor process?
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> Add arch_free_pages_prepare() hook that is called before that page flags
> are cleared. This will be used by arm64 when explicit management of tag
> storage pages is enabled.
Can you elaborate a bit what exactly will be done by that code with that
information?
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> the page allocator to manage them like regular pages.
>
> Ths migratype lends the pages some very desirable properties:
>
> * They cannot be longterm pinned, meaning they will always be migratable.
>
> * The pages can be allocated explicitely by using their PFN (with
> alloc_contig_range()) when they are needed to store tags.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
> include/linux/mmzone.h | 5 +++
> mm/internal.h | 3 --
> 4 files changed, 74 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fe8276fdc7a8..047487046e8f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2065,6 +2065,7 @@ config ARM64_MTE
> if ARM64_MTE
> config ARM64_MTE_TAG_STORAGE
> bool "Dynamic MTE tag storage management"
> + select CONFIG_CMA
> help
> Adds support for dynamic management of the memory used by the hardware
> for storing MTE tags. This memory, unlike normal memory, cannot be
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index fa6267ef8392..427f4f1909f3 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -5,10 +5,12 @@
> * Copyright (C) 2023 ARM Ltd.
> */
>
> +#include <linux/cma.h>
> #include <linux/memblock.h>
> #include <linux/mm.h>
> #include <linux/of_device.h>
> #include <linux/of_fdt.h>
> +#include <linux/pageblock-flags.h>
> #include <linux/range.h>
> #include <linux/string.h>
> #include <linux/xarray.h>
> @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> return ret;
> }
>
> + /* Pages are managed in pageblock_nr_pages chunks */
> + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> + PFN_PHYS(pageblock_nr_pages));
> + return -EINVAL;
> + }
> +
> ret = tag_storage_get_memory_node(node, &mem_node);
> if (ret)
> return ret;
> @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
> pr_info("MTE tag storage region management disabled");
> }
> }
> +
> +static int __init mte_tag_storage_activate_regions(void)
> +{
> + phys_addr_t dram_start, dram_end;
> + struct range *tag_range;
> + unsigned long pfn;
> + int i, ret;
> +
> + if (num_tag_regions == 0)
> + return 0;
> +
> + dram_start = memblock_start_of_DRAM();
> + dram_end = memblock_end_of_DRAM();
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + /*
> + * Tag storage region was clipped by arm64_bootmem_init()
> + * enforcing addressing limits.
> + */
> + if (PFN_PHYS(tag_range->start) < dram_start ||
> + PFN_PHYS(tag_range->end) >= dram_end) {
> + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> + ret = -EINVAL;
> + goto out_disabled;
> + }
> + }
> +
> + /*
> + * MTE disabled, tag storage pages can be used like any other pages. The
> + * only restriction is that the pages cannot be used by kexec because
> + * the memory remains marked as reserved in the memblock allocator.
> + */
> + if (!system_supports_mte()) {
> + for (i = 0; i< num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> + free_reserved_page(pfn_to_page(pfn));
> + }
> + ret = 0;
> + goto out_disabled;
> + }
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> + init_cma_reserved_pageblock(pfn_to_page(pfn));
> + totalcma_pages += range_len(tag_range);
> + }
You shouldn't be doing that manually in arm code. Likely you want some
cma.c helper for something like that.
But, can you elaborate on why you took this hacky (sorry) approach as
documented in the cover letter:
"The arm64 code manages this memory directly instead of using
cma_declare_contiguous/cma_alloc for performance reasons."
What is the exact problem?
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> Tag storage memory requires that the tag storage pages used for data are
> always migratable when they need to be repurposed to store tags.
>
> If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> memblocks to find a suitable location for copying the kernel image. The
> kernel image, once loaded, cannot be moved to another location in physical
> memory. The initialization code for the tag storage reserves the memblocks
> for the tag storage pages, which means kexec will not use them, and the tag
> storage pages can be migrated at any time, which is the desired behaviour.
>
> However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> flag, which isn't currently set by the tag storage initialization code.
>
> Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> that that the Kconfig option required for it to work correctly.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 047487046e8f..efa5b7958169 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2065,6 +2065,7 @@ config ARM64_MTE
> if ARM64_MTE
> config ARM64_MTE_TAG_STORAGE
> bool "Dynamic MTE tag storage management"
> + depends on ARCH_KEEP_MEMBLOCK
> select CONFIG_CMA
> help
> Adds support for dynamic management of the memory used by the hardware
Doesn't arm64 select that unconditionally? Why is this required then?
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> To be able to reserve the tag storage associated with a page requires that
> the tag storage page can be migrated.
>
> When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
> in non-preemptible contexts, which can make reserving the associate tag
> storage impossible.
I assume that it's the only in-kernel user that actually requires tagged
memory (besides for user space), correct?
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> alloc_contig_range() requires that the requested pages are in the same
> zone. Check that this is indeed the case before initializing the tag
> storage blocks.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 8b9bedf7575d..fd63430d4dc0 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> }
> }
>
> +/* alloc_contig_range() requires all pages to be in the same zone. */
> +static int __init mte_tag_storage_check_zone(void)
> +{
> + struct range *tag_range;
> + struct zone *zone;
> + unsigned long pfn;
> + u32 block_size;
> + int i, j;
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + block_size = tag_regions[i].block_size;
> + if (block_size == 1)
> + continue;
> +
> + tag_range = &tag_regions[i].tag_range;
> + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> + zone = page_zone(pfn_to_page(pfn));
> + for (j = 1; j < block_size; j++) {
> + if (page_zone(pfn_to_page(pfn + j)) != zone) {
> + pr_err("Tag storage block pages in different zones");
> + return -EINVAL;
> + }
> + }
> + }
> + }
> +
> + return 0;
> +}
> +
Looks like something that ordinary CMA provides. See cma_activate_area().
Can't we find a way to let CMA do CMA thingies and only be a user of
that? What would be required to make the performance issue you spelled
out in the cover letter be gone and not have to open-code that in arch code?
--
Cheers,
David / dhildenb
On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> It might be desirable for an architecture to modify the gfp flags used to
> allocate the destination page for migration based on the page that it is
> being replaced. For example, if an architectures has metadata associated
> with a page (like arm64, when the memory tagging extension is implemented),
> it can request that the destination page similarly has storage for tags
> already allocated.
>
> No functional change.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> include/linux/migrate.h | 4 ++++
> mm/mempolicy.c | 2 ++
> mm/migrate.c | 3 +++
> 3 files changed, 9 insertions(+)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 2ce13e8a309b..0acef592043c 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -60,6 +60,10 @@ struct movable_operations {
> /* Defined in mm/debug.c: */
> extern const char *migrate_reason_names[MR_TYPES];
>
> +#ifndef arch_migration_target_gfp
> +#define arch_migration_target_gfp(src, gfp) 0
> +#endif
> +
> #ifdef CONFIG_MIGRATION
>
> void putback_movable_pages(struct list_head *l);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..50bc43ab50d6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
>
> h = folio_hstate(src);
> gfp = htlb_alloc_mask(h);
> + gfp |= arch_migration_target_gfp(src, gfp);
I think it'll be more robust to have arch_migration_target_gfp() to modify
the flags and return the new mask with added (or potentially removed)
flags.
> nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> }
> @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> gfp = GFP_TRANSHUGE;
> else
> gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> + gfp |= arch_migration_target_gfp(src, gfp);
>
> page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> return page_rmappable_folio(page);
--
Sincerely yours,
Mike.
Hi Mike,
I really appreciate you having a look!
On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > It might be desirable for an architecture to modify the gfp flags used to
> > allocate the destination page for migration based on the page that it is
> > being replaced. For example, if an architectures has metadata associated
> > with a page (like arm64, when the memory tagging extension is implemented),
> > it can request that the destination page similarly has storage for tags
> > already allocated.
> >
> > No functional change.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > include/linux/migrate.h | 4 ++++
> > mm/mempolicy.c | 2 ++
> > mm/migrate.c | 3 +++
> > 3 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 2ce13e8a309b..0acef592043c 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -60,6 +60,10 @@ struct movable_operations {
> > /* Defined in mm/debug.c: */
> > extern const char *migrate_reason_names[MR_TYPES];
> >
> > +#ifndef arch_migration_target_gfp
> > +#define arch_migration_target_gfp(src, gfp) 0
> > +#endif
> > +
> > #ifdef CONFIG_MIGRATION
> >
> > void putback_movable_pages(struct list_head *l);
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..50bc43ab50d6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> >
> > h = folio_hstate(src);
> > gfp = htlb_alloc_mask(h);
> > + gfp |= arch_migration_target_gfp(src, gfp);
>
> I think it'll be more robust to have arch_migration_target_gfp() to modify
> the flags and return the new mask with added (or potentially removed)
> flags.
I did it this way so an arch won't be able to remove flags set by the MM code.
There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
arch_calc_vm_flag_bits().
I'll change it to return the new mask if you think that's better.
Thanks,
Alex
>
> > nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> > return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> > }
> > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > gfp = GFP_TRANSHUGE;
> > else
> > gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > + gfp |= arch_migration_target_gfp(src, gfp);
> >
> > page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > return page_rmappable_folio(page);
>
> --
> Sincerely yours,
> Mike.
>
Hi,
Thank you so much for your comments, there are genuinely useful.
On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:56, Alexandru Elisei wrote:
> > Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> > storage for an allocated page. Reserving tag storage can fail, for example,
> > if the tag storage page has a short pin on it, so allow prep_new_page() ->
> > arch_prep_new_page() to similarly fail.
>
> But what are the side-effects of this? How does the calling code recover?
>
> E.g., what if we need to populate a page into user space, but that
> particular page we allocated fails to be prepared? So we inject a signal
> into that poor process?
When the page fails to be prepared, it is put back to the tail of the
freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
are exhausted and no page has been found for which tag storage has been
reserved, then that's treated like an OOM situation.
I have been thinking about this, and I think I can simplify the code by
making tag reservation a best effort approach. The page can be allocated
even if reserving tag storage fails, but the page is marked as invalid in
set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
storage) and next time it is accessed, arm64 will reserve tag storage in
the fault handling code (the mechanism for that is implemented in patch #19
of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
mprotect(PROT_MTE)").
With this new approach, prep_new_page() stays the way it is, and no further
changes are required for the page allocator, as there are already arch
callbacks that can be used for that, for example tag_clear_highpage() and
arch_alloc_page(). The downside is extra page faults, which might impact
performance.
What do you think?
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Add arch_free_pages_prepare() hook that is called before that page flags
> > are cleared. This will be used by arm64 when explicit management of tag
> > storage pages is enabled.
>
> Can you elaborate a bit what exactly will be done by that code with that
> information?
Of course.
The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
renames to PG_mte_tagged, to track if a page has been mapped with tagging
enabled. That flag is cleared by free_pages_prepare() when it does:
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
When tag storage management is enabled, tag storage is reserved for a page if
and only if the page is mapped as tagged. When a page is freed, the code looks
at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
therefore has tag storage reserved, to determine if the corresponding tag
storage should also be freed.
I have considered using arch_free_page(), but free_pages_prepare() calls the
function after the flags are cleared.
Does that answer your question?
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi David,
On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> > the page allocator to manage them like regular pages.
> >
> > Ths migratype lends the pages some very desirable properties:
> >
> > * They cannot be longterm pinned, meaning they will always be migratable.
> >
> > * The pages can be allocated explicitely by using their PFN (with
> > alloc_contig_range()) when they are needed to store tags.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
> > include/linux/mmzone.h | 5 +++
> > mm/internal.h | 3 --
> > 4 files changed, 74 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index fe8276fdc7a8..047487046e8f 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > if ARM64_MTE
> > config ARM64_MTE_TAG_STORAGE
> > bool "Dynamic MTE tag storage management"
> > + select CONFIG_CMA
> > help
> > Adds support for dynamic management of the memory used by the hardware
> > for storing MTE tags. This memory, unlike normal memory, cannot be
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index fa6267ef8392..427f4f1909f3 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -5,10 +5,12 @@
> > * Copyright (C) 2023 ARM Ltd.
> > */
> > +#include <linux/cma.h>
> > #include <linux/memblock.h>
> > #include <linux/mm.h>
> > #include <linux/of_device.h>
> > #include <linux/of_fdt.h>
> > +#include <linux/pageblock-flags.h>
> > #include <linux/range.h>
> > #include <linux/string.h>
> > #include <linux/xarray.h>
> > @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > return ret;
> > }
> > + /* Pages are managed in pageblock_nr_pages chunks */
> > + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> > + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > + PFN_PHYS(pageblock_nr_pages));
> > + return -EINVAL;
> > + }
> > +
> > ret = tag_storage_get_memory_node(node, &mem_node);
> > if (ret)
> > return ret;
> > @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
> > pr_info("MTE tag storage region management disabled");
> > }
> > }
> > +
> > +static int __init mte_tag_storage_activate_regions(void)
> > +{
> > + phys_addr_t dram_start, dram_end;
> > + struct range *tag_range;
> > + unsigned long pfn;
> > + int i, ret;
> > +
> > + if (num_tag_regions == 0)
> > + return 0;
> > +
> > + dram_start = memblock_start_of_DRAM();
> > + dram_end = memblock_end_of_DRAM();
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + /*
> > + * Tag storage region was clipped by arm64_bootmem_init()
> > + * enforcing addressing limits.
> > + */
> > + if (PFN_PHYS(tag_range->start) < dram_start ||
> > + PFN_PHYS(tag_range->end) >= dram_end) {
> > + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> > + ret = -EINVAL;
> > + goto out_disabled;
> > + }
> > + }
> > +
> > + /*
> > + * MTE disabled, tag storage pages can be used like any other pages. The
> > + * only restriction is that the pages cannot be used by kexec because
> > + * the memory remains marked as reserved in the memblock allocator.
> > + */
> > + if (!system_supports_mte()) {
> > + for (i = 0; i< num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> > + free_reserved_page(pfn_to_page(pfn));
> > + }
> > + ret = 0;
> > + goto out_disabled;
> > + }
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > + init_cma_reserved_pageblock(pfn_to_page(pfn));
> > + totalcma_pages += range_len(tag_range);
> > + }
>
> You shouldn't be doing that manually in arm code. Likely you want some cma.c
> helper for something like that.
If you referring to the last loop (the one that does
ini_cma_reserved_pageblock()), indeed, there's already a function which
does that, cma_init_reserved_areas() -> cma_activate_area().
>
> But, can you elaborate on why you took this hacky (sorry) approach as
> documented in the cover letter:
No worries, it is indeed a bit hacky :)
>
> "The arm64 code manages this memory directly instead of using
> cma_declare_contiguous/cma_alloc for performance reasons."
>
> What is the exact problem?
I am referring to the performance degredation that is fixed in patch #26,
"arm64: mte: Fast track reserving tag storage when the block is free" [1].
The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
leads to a 10-20% performance degradation on Chrome. It has been observed
that most of the time the tag storage pages are free, and the
lru_cache_disable() calls are unnecessary.
The performance degradation is almost entirely eliminated by having the code
take the tag storage page directly from the free list if it's free, instead
of calling alloc_contig_range().
Do you believe it would be better to use the cma code, and modify it to use
this fast path to take the page drectly from the buddy allocator?
I can definitely try to integrate the code with cma_alloc(), but I think
keeping the fast path for reserving tag storage is extremely desirable,
since it makes such a huge difference to performance.
[1] https://lore.kernel.org/linux-trace-kernel/[email protected]/
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
>
Hi,
On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Tag storage memory requires that the tag storage pages used for data are
> > always migratable when they need to be repurposed to store tags.
> >
> > If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> > memblocks to find a suitable location for copying the kernel image. The
> > kernel image, once loaded, cannot be moved to another location in physical
> > memory. The initialization code for the tag storage reserves the memblocks
> > for the tag storage pages, which means kexec will not use them, and the tag
> > storage pages can be migrated at any time, which is the desired behaviour.
> >
> > However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> > region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> > flag, which isn't currently set by the tag storage initialization code.
> >
> > Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> > that that the Kconfig option required for it to work correctly.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 047487046e8f..efa5b7958169 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > if ARM64_MTE
> > config ARM64_MTE_TAG_STORAGE
> > bool "Dynamic MTE tag storage management"
> > + depends on ARCH_KEEP_MEMBLOCK
> > select CONFIG_CMA
> > help
> > Adds support for dynamic management of the memory used by the hardware
>
> Doesn't arm64 select that unconditionally? Why is this required then?
I've added this patch to make the dependancy explicit. If, in the future, arm64
stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
fact that tag storage depends on it. So this patch is not required per-se, it's
there to document the dependancy.
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > To be able to reserve the tag storage associated with a page requires that
> > the tag storage page can be migrated.
> >
> > When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
> > in non-preemptible contexts, which can make reserving the associate tag
> > storage impossible.
>
> I assume that it's the only in-kernel user that actually requires tagged
> memory (besides for user space), correct?
Indeed, this is the case. I'll expand the commit message to be more clear about
it.
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Fri, Nov 24, 2023 at 08:56:59PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > alloc_contig_range() requires that the requested pages are in the same
> > zone. Check that this is indeed the case before initializing the tag
> > storage blocks.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> > 1 file changed, 33 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 8b9bedf7575d..fd63430d4dc0 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> > }
> > }
> > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > +static int __init mte_tag_storage_check_zone(void)
> > +{
> > + struct range *tag_range;
> > + struct zone *zone;
> > + unsigned long pfn;
> > + u32 block_size;
> > + int i, j;
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + block_size = tag_regions[i].block_size;
> > + if (block_size == 1)
> > + continue;
> > +
> > + tag_range = &tag_regions[i].tag_range;
> > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > + zone = page_zone(pfn_to_page(pfn));
> > + for (j = 1; j < block_size; j++) {
> > + if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > + pr_err("Tag storage block pages in different zones");
> > + return -EINVAL;
> > + }
> > + }
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
>
> Looks like something that ordinary CMA provides. See cma_activate_area().
Indeed.
>
> Can't we find a way to let CMA do CMA thingies and only be a user of that?
> What would be required to make the performance issue you spelled out in the
> cover letter be gone and not have to open-code that in arch code?
I've replied with a possible solution here [1].
[1] https://lore.kernel.org/all/ZWSvMYMjFLFZ-abv@raptor/
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi Alexandru,
On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
<[email protected]> wrote:
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/include/asm/mte_tag_storage.h | 1 +
> arch/arm64/kernel/mte_tag_storage.c | 15 +++++++
> arch/arm64/mm/fault.c | 55 ++++++++++++++++++++++++
> include/linux/migrate.h | 8 +++-
> include/linux/migrate_mode.h | 1 +
> mm/internal.h | 6 ---
> 6 files changed, 78 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index b97406d369ce..6a8b19a6a758 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> void free_tag_storage(struct page *page, int order);
>
> bool page_tag_storage_reserved(struct page *page);
> +bool page_is_tag_storage(struct page *page);
>
> vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index a1cc239f7211..5096ce859136 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
> return test_bit(PG_tag_storage_reserved, &page->flags);
> }
>
> +bool page_is_tag_storage(struct page *page)
> +{
> + unsigned long pfn = page_to_pfn(page);
> + struct range *tag_range;
> + int i;
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + if (tag_range->start <= pfn && pfn <= tag_range->end)
> + return true;
> + }
> +
> + return false;
> +}
> +
> int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> {
> unsigned long start_block, end_block;
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 6730a0812a24..964c5ae161a3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -12,6 +12,7 @@
> #include <linux/extable.h>
> #include <linux/kfence.h>
> #include <linux/signal.h>
> +#include <linux/migrate.h>
> #include <linux/mm.h>
> #include <linux/hardirq.h>
> #include <linux/init.h>
> @@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
> }
>
> #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +#define MR_TAGGED_TAG_STORAGE MR_ARCH_1
> +
> +extern bool isolate_lru_page(struct page *page);
> +extern void putback_movable_pages(struct list_head *l);
Could we move these declarations to a non-mm-internal header and
#include it instead of manually declaring them here?
> +
> +/* Returns with the page reference dropped. */
> +static void migrate_tag_storage_page(struct page *page)
> +{
> + struct migration_target_control mtc = {
> + .nid = NUMA_NO_NODE,
> + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
> + };
> + unsigned long i, nr_pages = compound_nr(page);
> + LIST_HEAD(pagelist);
> + int ret, tries;
> +
> + lru_cache_disable();
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!isolate_lru_page(page + i)) {
> + ret = -EAGAIN;
> + goto out;
> + }
> + /* Isolate just grabbed another reference, drop ours. */
> + put_page(page + i);
> + list_add_tail(&(page + i)->lru, &pagelist);
> + }
> +
> + tries = 5;
> + while (tries--) {
> + ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
> + MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
> + if (ret == 0 || ret != -EBUSY)
This could be simplified to:
if (ret != -EBUSY)
Peter
On Mon, Nov 27, 2023 at 11:52:56AM +0000, Alexandru Elisei wrote:
> Hi Mike,
>
> I really appreciate you having a look!
>
> On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> > On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > > It might be desirable for an architecture to modify the gfp flags used to
> > > allocate the destination page for migration based on the page that it is
> > > being replaced. For example, if an architectures has metadata associated
> > > with a page (like arm64, when the memory tagging extension is implemented),
> > > it can request that the destination page similarly has storage for tags
> > > already allocated.
> > >
> > > No functional change.
> > >
> > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > ---
> > > include/linux/migrate.h | 4 ++++
> > > mm/mempolicy.c | 2 ++
> > > mm/migrate.c | 3 +++
> > > 3 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 2ce13e8a309b..0acef592043c 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -60,6 +60,10 @@ struct movable_operations {
> > > /* Defined in mm/debug.c: */
> > > extern const char *migrate_reason_names[MR_TYPES];
> > >
> > > +#ifndef arch_migration_target_gfp
> > > +#define arch_migration_target_gfp(src, gfp) 0
> > > +#endif
> > > +
> > > #ifdef CONFIG_MIGRATION
> > >
> > > void putback_movable_pages(struct list_head *l);
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 10a590ee1c89..50bc43ab50d6 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > >
> > > h = folio_hstate(src);
> > > gfp = htlb_alloc_mask(h);
> > > + gfp |= arch_migration_target_gfp(src, gfp);
> >
> > I think it'll be more robust to have arch_migration_target_gfp() to modify
> > the flags and return the new mask with added (or potentially removed)
> > flags.
>
> I did it this way so an arch won't be able to remove flags set by the MM code.
> There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
> arch_calc_vm_flag_bits().
Ok, just add a sentence about it to the commit message.
> Thanks,
> Alex
>
> >
> > > nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> > > return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> > > }
> > > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > > gfp = GFP_TRANSHUGE;
> > > else
> > > gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > > + gfp |= arch_migration_target_gfp(src, gfp);
> > >
> > > page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > > return page_rmappable_folio(page);
> >
> > --
> > Sincerely yours,
> > Mike.
> >
--
Sincerely yours,
Mike.
On 27.11.23 13:09, Alexandru Elisei wrote:
> Hi,
>
> Thank you so much for your comments, there are genuinely useful.
>
> On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:56, Alexandru Elisei wrote:
>>> Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
>>> storage for an allocated page. Reserving tag storage can fail, for example,
>>> if the tag storage page has a short pin on it, so allow prep_new_page() ->
>>> arch_prep_new_page() to similarly fail.
>>
>> But what are the side-effects of this? How does the calling code recover?
>>
>> E.g., what if we need to populate a page into user space, but that
>> particular page we allocated fails to be prepared? So we inject a signal
>> into that poor process?
>
> When the page fails to be prepared, it is put back to the tail of the
> freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
> are exhausted and no page has been found for which tag storage has been
> reserved, then that's treated like an OOM situation.
>
> I have been thinking about this, and I think I can simplify the code by
> making tag reservation a best effort approach. The page can be allocated
> even if reserving tag storage fails, but the page is marked as invalid in
> set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
> storage) and next time it is accessed, arm64 will reserve tag storage in
> the fault handling code (the mechanism for that is implemented in patch #19
> of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
> mprotect(PROT_MTE)").
>
> With this new approach, prep_new_page() stays the way it is, and no further
> changes are required for the page allocator, as there are already arch
> callbacks that can be used for that, for example tag_clear_highpage() and
> arch_alloc_page(). The downside is extra page faults, which might impact
> performance.
>
> What do you think?
That sounds a lot more robust, compared to intermittent failures to
allocate pages.
--
Cheers,
David / dhildenb
On 27.11.23 14:03, Alexandru Elisei wrote:
> Hi,
>
> On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Add arch_free_pages_prepare() hook that is called before that page flags
>>> are cleared. This will be used by arm64 when explicit management of tag
>>> storage pages is enabled.
>>
>> Can you elaborate a bit what exactly will be done by that code with that
>> information?
>
> Of course.
>
> The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
> renames to PG_mte_tagged, to track if a page has been mapped with tagging
> enabled. That flag is cleared by free_pages_prepare() when it does:
>
> page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> When tag storage management is enabled, tag storage is reserved for a page if
> and only if the page is mapped as tagged. When a page is freed, the code looks
> at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
> therefore has tag storage reserved, to determine if the corresponding tag
> storage should also be freed.
>
> I have considered using arch_free_page(), but free_pages_prepare() calls the
> function after the flags are cleared.
>
> Does that answer your question?
Yes, please add some of that to the patch description!
--
Cheers,
David / dhildenb
On 27.11.23 16:01, Alexandru Elisei wrote:
> Hi David,
>
> On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
>>> the page allocator to manage them like regular pages.
>>>
>>> Ths migratype lends the pages some very desirable properties:
>>>
>>> * They cannot be longterm pinned, meaning they will always be migratable.
>>>
>>> * The pages can be allocated explicitely by using their PFN (with
>>> alloc_contig_range()) when they are needed to store tags.
>>>
>>> Signed-off-by: Alexandru Elisei <[email protected]>
>>> ---
>>> arch/arm64/Kconfig | 1 +
>>> arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
>>> include/linux/mmzone.h | 5 +++
>>> mm/internal.h | 3 --
>>> 4 files changed, 74 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index fe8276fdc7a8..047487046e8f 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>>> if ARM64_MTE
>>> config ARM64_MTE_TAG_STORAGE
>>> bool "Dynamic MTE tag storage management"
>>> + select CONFIG_CMA
>>> help
>>> Adds support for dynamic management of the memory used by the hardware
>>> for storing MTE tags. This memory, unlike normal memory, cannot be
>>> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
>>> index fa6267ef8392..427f4f1909f3 100644
>>> --- a/arch/arm64/kernel/mte_tag_storage.c
>>> +++ b/arch/arm64/kernel/mte_tag_storage.c
>>> @@ -5,10 +5,12 @@
>>> * Copyright (C) 2023 ARM Ltd.
>>> */
>>> +#include <linux/cma.h>
>>> #include <linux/memblock.h>
>>> #include <linux/mm.h>
>>> #include <linux/of_device.h>
>>> #include <linux/of_fdt.h>
>>> +#include <linux/pageblock-flags.h>
>>> #include <linux/range.h>
>>> #include <linux/string.h>
>>> #include <linux/xarray.h>
>>> @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
>>> return ret;
>>> }
>>> + /* Pages are managed in pageblock_nr_pages chunks */
>>> + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
>>> + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
>>> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
>>> + PFN_PHYS(pageblock_nr_pages));
>>> + return -EINVAL;
>>> + }
>>> +
>>> ret = tag_storage_get_memory_node(node, &mem_node);
>>> if (ret)
>>> return ret;
>>> @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
>>> pr_info("MTE tag storage region management disabled");
>>> }
>>> }
>>> +
>>> +static int __init mte_tag_storage_activate_regions(void)
>>> +{
>>> + phys_addr_t dram_start, dram_end;
>>> + struct range *tag_range;
>>> + unsigned long pfn;
>>> + int i, ret;
>>> +
>>> + if (num_tag_regions == 0)
>>> + return 0;
>>> +
>>> + dram_start = memblock_start_of_DRAM();
>>> + dram_end = memblock_end_of_DRAM();
>>> +
>>> + for (i = 0; i < num_tag_regions; i++) {
>>> + tag_range = &tag_regions[i].tag_range;
>>> + /*
>>> + * Tag storage region was clipped by arm64_bootmem_init()
>>> + * enforcing addressing limits.
>>> + */
>>> + if (PFN_PHYS(tag_range->start) < dram_start ||
>>> + PFN_PHYS(tag_range->end) >= dram_end) {
>>> + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
>>> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
>>> + ret = -EINVAL;
>>> + goto out_disabled;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * MTE disabled, tag storage pages can be used like any other pages. The
>>> + * only restriction is that the pages cannot be used by kexec because
>>> + * the memory remains marked as reserved in the memblock allocator.
>>> + */
>>> + if (!system_supports_mte()) {
>>> + for (i = 0; i< num_tag_regions; i++) {
>>> + tag_range = &tag_regions[i].tag_range;
>>> + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
>>> + free_reserved_page(pfn_to_page(pfn));
>>> + }
>>> + ret = 0;
>>> + goto out_disabled;
>>> + }
>>> +
>>> + for (i = 0; i < num_tag_regions; i++) {
>>> + tag_range = &tag_regions[i].tag_range;
>>> + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
>>> + init_cma_reserved_pageblock(pfn_to_page(pfn));
>>> + totalcma_pages += range_len(tag_range);
>>> + }
>>
>> You shouldn't be doing that manually in arm code. Likely you want some cma.c
>> helper for something like that.
>
> If you referring to the last loop (the one that does
> ini_cma_reserved_pageblock()), indeed, there's already a function which
> does that, cma_init_reserved_areas() -> cma_activate_area().
>
>>
>> But, can you elaborate on why you took this hacky (sorry) approach as
>> documented in the cover letter:
>
> No worries, it is indeed a bit hacky :)
>
>>
>> "The arm64 code manages this memory directly instead of using
>> cma_declare_contiguous/cma_alloc for performance reasons."
>>
>> What is the exact problem?
>
> I am referring to the performance degredation that is fixed in patch #26,
> "arm64: mte: Fast track reserving tag storage when the block is free" [1].
> The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
> calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
> leads to a 10-20% performance degradation on Chrome. It has been observed
> that most of the time the tag storage pages are free, and the
> lru_cache_disable() calls are unnecessary.
This sounds like something eventually worth integrating into
CMA/alloc_contig_range(). Like, a fast path to check if we are only
allocating something small (e.g., falls within a single pageblock), and
if the page is free.
>
> The performance degradation is almost entirely eliminated by having the code
> take the tag storage page directly from the free list if it's free, instead
> of calling alloc_contig_range().
>
> Do you believe it would be better to use the cma code, and modify it to use
> this fast path to take the page drectly from the buddy allocator?
That sounds reasonable yes. Do you see any blockers for that?
>
> I can definitely try to integrate the code with cma_alloc(), but I think
> keeping the fast path for reserving tag storage is extremely desirable,
> since it makes such a huge difference to performance.
Yes, but let's try finding a way to optimize common code, to eventually
improve some CMA cases as well? :)
--
Cheers,
David / dhildenb
On 27.11.23 16:07, Alexandru Elisei wrote:
> Hi,
>
> On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> To be able to reserve the tag storage associated with a page requires that
>>> the tag storage page can be migrated.
>>>
>>> When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
>>> in non-preemptible contexts, which can make reserving the associate tag
>>> storage impossible.
>>
>> I assume that it's the only in-kernel user that actually requires tagged
>> memory (besides for user space), correct?
>
> Indeed, this is the case. I'll expand the commit message to be more clear about
> it.
>
Great, thanks!
--
Cheers,
David / dhildenb
On 27.11.23 16:04, Alexandru Elisei wrote:
> Hi,
>
> On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Tag storage memory requires that the tag storage pages used for data are
>>> always migratable when they need to be repurposed to store tags.
>>>
>>> If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
>>> memblocks to find a suitable location for copying the kernel image. The
>>> kernel image, once loaded, cannot be moved to another location in physical
>>> memory. The initialization code for the tag storage reserves the memblocks
>>> for the tag storage pages, which means kexec will not use them, and the tag
>>> storage pages can be migrated at any time, which is the desired behaviour.
>>>
>>> However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
>>> region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
>>> flag, which isn't currently set by the tag storage initialization code.
>>>
>>> Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
>>> that that the Kconfig option required for it to work correctly.
>>>
>>> Signed-off-by: Alexandru Elisei <[email protected]>
>>> ---
>>> arch/arm64/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 047487046e8f..efa5b7958169 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>>> if ARM64_MTE
>>> config ARM64_MTE_TAG_STORAGE
>>> bool "Dynamic MTE tag storage management"
>>> + depends on ARCH_KEEP_MEMBLOCK
>>> select CONFIG_CMA
>>> help
>>> Adds support for dynamic management of the memory used by the hardware
>>
>> Doesn't arm64 select that unconditionally? Why is this required then?
>
> I've added this patch to make the dependancy explicit. If, in the future, arm64
> stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
> fact that tag storage depends on it. So this patch is not required per-se, it's
> there to document the dependancy.
I see. Could you add some static_assert / BUILD_BUG_ON instead?
I suspect there are plenty other (undocumented) reasons why
ARCH_KEEP_MEMBLOCK has to be enabled for now, and none sets
ARCH_KEEP_MEMBLOCK, I suspect/
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> On arm64, the zero page receives special treatment by having the tagged
> flag set on MTE initialization, not when the page is mapped in a process
> address space. Reserve the corresponding tag block when tag storage
> management is being activated.
Out of curiosity: why does the shared zeropage require tagged storage?
What about the huge zeropage?
--
Cheers,
David / dhildenb
Hi,
On Tue, Nov 28, 2023 at 05:57:31PM +0100, David Hildenbrand wrote:
> On 27.11.23 13:09, Alexandru Elisei wrote:
> > Hi,
> >
> > Thank you so much for your comments, there are genuinely useful.
> >
> > On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:56, Alexandru Elisei wrote:
> > > > Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> > > > storage for an allocated page. Reserving tag storage can fail, for example,
> > > > if the tag storage page has a short pin on it, so allow prep_new_page() ->
> > > > arch_prep_new_page() to similarly fail.
> > >
> > > But what are the side-effects of this? How does the calling code recover?
> > >
> > > E.g., what if we need to populate a page into user space, but that
> > > particular page we allocated fails to be prepared? So we inject a signal
> > > into that poor process?
> >
> > When the page fails to be prepared, it is put back to the tail of the
> > freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
> > are exhausted and no page has been found for which tag storage has been
> > reserved, then that's treated like an OOM situation.
> >
> > I have been thinking about this, and I think I can simplify the code by
> > making tag reservation a best effort approach. The page can be allocated
> > even if reserving tag storage fails, but the page is marked as invalid in
> > set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
> > storage) and next time it is accessed, arm64 will reserve tag storage in
> > the fault handling code (the mechanism for that is implemented in patch #19
> > of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
> > mprotect(PROT_MTE)").
> >
> > With this new approach, prep_new_page() stays the way it is, and no further
> > changes are required for the page allocator, as there are already arch
> > callbacks that can be used for that, for example tag_clear_highpage() and
> > arch_alloc_page(). The downside is extra page faults, which might impact
> > performance.
> >
> > What do you think?
>
> That sounds a lot more robust, compared to intermittent failures to allocate
> pages.
Great, thank you for the feedback, I will use this approach for the next
iteration of the series.
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Tue, Nov 28, 2023 at 05:58:55PM +0100, David Hildenbrand wrote:
> On 27.11.23 14:03, Alexandru Elisei wrote:
> > Hi,
> >
> > On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Add arch_free_pages_prepare() hook that is called before that page flags
> > > > are cleared. This will be used by arm64 when explicit management of tag
> > > > storage pages is enabled.
> > >
> > > Can you elaborate a bit what exactly will be done by that code with that
> > > information?
> >
> > Of course.
> >
> > The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
> > renames to PG_mte_tagged, to track if a page has been mapped with tagging
> > enabled. That flag is cleared by free_pages_prepare() when it does:
> >
> > page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > When tag storage management is enabled, tag storage is reserved for a page if
> > and only if the page is mapped as tagged. When a page is freed, the code looks
> > at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
> > therefore has tag storage reserved, to determine if the corresponding tag
> > storage should also be freed.
> >
> > I have considered using arch_free_page(), but free_pages_prepare() calls the
> > function after the flags are cleared.
> >
> > Does that answer your question?
>
> Yes, please add some of that to the patch description!
Will do!
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Tue, Nov 28, 2023 at 08:49:57AM +0200, Mike Rapoport wrote:
> On Mon, Nov 27, 2023 at 11:52:56AM +0000, Alexandru Elisei wrote:
> > Hi Mike,
> >
> > I really appreciate you having a look!
> >
> > On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> > > On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > > > It might be desirable for an architecture to modify the gfp flags used to
> > > > allocate the destination page for migration based on the page that it is
> > > > being replaced. For example, if an architectures has metadata associated
> > > > with a page (like arm64, when the memory tagging extension is implemented),
> > > > it can request that the destination page similarly has storage for tags
> > > > already allocated.
> > > >
> > > > No functional change.
> > > >
> > > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > > ---
> > > > include/linux/migrate.h | 4 ++++
> > > > mm/mempolicy.c | 2 ++
> > > > mm/migrate.c | 3 +++
> > > > 3 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > > index 2ce13e8a309b..0acef592043c 100644
> > > > --- a/include/linux/migrate.h
> > > > +++ b/include/linux/migrate.h
> > > > @@ -60,6 +60,10 @@ struct movable_operations {
> > > > /* Defined in mm/debug.c: */
> > > > extern const char *migrate_reason_names[MR_TYPES];
> > > >
> > > > +#ifndef arch_migration_target_gfp
> > > > +#define arch_migration_target_gfp(src, gfp) 0
> > > > +#endif
> > > > +
> > > > #ifdef CONFIG_MIGRATION
> > > >
> > > > void putback_movable_pages(struct list_head *l);
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 10a590ee1c89..50bc43ab50d6 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > > >
> > > > h = folio_hstate(src);
> > > > gfp = htlb_alloc_mask(h);
> > > > + gfp |= arch_migration_target_gfp(src, gfp);
> > >
> > > I think it'll be more robust to have arch_migration_target_gfp() to modify
> > > the flags and return the new mask with added (or potentially removed)
> > > flags.
> >
> > I did it this way so an arch won't be able to remove flags set by the MM code.
> > There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
> > arch_calc_vm_flag_bits().
>
> Ok, just add a sentence about it to the commit message.
Great, will do that!
Thanks,
Alex
>
> > Thanks,
> > Alex
> >
> > >
> > > > nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> > > > return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> > > > }
> > > > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > > > gfp = GFP_TRANSHUGE;
> > > > else
> > > > gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > > > + gfp |= arch_migration_target_gfp(src, gfp);
> > > >
> > > > page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > > > return page_rmappable_folio(page);
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
>
> --
> Sincerely yours,
> Mike.
>
On 19.11.23 17:57, Alexandru Elisei wrote:
> To enable tagging on a memory range, userspace can use mprotect() with the
> PROT_MTE access flag. Pages already mapped in the VMA don't have the
> associated tag storage block reserved, so mark the PTEs as
> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> reserve the tag storage on the fault path.
That sounds alot like fake PROT_NONE. Would there be a way to unify hat
handling and simply reuse pte_protnone()? For example, could we special
case on VMA flags?
Like, don't do NUMA hinting in these special VMAs. Then, have something
like:
if (pte_protnone(vmf->orig_pte))
return handle_pte_protnone(vmf);
In there, special case on the VMA flags.
I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some
code from the prot_none handling path. At least the recovery path and
writability handling looks like it better be located shared in
handle_pte_protnone() as well.
That might take some magic out of this patch.
--
Cheers,
David / dhildenb
On 19.11.23 17:57, Alexandru Elisei wrote:
> Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> regular pages.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
Same comments :)
--
Cheers,
David / dhildenb
On 28.11.23 18:55, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
>> To enable tagging on a memory range, userspace can use mprotect() with the
>> PROT_MTE access flag. Pages already mapped in the VMA don't have the
>> associated tag storage block reserved, so mark the PTEs as
>> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
>> reserve the tag storage on the fault path.
>
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
> handling and simply reuse pte_protnone()? For example, could we special
> case on VMA flags?
>
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
>
> if (pte_protnone(vmf->orig_pte))
> return handle_pte_protnone(vmf);
>
Think out loud: maybe there isn't even the need to special-case on the
VMA. Arch code should know it there is something to do. If not, it
surely was triggered bu NUMA hinting. So maybe that could be handled in
handle_pte_protnone() quite nicely.
--
Cheers,
David / dhildenb
Hi,
On Tue, Nov 28, 2023 at 06:03:52PM +0100, David Hildenbrand wrote:
> On 27.11.23 16:01, Alexandru Elisei wrote:
> > Hi David,
> >
> > On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> > > > the page allocator to manage them like regular pages.
> > > >
> > > > Ths migratype lends the pages some very desirable properties:
> > > >
> > > > * They cannot be longterm pinned, meaning they will always be migratable.
> > > >
> > > > * The pages can be allocated explicitely by using their PFN (with
> > > > alloc_contig_range()) when they are needed to store tags.
> > > >
> > > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > > ---
> > > > arch/arm64/Kconfig | 1 +
> > > > arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
> > > > include/linux/mmzone.h | 5 +++
> > > > mm/internal.h | 3 --
> > > > 4 files changed, 74 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index fe8276fdc7a8..047487046e8f 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > > > if ARM64_MTE
> > > > config ARM64_MTE_TAG_STORAGE
> > > > bool "Dynamic MTE tag storage management"
> > > > + select CONFIG_CMA
> > > > help
> > > > Adds support for dynamic management of the memory used by the hardware
> > > > for storing MTE tags. This memory, unlike normal memory, cannot be
> > > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > > index fa6267ef8392..427f4f1909f3 100644
> > > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > > @@ -5,10 +5,12 @@
> > > > * Copyright (C) 2023 ARM Ltd.
> > > > */
> > > > +#include <linux/cma.h>
> > > > #include <linux/memblock.h>
> > > > #include <linux/mm.h>
> > > > #include <linux/of_device.h>
> > > > #include <linux/of_fdt.h>
> > > > +#include <linux/pageblock-flags.h>
> > > > #include <linux/range.h>
> > > > #include <linux/string.h>
> > > > #include <linux/xarray.h>
> > > > @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > > return ret;
> > > > }
> > > > + /* Pages are managed in pageblock_nr_pages chunks */
> > > > + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> > > > + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> > > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > > + PFN_PHYS(pageblock_nr_pages));
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > ret = tag_storage_get_memory_node(node, &mem_node);
> > > > if (ret)
> > > > return ret;
> > > > @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
> > > > pr_info("MTE tag storage region management disabled");
> > > > }
> > > > }
> > > > +
> > > > +static int __init mte_tag_storage_activate_regions(void)
> > > > +{
> > > > + phys_addr_t dram_start, dram_end;
> > > > + struct range *tag_range;
> > > > + unsigned long pfn;
> > > > + int i, ret;
> > > > +
> > > > + if (num_tag_regions == 0)
> > > > + return 0;
> > > > +
> > > > + dram_start = memblock_start_of_DRAM();
> > > > + dram_end = memblock_end_of_DRAM();
> > > > +
> > > > + for (i = 0; i < num_tag_regions; i++) {
> > > > + tag_range = &tag_regions[i].tag_range;
> > > > + /*
> > > > + * Tag storage region was clipped by arm64_bootmem_init()
> > > > + * enforcing addressing limits.
> > > > + */
> > > > + if (PFN_PHYS(tag_range->start) < dram_start ||
> > > > + PFN_PHYS(tag_range->end) >= dram_end) {
> > > > + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> > > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> > > > + ret = -EINVAL;
> > > > + goto out_disabled;
> > > > + }
> > > > + }
> > > > +
> > > > + /*
> > > > + * MTE disabled, tag storage pages can be used like any other pages. The
> > > > + * only restriction is that the pages cannot be used by kexec because
> > > > + * the memory remains marked as reserved in the memblock allocator.
> > > > + */
> > > > + if (!system_supports_mte()) {
> > > > + for (i = 0; i< num_tag_regions; i++) {
> > > > + tag_range = &tag_regions[i].tag_range;
> > > > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> > > > + free_reserved_page(pfn_to_page(pfn));
> > > > + }
> > > > + ret = 0;
> > > > + goto out_disabled;
> > > > + }
> > > > +
> > > > + for (i = 0; i < num_tag_regions; i++) {
> > > > + tag_range = &tag_regions[i].tag_range;
> > > > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > > > + init_cma_reserved_pageblock(pfn_to_page(pfn));
> > > > + totalcma_pages += range_len(tag_range);
> > > > + }
> > >
> > > You shouldn't be doing that manually in arm code. Likely you want some cma.c
> > > helper for something like that.
> >
> > If you referring to the last loop (the one that does
> > ini_cma_reserved_pageblock()), indeed, there's already a function which
> > does that, cma_init_reserved_areas() -> cma_activate_area().
> >
> > >
> > > But, can you elaborate on why you took this hacky (sorry) approach as
> > > documented in the cover letter:
> >
> > No worries, it is indeed a bit hacky :)
> >
> > >
> > > "The arm64 code manages this memory directly instead of using
> > > cma_declare_contiguous/cma_alloc for performance reasons."
> > >
> > > What is the exact problem?
> >
> > I am referring to the performance degredation that is fixed in patch #26,
> > "arm64: mte: Fast track reserving tag storage when the block is free" [1].
> > The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
> > calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
> > leads to a 10-20% performance degradation on Chrome. It has been observed
> > that most of the time the tag storage pages are free, and the
> > lru_cache_disable() calls are unnecessary.
>
> This sounds like something eventually worth integrating into
> CMA/alloc_contig_range(). Like, a fast path to check if we are only
> allocating something small (e.g., falls within a single pageblock), and if
> the page is free.
>
> >
> > The performance degradation is almost entirely eliminated by having the code
> > take the tag storage page directly from the free list if it's free, instead
> > of calling alloc_contig_range().
> >
> > Do you believe it would be better to use the cma code, and modify it to use
> > this fast path to take the page drectly from the buddy allocator?
>
> That sounds reasonable yes. Do you see any blockers for that?
I have been looking at the CMA code, and nothing stands out. I'll try changing
the code to use cma_alloc/cma_release for the next iteration.
>
> >
> > I can definitely try to integrate the code with cma_alloc(), but I think
> > keeping the fast path for reserving tag storage is extremely desirable,
> > since it makes such a huge difference to performance.
>
> Yes, but let's try finding a way to optimize common code, to eventually
> improve some CMA cases as well? :)
Sounds good, I'll try to integrate the fast path code to cma_alloc(), that way
existing callers can benefit from it immediately.
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Tue, Nov 28, 2023 at 06:05:20PM +0100, David Hildenbrand wrote:
> On 27.11.23 16:04, Alexandru Elisei wrote:
> > Hi,
> >
> > On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Tag storage memory requires that the tag storage pages used for data are
> > > > always migratable when they need to be repurposed to store tags.
> > > >
> > > > If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> > > > memblocks to find a suitable location for copying the kernel image. The
> > > > kernel image, once loaded, cannot be moved to another location in physical
> > > > memory. The initialization code for the tag storage reserves the memblocks
> > > > for the tag storage pages, which means kexec will not use them, and the tag
> > > > storage pages can be migrated at any time, which is the desired behaviour.
> > > >
> > > > However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> > > > region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> > > > flag, which isn't currently set by the tag storage initialization code.
> > > >
> > > > Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> > > > that that the Kconfig option required for it to work correctly.
> > > >
> > > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > > ---
> > > > arch/arm64/Kconfig | 1 +
> > > > 1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index 047487046e8f..efa5b7958169 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > > > if ARM64_MTE
> > > > config ARM64_MTE_TAG_STORAGE
> > > > bool "Dynamic MTE tag storage management"
> > > > + depends on ARCH_KEEP_MEMBLOCK
> > > > select CONFIG_CMA
> > > > help
> > > > Adds support for dynamic management of the memory used by the hardware
> > >
> > > Doesn't arm64 select that unconditionally? Why is this required then?
> >
> > I've added this patch to make the dependancy explicit. If, in the future, arm64
> > stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
> > fact that tag storage depends on it. So this patch is not required per-se, it's
> > there to document the dependancy.
>
> I see. Could you add some static_assert / BUILD_BUG_ON instead?
I can do that, sure.
Thanks,
Alex
>
> I suspect there are plenty other (undocumented) reasons why
> ARCH_KEEP_MEMBLOCK has to be enabled for now, and none sets
> ARCH_KEEP_MEMBLOCK, I suspect/
>
> --
> Cheers,
>
> David / dhildenb
>
On Tue, Nov 28, 2023 at 06:06:54PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > On arm64, the zero page receives special treatment by having the tagged
> > flag set on MTE initialization, not when the page is mapped in a process
> > address space. Reserve the corresponding tag block when tag storage
> > management is being activated.
>
> Out of curiosity: why does the shared zeropage require tagged storage? What
> about the huge zeropage?
There are two different tags that are used for tag checking: the logical
tag, the tag embedded in bits 59:56 of an address, and the physical tag
corresponding to the address. This tag is stored in a separate memory
location, called tag storage. When an access is performed, hardware
compares the logical tag (from the address) with the physical tag (from the
tag storage). If they match, the access is permitted.
The physical tag is set with special instructions.
Userspace pointers have bits 59:56 zero. If the pointer is in a VMA with
MTE enabled, then for userspace to be able to access this address, the
physical tag must also be 0b0000.
To make it easier on userspace, when a page is first mapped as tagged, its
tags are cleared by the kernel; this way, userspace can access the address
immediately, without clearing the physical tags beforehand. Another reason
for clearing the physical tags when a page is mapped as tagged would be to
avoid leaking uninitialized tags to userspace.
The zero page is special, because the physical tags are not zeroed every
time the page is mapped in a process; instead, the zero page is marked as
tagged (by setting a page flag) and the physical tags are zeroed only once,
when MTE is enabled at boot.
All of this means that when tag storage is enabled, which happens after MTE
is enabled, the tag storage corresponding to the zero page is already in
use and must be rezerved, and it can never be used for data allocations.
I hope all of the above makes sense. I can also put it in the commit
message :)
As for the zero huge page, the MTE code in the kernel treats it like a
regular page, and it zeroes the tags when it is mapped as tagged in a
process. I agree that this might not be the best solution from a
performance perspective, but it has worked so far.
With tag storage management enabled, set_pte_at()->mte_sync_tags() will
discover that the huge zero page doesn't have tag storage reserved, the
table entry will be mapped as invalid to use the page fault-on-access
mechanism that I introduce later in the series [1] to reserve tag storage,
and after that set_pte_at() will zero the physical tags.
[1] https://lore.kernel.org/all/[email protected]/
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Tue, Nov 28, 2023 at 06:55:18PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
>
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
Yes, arm64 basically defines PAGE_FAULT_ON_ACCESS as PAGE_NONE |
PTE_TAG_STORAGE_NONE.
> handling and simply reuse pte_protnone()? For example, could we special case
> on VMA flags?
>
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
>
> if (pte_protnone(vmf->orig_pte))
> return handle_pte_protnone(vmf);
>
> In there, special case on the VMA flags.
Your suggestion from the follow-up reply that an arch should know if it needs to
do something was spot on, arm64 can use the software bit in the translation
table entry for that.
So what you are proposing is this:
* Rename do_numa_page->handle_pte_protnone
* At some point in the do_numa_page (now renamed to handle_pte_protnone) flow,
decide if pte_protnone() has been set for an arch specific reason or because
of automatic NUMA balancing.
* if pte_protnone() has been set by an architecture, then let the architecture
handle the fault.
If I understood you correctly, that's a good idea, and should be easy to
implement.
>
> I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some
Indeed, most of the code is taken as-is from do_numa_page().
> code from the prot_none handling path. At least the recovery path and
> writability handling looks like it better be located shared in
> handle_pte_protnone() as well.
Yes, I agree.
Thanks,
Alex
>
> That might take some magic out of this patch.
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Tue, Nov 28, 2023 at 06:56:34PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> > regular pages.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
>
> Same comments :)
Yes, will have a look at this fault handling path too :)
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
Hi,
On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> Hello.
>
> On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> >
> > The DTB node for the tag storage region is defined as:
> >
> > tags0: tag-storage@8f8000000 {
> > compatible = "arm,mte-tag-storage";
> > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > block-size = <0x1000>;
> > memory = <&memory0>; // Associated tagged memory node
> > };
> >
>
> How about using compatible = "shared-dma-pool" like below ?
>
> &reserved_memory {
> tags0: tag0@8f8000000 {
> compatible = "arm,mte-tag-storage";
> reg = <0x08 0xf8000000 0x00 0x4000000>;
> };
> }
>
> tag-storage {
> compatible = "arm,mte-tag-storage";
> memory-region = <&tag>;
> memory = <&memory0>;
> block-size = <0x1000>;
> }
I'm sorry, but I don't follow where compatible = "shared-dma-pool" fits
with the examples.
>
> And then, the activation of CMA would be performed in the CMA code.
> We just can get the region information from memory-region and allocate it directly
> like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.
For the next iteration I am planning to integrate the code more tightly
with CMA, so any suggestions to that effect are very welcome :)
>
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> >
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
>
> Is it considered for kernel page size like 16K page, 64K page ? The comment says
> it should be a multiple of 4K, but it should be a multiple of the "page size" more accurately.
> Please let me know if there's anything I misunderstood. :-)
The block size in the DTB is a hardware property, it's independent of the
kernel page size, which is a compile time option.
The function get_block_size_pages(), which computes the tag storage block
size as the kernel will use it, takes into account the fact that the
hardware block size is not necessarily a multiple of the kernel page size,
and computes the least common multiple by doing:
(kernel page size in bytes x DTB block size in bytes) / greatest common divisor
As for why the hardware block size is a multiple of 4k, that was chosen
because it will be part of the architecture update. Since the minimum
hardware page size is 4K, it doesn't make much sense to have the DTB
block-size smaller than that.
Hope that makes sense!
Thanks,
Alex
>
>
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 12 ++
> > arch/arm64/include/asm/mte_tag_storage.h | 15 ++
> > arch/arm64/kernel/Makefile | 1 +
> > arch/arm64/kernel/mte_tag_storage.c | 256 +++++++++++++++++++++++
> > arch/arm64/kernel/setup.c | 7 +
> > 5 files changed, 291 insertions(+)
> > create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> > create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >
> > Documentation/arch/arm64/memory-tagging-extension.rst.
> >
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > + bool "Dynamic MTE tag storage management"
> > + help
> > + Adds support for dynamic management of the memory used by the hardware
> > + for storing MTE tags. This memory, unlike normal memory, cannot be
> > + tagged. When it is used to store tags for another memory location it
> > + cannot be used for any type of allocation.
> > +
> > + If unsure, say N
> > +endif # ARM64_MTE
> > +
> > endmenu # "ARMv8.5 architectural features"
> >
> > menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
> > obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
> > obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
> > obj-$(CONFIG_ARM64_MTE) += mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
> > obj-y += vdso-wrap.o
> > obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
> > obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > + struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> > + struct range tag_range; /* Tag storage memory, in PFNs. */
> > + u32 block_size; /* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS 32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > + int reg_len, struct range *range)
> > +{
> > + int addr_cells = dt_root_addr_cells;
> > + int size_cells = dt_root_size_cells;
> > + u64 size;
> > +
> > + if (reg_len / 4 > addr_cells + size_cells)
> > + return -EINVAL;
> > +
> > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > + if (size == 0) {
> > + pr_err("Invalid node");
> > + return -EINVAL;
> > + }
> > + range->end = range->start + size - 1;
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > + struct range *tag_range)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > + if (reg == NULL) {
> > + pr_err("Invalid metadata node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > + if (reg == NULL)
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > +
> > + if (reg == NULL) {
> > + pr_err("Invalid memory node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > + unsigned long node;
> > + u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > + struct find_memory_node_arg *arg = data;
> > +
> > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > + return 0;
> > +
> > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > + arg->node = node;
> > + return 1;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > + struct find_memory_node_arg arg = { 0 };
> > + const __be32 *memory_prop;
> > + u32 mem_phandle;
> > + int ret, reg_len;
> > +
> > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > + if (!memory_prop) {
> > + pr_err("Missing 'memory' property in the tag storage node");
> > + return -EINVAL;
> > + }
> > +
> > + mem_phandle = be32_to_cpup(memory_prop);
> > + arg.phandle = mem_phandle;
> > +
> > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > + if (ret != 1) {
> > + pr_err("Associated memory node not found");
> > + return -EINVAL;
> > + }
> > +
> > + *mem_node = arg.node;
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > + u32 *retval)
> > +{
> > + const __be32 *reg;
> > +
> > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > + if (!reg)
> > + return -EINVAL;
> > +
> > + *retval = be32_to_cpup(reg);
> > + return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > + u32 a = PAGE_SIZE;
> > + u32 b = block_size_bytes;
> > + u32 r;
> > +
> > + /* Find greatest common divisor using the Euclidian algorithm. */
> > + do {
> > + r = a % b;
> > + a = b;
> > + b = r;
> > + } while (b != 0);
> > +
> > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + struct tag_region *region;
> > + unsigned long mem_node;
> > + struct range *mem_range;
> > + struct range *tag_range;
> > + u32 block_size_bytes;
> > + u32 nid = 0;
> > + int ret;
> > +
> > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > + return 0;
> > +
> > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > + return 0;
> > +
> > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > + pr_err("Maximum number of tag storage regions exceeded");
> > + return -EINVAL;
> > + }
> > +
> > + region = &tag_regions[num_tag_regions];
> > + mem_range = ®ion->mem_range;
> > + tag_range = ®ion->tag_range;
> > +
> > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > + if (ret) {
> > + pr_err("Invalid tag storage node");
> > + return ret;
> > + }
> > +
> > + ret = tag_storage_get_memory_node(node, &mem_node);
> > + if (ret)
> > + return ret;
> > +
> > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > + if (ret) {
> > + pr_err("Invalid address for associated data memory node");
> > + return ret;
> > + }
> > +
> > + /* The tag region must exactly match the corresponding memory. */
> > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > + return -EINVAL;
> > + }
> > +
> > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > + if (ret || block_size_bytes == 0) {
> > + pr_err("Invalid or missing 'block-size' property");
> > + return -EINVAL;
> > + }
> > + region->block_size = get_block_size_pages(block_size_bytes);
> > + if (range_len(tag_range) % region->block_size != 0) {
> > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > + return -EINVAL;
> > + }
> > +
>
> I was confused about the variable "block_size", The block size declared in the device tree is
> in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> to something more readable, such as "block_nr_pages" (This is just a example!)
>
> Thanks,
> Regards.
>
> > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > + if (ret)
> > + nid = numa_node_id();
> > +
> > + ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > + nid, MEMBLOCK_NONE);
> > + if (ret) {
> > + pr_err("Error adding tag memblock (%d)", ret);
> > + return ret;
> > + }
> > + memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +
> > + pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > +
> > + num_tag_regions++;
> > +
> > + return 0;
> > +}
> > +
> > +void __init mte_tag_storage_init(void)
> > +{
> > + struct range *tag_range;
> > + int i, ret;
> > +
> > + ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > + if (ret) {
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > + }
> > + num_tag_regions = 0;
> > + pr_info("MTE tag storage region management disabled");
> > + }
> > +}
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 417a8a86b2db..1b77138c1aa5 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -42,6 +42,7 @@
> > #include <asm/cpufeature.h>
> > #include <asm/cpu_ops.h>
> > #include <asm/kasan.h>
> > +#include <asm/mte_tag_storage.h>
> > #include <asm/numa.h>
> > #include <asm/scs.h>
> > #include <asm/sections.h>
> > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> > FW_BUG "Booted with MMU enabled!");
> > }
> >
> > + /*
> > + * Must be called before memory limits are enforced by
> > + * arm64_memblock_init().
> > + */
> > + mte_tag_storage_init();
> > +
> > arm64_memblock_init();
> >
> > paging_init();
> > --
> > 2.42.1
> >
> >
Hi,
On Wed, Nov 29, 2023 at 05:57:44PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> > alloc_contig_range() requires that the requested pages are in the same
> > zone. Check that this is indeed the case before initializing the tag
> > storage blocks.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> > 1 file changed, 33 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 8b9bedf7575d..fd63430d4dc0 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> > }
> > }
> >
> > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > +static int __init mte_tag_storage_check_zone(void)
> > +{
> > + struct range *tag_range;
> > + struct zone *zone;
> > + unsigned long pfn;
> > + u32 block_size;
> > + int i, j;
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + block_size = tag_regions[i].block_size;
> > + if (block_size == 1)
> > + continue;
> > +
> > + tag_range = &tag_regions[i].tag_range;
> > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > + zone = page_zone(pfn_to_page(pfn));
>
> Hello.
>
> Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
> out of the loop ?
Hmm.. why do you say that the pages in a tag_range must be in the same
zone? I am not very familiar with how the memory management code puts pages
into zones, but I would imagine that pages in a tag range straddling the
4GB limit (so, let's say, from 3GB to 5GB) will end up in both ZONE_DMA and
ZONE_NORMAL.
Thanks,
Alex
>
> Thanks,
> Regards.
>
> > + for (j = 1; j < block_size; j++) {
> > + if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > + pr_err("Tag storage block pages in different zones");
> > + return -EINVAL;
> > + }
> > + }
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static int __init mte_tag_storage_activate_regions(void)
> > {
> > phys_addr_t dram_start, dram_end;
> > @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
> > goto out_disabled;
> > }
> >
> > + ret = mte_tag_storage_check_zone();
> > + if (ret)
> > + goto out_disabled;
> > +
> > for (i = 0; i < num_tag_regions; i++) {
> > tag_range = &tag_regions[i].tag_range;
> > for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > --
> > 2.42.1
> >
> >
Hi,
On Wed, Nov 29, 2023 at 06:27:25PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:13PM +0000, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
> >
> > This has several benefits over reserving the tag storage as part of the
> > mprotect() call handling:
> >
> > - Tag storage is reserved only for those pages in the VMA that are
> > accessed, instead of for all the pages already mapped in the VMA.
> > - Reduces the latency of the mprotect() call.
> > - Eliminates races with page migration.
> >
> > But all of this is at the expense of an extra page fault per page until the
> > pages being accessed all have their corresponding tag storage reserved.
> >
> > For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
> > page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
> > of the PBHA bits in entries from the last level of the translation table
> > and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
> > 59, is already being used as a software bit for PMD_PRESENT_INVALID.
> >
> > This is only implemented for PTE mappings; PMD mappings will follow.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/mte.h | 4 +-
> > arch/arm64/include/asm/mte_tag_storage.h | 2 +
> > arch/arm64/include/asm/pgtable-prot.h | 2 +
> > arch/arm64/include/asm/pgtable.h | 40 ++++++---
> > arch/arm64/kernel/mte.c | 12 ++-
> > arch/arm64/mm/fault.c | 101 +++++++++++++++++++++++
> > include/linux/pgtable.h | 17 ++++
> > mm/Kconfig | 3 +
> > mm/memory.c | 3 +
> > 10 files changed, 170 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index efa5b7958169..3b9c435eaafb 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2066,6 +2066,7 @@ if ARM64_MTE
> > config ARM64_MTE_TAG_STORAGE
> > bool "Dynamic MTE tag storage management"
> > depends on ARCH_KEEP_MEMBLOCK
> > + select ARCH_HAS_FAULT_ON_ACCESS
> > select CONFIG_CMA
> > help
> > Adds support for dynamic management of the memory used by the hardware
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 6457b7899207..70dc2e409070 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> > }
> >
> > void mte_zero_clear_page_tags(void *addr);
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages);
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
> > void mte_copy_page_tags(void *kto, const void *kfrom);
> > void mte_thread_init_user(void);
> > void mte_thread_switch(struct task_struct *next);
> > @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> > static inline void mte_zero_clear_page_tags(void *addr)
> > {
> > }
> > -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> > {
> > }
> > static inline void mte_copy_page_tags(void *kto, const void *kfrom)
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 6e5d28e607bb..c70ced60a0cd 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> > void free_tag_storage(struct page *page, int order);
> >
> > bool page_tag_storage_reserved(struct page *page);
> > +
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> > #else
> > static inline bool tag_storage_enabled(void)
> > {
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> > index e9624f6326dd..85ebb3e352ad 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -19,6 +19,7 @@
> > #define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
> > #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
> > #define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> > +#define PTE_TAG_STORAGE_NONE (_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
> >
> > /*
> > * This bit indicates that the entry is present i.e. pmd_page()
> > @@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
> > })
> >
> > #define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> > +#define PAGE_FAULT_ON_ACCESS __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> > /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
> > #define PAGE_SHARED __pgprot(_PAGE_SHARED)
> > #define PAGE_SHARED_EXEC __pgprot(_PAGE_SHARED_EXEC)
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 20e8de853f5d..8cc135f1c112 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
> > __func__, pte_val(old_pte), pte_val(pte));
> > }
> >
> > -static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
> > {
> > - if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> > - __sync_icache_dcache(pte);
> > + if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
> > + __sync_icache_dcache(*pteval);
> >
> > /*
> > * If the PTE would provide user space access to the tags associated
> > @@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> > * pte_access_permitted() returns false for exec only mappings, they
> > * don't expose tags (instruction fetches don't check tags).
> > */
> > - if (system_supports_mte() && pte_access_permitted(pte, false) &&
> > - !pte_special(pte) && pte_tagged(pte))
> > - mte_sync_tags(pte, nr_pages);
> > + if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
> > + !pte_special(*pteval) && pte_tagged(*pteval))
> > + mte_sync_tags(pteval, nr_pages);
> > }
> >
> > static inline void set_ptes(struct mm_struct *mm,
> > @@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
> > pte_t *ptep, pte_t pte, unsigned int nr)
> > {
> > page_table_check_ptes_set(mm, ptep, pte, nr);
> > - __sync_cache_and_tags(pte, nr);
> > + __sync_cache_and_tags(&pte, nr);
> >
> > for (;;) {
> > __check_safe_pte_update(mm, ptep, pte);
> > @@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
> > }
> > #endif
> >
> > +#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > + return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
> > + (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > + return fault_on_access_pte(pmd_pte(pmd));
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > + if (tag_storage_enabled())
> > + return handle_page_missing_tag_storage(vmf);
> > + return VM_FAULT_SIGBUS;
> > +}
> > +#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> > +
> > #define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> >
> > static inline int pmd_present(pmd_t pmd)
> > @@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
> > unsigned long __always_unused addr,
> > pte_t *ptep, pte_t pte, unsigned int nr)
> > {
> > - __sync_cache_and_tags(pte, nr);
> > + __sync_cache_and_tags(&pte, nr);
> > __check_safe_pte_update(mm, ptep, pte);
> > set_pte(ptep, pte);
> > }
> > @@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> > * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
> > */
> > const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
> > - PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
> > - PTE_ATTRINDX_MASK;
> > + PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
> > + PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
> > /* preserve the hardware dirty information */
> > if (pte_hw_dirty(pte))
> > pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index a41ef3213e1e..5962bab1d549 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -21,6 +21,7 @@
> > #include <asm/barrier.h>
> > #include <asm/cpufeature.h>
> > #include <asm/mte.h>
> > +#include <asm/mte_tag_storage.h>
> > #include <asm/ptrace.h>
> > #include <asm/sysreg.h>
> >
> > @@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
> > EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
> > #endif
> >
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> > {
> > - struct page *page = pte_page(pte);
> > + struct page *page = pte_page(*pteval);
> > unsigned int i;
> >
> > - /* if PG_mte_tagged is set, tags have already been initialised */
> > for (i = 0; i < nr_pages; i++, page++) {
> > + if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
> > + *pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
> > + continue;
> > + }
> > +
> > + /* if PG_mte_tagged is set, tags have already been initialised */
> > if (try_page_mte_tagging(page)) {
> > mte_clear_page_tags(page_address(page));
> > set_page_mte_tagged(page);
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index acbc7530d2b2..f5fa583acf18 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -19,6 +19,7 @@
> > #include <linux/kprobes.h>
> > #include <linux/uaccess.h>
> > #include <linux/page-flags.h>
> > +#include <linux/page-isolation.h>
> > #include <linux/sched/signal.h>
> > #include <linux/sched/debug.h>
> > #include <linux/highmem.h>
> > @@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
> > mte_zero_clear_page_tags(page_address(page));
> > set_page_mte_tagged(page);
> > }
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + struct page *page = NULL;
> > + pte_t new_pte, old_pte;
> > + bool writable = false;
> > + vm_fault_t err;
> > + int ret;
> > +
> > + spin_lock(vmf->ptl);
> > + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + return 0;
> > + }
> > +
> > + /* Get the normal PTE */
> > + old_pte = ptep_get(vmf->pte);
> > + new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > +
> > + /*
> > + * Detect now whether the PTE could be writable; this information
> > + * is only valid while holding the PT lock.
> > + */
> > + writable = pte_write(new_pte);
> > + if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> > + can_change_pte_writable(vma, vmf->address, new_pte))
> > + writable = true;
> > +
> > + page = vm_normal_page(vma, vmf->address, new_pte);
> > + if (!page || is_zone_device_page(page))
> > + goto out_map;
> > +
> > + /*
> > + * This should never happen, once a VMA has been marked as tagged, that
> > + * cannot be changed.
> > + */
> > + if (!(vma->vm_flags & VM_MTE))
> > + goto out_map;
> > +
> > + /* Prevent the page from being unmapped from under us. */
> > + get_page(page);
> > + vma_set_access_pid_bit(vma);
> > +
> > + /*
> > + * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
> > + * and spin_lock() above which takes the ptl lock. Both locks should be
> > + * balanced after this point.
> > + */
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > + /*
> > + * Probably the page is being isolated for migration, replay the fault
> > + * to give time for the entry to be replaced by a migration pte.
> > + */
> > + if (unlikely(is_migrate_isolate_page(page)))
> > + goto out_retry;
> > +
> > + ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
> > + if (ret)
> > + goto out_retry;
> > +
> > + put_page(page);
> > +
> > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> > + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + return 0;
> > + }
> > +
> > +out_map:
> > + /*
> > + * Make it present again, depending on how arch implements
> > + * non-accessible ptes, some can allow access by kernel mode.
> > + */
> > + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> > + new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > + new_pte = pte_mkyoung(new_pte);
> > + if (writable)
> > + new_pte = pte_mkwrite(new_pte, vma);
> > + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
> > + update_mmu_cache(vma, vmf->address, vmf->pte);
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > + return 0;
> > +
> > +out_retry:
> > + put_page(page);
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vma);
> > + if (fault_flag_allow_retry_first(vmf->flags)) {
> > + err = VM_FAULT_RETRY;
> > + } else {
> > + /* Replay the fault. */
> > + err = 0;
>
> Hello!
>
> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>
> How about migrating the page when migration problem repeats.
Yes, I had the same though in the previous iteration of the series, the
page was migrated out of the VMA if tag storage couldn't be reserved.
Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
pin will be released before the fault is replayed. Because of this, and
because it makes the code simpler, I chose not to migrate the page if tag
storage couldn't be reserved.
I'd be happy to revisit this if it turns out that in the real world
replaying the fault happens often enough that migrating the page is faster.
In fact, statistics about how often the fault is replayed and how long that
takes would be very helpful.
Thanks,
Alex
>
> Thanks,
> Regards.
>
> > + }
> > + return err;
> > +}
> > +#endif
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index ffdb9b6bed6c..e2c761dd6c41 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
> > }
> > #endif /* CONFIG_NUMA_BALANCING */
> >
> > +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > + return false;
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > + return false;
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > + return VM_FAULT_SIGBUS;
> > +}
> > +#endif
> > +
> > #endif /* CONFIG_MMU */
> >
> > #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 89971a894b60..a90eefc3ee80 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
> > config ARCH_HAS_CACHE_LINE_SIZE
> > bool
> >
> > +config ARCH_HAS_FAULT_ON_ACCESS
> > + bool
> > +
> > config ARCH_HAS_CURRENT_STACK_POINTER
> > bool
> > help
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e137f7673749..a04a971200b9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > if (!pte_present(vmf->orig_pte))
> > return do_swap_page(vmf);
> >
> > + if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > + return arch_do_page_fault_on_access(vmf);
> > +
> > if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > return do_numa_page(vmf);
> >
> > --
> > 2.42.1
> >
> >
>>> +
>>> +out_retry:
>>> + put_page(page);
>>> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>> + vma_end_read(vma);
>>> + if (fault_flag_allow_retry_first(vmf->flags)) {
>>> + err = VM_FAULT_RETRY;
>>> + } else {
>>> + /* Replay the fault. */
>>> + err = 0;
>>
>> Hello!
>>
>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>
>> How about migrating the page when migration problem repeats.
>
> Yes, I had the same though in the previous iteration of the series, the
> page was migrated out of the VMA if tag storage couldn't be reserved.
>
> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> pin will be released before the fault is replayed. Because of this, and
> because it makes the code simpler, I chose not to migrate the page if tag
> storage couldn't be reserved.
There are still some cases that are theoretically problematic:
vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet.
All these things also affect other users that rely on movability (e.g.,
CMA, memory hotunplug).
--
Cheers,
David / dhildenb
Hi,
On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > +
> > > > +out_retry:
> > > > + put_page(page);
> > > > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > + vma_end_read(vma);
> > > > + if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > + err = VM_FAULT_RETRY;
> > > > + } else {
> > > > + /* Replay the fault. */
> > > > + err = 0;
> > >
> > > Hello!
> > >
> > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > >
> > > How about migrating the page when migration problem repeats.
> >
> > Yes, I had the same though in the previous iteration of the series, the
> > page was migrated out of the VMA if tag storage couldn't be reserved.
> >
> > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > pin will be released before the fault is replayed. Because of this, and
> > because it makes the code simpler, I chose not to migrate the page if tag
> > storage couldn't be reserved.
>
> There are still some cases that are theoretically problematic: vmsplice()
> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>
> All these things also affect other users that rely on movability (e.g., CMA,
> memory hotunplug).
I wasn't aware of that, thank you for the information. Then to ensure that the
process doesn't hang by replying the loop indefinitely, I'll migrate the page if
tag storage cannot be reserved. Looking over the code again, I think I can reuse
the same function that migrates tag storage pages out of the MTE VMA (added in
patch #21), so no major changes needed.
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
>
On 30.11.23 14:32, Alexandru Elisei wrote:
> Hi,
>
> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>> +
>>>>> +out_retry:
>>>>> + put_page(page);
>>>>> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>> + vma_end_read(vma);
>>>>> + if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>> + err = VM_FAULT_RETRY;
>>>>> + } else {
>>>>> + /* Replay the fault. */
>>>>> + err = 0;
>>>>
>>>> Hello!
>>>>
>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>
>>>> How about migrating the page when migration problem repeats.
>>>
>>> Yes, I had the same though in the previous iteration of the series, the
>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>
>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>> pin will be released before the fault is replayed. Because of this, and
>>> because it makes the code simpler, I chose not to migrate the page if tag
>>> storage couldn't be reserved.
>>
>> There are still some cases that are theoretically problematic: vmsplice()
>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>
>> All these things also affect other users that rely on movability (e.g., CMA,
>> memory hotunplug).
>
> I wasn't aware of that, thank you for the information. Then to ensure that the
> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> tag storage cannot be reserved. Looking over the code again, I think I can reuse
> the same function that migrates tag storage pages out of the MTE VMA (added in
> patch #21), so no major changes needed.
It's going to be interesting if migrating that page fails because it is
pinned :/
--
Cheers,
David / dhildenb
On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
> On 30.11.23 14:32, Alexandru Elisei wrote:
> > Hi,
> >
> > On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > > > +
> > > > > > +out_retry:
> > > > > > + put_page(page);
> > > > > > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > > > + vma_end_read(vma);
> > > > > > + if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > > > + err = VM_FAULT_RETRY;
> > > > > > + } else {
> > > > > > + /* Replay the fault. */
> > > > > > + err = 0;
> > > > >
> > > > > Hello!
> > > > >
> > > > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > > > >
> > > > > How about migrating the page when migration problem repeats.
> > > >
> > > > Yes, I had the same though in the previous iteration of the series, the
> > > > page was migrated out of the VMA if tag storage couldn't be reserved.
> > > >
> > > > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > > > pin will be released before the fault is replayed. Because of this, and
> > > > because it makes the code simpler, I chose not to migrate the page if tag
> > > > storage couldn't be reserved.
> > >
> > > There are still some cases that are theoretically problematic: vmsplice()
> > > can pin pages forever and doesn't use FOLL_LONGTERM yet.
> > >
> > > All these things also affect other users that rely on movability (e.g., CMA,
> > > memory hotunplug).
> >
> > I wasn't aware of that, thank you for the information. Then to ensure that the
> > process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> > tag storage cannot be reserved. Looking over the code again, I think I can reuse
> > the same function that migrates tag storage pages out of the MTE VMA (added in
> > patch #21), so no major changes needed.
>
> It's going to be interesting if migrating that page fails because it is
> pinned :/
I imagine that having both the page **and** its tag storage pinned longterm
without FOLL_LONGTERM is going to be exceedingly rare.
Am I mistaken in believing that the problematic vmsplice() behaviour is
recognized as something that needs to be fixed?
Thanks,
Alex
>
> --
> Cheers,
>
> David / dhildenb
>
On 30.11.23 15:33, Alexandru Elisei wrote:
> On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
>> On 30.11.23 14:32, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>>>> +
>>>>>>> +out_retry:
>>>>>>> + put_page(page);
>>>>>>> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>>>> + vma_end_read(vma);
>>>>>>> + if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>>>> + err = VM_FAULT_RETRY;
>>>>>>> + } else {
>>>>>>> + /* Replay the fault. */
>>>>>>> + err = 0;
>>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>>>
>>>>>> How about migrating the page when migration problem repeats.
>>>>>
>>>>> Yes, I had the same though in the previous iteration of the series, the
>>>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>>>
>>>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>>>> pin will be released before the fault is replayed. Because of this, and
>>>>> because it makes the code simpler, I chose not to migrate the page if tag
>>>>> storage couldn't be reserved.
>>>>
>>>> There are still some cases that are theoretically problematic: vmsplice()
>>>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>>>
>>>> All these things also affect other users that rely on movability (e.g., CMA,
>>>> memory hotunplug).
>>>
>>> I wasn't aware of that, thank you for the information. Then to ensure that the
>>> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
>>> tag storage cannot be reserved. Looking over the code again, I think I can reuse
>>> the same function that migrates tag storage pages out of the MTE VMA (added in
>>> patch #21), so no major changes needed.
>>
>> It's going to be interesting if migrating that page fails because it is
>> pinned :/
>
> I imagine that having both the page **and** its tag storage pinned longterm
> without FOLL_LONGTERM is going to be exceedingly rare.
Yes. I recall that the rule of thumb is that some O_DIRECT I/O can take
up to 10 seconds, although extremely rare (and maybe not applicable on
arm64).
>
> Am I mistaken in believing that the problematic vmsplice() behaviour is
> recognized as something that needs to be fixed?
Yes, for a couple of years I'm hoping this will actually get fixed now
that O_DIRECT mostly uses FOLL_PIN instead of FOLL_GET.
--
Cheers,
David / dhildenb
Hi Peter,
On Mon, Nov 27, 2023 at 09:39:17PM -0800, Peter Collingbourne wrote:
> Hi Alexandru,
>
> On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/include/asm/mte_tag_storage.h | 1 +
> > arch/arm64/kernel/mte_tag_storage.c | 15 +++++++
> > arch/arm64/mm/fault.c | 55 ++++++++++++++++++++++++
> > include/linux/migrate.h | 8 +++-
> > include/linux/migrate_mode.h | 1 +
> > mm/internal.h | 6 ---
> > 6 files changed, 78 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index b97406d369ce..6a8b19a6a758 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> > void free_tag_storage(struct page *page, int order);
> >
> > bool page_tag_storage_reserved(struct page *page);
> > +bool page_is_tag_storage(struct page *page);
> >
> > vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> > vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index a1cc239f7211..5096ce859136 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
> > return test_bit(PG_tag_storage_reserved, &page->flags);
> > }
> >
> > +bool page_is_tag_storage(struct page *page)
> > +{
> > + unsigned long pfn = page_to_pfn(page);
> > + struct range *tag_range;
> > + int i;
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + if (tag_range->start <= pfn && pfn <= tag_range->end)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > {
> > unsigned long start_block, end_block;
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 6730a0812a24..964c5ae161a3 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -12,6 +12,7 @@
> > #include <linux/extable.h>
> > #include <linux/kfence.h>
> > #include <linux/signal.h>
> > +#include <linux/migrate.h>
> > #include <linux/mm.h>
> > #include <linux/hardirq.h>
> > #include <linux/init.h>
> > @@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
> > }
> >
> > #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +
> > +#define MR_TAGGED_TAG_STORAGE MR_ARCH_1
> > +
> > +extern bool isolate_lru_page(struct page *page);
> > +extern void putback_movable_pages(struct list_head *l);
>
> Could we move these declarations to a non-mm-internal header and
> #include it instead of manually declaring them here?
Yes, that's better than this hackish way of doing it.
>
> > +
> > +/* Returns with the page reference dropped. */
> > +static void migrate_tag_storage_page(struct page *page)
> > +{
> > + struct migration_target_control mtc = {
> > + .nid = NUMA_NO_NODE,
> > + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
> > + };
> > + unsigned long i, nr_pages = compound_nr(page);
> > + LIST_HEAD(pagelist);
> > + int ret, tries;
> > +
> > + lru_cache_disable();
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + if (!isolate_lru_page(page + i)) {
> > + ret = -EAGAIN;
> > + goto out;
> > + }
> > + /* Isolate just grabbed another reference, drop ours. */
> > + put_page(page + i);
> > + list_add_tail(&(page + i)->lru, &pagelist);
> > + }
> > +
> > + tries = 5;
> > + while (tries--) {
> > + ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
> > + MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
> > + if (ret == 0 || ret != -EBUSY)
>
> This could be simplified to:
>
> if (ret != -EBUSY)
Indeed! I can do the same thing in reserve_tag_storage(), in the loop where I
call alloc_contig_range().
Thanks,
Alex
Hi,
On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> Hello.
>
> On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> >
> > The DTB node for the tag storage region is defined as:
> >
> > tags0: tag-storage@8f8000000 {
> > compatible = "arm,mte-tag-storage";
> > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > block-size = <0x1000>;
> > memory = <&memory0>; // Associated tagged memory node
> > };
> >
>
> How about using compatible = "shared-dma-pool" like below ?
>
> &reserved_memory {
> tags0: tag0@8f8000000 {
> compatible = "arm,mte-tag-storage";
> reg = <0x08 0xf8000000 0x00 0x4000000>;
> };
> }
>
> tag-storage {
> compatible = "arm,mte-tag-storage";
> memory-region = <&tag>;
> memory = <&memory0>;
> block-size = <0x1000>;
> }
>
> And then, the activation of CMA would be performed in the CMA code.
> We just can get the region information from memory-region and allocate it directly
> like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.
Played with reserved_mem a bit. I don't think that's the correct path
forward.
The location of the tag storage is a hardware property, independent of how
Linux is configured.
early_init_fdt_scan_reserved_mem() is called from arm64_memblock_init(),
**after** the kernel enforces an upper address for various reasons. One of
the reasons can be that it's been compiled with 39 bits VA.
After early_init_fdt_scan_reserved_mem() returns, the kernel sets the
maximum address, stored in the variable "high_memory".
What can happen is that tag storage is present at an address above the
maximum addressable by the kernel, and the CMA code will trigger an
unrecovrable page fault.
I was able to trigger this with the dts change:
diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..201359d014e4 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -183,6 +183,13 @@ vram: vram@18000000 {
reg = <0x00000000 0x18000000 0 0x00800000>;
no-map;
};
+
+
+ linux,cma {
+ compatible = "shared-dma-pool";
+ reg = <0x100 0x0 0x00 0x4000000>;
+ reusable;
+ };
};
gic: interrupt-controller@2f000000 {
And the error I got:
[ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
[ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
[..]
[ 0.793193] WARNING: CPU: 0 PID: 1 at mm/cma.c:111 cma_init_reserved_areas+0xa8/0x378
[..]
[ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
[ 0.807277] Mem abort info:
[ 0.807277] ESR = 0x0000000096000005
[ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.808110] SET = 0, FnV = 0
[ 0.808443] EA = 0, S1PTW = 0
[ 0.808526] FSC = 0x05: level 1 translation fault
[ 0.808943] Data abort info:
[ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 0.810221] [00000001fe000000] user address but active_mm is swapper
[..]
[ 0.820887] Call trace:
[ 0.821027] cma_init_reserved_areas+0xc4/0x378
[ 0.821443] do_one_initcall+0x7c/0x1c0
[ 0.821860] kernel_init_freeable+0x1bc/0x284
[ 0.822277] kernel_init+0x24/0x1dc
[ 0.822693] ret_from_fork+0x10/0x20
[ 0.823554] Code: 9127a29a cb813321 d37ae421 8b030020 (f8636822)
[ 0.823554] ---[ end trace 0000000000000000 ]---
[ 0.824360] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 0.824443] SMP: stopping secondary CPUs
[ 0.825193] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Should reserved mem check if the reserved memory is actually addressable by
the kernel if it's not "no-map"? Should cma fail gracefully if
!pfn_valid(base_pfn)? Shold early_init_fdt_scan_reserved_mem() be moved
because arm64_bootmem_init()? I don't have the answer to any of those. And
I got a kernel panic because the kernel cannot address that memory (39 bits
VA). I don't know what would happen if the upper limit is reduced for
another reason.
What I think should happen:
1. Add the tag storage memory before any limits are enforced by
arm64_bootmem_init().
2. Call cma_declare_contiguous_nid() after arm64_bootmem_init(), because
the function will check the memory limit.
3. Have an arch initcall that checks that the CMA regions corresponding to
the tag storage have been activated successfully (cma_init_reserved_areas()
is a core initcall). If not, then don't enable tag storage.
How does that sound to you?
Thanks,
Alex
>
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> >
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
>
> Is it considered for kernel page size like 16K page, 64K page ? The comment says
> it should be a multiple of 4K, but it should be a multiple of the "page size" more accurately.
> Please let me know if there's anything I misunderstood. :-)
>
>
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 12 ++
> > arch/arm64/include/asm/mte_tag_storage.h | 15 ++
> > arch/arm64/kernel/Makefile | 1 +
> > arch/arm64/kernel/mte_tag_storage.c | 256 +++++++++++++++++++++++
> > arch/arm64/kernel/setup.c | 7 +
> > 5 files changed, 291 insertions(+)
> > create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> > create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >
> > Documentation/arch/arm64/memory-tagging-extension.rst.
> >
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > + bool "Dynamic MTE tag storage management"
> > + help
> > + Adds support for dynamic management of the memory used by the hardware
> > + for storing MTE tags. This memory, unlike normal memory, cannot be
> > + tagged. When it is used to store tags for another memory location it
> > + cannot be used for any type of allocation.
> > +
> > + If unsure, say N
> > +endif # ARM64_MTE
> > +
> > endmenu # "ARMv8.5 architectural features"
> >
> > menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
> > obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
> > obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
> > obj-$(CONFIG_ARM64_MTE) += mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
> > obj-y += vdso-wrap.o
> > obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
> > obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > + struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> > + struct range tag_range; /* Tag storage memory, in PFNs. */
> > + u32 block_size; /* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS 32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > + int reg_len, struct range *range)
> > +{
> > + int addr_cells = dt_root_addr_cells;
> > + int size_cells = dt_root_size_cells;
> > + u64 size;
> > +
> > + if (reg_len / 4 > addr_cells + size_cells)
> > + return -EINVAL;
> > +
> > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > + if (size == 0) {
> > + pr_err("Invalid node");
> > + return -EINVAL;
> > + }
> > + range->end = range->start + size - 1;
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > + struct range *tag_range)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > + if (reg == NULL) {
> > + pr_err("Invalid metadata node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > + if (reg == NULL)
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > +
> > + if (reg == NULL) {
> > + pr_err("Invalid memory node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > + unsigned long node;
> > + u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > + struct find_memory_node_arg *arg = data;
> > +
> > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > + return 0;
> > +
> > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > + arg->node = node;
> > + return 1;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > + struct find_memory_node_arg arg = { 0 };
> > + const __be32 *memory_prop;
> > + u32 mem_phandle;
> > + int ret, reg_len;
> > +
> > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > + if (!memory_prop) {
> > + pr_err("Missing 'memory' property in the tag storage node");
> > + return -EINVAL;
> > + }
> > +
> > + mem_phandle = be32_to_cpup(memory_prop);
> > + arg.phandle = mem_phandle;
> > +
> > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > + if (ret != 1) {
> > + pr_err("Associated memory node not found");
> > + return -EINVAL;
> > + }
> > +
> > + *mem_node = arg.node;
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > + u32 *retval)
> > +{
> > + const __be32 *reg;
> > +
> > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > + if (!reg)
> > + return -EINVAL;
> > +
> > + *retval = be32_to_cpup(reg);
> > + return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > + u32 a = PAGE_SIZE;
> > + u32 b = block_size_bytes;
> > + u32 r;
> > +
> > + /* Find greatest common divisor using the Euclidian algorithm. */
> > + do {
> > + r = a % b;
> > + a = b;
> > + b = r;
> > + } while (b != 0);
> > +
> > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + struct tag_region *region;
> > + unsigned long mem_node;
> > + struct range *mem_range;
> > + struct range *tag_range;
> > + u32 block_size_bytes;
> > + u32 nid = 0;
> > + int ret;
> > +
> > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > + return 0;
> > +
> > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > + return 0;
> > +
> > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > + pr_err("Maximum number of tag storage regions exceeded");
> > + return -EINVAL;
> > + }
> > +
> > + region = &tag_regions[num_tag_regions];
> > + mem_range = ®ion->mem_range;
> > + tag_range = ®ion->tag_range;
> > +
> > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > + if (ret) {
> > + pr_err("Invalid tag storage node");
> > + return ret;
> > + }
> > +
> > + ret = tag_storage_get_memory_node(node, &mem_node);
> > + if (ret)
> > + return ret;
> > +
> > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > + if (ret) {
> > + pr_err("Invalid address for associated data memory node");
> > + return ret;
> > + }
> > +
> > + /* The tag region must exactly match the corresponding memory. */
> > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > + return -EINVAL;
> > + }
> > +
> > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > + if (ret || block_size_bytes == 0) {
> > + pr_err("Invalid or missing 'block-size' property");
> > + return -EINVAL;
> > + }
> > + region->block_size = get_block_size_pages(block_size_bytes);
> > + if (range_len(tag_range) % region->block_size != 0) {
> > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > + return -EINVAL;
> > + }
> > +
>
> I was confused about the variable "block_size", The block size declared in the device tree is
> in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> to something more readable, such as "block_nr_pages" (This is just a example!)
>
> Thanks,
> Regards.
>
> > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > + if (ret)
> > + nid = numa_node_id();
> > +
> > + ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > + nid, MEMBLOCK_NONE);
> > + if (ret) {
> > + pr_err("Error adding tag memblock (%d)", ret);
> > + return ret;
> > + }
> > + memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +
> > + pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > +
> > + num_tag_regions++;
> > +
> > + return 0;
> > +}
> > +
> > +void __init mte_tag_storage_init(void)
> > +{
> > + struct range *tag_range;
> > + int i, ret;
> > +
> > + ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > + if (ret) {
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > + }
> > + num_tag_regions = 0;
> > + pr_info("MTE tag storage region management disabled");
> > + }
> > +}
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 417a8a86b2db..1b77138c1aa5 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -42,6 +42,7 @@
> > #include <asm/cpufeature.h>
> > #include <asm/cpu_ops.h>
> > #include <asm/kasan.h>
> > +#include <asm/mte_tag_storage.h>
> > #include <asm/numa.h>
> > #include <asm/scs.h>
> > #include <asm/sections.h>
> > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> > FW_BUG "Booted with MMU enabled!");
> > }
> >
> > + /*
> > + * Must be called before memory limits are enforced by
> > + * arm64_memblock_init().
> > + */
> > + mte_tag_storage_init();
> > +
> > arm64_memblock_init();
> >
> > paging_init();
> > --
> > 2.42.1
> >
> >
Hi,
On Fri, Dec 08, 2023 at 02:27:39PM +0900, Hyesoo Yu wrote:
> Hi~
>
> On Thu, Nov 30, 2023 at 12:00:11PM +0000, Alexandru Elisei wrote:
> > Hi,
> >
> > On Wed, Nov 29, 2023 at 05:57:44PM +0900, Hyesoo Yu wrote:
> > > On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> > > > alloc_contig_range() requires that the requested pages are in the same
> > > > zone. Check that this is indeed the case before initializing the tag
> > > > storage blocks.
> > > >
> > > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > > ---
> > > > arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> > > > 1 file changed, 33 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > > index 8b9bedf7575d..fd63430d4dc0 100644
> > > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> > > > }
> > > > }
> > > >
> > > > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > > > +static int __init mte_tag_storage_check_zone(void)
> > > > +{
> > > > + struct range *tag_range;
> > > > + struct zone *zone;
> > > > + unsigned long pfn;
> > > > + u32 block_size;
> > > > + int i, j;
> > > > +
> > > > + for (i = 0; i < num_tag_regions; i++) {
> > > > + block_size = tag_regions[i].block_size;
> > > > + if (block_size == 1)
> > > > + continue;
> > > > +
> > > > + tag_range = &tag_regions[i].tag_range;
> > > > + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > > > + zone = page_zone(pfn_to_page(pfn));
> > >
> > > Hello.
> > >
> > > Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
> > > out of the loop ?
> > `
> > Hmm.. why do you say that the pages in a tag_range must be in the same
> > zone? I am not very familiar with how the memory management code puts pages
> > into zones, but I would imagine that pages in a tag range straddling the
> > 4GB limit (so, let's say, from 3GB to 5GB) will end up in both ZONE_DMA and
> > ZONE_NORMAL.
> >
> > Thanks,
> > Alex
> >
>
> Oh, I see that reserve_tag_storage only calls alloc_contig_rnage in units of block_size,
> I thought it could be called for the entire range the page needed at once.
> (Maybe it could be a bit faster ? It doesn't seem like unnecessary drain and
> other operation are repeated.)
Yes, that might be useful to do. Worth keeping in mind is that:
- a number of block size pages at the start and end of the range might
already be reserved for other tagged pages, so the actual range that is
being reserved might end up being smaller that what we are expecting.
- the most common allocation order is smaller or equal to
PAGE_ALLOC_COSTLY_ORDER, which is 3, which means that the most common
case is that reserve_tag_storage reserves only one tag storage block.
I will definitely keep this optimization in mind, but I would prefer to get
the series into a more stable shape before looking at performance
optimizations.
>
> If we use the cma code when activating the tag storage, it will be error if the
> entire area of tag region is not in the same zone, so there should be a constraint
> that it must be in the same zone when defining the tag region on device tree.
I don't think that's the best approach, because the device tree describes
the hardware, which does not change, and this is a software limitation
(i.e, CMA doesn't work if a CMA region spans different zones), which might
get fixed in a future version of Linux.
In my opinion, the simplest solution would be to check that all tag storage
regions have been activated successfully by CMA before enabling tag
storage. Another alternative would be to split the tag storage region into
several CMA regions at a zone boundary, and add it as distinct CMA regions.
Thanks,
Alex
>
> Thanks,
> Regards.
>
> > >
> > > Thanks,
> > > Regards.
> > >
> > > > + for (j = 1; j < block_size; j++) {
> > > > + if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > > > + pr_err("Tag storage block pages in different zones");
> > > > + return -EINVAL;
> > > > + }
> > > > + }
> > > > + }
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int __init mte_tag_storage_activate_regions(void)
> > > > {
> > > > phys_addr_t dram_start, dram_end;
> > > > @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
> > > > goto out_disabled;
> > > > }
> > > >
> > > > + ret = mte_tag_storage_check_zone();
> > > > + if (ret)
> > > > + goto out_disabled;
> > > > +
> > > > for (i = 0; i < num_tag_regions; i++) {
> > > > tag_range = &tag_regions[i].tag_range;
> > > > for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > > > --
> > > > 2.42.1
> > > >
> > > >
> >
> >
> >
Hi,
On Fri, Dec 08, 2023 at 02:03:44PM +0900, Hyesoo Yu wrote:
> Hi,
>
> I'm sorry for the late response, I was on vacation.
>
> On Sun, Dec 03, 2023 at 12:14:30PM +0000, Alexandru Elisei wrote:
> > Hi,
> >
> > On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> > > Hello.
> > >
> > > On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > regions from the DTB. This memory is marked as reserved for now.
> > > >
> > > > The DTB node for the tag storage region is defined as:
> > > >
> > > > tags0: tag-storage@8f8000000 {
> > > > compatible = "arm,mte-tag-storage";
> > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > block-size = <0x1000>;
> > > > memory = <&memory0>; // Associated tagged memory node
> > > > };
> > > >
> > >
> > > How about using compatible = "shared-dma-pool" like below ?
> > >
> > > &reserved_memory {
> > > tags0: tag0@8f8000000 {
> > > compatible = "arm,mte-tag-storage";
> > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > };
> > > }
> > >
> > > tag-storage {
> > > compatible = "arm,mte-tag-storage";
> > > memory-region = <&tag>;
> > > memory = <&memory0>;
> > > block-size = <0x1000>;
> > > }
> > >
> > > And then, the activation of CMA would be performed in the CMA code.
> > > We just can get the region information from memory-region and allocate it directly
> > > like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.
> >
>
> Sorry, that example was my mistake. Actually I wanted to write like this.
>
> &reserved_memory {
> tags0: tag0@8f8000000 {
> compatible = "shared-dma-pool";
> reg = <0x08 0xf8000000 0x00 0x4000000>;
> reusable;
> };
> }
>
> tag-storage {
> compatible = "arm,mte-tag-storage";
> memory-region = <&tag>;
> memory = <&memory0>;
> block-size = <0x1000>;
> }
I prototyped your suggestion with this change to the device tree:
reserved-memory {
#address-cells = <0x02>;
#size-cells = <0x02>;
ranges;
tags0: tag-storage@8f8000000 {
compatible = "arm,mte-tag-storage";
reg = <0x08 0xf8000000 0x00 0x4000000>;
block-size = <0x1000>;
memory = <&memory0>;
reusable;
};
};
Would you mind explaining what we are gaining by using reserved mem?
Struct reserved_mem only has the base and size of the tag storage region,
and initialization for reserved mem happens before the DTB is unflattened.
When I prototyped using reserved mem, I still had to write the code to
parse the memory node address and size. This code was the same as the code
needed to parse the tag storage region address and size, so having that
information in struct reserved_mem does not reduce the size of the code by
a meaningful amount.
>
>
> > Played with reserved_mem a bit. I don't think that's the correct path
> > forward.
> >
> > The location of the tag storage is a hardware property, independent of how
> > Linux is configured.
> >
> > early_init_fdt_scan_reserved_mem() is called from arm64_memblock_init(),
> > **after** the kernel enforces an upper address for various reasons. One of
> > the reasons can be that it's been compiled with 39 bits VA.
> >
>
> I'm not sure about this part. What is the upper address enforced by the kernel ?
> Where can I check the code ? Do you means that memblock_end_of_DRAM() ?
I am referring to arch/arm64/mm/init.c:: arm64_memblock_init(). The
function initializes reserved mem (in early_init_fdt_scan_reserved_mem())
**after**removing memory from memblock that the kernel cannot address.
>
> > After early_init_fdt_scan_reserved_mem() returns, the kernel sets the
> > maximum address, stored in the variable "high_memory".
> >
> > What can happen is that tag storage is present at an address above the
> > maximum addressable by the kernel, and the CMA code will trigger an
> > unrecovrable page fault.
> >
> > I was able to trigger this with the dts change:
> >
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..201359d014e4 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -183,6 +183,13 @@ vram: vram@18000000 {
> > reg = <0x00000000 0x18000000 0 0x00800000>;
> > no-map;
> > };
> > +
> > +
> > + linux,cma {
> > + compatible = "shared-dma-pool";
> > + reg = <0x100 0x0 0x00 0x4000000>;
> > + reusable;
> > + };
> > };
> >
> > gic: interrupt-controller@2f000000 {
> >
> > And the error I got:
> >
> > [ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> > [ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> > [ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> > [..]
> > [ 0.793193] WARNING: CPU: 0 PID: 1 at mm/cma.c:111 cma_init_reserved_areas+0xa8/0x378
> > [..]
> > [ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> > [ 0.807277] Mem abort info:
> > [ 0.807277] ESR = 0x0000000096000005
> > [ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 0.808110] SET = 0, FnV = 0
> > [ 0.808443] EA = 0, S1PTW = 0
> > [ 0.808526] FSC = 0x05: level 1 translation fault
> > [ 0.808943] Data abort info:
> > [ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > [ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [ 0.810221] [00000001fe000000] user address but active_mm is swapper
> > [..]
> > [ 0.820887] Call trace:
> > [ 0.821027] cma_init_reserved_areas+0xc4/0x378
> > [ 0.821443] do_one_initcall+0x7c/0x1c0
> > [ 0.821860] kernel_init_freeable+0x1bc/0x284
> > [ 0.822277] kernel_init+0x24/0x1dc
> > [ 0.822693] ret_from_fork+0x10/0x20
> > [ 0.823554] Code: 9127a29a cb813321 d37ae421 8b030020 (f8636822)
> > [ 0.823554] ---[ end trace 0000000000000000 ]---
> > [ 0.824360] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> > [ 0.824443] SMP: stopping secondary CPUs
> > [ 0.825193] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> >
> > Should reserved mem check if the reserved memory is actually addressable by
> > the kernel if it's not "no-map"? Should cma fail gracefully if
> > !pfn_valid(base_pfn)? Shold early_init_fdt_scan_reserved_mem() be moved
> > because arm64_bootmem_init()? I don't have the answer to any of those. And
> > I got a kernel panic because the kernel cannot address that memory (39 bits
> > VA). I don't know what would happen if the upper limit is reduced for
> > another reason.
> >
>
> My answer may not be accurate because I don't understand what this upper limit is.
> Is this a problem caused by the tag storage area not being included in the memory node ?
This problem is caused by the kernel cannot using virtual addresses in the
linear map (where VA == PA) to access the tag storage region.
>
> The reason for not including it in the memory node is to enable static mte when dmte
> initilization fails, right ? I think I missed that. I thought the tag storage is included
> in the memory node and registered as cma.
>
> > What I think should happen:
> >
> > 1. Add the tag storage memory before any limits are enforced by
> > arm64_bootmem_init().
> >
> > 2. Call cma_declare_contiguous_nid() after arm64_bootmem_init(), because
> > the function will check the memory limit.
> >
> > 3. Have an arch initcall that checks that the CMA regions corresponding to
> > the tag storage have been activated successfully (cma_init_reserved_areas()
> > is a core initcall). If not, then don't enable tag storage.
> >
> > How does that sound to you?
> >
> > Thanks,
> > Alex
> >
>
> I think this is a good way to utilize the cma code !
Cool, thanks!
Alex
>
> Thanks,
> Regards.
>
> > > > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > > + if (ret || block_size_bytes == 0) {
> > > > + pr_err("Invalid or missing 'block-size' property");
> > > > + return -EINVAL;
> > > > + }
> > > > + region->block_size = get_block_size_pages(block_size_bytes);
> > > > + if (range_len(tag_range) % region->block_size != 0) {
> > > > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > >
> > > I was confused about the variable "block_size", The block size declared in the device tree is
> > > in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> > > confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> > > to something more readable, such as "block_nr_pages" (This is just a example!)
> > >
> > > Thanks,
> > > Regards.
> >>
>
> What do you think about this ?
>
> Thanks,
> Regards.
>
> > > > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > > > + if (ret)
> > > > + nid = numa_node_id();
> > > > +
> > > > + ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > > > + nid, MEMBLOCK_NONE);
> > > > + if (ret) {
> > > > + pr_err("Error adding tag memblock (%d)", ret);
> > > > + return ret;
> > > > + }
> > > > + memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > > +
> > > > + pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > > > +
> > > > + num_tag_regions++;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +void __init mte_tag_storage_init(void)
> > > > +{
> > > > + struct range *tag_range;
> > > > + int i, ret;
> > > > +
> > > > + ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > > > + if (ret) {
> > > > + for (i = 0; i < num_tag_regions; i++) {
> > > > + tag_range = &tag_regions[i].tag_range;
> > > > + memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > > + }
> > > > + num_tag_regions = 0;
> > > > + pr_info("MTE tag storage region management disabled");
> > > > + }
> > > > +}
> > > > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > > > index 417a8a86b2db..1b77138c1aa5 100644
> > > > --- a/arch/arm64/kernel/setup.c
> > > > +++ b/arch/arm64/kernel/setup.c
> > > > @@ -42,6 +42,7 @@
> > > > #include <asm/cpufeature.h>
> > > > #include <asm/cpu_ops.h>
> > > > #include <asm/kasan.h>
> > > > +#include <asm/mte_tag_storage.h>
> > > > #include <asm/numa.h>
> > > > #include <asm/scs.h>
> > > > #include <asm/sections.h>
> > > > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> > > > FW_BUG "Booted with MMU enabled!");
> > > > }
> > > >
> > > > + /*
> > > > + * Must be called before memory limits are enforced by
> > > > + * arm64_memblock_init().
> > > > + */
> > > > + mte_tag_storage_init();
> > > > +
> > > > arm64_memblock_init();
> > > >
> > > > paging_init();
> > > > --
> > > > 2.42.1
> > > >
> > > >
> >
> >
> >
On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
<[email protected]> wrote:
>
> Allow the kernel to get the size and location of the MTE tag storage
> regions from the DTB. This memory is marked as reserved for now.
>
> The DTB node for the tag storage region is defined as:
>
> tags0: tag-storage@8f8000000 {
> compatible = "arm,mte-tag-storage";
> reg = <0x08 0xf8000000 0x00 0x4000000>;
> block-size = <0x1000>;
> memory = <&memory0>; // Associated tagged memory node
> };
I skimmed thru the discussion some. If this memory range is within
main RAM, then it definitely belongs in /reserved-memory.
You need a binding for this too.
> The tag storage region represents the largest contiguous memory region that
> holds all the tags for the associated contiguous memory region which can be
> tagged. For example, for a 32GB contiguous tagged memory the corresponding
> tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> tag storage memory.
>
> "block-size" represents the minimum multiple of 4K of tag storage where all
> the tags stored in the block correspond to a contiguous memory region. This
> is needed for platforms where the memory controller interleaves tag writes
> to memory. For example, if the memory controller interleaves tag writes for
> 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> then the correct value for "block-size" is 0x2000. This value is a hardware
> property, independent of the selected kernel page size.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/Kconfig | 12 ++
> arch/arm64/include/asm/mte_tag_storage.h | 15 ++
> arch/arm64/kernel/Makefile | 1 +
> arch/arm64/kernel/mte_tag_storage.c | 256 +++++++++++++++++++++++
> arch/arm64/kernel/setup.c | 7 +
> 5 files changed, 291 insertions(+)
> create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b071a00425d..fe8276fdc7a8 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2062,6 +2062,18 @@ config ARM64_MTE
>
> Documentation/arch/arm64/memory-tagging-extension.rst.
>
> +if ARM64_MTE
> +config ARM64_MTE_TAG_STORAGE
> + bool "Dynamic MTE tag storage management"
> + help
> + Adds support for dynamic management of the memory used by the hardware
> + for storing MTE tags. This memory, unlike normal memory, cannot be
> + tagged. When it is used to store tags for another memory location it
> + cannot be used for any type of allocation.
> +
> + If unsure, say N
> +endif # ARM64_MTE
> +
> endmenu # "ARMv8.5 architectural features"
>
> menu "ARMv8.7 architectural features"
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> new file mode 100644
> index 000000000000..8f86c4f9a7c3
> --- /dev/null
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +#ifndef __ASM_MTE_TAG_STORAGE_H
> +#define __ASM_MTE_TAG_STORAGE_H
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +void mte_tag_storage_init(void);
> +#else
> +static inline void mte_tag_storage_init(void)
> +{
> +}
> +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> +#endif /* __ASM_MTE_TAG_STORAGE_H */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index d95b3d6b471a..5f031bf9f8f1 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
> obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
> obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
> obj-$(CONFIG_ARM64_MTE) += mte.o
> +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
> obj-y += vdso-wrap.o
> obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
> obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> new file mode 100644
> index 000000000000..fa6267ef8392
> --- /dev/null
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -0,0 +1,256 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Support for dynamic tag storage.
> + *
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/memblock.h>
> +#include <linux/mm.h>
> +#include <linux/of_device.h>
You probably don't need this header. If you depend on what it
implicitly includes, then that will now break in linux-next.
> +#include <linux/of_fdt.h>
> +#include <linux/range.h>
> +#include <linux/string.h>
> +#include <linux/xarray.h>
> +
> +#include <asm/mte_tag_storage.h>
> +
> +struct tag_region {
> + struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> + struct range tag_range; /* Tag storage memory, in PFNs. */
> + u32 block_size; /* Tag block size, in pages. */
> +};
> +
> +#define MAX_TAG_REGIONS 32
> +
> +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> +static int num_tag_regions;
> +
> +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> + int reg_len, struct range *range)
> +{
> + int addr_cells = dt_root_addr_cells;
> + int size_cells = dt_root_size_cells;
> + u64 size;
> +
> + if (reg_len / 4 > addr_cells + size_cells)
> + return -EINVAL;
> +
> + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> + if (size == 0) {
> + pr_err("Invalid node");
> + return -EINVAL;
> + }
> + range->end = range->start + size - 1;
We have a function to read (and translate which you forgot) addresses.
Add what's missing rather than open code your own.
> +
> + return 0;
> +}
> +
> +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> + struct range *tag_range)
> +{
> + const __be32 *reg;
> + int reg_len;
> +
> + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> + if (reg == NULL) {
> + pr_err("Invalid metadata node");
> + return -EINVAL;
> + }
> +
> + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> +}
> +
> +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> +{
> + const __be32 *reg;
> + int reg_len;
> +
> + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> + if (reg == NULL)
> + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> +
> + if (reg == NULL) {
> + pr_err("Invalid memory node");
> + return -EINVAL;
> + }
> +
> + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> +}
> +
> +struct find_memory_node_arg {
> + unsigned long node;
> + u32 phandle;
> +};
> +
> +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> + int depth, void *data)
> +{
> + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> + struct find_memory_node_arg *arg = data;
> +
> + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> + return 0;
> +
> + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> + arg->node = node;
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> +{
> + struct find_memory_node_arg arg = { 0 };
> + const __be32 *memory_prop;
> + u32 mem_phandle;
> + int ret, reg_len;
> +
> + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> + if (!memory_prop) {
> + pr_err("Missing 'memory' property in the tag storage node");
> + return -EINVAL;
> + }
> +
> + mem_phandle = be32_to_cpup(memory_prop);
> + arg.phandle = mem_phandle;
> +
> + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
Do not use of_scan_flat_dt. It is a relic predating libfdt which can
get a node by phandle directly.
> + if (ret != 1) {
> + pr_err("Associated memory node not found");
> + return -EINVAL;
> + }
> +
> + *mem_node = arg.node;
> +
> + return 0;
> +}
> +
> +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> + u32 *retval)
If you are going to make a generic function, make it for everyone.
> +{
> + const __be32 *reg;
> +
> + reg = of_get_flat_dt_prop(node, propname, NULL);
> + if (!reg)
> + return -EINVAL;
> +
> + *retval = be32_to_cpup(reg);
> + return 0;
> +}
> +
> +static u32 __init get_block_size_pages(u32 block_size_bytes)
> +{
> + u32 a = PAGE_SIZE;
> + u32 b = block_size_bytes;
> + u32 r;
> +
> + /* Find greatest common divisor using the Euclidian algorithm. */
> + do {
> + r = a % b;
> + a = b;
> + b = r;
> + } while (b != 0);
> +
> + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> +}
> +
> +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> + int depth, void *data)
> +{
> + struct tag_region *region;
> + unsigned long mem_node;
> + struct range *mem_range;
> + struct range *tag_range;
> + u32 block_size_bytes;
> + u32 nid = 0;
> + int ret;
> +
> + if (depth != 1 || !strstr(uname, "tag-storage"))
> + return 0;
> +
> + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> + return 0;
> +
> + if (num_tag_regions == MAX_TAG_REGIONS) {
> + pr_err("Maximum number of tag storage regions exceeded");
> + return -EINVAL;
> + }
> +
> + region = &tag_regions[num_tag_regions];
> + mem_range = ®ion->mem_range;
> + tag_range = ®ion->tag_range;
> +
> + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> + if (ret) {
> + pr_err("Invalid tag storage node");
> + return ret;
> + }
> +
> + ret = tag_storage_get_memory_node(node, &mem_node);
> + if (ret)
> + return ret;
> +
> + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> + if (ret) {
> + pr_err("Invalid address for associated data memory node");
> + return ret;
> + }
> +
> + /* The tag region must exactly match the corresponding memory. */
> + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> + return -EINVAL;
> + }
> +
> + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> + if (ret || block_size_bytes == 0) {
> + pr_err("Invalid or missing 'block-size' property");
> + return -EINVAL;
> + }
> + region->block_size = get_block_size_pages(block_size_bytes);
> + if (range_len(tag_range) % region->block_size != 0) {
> + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> + PFN_PHYS(range_len(tag_range)), region->block_size);
> + return -EINVAL;
> + }
> +
> + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
I was going to say we already have a way to associate memory nodes
other nodes using "numa-node-id", so the "memory" phandle property is
somewhat redundant. Maybe the tag node should have a numa-node-id.
With that, it looks like you don't even need to access the /memory
node. Avoiding that would be good for 2 reasons. It avoids parsing
memory nodes twice and it's not the kernel's job to validate the DT.
Really, if you want memory info, you should use memblock to get it
because all the special cases of memory layout are handled. For
example you can have memory nodes with multiple 'reg' entries or
multiple memory nodes or both, and then some of those could be
contiguous.
Rob
Hi Rob,
Thank you so much for the feedback, I'm not very familiar with device tree,
and any comments are very useful.
On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> >
> > The DTB node for the tag storage region is defined as:
> >
> > tags0: tag-storage@8f8000000 {
> > compatible = "arm,mte-tag-storage";
> > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > block-size = <0x1000>;
> > memory = <&memory0>; // Associated tagged memory node
> > };
>
> I skimmed thru the discussion some. If this memory range is within
> main RAM, then it definitely belongs in /reserved-memory.
Ok, will do that.
If you don't mind, why do you say that it definitely belongs in
reserved-memory? I'm not trying to argue otherwise, I'm curious about the
motivation.
Tag storage is not DMA and can live anywhere in memory. In
arm64_memblock_init(), the kernel first removes the memory that it cannot
address from memblock. For example, because it has been compiled with
CONFIG_ARM64_VA_BITS_39=y. And then calls
early_init_fdt_scan_reserved_mem().
What happens if reserved memory is above what the kernel can address?
From my testing, when the kernel is compiled with 39 bit VA, if I use
reserved memory to discover tag storage the lives above the virtua address
limit and then I try to use CMA to manage the tag storage memory, I get a
kernel panic:
[ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
[ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
[..]
[ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
[ 0.807277] Mem abort info:
[ 0.807277] ESR = 0x0000000096000005
[ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.808110] SET = 0, FnV = 0
[ 0.808443] EA = 0, S1PTW = 0
[ 0.808526] FSC = 0x05: level 1 translation fault
[ 0.808943] Data abort info:
[ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 0.810221] [00000001fe000000] user address but active_mm is swapper
[..]
[ 0.820887] Call trace:
[ 0.821027] cma_init_reserved_areas+0xc4/0x378
>
> You need a binding for this too.
By binding you mean having an yaml file in dt-schem [1] describing the tag
storage node, right?
[1] https://github.com/devicetree-org/dt-schema
>
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> >
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/Kconfig | 12 ++
> > arch/arm64/include/asm/mte_tag_storage.h | 15 ++
> > arch/arm64/kernel/Makefile | 1 +
> > arch/arm64/kernel/mte_tag_storage.c | 256 +++++++++++++++++++++++
> > arch/arm64/kernel/setup.c | 7 +
> > 5 files changed, 291 insertions(+)
> > create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> > create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >
> > Documentation/arch/arm64/memory-tagging-extension.rst.
> >
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > + bool "Dynamic MTE tag storage management"
> > + help
> > + Adds support for dynamic management of the memory used by the hardware
> > + for storing MTE tags. This memory, unlike normal memory, cannot be
> > + tagged. When it is used to store tags for another memory location it
> > + cannot be used for any type of allocation.
> > +
> > + If unsure, say N
> > +endif # ARM64_MTE
> > +
> > endmenu # "ARMv8.5 architectural features"
> >
> > menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
> > obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
> > obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
> > obj-$(CONFIG_ARM64_MTE) += mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
> > obj-y += vdso-wrap.o
> > obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
> > obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
>
> You probably don't need this header. If you depend on what it
> implicitly includes, then that will now break in linux-next.
I'll have a look if I can remove it. Might be an artifact from an earlier
version of the patches.
>
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > + struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> > + struct range tag_range; /* Tag storage memory, in PFNs. */
> > + u32 block_size; /* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS 32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > + int reg_len, struct range *range)
> > +{
> > + int addr_cells = dt_root_addr_cells;
> > + int size_cells = dt_root_size_cells;
> > + u64 size;
> > +
> > + if (reg_len / 4 > addr_cells + size_cells)
> > + return -EINVAL;
> > +
> > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > + if (size == 0) {
> > + pr_err("Invalid node");
> > + return -EINVAL;
> > + }
> > + range->end = range->start + size - 1;
>
> We have a function to read (and translate which you forgot) addresses.
> Add what's missing rather than open code your own.
I must have missed that there's already a function to read addresses. Would
you mind pointing me in the right direction?
>
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > + struct range *tag_range)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > + if (reg == NULL) {
> > + pr_err("Invalid metadata node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > + const __be32 *reg;
> > + int reg_len;
> > +
> > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > + if (reg == NULL)
> > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > +
> > + if (reg == NULL) {
> > + pr_err("Invalid memory node");
> > + return -EINVAL;
> > + }
> > +
> > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > + unsigned long node;
> > + u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > + struct find_memory_node_arg *arg = data;
> > +
> > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > + return 0;
> > +
> > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > + arg->node = node;
> > + return 1;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > + struct find_memory_node_arg arg = { 0 };
> > + const __be32 *memory_prop;
> > + u32 mem_phandle;
> > + int ret, reg_len;
> > +
> > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > + if (!memory_prop) {
> > + pr_err("Missing 'memory' property in the tag storage node");
> > + return -EINVAL;
> > + }
> > +
> > + mem_phandle = be32_to_cpup(memory_prop);
> > + arg.phandle = mem_phandle;
> > +
> > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
>
> Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> get a node by phandle directly.
I used that because that's what drivers/of/fdt.c uses. With reserved memory
I shouldn't need it, because struct reserved_mem already includes a
phandle.
>
> > + if (ret != 1) {
> > + pr_err("Associated memory node not found");
> > + return -EINVAL;
> > + }
> > +
> > + *mem_node = arg.node;
> > +
> > + return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > + u32 *retval)
>
> If you are going to make a generic function, make it for everyone.
Sure. If I still need it, should I put the function in
include/linux/of_fdt.h?
>
> > +{
> > + const __be32 *reg;
> > +
> > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > + if (!reg)
> > + return -EINVAL;
> > +
> > + *retval = be32_to_cpup(reg);
> > + return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > + u32 a = PAGE_SIZE;
> > + u32 b = block_size_bytes;
> > + u32 r;
> > +
> > + /* Find greatest common divisor using the Euclidian algorithm. */
> > + do {
> > + r = a % b;
> > + a = b;
> > + b = r;
> > + } while (b != 0);
> > +
> > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > + int depth, void *data)
> > +{
> > + struct tag_region *region;
> > + unsigned long mem_node;
> > + struct range *mem_range;
> > + struct range *tag_range;
> > + u32 block_size_bytes;
> > + u32 nid = 0;
> > + int ret;
> > +
> > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > + return 0;
> > +
> > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > + return 0;
> > +
> > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > + pr_err("Maximum number of tag storage regions exceeded");
> > + return -EINVAL;
> > + }
> > +
> > + region = &tag_regions[num_tag_regions];
> > + mem_range = ®ion->mem_range;
> > + tag_range = ®ion->tag_range;
> > +
> > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > + if (ret) {
> > + pr_err("Invalid tag storage node");
> > + return ret;
> > + }
> > +
> > + ret = tag_storage_get_memory_node(node, &mem_node);
> > + if (ret)
> > + return ret;
> > +
> > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > + if (ret) {
> > + pr_err("Invalid address for associated data memory node");
> > + return ret;
> > + }
> > +
> > + /* The tag region must exactly match the corresponding memory. */
> > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > + return -EINVAL;
> > + }
> > +
> > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > + if (ret || block_size_bytes == 0) {
> > + pr_err("Invalid or missing 'block-size' property");
> > + return -EINVAL;
> > + }
> > + region->block_size = get_block_size_pages(block_size_bytes);
> > + if (range_len(tag_range) % region->block_size != 0) {
> > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > + return -EINVAL;
> > + }
> > +
> > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
>
> I was going to say we already have a way to associate memory nodes
> other nodes using "numa-node-id", so the "memory" phandle property is
> somewhat redundant. Maybe the tag node should have a numa-node-id.
> With that, it looks like you don't even need to access the /memory
> node. Avoiding that would be good for 2 reasons. It avoids parsing
> memory nodes twice and it's not the kernel's job to validate the DT.
> Really, if you want memory info, you should use memblock to get it
> because all the special cases of memory layout are handled. For
> example you can have memory nodes with multiple 'reg' entries or
> multiple memory nodes or both, and then some of those could be
> contiguous.
I need to have a memory node associated with the tag storage node because
there is a static relationship between a page from "normal" memory and its
associated tag storage. If the code doesn't know that the memory region
A..B has the corresponding tag storage in the region X..Y, then it doesn't
know which tag storage to reserve when a page is allocated as tagged.
In the example above, assuming that page P is allocated as tagged, the
corresponding tag storage page that needs to be reserved is:
tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
numa-node-id is not enough for this, because as far I know you can have
multiple memory regions withing the same numa node.
*32 tagged pages use one tag storage page to store the tags.
Thanks,
Alex
On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi Rob,
>
> Thank you so much for the feedback, I'm not very familiar with device tree,
> and any comments are very useful.
>
> On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Allow the kernel to get the size and location of the MTE tag storage
> > > regions from the DTB. This memory is marked as reserved for now.
> > >
> > > The DTB node for the tag storage region is defined as:
> > >
> > > tags0: tag-storage@8f8000000 {
> > > compatible = "arm,mte-tag-storage";
> > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > block-size = <0x1000>;
> > > memory = <&memory0>; // Associated tagged memory node
> > > };
> >
> > I skimmed thru the discussion some. If this memory range is within
> > main RAM, then it definitely belongs in /reserved-memory.
>
> Ok, will do that.
>
> If you don't mind, why do you say that it definitely belongs in
> reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> motivation.
Simply so that /memory nodes describe all possible memory and
/reserved-memory is just adding restrictions. It's also because
/reserved-memory is what gets handled early, and we don't need
multiple things to handle early.
> Tag storage is not DMA and can live anywhere in memory.
Then why put it in DT at all? The only reason CMA is there is to set
the size. It's not even clear to me we need CMA in DT either. The
reasoning long ago was the kernel didn't do a good job of moving and
reclaiming contiguous space, but that's supposed to be better now (and
most h/w figured out they need IOMMUs).
But for tag storage you know the size as it is a function of the
memory size, right? After all, you are validating the size is correct.
I guess there is still the aspect of whether you want enable MTE or
not which could be done in a variety of ways.
> In
> arm64_memblock_init(), the kernel first removes the memory that it cannot
> address from memblock. For example, because it has been compiled with
> CONFIG_ARM64_VA_BITS_39=y. And then calls
> early_init_fdt_scan_reserved_mem().
>
> What happens if reserved memory is above what the kernel can address?
I would hope the kernel handles it. That's the kernel's problem unless
there's some h/w limitation to access some region. The DT can't have
things dependent on the kernel config.
> From my testing, when the kernel is compiled with 39 bit VA, if I use
> reserved memory to discover tag storage the lives above the virtua address
> limit and then I try to use CMA to manage the tag storage memory, I get a
> kernel panic:
Looks like we should handle that better...
>> [ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> [ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> [ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> [..]
> [ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> [ 0.807277] Mem abort info:
> [ 0.807277] ESR = 0x0000000096000005
> [ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 0.808110] SET = 0, FnV = 0
> [ 0.808443] EA = 0, S1PTW = 0
> [ 0.808526] FSC = 0x05: level 1 translation fault
> [ 0.808943] Data abort info:
> [ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> [ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 0.810221] [00000001fe000000] user address but active_mm is swapper
> [..]
> [ 0.820887] Call trace:
> [ 0.821027] cma_init_reserved_areas+0xc4/0x378
>
> >
> > You need a binding for this too.
>
> By binding you mean having an yaml file in dt-schem [1] describing the tag
> storage node, right?
Yes, but in the kernel tree is fine.
[...]
> > > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > > + int reg_len, struct range *range)
> > > +{
> > > + int addr_cells = dt_root_addr_cells;
> > > + int size_cells = dt_root_size_cells;
> > > + u64 size;
> > > +
> > > + if (reg_len / 4 > addr_cells + size_cells)
> > > + return -EINVAL;
> > > +
> > > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > + if (size == 0) {
> > > + pr_err("Invalid node");
> > > + return -EINVAL;
> > > + }
> > > + range->end = range->start + size - 1;
> >
> > We have a function to read (and translate which you forgot) addresses.
> > Add what's missing rather than open code your own.
>
> I must have missed that there's already a function to read addresses. Would
> you mind pointing me in the right direction?
drivers/of/fdt_address.c
Though it doesn't provide getting the size, so that will have to be added.
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > > + struct range *tag_range)
> > > +{
> > > + const __be32 *reg;
> > > + int reg_len;
> > > +
> > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > + if (reg == NULL) {
> > > + pr_err("Invalid metadata node");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > > +{
> > > + const __be32 *reg;
> > > + int reg_len;
> > > +
> > > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > > + if (reg == NULL)
> > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > +
> > > + if (reg == NULL) {
> > > + pr_err("Invalid memory node");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > > +}
> > > +
> > > +struct find_memory_node_arg {
> > > + unsigned long node;
> > > + u32 phandle;
> > > +};
> > > +
> > > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > > + int depth, void *data)
> > > +{
> > > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > > + struct find_memory_node_arg *arg = data;
> > > +
> > > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > > + return 0;
> > > +
> > > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > > + arg->node = node;
> > > + return 1;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > > +{
> > > + struct find_memory_node_arg arg = { 0 };
> > > + const __be32 *memory_prop;
> > > + u32 mem_phandle;
> > > + int ret, reg_len;
> > > +
> > > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > > + if (!memory_prop) {
> > > + pr_err("Missing 'memory' property in the tag storage node");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + mem_phandle = be32_to_cpup(memory_prop);
> > > + arg.phandle = mem_phandle;
> > > +
> > > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> >
> > Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> > get a node by phandle directly.
>
> I used that because that's what drivers/of/fdt.c uses. With reserved memory
> I shouldn't need it, because struct reserved_mem already includes a
> phandle.
Check again. Only some arch/ code (mostly powerpc) uses it. I've
killed off most of it.
> > > + if (ret != 1) {
> > > + pr_err("Associated memory node not found");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + *mem_node = arg.node;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > > + u32 *retval)
> >
> > If you are going to make a generic function, make it for everyone.
>
> Sure. If I still need it, should I put the function in
> include/linux/of_fdt.h?
Yes.
> > > +{
> > > + const __be32 *reg;
> > > +
> > > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > > + if (!reg)
> > > + return -EINVAL;
> > > +
> > > + *retval = be32_to_cpup(reg);
> > > + return 0;
> > > +}
> > > +
> > > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > > +{
> > > + u32 a = PAGE_SIZE;
> > > + u32 b = block_size_bytes;
> > > + u32 r;
> > > +
> > > + /* Find greatest common divisor using the Euclidian algorithm. */
> > > + do {
> > > + r = a % b;
> > > + a = b;
> > > + b = r;
> > > + } while (b != 0);
> > > +
> > > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > > +}
> > > +
> > > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > + int depth, void *data)
> > > +{
> > > + struct tag_region *region;
> > > + unsigned long mem_node;
> > > + struct range *mem_range;
> > > + struct range *tag_range;
> > > + u32 block_size_bytes;
> > > + u32 nid = 0;
> > > + int ret;
> > > +
> > > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > > + return 0;
> > > +
> > > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > > + return 0;
> > > +
> > > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > > + pr_err("Maximum number of tag storage regions exceeded");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + region = &tag_regions[num_tag_regions];
> > > + mem_range = ®ion->mem_range;
> > > + tag_range = ®ion->tag_range;
> > > +
> > > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > > + if (ret) {
> > > + pr_err("Invalid tag storage node");
> > > + return ret;
> > > + }
> > > +
> > > + ret = tag_storage_get_memory_node(node, &mem_node);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > > + if (ret) {
> > > + pr_err("Invalid address for associated data memory node");
> > > + return ret;
> > > + }
> > > +
> > > + /* The tag region must exactly match the corresponding memory. */
> > > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > + if (ret || block_size_bytes == 0) {
> > > + pr_err("Invalid or missing 'block-size' property");
> > > + return -EINVAL;
> > > + }
> > > + region->block_size = get_block_size_pages(block_size_bytes);
> > > + if (range_len(tag_range) % region->block_size != 0) {
> > > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> >
> > I was going to say we already have a way to associate memory nodes
> > other nodes using "numa-node-id", so the "memory" phandle property is
> > somewhat redundant. Maybe the tag node should have a numa-node-id.
> > With that, it looks like you don't even need to access the /memory
> > node. Avoiding that would be good for 2 reasons. It avoids parsing
> > memory nodes twice and it's not the kernel's job to validate the DT.
> > Really, if you want memory info, you should use memblock to get it
> > because all the special cases of memory layout are handled. For
> > example you can have memory nodes with multiple 'reg' entries or
> > multiple memory nodes or both, and then some of those could be
> > contiguous.
>
> I need to have a memory node associated with the tag storage node because
> there is a static relationship between a page from "normal" memory and its
> associated tag storage. If the code doesn't know that the memory region
> A..B has the corresponding tag storage in the region X..Y, then it doesn't
> know which tag storage to reserve when a page is allocated as tagged.
>
> In the example above, assuming that page P is allocated as tagged, the
> corresponding tag storage page that needs to be reserved is:
>
> tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
>
> numa-node-id is not enough for this, because as far I know you can have
> multiple memory regions withing the same numa node.
Okay.
Rob
Hi Rob,
On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Hi Rob,
> >
> > Thank you so much for the feedback, I'm not very familiar with device tree,
> > and any comments are very useful.
> >
> > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > <[email protected]> wrote:
> > > >
> > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > regions from the DTB. This memory is marked as reserved for now.
> > > >
> > > > The DTB node for the tag storage region is defined as:
> > > >
> > > > tags0: tag-storage@8f8000000 {
> > > > compatible = "arm,mte-tag-storage";
> > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > block-size = <0x1000>;
> > > > memory = <&memory0>; // Associated tagged memory node
> > > > };
> > >
> > > I skimmed thru the discussion some. If this memory range is within
> > > main RAM, then it definitely belongs in /reserved-memory.
> >
> > Ok, will do that.
> >
> > If you don't mind, why do you say that it definitely belongs in
> > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > motivation.
>
> Simply so that /memory nodes describe all possible memory and
> /reserved-memory is just adding restrictions. It's also because
> /reserved-memory is what gets handled early, and we don't need
> multiple things to handle early.
>
> > Tag storage is not DMA and can live anywhere in memory.
>
> Then why put it in DT at all? The only reason CMA is there is to set
> the size. It's not even clear to me we need CMA in DT either. The
> reasoning long ago was the kernel didn't do a good job of moving and
> reclaiming contiguous space, but that's supposed to be better now (and
> most h/w figured out they need IOMMUs).
>
> But for tag storage you know the size as it is a function of the
> memory size, right? After all, you are validating the size is correct.
> I guess there is still the aspect of whether you want enable MTE or
> not which could be done in a variety of ways.
Oh, sorry, my bad, I should have been clearer about this. I don't want to
put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
>
> > In
> > arm64_memblock_init(), the kernel first removes the memory that it cannot
> > address from memblock. For example, because it has been compiled with
> > CONFIG_ARM64_VA_BITS_39=y. And then calls
> > early_init_fdt_scan_reserved_mem().
> >
> > What happens if reserved memory is above what the kernel can address?
>
> I would hope the kernel handles it. That's the kernel's problem unless
> there's some h/w limitation to access some region. The DT can't have
> things dependent on the kernel config.
I would hope so too, that's why I was surprised when I put reserved memory
at 1TB in a 39 bit VA kernel and got a panic.
>
> > From my testing, when the kernel is compiled with 39 bit VA, if I use
> > reserved memory to discover tag storage the lives above the virtua address
> > limit and then I try to use CMA to manage the tag storage memory, I get a
> > kernel panic:
>
> Looks like we should handle that better...
I guess we don't need to tackle that problem right now. I don't know of
many systems in the wild that have memory above the 1TB address.
>
> >> [ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> > [ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> > [ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> > [..]
> > [ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> > [ 0.807277] Mem abort info:
> > [ 0.807277] ESR = 0x0000000096000005
> > [ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 0.808110] SET = 0, FnV = 0
> > [ 0.808443] EA = 0, S1PTW = 0
> > [ 0.808526] FSC = 0x05: level 1 translation fault
> > [ 0.808943] Data abort info:
> > [ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > [ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [ 0.810221] [00000001fe000000] user address but active_mm is swapper
> > [..]
> > [ 0.820887] Call trace:
> > [ 0.821027] cma_init_reserved_areas+0xc4/0x378
> >
> > >
> > > You need a binding for this too.
> >
> > By binding you mean having an yaml file in dt-schem [1] describing the tag
> > storage node, right?
>
> Yes, but in the kernel tree is fine.
Cool, thanks.
>
> [...]
>
> > > > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > > > + int reg_len, struct range *range)
> > > > +{
> > > > + int addr_cells = dt_root_addr_cells;
> > > > + int size_cells = dt_root_size_cells;
> > > > + u64 size;
> > > > +
> > > > + if (reg_len / 4 > addr_cells + size_cells)
> > > > + return -EINVAL;
> > > > +
> > > > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > > + if (size == 0) {
> > > > + pr_err("Invalid node");
> > > > + return -EINVAL;
> > > > + }
> > > > + range->end = range->start + size - 1;
> > >
> > > We have a function to read (and translate which you forgot) addresses.
> > > Add what's missing rather than open code your own.
> >
> > I must have missed that there's already a function to read addresses. Would
> > you mind pointing me in the right direction?
>
> drivers/of/fdt_address.c
>
> Though it doesn't provide getting the size, so that will have to be added.
Ok, will do!
>
>
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > > > + struct range *tag_range)
> > > > +{
> > > > + const __be32 *reg;
> > > > + int reg_len;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > > + if (reg == NULL) {
> > > > + pr_err("Invalid metadata node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > > > +{
> > > > + const __be32 *reg;
> > > > + int reg_len;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > > > + if (reg == NULL)
> > > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > > +
> > > > + if (reg == NULL) {
> > > > + pr_err("Invalid memory node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > > > +}
> > > > +
> > > > +struct find_memory_node_arg {
> > > > + unsigned long node;
> > > > + u32 phandle;
> > > > +};
> > > > +
> > > > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > > > + int depth, void *data)
> > > > +{
> > > > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > > > + struct find_memory_node_arg *arg = data;
> > > > +
> > > > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > > > + return 0;
> > > > +
> > > > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > > > + arg->node = node;
> > > > + return 1;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > > > +{
> > > > + struct find_memory_node_arg arg = { 0 };
> > > > + const __be32 *memory_prop;
> > > > + u32 mem_phandle;
> > > > + int ret, reg_len;
> > > > +
> > > > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > > > + if (!memory_prop) {
> > > > + pr_err("Missing 'memory' property in the tag storage node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + mem_phandle = be32_to_cpup(memory_prop);
> > > > + arg.phandle = mem_phandle;
> > > > +
> > > > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > >
> > > Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> > > get a node by phandle directly.
> >
> > I used that because that's what drivers/of/fdt.c uses. With reserved memory
> > I shouldn't need it, because struct reserved_mem already includes a
> > phandle.
>
> Check again. Only some arch/ code (mostly powerpc) uses it. I've
> killed off most of it.
You're right, I think I grep'ed for a different function name in
drivers/of/fdt.c Either way, the message is clear: no of_scan_flat_dt().
>
>
> > > > + if (ret != 1) {
> > > > + pr_err("Associated memory node not found");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + *mem_node = arg.node;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > > > + u32 *retval)
> > >
> > > If you are going to make a generic function, make it for everyone.
> >
> > Sure. If I still need it, should I put the function in
> > include/linux/of_fdt.h?
>
> Yes.
Noted.
>
> > > > +{
> > > > + const __be32 *reg;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > > > + if (!reg)
> > > > + return -EINVAL;
> > > > +
> > > > + *retval = be32_to_cpup(reg);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > > > +{
> > > > + u32 a = PAGE_SIZE;
> > > > + u32 b = block_size_bytes;
> > > > + u32 r;
> > > > +
> > > > + /* Find greatest common divisor using the Euclidian algorithm. */
> > > > + do {
> > > > + r = a % b;
> > > > + a = b;
> > > > + b = r;
> > > > + } while (b != 0);
> > > > +
> > > > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > > > +}
> > > > +
> > > > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > > + int depth, void *data)
> > > > +{
> > > > + struct tag_region *region;
> > > > + unsigned long mem_node;
> > > > + struct range *mem_range;
> > > > + struct range *tag_range;
> > > > + u32 block_size_bytes;
> > > > + u32 nid = 0;
> > > > + int ret;
> > > > +
> > > > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > > > + return 0;
> > > > +
> > > > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > > > + return 0;
> > > > +
> > > > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > > > + pr_err("Maximum number of tag storage regions exceeded");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + region = &tag_regions[num_tag_regions];
> > > > + mem_range = ®ion->mem_range;
> > > > + tag_range = ®ion->tag_range;
> > > > +
> > > > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > > > + if (ret) {
> > > > + pr_err("Invalid tag storage node");
> > > > + return ret;
> > > > + }
> > > > +
> > > > + ret = tag_storage_get_memory_node(node, &mem_node);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > > > + if (ret) {
> > > > + pr_err("Invalid address for associated data memory node");
> > > > + return ret;
> > > > + }
> > > > +
> > > > + /* The tag region must exactly match the corresponding memory. */
> > > > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > > > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > > + if (ret || block_size_bytes == 0) {
> > > > + pr_err("Invalid or missing 'block-size' property");
> > > > + return -EINVAL;
> > > > + }
> > > > + region->block_size = get_block_size_pages(block_size_bytes);
> > > > + if (range_len(tag_range) % region->block_size != 0) {
> > > > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > >
> > > I was going to say we already have a way to associate memory nodes
> > > other nodes using "numa-node-id", so the "memory" phandle property is
> > > somewhat redundant. Maybe the tag node should have a numa-node-id.
> > > With that, it looks like you don't even need to access the /memory
> > > node. Avoiding that would be good for 2 reasons. It avoids parsing
> > > memory nodes twice and it's not the kernel's job to validate the DT.
> > > Really, if you want memory info, you should use memblock to get it
> > > because all the special cases of memory layout are handled. For
> > > example you can have memory nodes with multiple 'reg' entries or
> > > multiple memory nodes or both, and then some of those could be
> > > contiguous.
> >
> > I need to have a memory node associated with the tag storage node because
> > there is a static relationship between a page from "normal" memory and its
> > associated tag storage. If the code doesn't know that the memory region
> > A..B has the corresponding tag storage in the region X..Y, then it doesn't
> > know which tag storage to reserve when a page is allocated as tagged.
> >
> > In the example above, assuming that page P is allocated as tagged, the
> > corresponding tag storage page that needs to be reserved is:
> >
> > tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
> >
> > numa-node-id is not enough for this, because as far I know you can have
> > multiple memory regions withing the same numa node.
>
> Okay.
Great, glad we are on the same page.
Thanks,
Alex
On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi Rob,
>
> On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Hi Rob,
> > >
> > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > and any comments are very useful.
> > >
> > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > <[email protected]> wrote:
> > > > >
> > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > >
> > > > > The DTB node for the tag storage region is defined as:
> > > > >
> > > > > tags0: tag-storage@8f8000000 {
> > > > > compatible = "arm,mte-tag-storage";
> > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > block-size = <0x1000>;
> > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > };
> > > >
> > > > I skimmed thru the discussion some. If this memory range is within
> > > > main RAM, then it definitely belongs in /reserved-memory.
> > >
> > > Ok, will do that.
> > >
> > > If you don't mind, why do you say that it definitely belongs in
> > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > motivation.
> >
> > Simply so that /memory nodes describe all possible memory and
> > /reserved-memory is just adding restrictions. It's also because
> > /reserved-memory is what gets handled early, and we don't need
> > multiple things to handle early.
> >
> > > Tag storage is not DMA and can live anywhere in memory.
> >
> > Then why put it in DT at all? The only reason CMA is there is to set
> > the size. It's not even clear to me we need CMA in DT either. The
> > reasoning long ago was the kernel didn't do a good job of moving and
> > reclaiming contiguous space, but that's supposed to be better now (and
> > most h/w figured out they need IOMMUs).
> >
> > But for tag storage you know the size as it is a function of the
> > memory size, right? After all, you are validating the size is correct.
> > I guess there is still the aspect of whether you want enable MTE or
> > not which could be done in a variety of ways.
>
> Oh, sorry, my bad, I should have been clearer about this. I don't want to
> put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
Yes, I understand, but my point remains. Why do you need this in DT?
If the location doesn't matter and you can calculate the size from the
memory size, what else is there to add to the DT?
Rob
Hi,
On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Hi Rob,
> >
> > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > <[email protected]> wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > and any comments are very useful.
> > > >
> > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > >
> > > > > > The DTB node for the tag storage region is defined as:
> > > > > >
> > > > > > tags0: tag-storage@8f8000000 {
> > > > > > compatible = "arm,mte-tag-storage";
> > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > block-size = <0x1000>;
> > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > };
> > > > >
> > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > >
> > > > Ok, will do that.
> > > >
> > > > If you don't mind, why do you say that it definitely belongs in
> > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > motivation.
> > >
> > > Simply so that /memory nodes describe all possible memory and
> > > /reserved-memory is just adding restrictions. It's also because
> > > /reserved-memory is what gets handled early, and we don't need
> > > multiple things to handle early.
> > >
> > > > Tag storage is not DMA and can live anywhere in memory.
> > >
> > > Then why put it in DT at all? The only reason CMA is there is to set
> > > the size. It's not even clear to me we need CMA in DT either. The
> > > reasoning long ago was the kernel didn't do a good job of moving and
> > > reclaiming contiguous space, but that's supposed to be better now (and
> > > most h/w figured out they need IOMMUs).
> > >
> > > But for tag storage you know the size as it is a function of the
> > > memory size, right? After all, you are validating the size is correct.
> > > I guess there is still the aspect of whether you want enable MTE or
> > > not which could be done in a variety of ways.
> >
> > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
>
> Yes, I understand, but my point remains. Why do you need this in DT?
> If the location doesn't matter and you can calculate the size from the
> memory size, what else is there to add to the DT?
I am afraid there has been a misunderstanding. What do you mean by
"location doesn't matter"?
At the very least, Linux needs to know the address and size of a memory
region to use it. The series is about using the tag storage memory for
data. Tag storage cannot be described as a regular memory node because it
cannot be tagged (and normal memory can).
Then there's the matter of the tag storage block size (explained in this
commit message), and also knowing the memory range for which a tag storage
region stores the tags. This is explained in the cover letter.
Is there something that you feel that is not clear enough? I am more than
happy to go into details.
Thanks,
Alex
On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi,
>
> On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Hi Rob,
> > >
> > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > and any comments are very useful.
> > > > >
> > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > >
> > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > >
> > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > block-size = <0x1000>;
> > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > };
> > > > > >
> > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > >
> > > > > Ok, will do that.
> > > > >
> > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > motivation.
> > > >
> > > > Simply so that /memory nodes describe all possible memory and
> > > > /reserved-memory is just adding restrictions. It's also because
> > > > /reserved-memory is what gets handled early, and we don't need
> > > > multiple things to handle early.
> > > >
> > > > > Tag storage is not DMA and can live anywhere in memory.
> > > >
> > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > most h/w figured out they need IOMMUs).
> > > >
> > > > But for tag storage you know the size as it is a function of the
> > > > memory size, right? After all, you are validating the size is correct.
> > > > I guess there is still the aspect of whether you want enable MTE or
> > > > not which could be done in a variety of ways.
> > >
> > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> >
> > Yes, I understand, but my point remains. Why do you need this in DT?
> > If the location doesn't matter and you can calculate the size from the
> > memory size, what else is there to add to the DT?
>
> I am afraid there has been a misunderstanding. What do you mean by
> "location doesn't matter"?
You said:
> Tag storage is not DMA and can live anywhere in memory.
Which I took as the kernel can figure out where to put it. But maybe
you meant the h/w platform can hard code it to be anywhere in memory?
If so, then yes, DT is needed.
> At the very least, Linux needs to know the address and size of a memory
> region to use it. The series is about using the tag storage memory for
> data. Tag storage cannot be described as a regular memory node because it
> cannot be tagged (and normal memory can).
If the tag storage lives in the middle of memory, then it would be
described in the memory node, but removed by being in reserved-memory
node.
> Then there's the matter of the tag storage block size (explained in this
> commit message), and also knowing the memory range for which a tag storage
> region stores the tags. This is explained in the cover letter.
Honestly, I just forgot about that part.
Rob
On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Hi,
> >
> > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > <[email protected]> wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Hi Rob,
> > > > > >
> > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > and any comments are very useful.
> > > > > >
> > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > >
> > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > >
> > > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > block-size = <0x1000>;
> > > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > > };
> > > > > > >
> > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > >
> > > > > > Ok, will do that.
> > > > > >
> > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > motivation.
> > > > >
> > > > > Simply so that /memory nodes describe all possible memory and
> > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > multiple things to handle early.
> > > > >
> > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > >
> > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > most h/w figured out they need IOMMUs).
> > > > >
> > > > > But for tag storage you know the size as it is a function of the
> > > > > memory size, right? After all, you are validating the size is correct.
> > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > not which could be done in a variety of ways.
> > > >
> > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > >
> > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > If the location doesn't matter and you can calculate the size from the
> > > memory size, what else is there to add to the DT?
> >
> > I am afraid there has been a misunderstanding. What do you mean by
> > "location doesn't matter"?
>
> You said:
> > Tag storage is not DMA and can live anywhere in memory.
>
> Which I took as the kernel can figure out where to put it. But maybe
> you meant the h/w platform can hard code it to be anywhere in memory?
> If so, then yes, DT is needed.
Ah, I see, sorry for not being clear enough, you are correct: tag storage
is a hardware property, and software needs a mechanism (in this case, the
dt) to discover its properties.
>
> > At the very least, Linux needs to know the address and size of a memory
> > region to use it. The series is about using the tag storage memory for
> > data. Tag storage cannot be described as a regular memory node because it
> > cannot be tagged (and normal memory can).
>
> If the tag storage lives in the middle of memory, then it would be
> described in the memory node, but removed by being in reserved-memory
> node.
I don't follow. Would you mind going into more details?
>
> > Then there's the matter of the tag storage block size (explained in this
> > commit message), and also knowing the memory range for which a tag storage
> > region stores the tags. This is explained in the cover letter.
>
> Honestly, I just forgot about that part.
I totally understand, there are a lot of things to consider at the same
time.
Thanks,
Alex
On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
<[email protected]> wrote:
>
> On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi Rob,
> > > > > > >
> > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > and any comments are very useful.
> > > > > > >
> > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > >
> > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > >
> > > > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > block-size = <0x1000>;
> > > > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > > > };
> > > > > > > >
> > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > >
> > > > > > > Ok, will do that.
> > > > > > >
> > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > motivation.
> > > > > >
> > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > multiple things to handle early.
> > > > > >
> > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > >
> > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > most h/w figured out they need IOMMUs).
> > > > > >
> > > > > > But for tag storage you know the size as it is a function of the
> > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > not which could be done in a variety of ways.
> > > > >
> > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > >
> > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > If the location doesn't matter and you can calculate the size from the
> > > > memory size, what else is there to add to the DT?
> > >
> > > I am afraid there has been a misunderstanding. What do you mean by
> > > "location doesn't matter"?
> >
> > You said:
> > > Tag storage is not DMA and can live anywhere in memory.
> >
> > Which I took as the kernel can figure out where to put it. But maybe
> > you meant the h/w platform can hard code it to be anywhere in memory?
> > If so, then yes, DT is needed.
>
> Ah, I see, sorry for not being clear enough, you are correct: tag storage
> is a hardware property, and software needs a mechanism (in this case, the
> dt) to discover its properties.
>
> >
> > > At the very least, Linux needs to know the address and size of a memory
> > > region to use it. The series is about using the tag storage memory for
> > > data. Tag storage cannot be described as a regular memory node because it
> > > cannot be tagged (and normal memory can).
> >
> > If the tag storage lives in the middle of memory, then it would be
> > described in the memory node, but removed by being in reserved-memory
> > node.
>
> I don't follow. Would you mind going into more details?
It goes back to what I said earlier about /memory nodes describing all
the memory. There's no reason to reserve memory if you haven't
described that range as memory to begin with. One could presumably
just have a memory node for each contiguous chunk and not need
/reserved-memory (ignoring the need to say what things are reserved
for). That would become very difficult to adjust. Note that the kernel
has a hardcoded limit of 64 reserved regions currently and that is not
enough for some people. Seems like a lot, but I have no idea how they
are (ab)using /reserved-memory.
Let me give an example. Presumably using MTE at all is configurable.
If you boot a kernel with MTE disabled (or older and not supporting
it), then I'd assume you'd want to use the tag storage for regular
memory. Well, If tag storage is already part of /memory, then all you
have to do is ignore the tag reserved-memory region. Tweaking the
memory nodes would be more work.
Also, I should point out that /memory and /reserved-memory nodes are
not used for UEFI boot.
Rob
Hi,
On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Hi Rob,
> > > > > >
> > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi Rob,
> > > > > > > >
> > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > and any comments are very useful.
> > > > > > > >
> > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > >
> > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > >
> > > > > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > > block-size = <0x1000>;
> > > > > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > > > > };
> > > > > > > > >
> > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > >
> > > > > > > > Ok, will do that.
> > > > > > > >
> > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > motivation.
> > > > > > >
> > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > multiple things to handle early.
> > > > > > >
> > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > >
> > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > most h/w figured out they need IOMMUs).
> > > > > > >
> > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > not which could be done in a variety of ways.
> > > > > >
> > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > >
> > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > If the location doesn't matter and you can calculate the size from the
> > > > > memory size, what else is there to add to the DT?
> > > >
> > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > "location doesn't matter"?
> > >
> > > You said:
> > > > Tag storage is not DMA and can live anywhere in memory.
> > >
> > > Which I took as the kernel can figure out where to put it. But maybe
> > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > If so, then yes, DT is needed.
> >
> > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > is a hardware property, and software needs a mechanism (in this case, the
> > dt) to discover its properties.
> >
> > >
> > > > At the very least, Linux needs to know the address and size of a memory
> > > > region to use it. The series is about using the tag storage memory for
> > > > data. Tag storage cannot be described as a regular memory node because it
> > > > cannot be tagged (and normal memory can).
> > >
> > > If the tag storage lives in the middle of memory, then it would be
> > > described in the memory node, but removed by being in reserved-memory
> > > node.
> >
> > I don't follow. Would you mind going into more details?
>
> It goes back to what I said earlier about /memory nodes describing all
> the memory. There's no reason to reserve memory if you haven't
> described that range as memory to begin with. One could presumably
> just have a memory node for each contiguous chunk and not need
> /reserved-memory (ignoring the need to say what things are reserved
> for). That would become very difficult to adjust. Note that the kernel
> has a hardcoded limit of 64 reserved regions currently and that is not
> enough for some people. Seems like a lot, but I have no idea how they
> are (ab)using /reserved-memory.
Ah, I see what you mean, reserved memory is about marking existing memory
(from a /memory node) as special, not about adding new memory.
After the memblock allocator is initialized, the kernel can use it for its
own allocations. Kernel allocations are not movable.
When a page is allocated as tagged, the associated tag storage cannot be
used for data, otherwise the tags would corrupt that data. To avoid this,
the requirement is that tag storage pages are only used for movable
allocations. When a page is allocated as tagged, the data in the associated
tag storage is migrated and the tag storage is taken from the page
allocator (via alloc_contig_range()).
My understanding is that the memblock allocator can use all the memory from
a /memory node. If the tags storage memory is declared in a /memory node,
there exists the possibility that Linux will use tag storage memory for its
own allocation, which would make that tags storage memory unmovable, and
thus unusable for storing tags.
Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
hotpluggable, memblock will still use it, unless "movable_node" is set on
the kernel command line.
That's the reason why I'm not describing tag storage in a /memory node. Is
there way to tell the memblock allocator not to use memory from a /memory
node?
>
> Let me give an example. Presumably using MTE at all is configurable.
> If you boot a kernel with MTE disabled (or older and not supporting
> it), then I'd assume you'd want to use the tag storage for regular
> memory. Well, If tag storage is already part of /memory, then all you
> have to do is ignore the tag reserved-memory region. Tweaking the
> memory nodes would be more work.
Right now, memory is added via memblock_reserve(), and if MTE is disabled
(for example, via the kernel command line), the code calls
free_reserved_page() for each tag storage page. I find that straightfoward
to implement.
Thanks,
Alex
>
>
> Also, I should point out that /memory and /reserved-memory nodes are
> not used for UEFI boot.
>
> Rob
>
On Thu, Dec 14, 2023 at 9:45 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi,
>
> On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi Rob,
> > > > > > >
> > > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hi Rob,
> > > > > > > > >
> > > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > > and any comments are very useful.
> > > > > > > > >
> > > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > > >
> > > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > > >
> > > > > > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > > > block-size = <0x1000>;
> > > > > > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > > > > > };
> > > > > > > > > >
> > > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > > >
> > > > > > > > > Ok, will do that.
> > > > > > > > >
> > > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > > motivation.
> > > > > > > >
> > > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > > multiple things to handle early.
> > > > > > > >
> > > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > > >
> > > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > > most h/w figured out they need IOMMUs).
> > > > > > > >
> > > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > > not which could be done in a variety of ways.
> > > > > > >
> > > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > > >
> > > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > > If the location doesn't matter and you can calculate the size from the
> > > > > > memory size, what else is there to add to the DT?
> > > > >
> > > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > > "location doesn't matter"?
> > > >
> > > > You said:
> > > > > Tag storage is not DMA and can live anywhere in memory.
> > > >
> > > > Which I took as the kernel can figure out where to put it. But maybe
> > > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > > If so, then yes, DT is needed.
> > >
> > > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > > is a hardware property, and software needs a mechanism (in this case, the
> > > dt) to discover its properties.
> > >
> > > >
> > > > > At the very least, Linux needs to know the address and size of a memory
> > > > > region to use it. The series is about using the tag storage memory for
> > > > > data. Tag storage cannot be described as a regular memory node because it
> > > > > cannot be tagged (and normal memory can).
> > > >
> > > > If the tag storage lives in the middle of memory, then it would be
> > > > described in the memory node, but removed by being in reserved-memory
> > > > node.
> > >
> > > I don't follow. Would you mind going into more details?
> >
> > It goes back to what I said earlier about /memory nodes describing all
> > the memory. There's no reason to reserve memory if you haven't
> > described that range as memory to begin with. One could presumably
> > just have a memory node for each contiguous chunk and not need
> > /reserved-memory (ignoring the need to say what things are reserved
> > for). That would become very difficult to adjust. Note that the kernel
> > has a hardcoded limit of 64 reserved regions currently and that is not
> > enough for some people. Seems like a lot, but I have no idea how they
> > are (ab)using /reserved-memory.
>
> Ah, I see what you mean, reserved memory is about marking existing memory
> (from a /memory node) as special, not about adding new memory.
>
> After the memblock allocator is initialized, the kernel can use it for its
> own allocations. Kernel allocations are not movable.
>
> When a page is allocated as tagged, the associated tag storage cannot be
> used for data, otherwise the tags would corrupt that data. To avoid this,
> the requirement is that tag storage pages are only used for movable
> allocations. When a page is allocated as tagged, the data in the associated
> tag storage is migrated and the tag storage is taken from the page
> allocator (via alloc_contig_range()).
>
> My understanding is that the memblock allocator can use all the memory from
> a /memory node. If the tags storage memory is declared in a /memory node,
> there exists the possibility that Linux will use tag storage memory for its
> own allocation, which would make that tags storage memory unmovable, and
> thus unusable for storing tags.
No, because the tag storage would be reserved in /reserved-memory.
Of course, the arch code could do something between scanning /memory
nodes and /reserved-memory, but that would be broken arch code.
Ideally, there wouldn't be any arch code in between those 2 points,
but it's complicated. It used to mainly be powerpc, but we keep adding
to the complexity on arm64.
> Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
> hotpluggable, memblock will still use it, unless "movable_node" is set on
> the kernel command line.
>
> That's the reason why I'm not describing tag storage in a /memory node. Is
> there way to tell the memblock allocator not to use memory from a /memory
> node?
>
> >
> > Let me give an example. Presumably using MTE at all is configurable.
> > If you boot a kernel with MTE disabled (or older and not supporting
> > it), then I'd assume you'd want to use the tag storage for regular
> > memory. Well, If tag storage is already part of /memory, then all you
> > have to do is ignore the tag reserved-memory region. Tweaking the
> > memory nodes would be more work.
>
> Right now, memory is added via memblock_reserve(), and if MTE is disabled
> (for example, via the kernel command line), the code calls
> free_reserved_page() for each tag storage page. I find that straightfoward
> to implement.
But better to just not reserve the region in the first place. Also, it
needs to be simple enough to back port.
Also, does free_reserved_page() work on ranges outside of memblock
range (e.g. beyond end_of_DRAM())? If the tag storage happened to live
at the end of DRAM and you shorten the /memory node size to remove tag
storage, is it still going to work?
Rob
Hi,
On Thu, Dec 14, 2023 at 12:55:14PM -0600, Rob Herring wrote:
> On Thu, Dec 14, 2023 at 9:45 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Hi,
> >
> > On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi Rob,
> > > > > > > >
> > > > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Rob,
> > > > > > > > > >
> > > > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > > > and any comments are very useful.
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > > > >
> > > > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > > > >
> > > > > > > > > > > > tags0: tag-storage@8f8000000 {
> > > > > > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > > > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > > > > block-size = <0x1000>;
> > > > > > > > > > > > memory = <&memory0>; // Associated tagged memory node
> > > > > > > > > > > > };
> > > > > > > > > > >
> > > > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > > > >
> > > > > > > > > > Ok, will do that.
> > > > > > > > > >
> > > > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > > > motivation.
> > > > > > > > >
> > > > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > > > multiple things to handle early.
> > > > > > > > >
> > > > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > > > >
> > > > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > > > most h/w figured out they need IOMMUs).
> > > > > > > > >
> > > > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > > > not which could be done in a variety of ways.
> > > > > > > >
> > > > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > > > >
> > > > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > > > If the location doesn't matter and you can calculate the size from the
> > > > > > > memory size, what else is there to add to the DT?
> > > > > >
> > > > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > > > "location doesn't matter"?
> > > > >
> > > > > You said:
> > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > >
> > > > > Which I took as the kernel can figure out where to put it. But maybe
> > > > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > > > If so, then yes, DT is needed.
> > > >
> > > > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > > > is a hardware property, and software needs a mechanism (in this case, the
> > > > dt) to discover its properties.
> > > >
> > > > >
> > > > > > At the very least, Linux needs to know the address and size of a memory
> > > > > > region to use it. The series is about using the tag storage memory for
> > > > > > data. Tag storage cannot be described as a regular memory node because it
> > > > > > cannot be tagged (and normal memory can).
> > > > >
> > > > > If the tag storage lives in the middle of memory, then it would be
> > > > > described in the memory node, but removed by being in reserved-memory
> > > > > node.
> > > >
> > > > I don't follow. Would you mind going into more details?
> > >
> > > It goes back to what I said earlier about /memory nodes describing all
> > > the memory. There's no reason to reserve memory if you haven't
> > > described that range as memory to begin with. One could presumably
> > > just have a memory node for each contiguous chunk and not need
> > > /reserved-memory (ignoring the need to say what things are reserved
> > > for). That would become very difficult to adjust. Note that the kernel
> > > has a hardcoded limit of 64 reserved regions currently and that is not
> > > enough for some people. Seems like a lot, but I have no idea how they
> > > are (ab)using /reserved-memory.
> >
> > Ah, I see what you mean, reserved memory is about marking existing memory
> > (from a /memory node) as special, not about adding new memory.
> >
> > After the memblock allocator is initialized, the kernel can use it for its
> > own allocations. Kernel allocations are not movable.
> >
> > When a page is allocated as tagged, the associated tag storage cannot be
> > used for data, otherwise the tags would corrupt that data. To avoid this,
> > the requirement is that tag storage pages are only used for movable
> > allocations. When a page is allocated as tagged, the data in the associated
> > tag storage is migrated and the tag storage is taken from the page
> > allocator (via alloc_contig_range()).
> >
> > My understanding is that the memblock allocator can use all the memory from
> > a /memory node. If the tags storage memory is declared in a /memory node,
> > there exists the possibility that Linux will use tag storage memory for its
> > own allocation, which would make that tags storage memory unmovable, and
> > thus unusable for storing tags.
>
> No, because the tag storage would be reserved in /reserved-memory.
>
> Of course, the arch code could do something between scanning /memory
> nodes and /reserved-memory, but that would be broken arch code.
> Ideally, there wouldn't be any arch code in between those 2 points,
> but it's complicated. It used to mainly be powerpc, but we keep adding
> to the complexity on arm64.
Ah, yes, thats what I was referring to, the fact that the memory nodes are
parsed in setup_arch -> setup_machine_fdt -> early_init_dt_scan, and the
reserved memory is parsed later in setup_arch -> arm64_memblock_init.
If the rule is that no memblock allocations can take place between
setup_machine_fdt() and arm64_memblock_init(), then putting tag storage in
a /memory node will work, thank you for the clarification.
>
> > Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
> > hotpluggable, memblock will still use it, unless "movable_node" is set on
> > the kernel command line.
> >
> > That's the reason why I'm not describing tag storage in a /memory node. Is
> > there way to tell the memblock allocator not to use memory from a /memory
> > node?
> >
> > >
> > > Let me give an example. Presumably using MTE at all is configurable.
> > > If you boot a kernel with MTE disabled (or older and not supporting
> > > it), then I'd assume you'd want to use the tag storage for regular
> > > memory. Well, If tag storage is already part of /memory, then all you
> > > have to do is ignore the tag reserved-memory region. Tweaking the
> > > memory nodes would be more work.
> >
> > Right now, memory is added via memblock_reserve(), and if MTE is disabled
> > (for example, via the kernel command line), the code calls
> > free_reserved_page() for each tag storage page. I find that straightfoward
> > to implement.
>
> But better to just not reserve the region in the first place. Also, it
> needs to be simple enough to back port.
I don't think that works - reserved memory is parsed in setup_arch ->
arm64_memblock_init, and the cpu capabilities are initialized later, in
smp_prepare_boot_cpu.
>
> Also, does free_reserved_page() work on ranges outside of memblock
> range (e.g. beyond end_of_DRAM())? If the tag storage happened to live
> at the end of DRAM and you shorten the /memory node size to remove tag
> storage, is it still going to work?
Tag storage memory is discovered in 2 staged: first it is added to memblock
with memblock_add(), then reserved with memblock_reserve(). This is
performed in setup_arch(), after setup_machine_fdt(), and before
arm64_memblock_init(). The tag torage code keeps an array of the discovered
tag regions. This is implemented in this patch.
The next patch [1] adds an arch_initcall that checks if
memblock_end_of_DRAM() is less than the upper address of a tag storage
region. If that is the case, then tag storage memory is kept as reserved
and remains unused by the kernel.
The next check is for mte enabled: if it is disabled, then the pages are
unreserved by doing free_reserved_page().
And finally, if all the checks pass, the tag storage pages are put on the
MIGRATE_CMA lists with init_cma_reserved_pageblock().
[1] https://lore.kernel.org/all/[email protected]/
Thanks,
Alex
>
> Rob