2023-08-23 16:33:35

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

Introduction
============

Arm has implemented memory coloring in hardware, and the feature is called
Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
59..56 of a pointer, and storing this tag to a reserved memory location.
When the pointer is dereferenced, the hardware compares the tag embedded in
the pointer (logical tag) with the tag stored in memory (allocation tag).

The relation between memory and where the tag for that memory is stored is
static.

The memory where the tags are stored have been so far unaccessible to Linux.
This series aims to change that, by adding support for using the tag storage
memory only as data memory; tag storage memory cannot be itself tagged.


Implementation
==============

The series is based on v6.5-rc3 with these two patches cherry picked:

- mm: Call arch_swap_restore() from unuse_pte():

https://lore.kernel.org/all/[email protected]/

- arm64: mte: Simplify swap tag restoration logic:

https://lore.kernel.org/all/[email protected]/

The above two patches are queued for the v6.6 merge window:

https://lore.kernel.org/all/[email protected]/

The entire series, including the above patches, can be cloned with:

$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
-b arm-mte-dynamic-carveout-rfc-v1

On the arm64 architecture side, an extension is being worked on that will
clarify how MTE tag storage reuse should behave. The extension will be
made public soon.

On the Linux side, MTE tag storage reuse is accomplished with the
following changes:

1. The tag storage memory is exposed to the memory allocator as a new
migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
the restriction that it cannot be used to allocate tagged memory (tag
storage memory cannot be tagged). On tagged page allocation, the
corresponding tag storage is reserved via alloc_contig_range().

2. mprotect(PROT_MTE) is implemented by changing the pte prot to
PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
the corresponding tag storage is reserved.

3. When the code tries to copy tags to a page which doesn't have the tag
storage reserved, the tags are copied to an xarray and restored in
set_pte_at(), when the page is eventually mapped with the tag storage
reserved.

KVM support has not been implemented yet, that because a non-MTE enabled VMA
can back the memory of an MTE-enabled VM. After there is a consensus on the
right approach on the memory management support, I will add it.

Explanations for the last two changes follow. The gist of it is that they
were added mostly because of races, and it my intention to make the code
more robust.

PAGE_METADATA_NONE was introduced to avoid races with mprotect(PROT_MTE).
For example, migration can race with mprotect(PROT_MTE):
- thread 0 initiates migration for a page in a non-MTE enabled VMA and a
destination page is allocated without tag storage.
- thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and an
access turns the source page that is in the process of being migrated
into a tagged page.
- thread 0 finishes migration and the destination page is mapped as tagged,
but without tag storage reserved.
More details and examples can be found in the patches.

This race is also related to how tag restoring is handled when tag storage
is missing: when a tagged page is swapped out, the tags are saved in an
xarray indexed by swp_entry.val. When a page is swapped back in, if there
are tags corresponding to the swp_entry that the page will replace, the
tags are unconditionally restored, even if the page will be mapped as
untagged. Because the page will be mapped as untagged, tag storage was
not reserved when the page was allocated to replace the swp_entry which has
tags associated with it.

To get around this, save the tags in a new xarray, this time indexed by
pfn, and restore them when the same page is mapped as tagged.

This also solves another race, this time with copy_highpage. In the
scenario where migration races with mprotect(PROT_MTE), before the page is
mapped, the contents of the source page is copied to the destination. And
this includes tags, which will be copied to a page with missing tag
storage, which can to data corruption if the missing tag storage is in use
for data. So copy_highpage() has received a similar treatment to the swap
code, and the source tags are copied in the xarray indexed by the
destination page pfn.


Overview of the patches
=======================

Patches 1-3 do some preparatory work by renaming a few functions and a gfp
flag.

Patches 4-12 are arch independent and introduce MIGRATE_METADATA to the
page allocator.

Patches 13-18 are arm64 specific and add support for detecting the tag
storage region and onlining it with the MIGRATE_METADATA migratetype.

Patches 19-24 are arch independent and modify the page allocator to
callback into arch dependant functions to reserve metadata storage for an
allocation which requires metadata.

Patches 25-28 are mostly arm64 specific and implement the reservation and
freeing of tag storage on tagged page allocation. Patch #28 ("mm: sched:
Introduce PF_MEMALLOC_ISOLATE") adds a current flag, PF_MEMALLOC_ISOLATE,
which ignores page isolation limits; this is used by arm64 when reserving
tag storage in the same patch.

Patches 29-30 add arch independent support for doing mprotect(PROT_MTE)
when metadata storage is enabled.

Patches 31-37 are mostly arm64 specific and handle the restoring of tags
when tag storage is missing. The exceptions are patches 32 (adds the
arch_swap_prepare_to_restore() function) and 35 (add PAGE_METADATA_NONE
support for THPs).

Testing
=======

To enable MTE dynamic tag storage:

- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
storage memory")

Check dmesg for the message "MTE tag storage enabled" or grep for metadata
in /proc/vmstat.

I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting a 2GB region roughly into 33 slices of size
0x3e0_0000, and using 32 of them for tagged memory and one slice for tag
storage:

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..bd050373d6cf 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
};
};

- memory@80000000 {
+ memory0: memory@80000000 {
device_type = "memory";
- reg = <0x00000000 0x80000000 0 0x80000000>,
- <0x00000008 0x80000000 0 0x80000000>;
+ reg = <0x00 0x80000000 0x00 0x7c000000>;
+ };
+
+ metadata0: metadata@c0000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x00 0xfc000000 0x00 0x3e00000>;
+ block-size = <0x1000>;
+ memory = <&memory0>;
+ };
+
+ memory1: memory@880000000 {
+ device_type = "memory";
+ reg = <0x08 0x80000000 0x00 0x7c000000>;
+ };
+
+ metadata1: metadata@8c0000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x08 0xfc000000 0x00 0x3e00000>;
+ block-size = <0x1000>;
+ memory = <&memory1>;
};

reserved-memory {


Alexandru Elisei (37):
mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
gfp_to_alloc_flags_fast
arm64: mte: Rework naming for tag manipulation functions
arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
mm: Add MIGRATE_METADATA allocation policy
mm: Add memory statistics for the MIGRATE_METADATA allocation policy
mm: page_alloc: Allocate from movable pcp lists only if
ALLOC_FROM_METADATA
mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
mm: compaction: Account for free metadata pages in
__compact_finished()
mm: compaction: Handle metadata pages as source for direct compaction
mm: compaction: Do not use MIGRATE_METADATA to replace pages with
metadata
mm: migrate/mempolicy: Allocate metadata-enabled destination page
mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
arm64: mte: Reserve tag storage memory
arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
arm64: mte: Disable dynamic tag storage management if HW KASAN is
enabled
arm64: mte: Check that tag storage blocks are in the same zone
mm: page_alloc: Manage metadata storage on page allocation
mm: compaction: Reserve metadata storage in compaction_alloc()
mm: khugepaged: Handle metadata-enabled VMAs
mm: shmem: Allocate metadata storage for in-memory filesystems
mm: Teach vma_alloc_folio() about metadata-enabled VMAs
mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
arm64: mte: Manage tag storage on page allocation
arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
arm64: mte: Reserve tag block for the zero page
mm: sched: Introduce PF_MEMALLOC_ISOLATE
mm: arm64: Define the PAGE_METADATA_NONE page protection
mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata
storage
mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
arm64: mte: swap/copypage: Handle tag restoring when missing tag
storage
arm64: mte: Handle fatal signal in reserve_metadata_storage()
mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
KVM: arm64: Disable MTE is tag storage is enabled
arm64: mte: Enable tag storage management

arch/arm64/Kconfig | 13 +
arch/arm64/include/asm/assembler.h | 10 +
arch/arm64/include/asm/memory_metadata.h | 49 ++
arch/arm64/include/asm/mte-def.h | 16 +-
arch/arm64/include/asm/mte.h | 40 +-
arch/arm64/include/asm/mte_tag_storage.h | 36 ++
arch/arm64/include/asm/page.h | 5 +-
arch/arm64/include/asm/pgtable-prot.h | 2 +
arch/arm64/include/asm/pgtable.h | 33 +-
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/elfcore.c | 14 +-
arch/arm64/kernel/hibernate.c | 46 +-
arch/arm64/kernel/mte.c | 31 +-
arch/arm64/kernel/mte_tag_storage.c | 667 +++++++++++++++++++++++
arch/arm64/kernel/setup.c | 7 +
arch/arm64/kvm/arm.c | 6 +-
arch/arm64/lib/mte.S | 30 +-
arch/arm64/mm/copypage.c | 26 +
arch/arm64/mm/fault.c | 35 +-
arch/arm64/mm/mteswap.c | 113 +++-
fs/proc/meminfo.c | 8 +
fs/proc/page.c | 1 +
include/asm-generic/Kbuild | 1 +
include/asm-generic/memory_metadata.h | 50 ++
include/linux/gfp.h | 10 +
include/linux/gfp_types.h | 14 +-
include/linux/huge_mm.h | 6 +
include/linux/kernel-page-flags.h | 1 +
include/linux/migrate_mode.h | 1 +
include/linux/mm.h | 12 +-
include/linux/mmzone.h | 26 +-
include/linux/page-flags.h | 1 +
include/linux/pgtable.h | 19 +
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 13 +
include/linux/vm_event_item.h | 5 +
include/linux/vmstat.h | 2 +
include/trace/events/mmflags.h | 5 +-
mm/Kconfig | 5 +
mm/compaction.c | 52 +-
mm/huge_memory.c | 109 ++++
mm/internal.h | 7 +
mm/khugepaged.c | 7 +
mm/memory.c | 180 +++++-
mm/mempolicy.c | 7 +
mm/migrate.c | 6 +
mm/mm_init.c | 23 +-
mm/mprotect.c | 46 ++
mm/page_alloc.c | 136 ++++-
mm/page_isolation.c | 19 +-
mm/page_owner.c | 3 +-
mm/shmem.c | 14 +-
mm/show_mem.c | 4 +
mm/swapfile.c | 4 +
mm/vmscan.c | 3 +
mm/vmstat.c | 13 +-
56 files changed, 1834 insertions(+), 161 deletions(-)
create mode 100644 arch/arm64/include/asm/memory_metadata.h
create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
create mode 100644 arch/arm64/kernel/mte_tag_storage.c
create mode 100644 include/asm-generic/memory_metadata.h

--
2.41.0



2023-08-23 16:38:52

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 11/37] mm: migrate/mempolicy: Allocate metadata-enabled destination page

With explicit metadata page management support, it's important to know if
the source page for migration has metadata associated with it for two
reasons:

- The page allocator knows to skip metadata pages (which cannot have
metadata) when allocating the destination page.
- The associated metadata page is correctly reserved when fulfilling the
allocation for the destination page.

When choosing the destination during migration, keep track if the source
page has metadata.

The mbind() system call changes the NUMA allocation policy for the
specified memory range and nodemask. If the MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flags are set, then any existing allocations that fall
within the range which don't conform to the specified policy will be
migrated. The function that allocates the destination page for migration
is new_page(), teach it too about source pages with metadata.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/memory_metadata.h | 4 ++++
include/asm-generic/memory_metadata.h | 4 ++++
mm/mempolicy.c | 4 ++++
mm/migrate.c | 6 ++++++
4 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index c57c435c8ba3..132707fce9ab 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -21,6 +21,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)

#define page_has_metadata(page) page_mte_tagged(page)

+static inline bool folio_has_metadata(struct folio *folio)
+{
+ return page_has_metadata(&folio->page);
+}
#endif /* CONFIG_MEMORY_METADATA */

#endif /* __ASM_MEMORY_METADATA_H */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 02b279823920..8f4e2fba222f 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -20,6 +20,10 @@ static inline bool page_has_metadata(struct page *page)
{
return false;
}
+static inline bool folio_has_metadata(struct folio *folio)
+{
+ return false;
+}
#endif /* !CONFIG_MEMORY_METADATA */

#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index edc25195f5bd..d164b5c50243 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -103,6 +103,7 @@
#include <linux/printk.h>
#include <linux/swapops.h>

+#include <asm/memory_metadata.h>
#include <asm/tlbflush.h>
#include <asm/tlb.h>
#include <linux/uaccess.h>
@@ -1219,6 +1220,9 @@ static struct folio *new_folio(struct folio *src, unsigned long start)
if (folio_test_large(src))
gfp = GFP_TRANSHUGE;

+ if (folio_has_metadata(src))
+ gfp |= __GFP_TAGGED;
+
/*
* if !vma, vma_alloc_folio() will use task or system default policy
*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 24baad2571e3..c6826562220a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>

+#include <asm/memory_metadata.h>
#include <asm/tlbflush.h>

#include <trace/events/migrate.h>
@@ -1990,6 +1991,9 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
if (nid == NUMA_NO_NODE)
nid = folio_nid(src);

+ if (folio_has_metadata(src))
+ gfp_mask |= __GFP_TAGGED;
+
if (folio_test_hugetlb(src)) {
struct hstate *h = folio_hstate(src);

@@ -2476,6 +2480,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
__GFP_NOWARN;
gfp &= ~__GFP_RECLAIM;
}
+ if (folio_has_metadata(src))
+ gfp |= __GFP_TAGGED;
return __folio_alloc_node(gfp, order, nid);
}

--
2.41.0


2023-08-23 16:47:43

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 10/37] mm: compaction: Do not use MIGRATE_METADATA to replace pages with metadata

MIGRATE_METADATA pages are special because for the one architecture
(arm64) that use them, it is not possible to have metadata associated
with a page used to store metadata.

To avoid a situation where a page with metadata is being migrated to a
page which cannot have metadata, keep track of whether such pages have
been isolated as the source for migration. When allocating a destination
page for migration, deny allocations from MIGRATE_METADATA if that's the
case.

fast_isolate_freepages() takes pages only from the MIGRATE_MOVABLE list,
which means it is not necessary to have a similar check, as
MIGRATE_METADATA pages will never be considered.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/memory_metadata.h | 5 +++++
include/asm-generic/memory_metadata.h | 5 +++++
include/linux/mmzone.h | 2 +-
mm/compaction.c | 19 +++++++++++++++++--
mm/internal.h | 1 +
5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 5269be7f455f..c57c435c8ba3 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -7,6 +7,8 @@

#include <asm-generic/memory_metadata.h>

+#include <asm/mte.h>
+
#ifdef CONFIG_MEMORY_METADATA
static inline bool metadata_storage_enabled(void)
{
@@ -16,6 +18,9 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
{
return false;
}
+
+#define page_has_metadata(page) page_mte_tagged(page)
+
#endif /* CONFIG_MEMORY_METADATA */

#endif /* __ASM_MEMORY_METADATA_H */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 63ea661b354d..02b279823920 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -3,6 +3,7 @@
#define __ASM_GENERIC_MEMORY_METADATA_H

#include <linux/gfp.h>
+#include <linux/mm_types.h>

extern unsigned long totalmetadata_pages;

@@ -15,6 +16,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
{
return false;
}
+static inline bool page_has_metadata(struct page *page)
+{
+ return false;
+}
#endif /* !CONFIG_MEMORY_METADATA */

#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 48c237248d87..12d5072668ab 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,7 +91,7 @@ extern const char * const migratetype_names[MIGRATE_TYPES];

static inline bool is_migrate_movable(int mt)
{
- return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+ return is_migrate_cma(mt) || is_migrate_metadata(mt) || mt == MIGRATE_MOVABLE;
}

/*
diff --git a/mm/compaction.c b/mm/compaction.c
index a29db409c5cc..cc0139fa0cb0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1153,6 +1153,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
nr_isolated += folio_nr_pages(folio);
nr_scanned += folio_nr_pages(folio) - 1;

+ if (page_has_metadata(&folio->page))
+ cc->source_has_metadata = true;
+
/*
* Avoid isolating too much unless this block is being
* fully scanned (e.g. dirty/writeback pages, parallel allocation)
@@ -1328,6 +1331,15 @@ static bool suitable_migration_source(struct compact_control *cc,
static bool suitable_migration_target(struct compact_control *cc,
struct page *page)
{
+ int block_mt;
+
+ block_mt = get_pageblock_migratetype(page);
+
+ /* Pages from MIGRATE_METADATA cannot have metadata. */
+ if (is_migrate_metadata(block_mt) && cc->source_has_metadata)
+ return false;
+
+
/* If the page is a large free page, then disallow migration */
if (PageBuddy(page)) {
/*
@@ -1342,8 +1354,11 @@ static bool suitable_migration_target(struct compact_control *cc,
if (cc->ignore_block_suitable)
return true;

- /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
- if (is_migrate_movable(get_pageblock_migratetype(page)))
+ /*
+ * If the block is MIGRATE_MOVABLE, MIGRATE_CMA or MIGRATE_METADATA,
+ * allow migration.
+ */
+ if (is_migrate_movable(block_mt))
return true;

/* Otherwise skip the block */
diff --git a/mm/internal.h b/mm/internal.h
index efd52c9f1578..d28ac0085f61 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -491,6 +491,7 @@ struct compact_control {
* ensure forward progress.
*/
bool alloc_contig; /* alloc_contig_range allocation */
+ bool source_has_metadata; /* source pages have associated metadata */
};

/*
--
2.41.0


2023-08-23 16:48:49

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA obviously don't have
the associated tag storage block reserved, so mark the PTEs as
PAGE_METADATA_NONE to trigger a fault next time they are accessed, and
reserve the tag storage as part of the fault handling. If the tag storage
for the page cannot be reserved, then migrate the page, because
alloc_migration_target() will do the right thing and allocate a destination
page with the tag storage reserved.

If the mapped page is a metadata storage page, which cannot have metadata
associated with it, the page is unconditionally migrated.

This has several benefits over reserving the tag storage as part of the
mprotect() call handling:

- Tag storage is reserved only for pages that are accessed.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.

But all of this is at the expense of an extra page fault until the pages
being accessed all have their corresponding tag storage reserved.

This is only implemented for PTE mappings; PMD mappings will follow.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 6 ++
include/linux/migrate_mode.h | 1 +
include/linux/mm.h | 2 +
mm/memory.c | 152 +++++++++++++++++++++++++++-
mm/mprotect.c | 46 +++++++++
5 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ba316ffb9aef..27bde1d2609c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -531,6 +531,10 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)

mutex_lock(&tag_blocks_lock);

+ /* Can happen for concurrent accesses to a METADATA_NONE page. */
+ if (page_tag_storage_reserved(page))
+ goto out_unlock;
+
/* Make sure existing entries are not freed from out under out feet. */
xa_lock_irqsave(&tag_blocks_reserved, flags);
for (block = start_block; block < end_block; block += region->block_size) {
@@ -568,6 +572,8 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
set_bit(PG_tag_storage_reserved, &(page + i)->flags);

memalloc_isolate_restore(cflags);
+
+out_unlock:
mutex_unlock(&tag_blocks_lock);

return 0;
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..5a9af239e425 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+ MR_METADATA_NONE,
MR_TYPES
};

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce87d55ecf87..6bd7d5810122 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2466,6 +2466,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+/* Whether this protection change is to allocate metadata on next access */
+#define MM_CP_PROT_METADATA_NONE (1UL << 4)

bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..6c4a6151c7b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/page-isolation.h>
#include <linux/memremap.h>
#include <linux/kmsan.h>
#include <linux/ksm.h>
@@ -82,6 +83,7 @@
#include <trace/events/kmem.h>

#include <asm/io.h>
+#include <asm/memory_metadata.h>
#include <asm/mmu_context.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -4681,6 +4683,151 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
return ret;
}

+/* Returns with the page reference dropped. */
+static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
+{
+ struct migration_target_control mtc = {
+ .nid = NUMA_NO_NODE,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+ };
+ LIST_HEAD(pagelist);
+ int ret, tries;
+
+ lru_cache_disable();
+
+ if (!isolate_lru_page(page)) {
+ put_page(page);
+ lru_cache_enable();
+ return;
+ }
+ /* Isolate just grabbed another reference, drop ours. */
+ put_page(page);
+
+ list_add_tail(&page->lru, &pagelist);
+
+ tries = 5;
+ while (tries--) {
+ ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
+ (unsigned long)&mtc, MIGRATE_SYNC, MR_METADATA_NONE, NULL);
+ if (ret == 0 || ret != -EBUSY)
+ break;
+ }
+
+ if (ret != 0) {
+ list_del(&page->lru);
+ putback_movable_pages(&pagelist);
+ }
+ lru_cache_enable();
+}
+
+static vm_fault_t do_metadata_none_page(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct page *page = NULL;
+ bool do_migrate = false;
+ pte_t new_pte, old_pte;
+ bool writable = false;
+ vm_fault_t err;
+ int ret;
+
+ /*
+ * The pte at this point cannot be used safely without validation
+ * through pte_same().
+ */
+ vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+
+ /* Get the normal PTE */
+ old_pte = ptep_get(vmf->pte);
+ new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+ /*
+ * Detect now whether the PTE could be writable; this information
+ * is only valid while holding the PT lock.
+ */
+ writable = pte_write(new_pte);
+ if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+ can_change_pte_writable(vma, vmf->address, new_pte))
+ writable = true;
+
+ page = vm_normal_page(vma, vmf->address, new_pte);
+ if (!page)
+ goto out_map;
+
+ /*
+ * This should never happen, once a VMA has been marked as tagged, that
+ * cannot be changed.
+ */
+ if (!(vma->vm_flags & VM_MTE))
+ goto out_map;
+
+ /* Prevent the page from being unmapped from under us. */
+ get_page(page);
+ vma_set_access_pid_bit(vma);
+
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+ /*
+ * Probably the page is being isolated for migration, replay the fault
+ * to give time for the entry to be replaced by a migration pte.
+ */
+ if (unlikely(is_migrate_isolate_page(page))) {
+ if (!(vmf->flags & FAULT_FLAG_TRIED))
+ err = VM_FAULT_RETRY;
+ else
+ err = 0;
+ put_page(page);
+ return 0;
+ } else if (is_migrate_metadata_page(page)) {
+ do_migrate = true;
+ } else {
+ ret = reserve_metadata_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+ if (ret == -EINTR) {
+ put_page(page);
+ return VM_FAULT_RETRY;
+ } else if (ret) {
+ do_migrate = true;
+ }
+ }
+ if (do_migrate) {
+ migrate_metadata_none_page(page, vma);
+ /*
+ * Either the page was migrated, in which case there's nothing
+ * we need to do; either migration failed, in which case all we
+ * can do is try again. So don't change the pte.
+ */
+ return 0;
+ }
+
+ put_page(page);
+
+ spin_lock(vmf->ptl);
+ if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+
+out_map:
+ /*
+ * Make it present again, depending on how arch implements
+ * non-accessible ptes, some can allow access by kernel mode.
+ */
+ old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+ new_pte = pte_modify(old_pte, vma->vm_page_prot);
+ new_pte = pte_mkyoung(new_pte);
+ if (writable)
+ new_pte = pte_mkwrite(new_pte);
+ ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+ update_mmu_cache(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+ return 0;
+}
+
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags)
{
@@ -4941,8 +5088,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);

- if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+ if (metadata_storage_enabled() && pte_metadata_none(vmf->orig_pte))
+ return do_metadata_none_page(vmf);
return do_numa_page(vmf);
+ }

spin_lock(vmf->ptl);
entry = vmf->orig_pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..2c022133aed3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,6 +33,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/memory-tiers.h>
#include <asm/cacheflush.h>
+#include <asm/memory_metadata.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
#include <asm/tlb.h>
@@ -89,6 +90,7 @@ static long change_pte_range(struct mmu_gather *tlb,
long pages = 0;
int target_node = NUMA_NO_NODE;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+ bool prot_metadata_none = cp_flags & MM_CP_PROT_METADATA_NONE;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

@@ -161,6 +163,40 @@ static long change_pte_range(struct mmu_gather *tlb,
jiffies_to_msecs(jiffies));
}

+ if (prot_metadata_none) {
+ struct page *page;
+
+ /*
+ * Skip METADATA_NONE pages, but not NUMA pages,
+ * just so we don't get two faults, one after
+ * the other. The page fault handling code
+ * might end up migrating the current page
+ * anyway, so there really is no need to keep
+ * the pte marked for NUMA balancing.
+ */
+ if (pte_protnone(oldpte) && pte_metadata_none(oldpte))
+ continue;
+
+ page = vm_normal_page(vma, addr, oldpte);
+ if (!page || is_zone_device_page(page))
+ continue;
+
+ /* Page already mapped as tagged in a shared VMA. */
+ if (page_has_metadata(page))
+ continue;
+
+ /*
+ * The LRU takes a page reference, which means
+ * that page_count > 1 is true even if the page
+ * is not COW. Reserving tag storage for a COW
+ * page is ok, because one mapping of that page
+ * won't be migrated; but not reserving tag
+ * storage for a page is definitely wrong. So
+ * don't skip pages that might be COW, like
+ * NUMA does.
+ */
+ }
+
oldpte = ptep_modify_prot_start(vma, addr, pte);
ptent = pte_modify(oldpte, newprot);

@@ -531,6 +567,13 @@ long change_protection(struct mmu_gather *tlb,
WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
#endif

+#ifdef CONFIG_MEMORY_METADATA
+ if (cp_flags & MM_CP_PROT_METADATA_NONE)
+ newprot = PAGE_METADATA_NONE;
+#else
+ WARN_ON_ONCE(cp_flags & MM_CP_PROT_METADATA_NONE);
+#endif
+
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot,
cp_flags);
@@ -661,6 +704,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
vma_set_page_prot(vma);

+ if (metadata_storage_enabled() && (newflags & VM_MTE) && !(oldflags & VM_MTE))
+ mm_cp_flags |= MM_CP_PROT_METADATA_NONE;
+
change_protection(tlb, vma, start, end, mm_cp_flags);

/*
--
2.41.0


2023-08-23 16:48:51

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 37/37] arm64: mte: Enable tag storage management

Everything is in place, enable tag storage management.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ccbcc144979..18264bc8f590 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -399,6 +399,12 @@ static int __init mte_tag_storage_activate_regions(void)
}

ret = reserve_metadata_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+ if (ret) {
+ pr_info("MTE tag storage disabled");
+ } else {
+ static_branch_enable(&metadata_storage_enabled_key);
+ pr_info("MTE tag storage enabled\n");
+ }

return ret;
}
--
2.41.0


2023-08-23 16:49:28

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 29/37] mm: arm64: Define the PAGE_METADATA_NONE page protection

Define the PAGE_METADATA_NONE page protection to be used when a page with
metadata doesn't have metadata storage reserved.

For arm64, this is accomplished by adding a new page table entry software
bit PTE_METADATA_NONE. Linux doesn't set any of the PBHA bits in entries
from the last level of the translation table and it doesn't use the
TCR_ELx.HWUxx bits. This makes it safe to define PTE_METADATA_NONE as bit
59.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/pgtable-prot.h | 2 ++
arch/arm64/include/asm/pgtable.h | 16 ++++++++++++++--
include/linux/pgtable.h | 12 ++++++++++++
3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index eed814b00a38..ed2a98ec4e95 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_METADATA_NONE (_AT(pteval_t, 1) << 59) /* only when PTE_PROT_NONE */

/*
* This bit indicates that the entry is present i.e. pmd_page()
@@ -98,6 +99,7 @@ extern bool arm64_use_ng_mappings;
})

#define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_METADATA_NONE __pgprot((_PAGE_DEFAULT & ~PTE_VALID) | PTE_PROT_NONE | PTE_METADATA_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
/* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
#define PAGE_SHARED __pgprot(_PAGE_SHARED)
#define PAGE_SHARED_EXEC __pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 944860d7090e..2e42f7713425 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -451,6 +451,18 @@ static inline int pmd_protnone(pmd_t pmd)
}
#endif

+#ifdef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+ return (((pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE)
+ && (pte_val(pte) & PTE_METADATA_NONE));
+}
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+ return pte_metadata_none(pmd_pte(pmd));
+}
+#endif
+
#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))

static inline int pmd_present(pmd_t pmd)
@@ -809,8 +821,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
* in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
*/
const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
- PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
- PTE_ATTRINDX_MASK;
+ PTE_PROT_NONE | PTE_METADATA_NONE | PTE_VALID |
+ PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
/* preserve the hardware dirty information */
if (pte_hw_dirty(pte))
pte = pte_mkdirty(pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..0119ffa2c0ab 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1340,6 +1340,18 @@ static inline int pmd_protnone(pmd_t pmd)
}
#endif /* CONFIG_NUMA_BALANCING */

+#ifndef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+ return false;
+}
+
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+ return false;
+}
+#endif /* CONFIG_MEMORY_METADATA */
+
#endif /* CONFIG_MMU */

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
--
2.41.0


2023-08-23 17:25:23

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 16/37] arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled

If MTE is disabled (for example, from the kernel command line with the
arm64.nomte option), the tag storage pages behave just like normal
pages, because they will never be used to store tags. If that's the
case, expose them to the page allocator as MIGRATE_MOVABLE pages.

MIGRATE_MOVABLE has been chosen because the bulk of memory allocations
comes from userspace, and the migratetype for those allocations is
MIGRATE_MOVABLE. MIGRATE_RECLAIMABLE and MIGRATE_UNMOVABLE requests can
still use the pages as a fallback.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 18 ++++++++++++++++++
include/linux/gfp.h | 2 ++
mm/mm_init.c | 3 +--
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 87160f53960f..4a6bfdf88458 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -296,6 +296,24 @@ static int __init mte_tag_storage_activate_regions(void)
}
}

+ /*
+ * MTE disabled, tag storage pages can be used like any other pages. The
+ * only restriction is that the pages cannot be used by kexec because
+ * the memory is marked as reserved in the memblock allocator.
+ */
+ if (!system_supports_mte()) {
+ for (i = 0; i< num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ for (pfn = tag_range->start;
+ pfn <= tag_range->end;
+ pfn += pageblock_nr_pages) {
+ init_reserved_pageblock(pfn_to_page(pfn), MIGRATE_MOVABLE);
+ }
+ }
+
+ return 0;
+ }
+
for (i = 0; i < num_tag_regions; i++) {
tag_range = &tag_regions[i].tag_range;
for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb344baa9a9b..622bb9406cae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,6 +354,8 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

+extern void init_reserved_pageblock(struct page *page, enum migratetype migratetype);
+
#ifdef CONFIG_MEMORY_METADATA
extern void init_metadata_reserved_pageblock(struct page *page);
#else
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467c80e9dacc..eedaacdf153d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2330,8 +2330,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
-static void __init init_reserved_pageblock(struct page *page,
- enum migratetype migratetype)
+void __init init_reserved_pageblock(struct page *page, enum migratetype migratetype)
{
unsigned i = pageblock_nr_pages;
struct page *p = page;
--
2.41.0


2023-08-23 18:07:59

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 08/37] mm: compaction: Account for free metadata pages in __compact_finished()

__compact_finished() signals the end of compaction if a page of an order
greater than or equal to the requested order if found on a free_area.
When allocation of MIGRATE_METADATA pages is allowed, count the number
of free metadata storage pages towards the request order.

Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/compaction.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index dbc9f86b1934..f132c02b0655 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2208,6 +2208,13 @@ static enum compact_result __compact_finished(struct compact_control *cc)
if (migratetype == MIGRATE_MOVABLE &&
!free_area_empty(area, MIGRATE_CMA))
return COMPACT_SUCCESS;
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+ if (metadata_storage_enabled() &&
+ migratetype == MIGRATE_MOVABLE &&
+ (cc->alloc_flags & ALLOC_FROM_METADATA) &&
+ !free_area_empty(area, MIGRATE_METADATA))
+ return COMPACT_SUCCESS;
#endif
/*
* Job done if allocation would steal freepages from
--
2.41.0


2023-08-23 19:07:33

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 14/37] arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist

Add the MTE tag storage pages to the MIGRATE_METADATA freelist, which
allows the page allocator to manage them like (almost) regular pages.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 47 +++++++++++++++++++++++++++++
include/linux/gfp.h | 8 +++++
mm/mm_init.c | 24 +++++++++++++--
3 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 5014dda9bf35..87160f53960f 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
* Copyright (C) 2023 ARM Ltd.
*/

+#include <linux/gfp.h>
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/of_device.h>
#include <linux/of_fdt.h>
+#include <linux/pageblock-flags.h>
#include <linux/range.h>
#include <linux/string.h>
#include <linux/xarray.h>
@@ -190,6 +192,12 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
return ret;
}

+ /* Pages are managed in pageblock_nr_pages chunks */
+ if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
+ pr_err("Tag storage region not aligned to 0x%lx", pageblock_nr_pages);
+ return -EINVAL;
+ }
+
ret = tag_storage_get_memory_node(node, &mem_node);
if (ret)
return ret;
@@ -260,3 +268,42 @@ void __init mte_tag_storage_init(void)
}
num_tag_regions = 0;
}
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+ phys_addr_t dram_start, dram_end;
+ struct range *tag_range;
+ unsigned long pfn;
+ int i;
+
+ if (num_tag_regions == 0)
+ return 0;
+
+ dram_start = memblock_start_of_DRAM();
+ dram_end = memblock_end_of_DRAM();
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ /*
+ * Tag storage region was clipped by arm64_bootmem_init()
+ * enforcing addressing limits.
+ */
+ if (PFN_PHYS(tag_range->start) < dram_start ||
+ PFN_PHYS(tag_range->end) >= dram_end) {
+ pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1));
+ return -EINVAL;
+ }
+ }
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
+ init_metadata_reserved_pageblock(pfn_to_page(pfn));
+ totalmetadata_pages += pageblock_nr_pages;
+ }
+ }
+
+ return 0;
+}
+core_initcall(mte_tag_storage_activate_regions)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..fb344baa9a9b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,4 +354,12 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

+#ifdef CONFIG_MEMORY_METADATA
+extern void init_metadata_reserved_pageblock(struct page *page);
+#else
+static inline void init_metadata_reserved_pageblock(struct page *page)
+{
+}
+#endif
+
#endif /* __LINUX_GFP_H */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a1963c3322af..467c80e9dacc 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2329,8 +2329,9 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)

#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

-#ifdef CONFIG_CMA
-void __init init_cma_reserved_pageblock(struct page *page)
+#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
+static void __init init_reserved_pageblock(struct page *page,
+ enum migratetype migratetype)
{
unsigned i = pageblock_nr_pages;
struct page *p = page;
@@ -2340,15 +2341,32 @@ void __init init_cma_reserved_pageblock(struct page *page)
set_page_count(p, 0);
} while (++p, --i);

- set_pageblock_migratetype(page, MIGRATE_CMA);
+ set_pageblock_migratetype(page, migratetype);
set_page_refcounted(page);
__free_pages(page, pageblock_order);

adjust_managed_page_count(page, pageblock_nr_pages);
+}
+
+#ifdef CONFIG_CMA
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void __init init_cma_reserved_pageblock(struct page *page)
+{
+ init_reserved_pageblock(page, MIGRATE_CMA);
page_zone(page)->cma_pages += pageblock_nr_pages;
}
#endif

+#ifdef CONFIG_MEMORY_METADATA
+/* Free whole pageblock and set its migration type to MIGRATE_METADATA. */
+void __init init_metadata_reserved_pageblock(struct page *page)
+{
+ init_reserved_pageblock(page, MIGRATE_METADATA);
+ page_zone(page)->metadata_pages += pageblock_nr_pages;
+}
+#endif
+#endif /* CONFIG_CMA || CONFIG_MEMORY_METADATA */
+
void set_zone_contiguous(struct zone *zone)
{
unsigned long block_start_pfn = zone->zone_start_pfn;
--
2.41.0


2023-08-23 19:44:19

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 22/37] mm: shmem: Allocate metadata storage for in-memory filesystems

Set __GFP_TAGGED when a new page is faulted in, so the page allocator
reserves the corresponding metadata storage.

Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/shmem.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..0b772ec34caa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -81,6 +81,8 @@ static struct vfsmount *shm_mnt;

#include <linux/uaccess.h>

+#include <asm/memory_metadata.h>
+
#include "internal.h"

#define BLOCKS_PER_PAGE (PAGE_SIZE/512)
@@ -1530,7 +1532,7 @@ static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
*/
static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
{
- gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+ gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -1941,6 +1943,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
goto alloc_nohuge;

huge_gfp = vma_thp_gfp_mask(vma);
+ if (vma_has_metadata(vma))
+ huge_gfp |= __GFP_TAGGED;
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
if (IS_ERR(folio)) {
@@ -2101,6 +2105,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
int err;
vm_fault_t ret = VM_FAULT_LOCKED;

+ /* Fixup gfp flags for metadata enabled VMAs. */
+ if (vma_has_metadata(vma))
+ gfp |= __GFP_TAGGED;
+
/*
* Trinity finds that probing a hole which tmpfs is punching can
* prevent the hole-punch from ever completing: which in turn
--
2.41.0


2023-08-23 20:19:17

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 28/37] mm: sched: Introduce PF_MEMALLOC_ISOLATE

On arm64, when reserving tag storage for an allocated page, if the tag
storage is in use, the tag storage must be migrated before it can be
reserved. As part of the migration process, the tag storage block is
first isolated.

Compaction also isolates the source pages before migrating them. If the
target for compaction requires metadata pages to be reserved, those
metadata pages might also need to be isolated, which, in rare
circumstances, can lead to the threshold in too_many_isolated() being
reached, and isolate_migratepages_pageblock() will get stuck in an infinite
loop.

Add the flag PF_MEMALLOC_ISOLATE for the current thread, which makes
too_many_isolated() ignore the threshold to make forward progress in
isolate_migratepages_pageblock().

For consistency, the similarly named function too_many_isolated() called
during reclaim has received the same treatment.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kernel/mte_tag_storage.c | 5 ++++-
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 13 +++++++++++++
mm/compaction.c | 3 +++
mm/vmscan.c | 3 +++
5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ab875be5f9b..ba316ffb9aef 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -505,9 +505,9 @@ static int order_to_num_blocks(int order)
int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
{
unsigned long start_block, end_block;
+ unsigned long flags, cflags;
struct tag_region *region;
unsigned long block;
- unsigned long flags;
int i, tries;
int ret = 0;

@@ -539,6 +539,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
}
xa_unlock_irqrestore(&tag_blocks_reserved, flags);

+ cflags = memalloc_isolate_save();
for (block = start_block; block < end_block; block += region->block_size) {
/* Refcount incremented above. */
if (tag_storage_block_is_reserved(block))
@@ -566,6 +567,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
for (i = 0; i < (1 << order); i++)
set_bit(PG_tag_storage_reserved, &(page + i)->flags);

+ memalloc_isolate_restore(cflags);
mutex_unlock(&tag_blocks_lock);

return 0;
@@ -581,6 +583,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
}
xa_unlock_irqrestore(&tag_blocks_reserved, flags);

+ memalloc_isolate_restore(cflags);
mutex_unlock(&tag_blocks_lock);

count_vm_events(METADATA_RESERVE_FAIL, region->block_size);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..a2a930cab31a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1734,7 +1734,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_MEMALLOC_ISOLATE 0x00010000 /* Ignore isolation limits */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..8db491208746 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -393,6 +393,19 @@ static inline void memalloc_pin_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
}

+static inline unsigned int memalloc_isolate_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_ISOLATE;
+
+ current->flags |= PF_MEMALLOC_ISOLATE;
+ return flags;
+}
+
+static inline void memalloc_isolate_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_ISOLATE) | flags;
+}
+
#ifdef CONFIG_MEMCG
DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
/**
diff --git a/mm/compaction.c b/mm/compaction.c
index 314793ec8bdb..fdb75316f0cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -778,6 +778,9 @@ static bool too_many_isolated(struct compact_control *cc)

unsigned long active, inactive, isolated;

+ if (current->flags & PF_MEMALLOC_ISOLATE)
+ return false;
+
inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
node_page_state(pgdat, NR_INACTIVE_ANON);
active = node_page_state(pgdat, NR_ACTIVE_FILE) +
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..912ebb6003a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2453,6 +2453,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
if (current_is_kswapd())
return 0;

+ if (current->flags & PF_MEMALLOC_ISOLATE)
+ return 0;
+
if (!writeback_throttling_sane(sc))
return 0;

--
2.41.0


2023-08-23 20:33:52

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast

gfp_to_alloc_flags_cma() is called on the fast path of the page allocator
and all it does is set the ALLOC_CMA flag if all the conditions are met for
the allocation to be satisfied from the MIGRATE_CMA list. Rename it to be
more generic, as it will soon have to handle another another flag.

Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/page_alloc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..e6f950c54494 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3081,7 +3081,7 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
}

/* Must be called after current_gfp_context() which can change gfp_mask */
-static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
+static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
@@ -3784,7 +3784,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
} else if (unlikely(rt_task(current)) && in_task())
alloc_flags |= ALLOC_MIN_RESERVE;

- alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
+ alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, alloc_flags);

return alloc_flags;
}
@@ -4074,7 +4074,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
- alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
+ alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, reserve_flags) |
(alloc_flags & ALLOC_KSWAPD);

/*
@@ -4250,7 +4250,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;

- *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
+ *alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, *alloc_flags);

/* Dirty zone balancing only done in the fast path */
ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
--
2.41.0


2023-08-23 20:56:32

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 03/37] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED

__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).

Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/mm/fault.c | 2 +-
include/linux/gfp_types.h | 14 +++++++-------
include/trace/events/mmflags.h | 2 +-
mm/page_alloc.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 3fe516b32577..0ca89ebcdc63 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -949,7 +949,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
* separate DC ZVA and STGM.
*/
if (vma->vm_flags & VM_MTE)
- flags |= __GFP_ZEROTAGS;
+ flags |= __GFP_TAGGED;

return vma_alloc_folio(flags, 0, vma, vaddr, false);
}
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..37b9e265d77e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_HARDWALL 0x100000u
#define ___GFP_THISNODE 0x200000u
#define ___GFP_ACCOUNT 0x400000u
-#define ___GFP_ZEROTAGS 0x800000u
+#define ___GFP_TAGGED 0x800000u
#ifdef CONFIG_KASAN_HW_TAGS
#define ___GFP_SKIP_ZERO 0x1000000u
#define ___GFP_SKIP_KASAN 0x2000000u
@@ -226,11 +226,11 @@ typedef unsigned int __bitwise gfp_t;
*
* %__GFP_ZERO returns a zeroed page on success.
*
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
- * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
- * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
- * memory tags at the same time as zeroing memory has minimal additional
- * performace impact.
+ * %__GFP_TAGGED marks the allocation as having tags, which will be zeroed it
+ * allocation time if the memory itself is being zeroed (either via __GFP_ZERO
+ * or via init_on_alloc, provided that __GFP_SKIP_ZERO is not set). This flag is
+ * intended for optimization: setting memory tags at the same time as zeroing
+ * memory has minimal additional performace impact.
*
* %__GFP_SKIP_KASAN makes KASAN skip unpoisoning on page allocation.
* Used for userspace and vmalloc pages; the latter are unpoisoned by
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED ((__force gfp_t)___GFP_TAGGED)
#define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
#define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)

diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..4ccca8e73c93 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
gfpflag_string(__GFP_RECLAIM), \
gfpflag_string(__GFP_DIRECT_RECLAIM), \
gfpflag_string(__GFP_KSWAPD_RECLAIM), \
- gfpflag_string(__GFP_ZEROTAGS)
+ gfpflag_string(__GFP_TAGGED)

#ifdef CONFIG_KASAN_HW_TAGS
#define __def_gfpflag_names_kasan , \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6f950c54494..fdc230440a44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1516,7 +1516,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
{
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
- bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+ bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
int i;

set_page_private(page, 0);
--
2.41.0


2023-08-24 12:33:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
> after re-reading it 2 times, I still have no clue what your patch set is
> actually trying to achieve. Probably there is a way to describe how user
> space intents to interact with this feature, so to see which value this
> actually has for user space -- and if we are using the right APIs and
> allocators.

I'll try with an alternative summary, hopefully it becomes clearer (I
think Alex is away until the end of the week, may not reply
immediately). If this still doesn't work, maybe we should try a
different implementation ;).

The way MTE is implemented currently is to have a static carve-out of
the DRAM to store the allocation tags (a.k.a. memory colour). This is
what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup)
and normally hidden from the OS. So a checked memory access to location
X generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X
to Y is linear (subject to a minimum block size to deal with some
address interleaving). The software doesn't need to know about this
correspondence as we have specific instructions like STG/LDG to location
X that lead to a tag store/load to Y.

Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
For example, some large allocations may not use PROT_MTE at all or only
for the first and last page since initialising the tags takes time. The
side-effect is that of these 3% DRAM, only part, say 1% is effectively
used. Some people want the unused tag storage to be released for normal
data usage (i.e. give it to the kernel page allocator).

So the first complication is that a PROT_MTE page allocation at address
X will need to reserve the tag storage at location Y (and migrate any
data in that page if it is in use).

To make things worse, pages in the tag storage/carve-out range cannot
use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.

Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.

To your question about user APIs/ABIs, that's entirely transparent. As
with the current kernel (without this dynamic tag storage), a user only
needs to ask for PROT_MTE mappings to get tagged pages.

> So some dummy questions / statements
>
> 1) Is this about re-propusing the memory used to hold tags for different
> purpose?

Yes. To allow part of this 3% to be used for data. It could even be the
whole 3% if no application is enabling MTE.

> Or what exactly is user space going to do with the PROT_MTE memory?
> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.

As I mentioned above, there's no difference to the user ABI. PROT_MTE
works as before with the kernel moving pages around as needed.

> 2) Why do we even have to involve the page allocator if this is some
> special-purpose memory? Re-porpusing the buddy when later using
> alloc_contig_range() either way feels wrong.

The aim here is to rebrand this special-purpose memory as a nearly
general-purpose one (bar the PROT_MTE restriction).

> The core-mm changes don't look particularly appealing :)

OTOH, it's a fun project to learn about the mm ;).

Our aim for now is to get some feedback from the mm community on whether
this special -> nearly general rebranding is acceptable together with
the introduction of a heterogeneous memory concept for the general
purpose page allocator.

There are some alternatives we looked at with a smaller mm impact but we
haven't prototyped them yet: (a) use the available tag storage as a
frontswap accelerator or (b) use it as a (compressed) ramdisk that can
be mounted as swap. The latter has the advantage of showing up in the
available total memory, keeps customers happy ;). Both options would
need some mm hooks when a PROT_MTE page gets allocated to release the
corresponding page in the tag storage range.

--
Catalin

2023-09-06 16:07:48

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

Hi,

Thank you for the feedback!

Catalin did a great job explaining what this patch series does, I'll add my
own comments on top of his.

On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > On 24.08.23 13:06, David Hildenbrand wrote:
> > > On 24.08.23 12:44, Catalin Marinas wrote:
> > > > The way MTE is implemented currently is to have a static carve-out of
> > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > > done transparently by the hardware/interconnect (with firmware setup)
> > > > and normally hidden from the OS. So a checked memory access to location
> > > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > > to Y is linear (subject to a minimum block size to deal with some
> > > > address interleaving). The software doesn't need to know about this
> > > > correspondence as we have specific instructions like STG/LDG to location
> > > > X that lead to a tag store/load to Y.
> > > >
> > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > > For example, some large allocations may not use PROT_MTE at all or only
> > > > for the first and last page since initialising the tags takes time. The
> > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > > used. Some people want the unused tag storage to be released for normal
> > > > data usage (i.e. give it to the kernel page allocator).
> [...]
> > > So it sounds like you might want to provide that tag memory using CMA.
> > >
> > > That way, only movable allocations can end up on that CMA memory area,
> > > and you can allocate selected tag pages on demand (similar to the
> > > alloc_contig_range() use case).
> > >
> > > That also solves the issue that such tag memory must not be longterm-pinned.
> > >
> > > Regarding one complication: "The kernel needs to know where to allocate
> > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > (mprotect()) and the range it is in does not support tagging.",
> > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> >
> > Okay, I now realize that this patch set effectively duplicates some CMA
> > behavior using a new migrate-type.
>
> Yes, pretty much, with some additional hooks to trigger migration. The
> CMA mechanism was a great source of inspiration.
>
> In addition, there are some races that are addressed mostly around page
> migration/copying: the source page is untagged, the destination
> allocated as untagged but before the copy an mprotect() makes the source
> tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
> anywhere to store the tags.
>
> > Yeah, that's probably not what we want just to identify if memory is
> > taggable or not.
> >
> > Maybe there is a way to just keep reusing most of CMA instead.
>
> A potential issue is that devices (mobile phones) may need a different
> CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
> free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
> not as it's just a list.

I don't think that's a problem either, today the user can specify multiple
CMA ranges on the kernel command line (via "cma", "hugetlb_cma", etc). CMA
already has the mechanism to keep track of multiple regions - it stores in
the cma_areas array.

>
> We (Google and Arm) went through a few rounds of discussions and
> prototyping trying to find the best approach: (1) a separate free_area[]
> array in each zone (early proof of concept from Peter C and Evgenii S,
> https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
> (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
> tag storage, (4) a new MIGRATE_METADATA type.
>
> We settled on the latter as it closely resembles CMA without interfering
> with it. I don't remember why we did not just go for MIGRATE_CMA, it may
> have been the heterogeneous memory aspect and the fact that we don't
> want PROT_MTE (VM_MTE) allocations from this range. If the hardware
> allowed this, I think the patches would have been a bit simpler.

You are correct, we settled on a new migrate type because the tag storage
memory is fundamentally a different memory type with different properties
than the rest of the memory in the system: tag storage memory cannot be
tagged, MIGRATE_CMA memory can be tagged.

>
> Alex can comment more next week on how we ended up with this choice but
> if we find a way to avoid VM_MTE allocations from certain areas, I think
> we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
> allocations from any CMA range but it seems too restrictive.

I considered mixing the tag storage memory memory with normal memory and
adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
this means that it's not enough anymore to have a __GFP_MOVABLE allocation
request to use MIGRATE_CMA.

I considered two solutions to this problem:

1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
this effectively means transforming all memory from MIGRATE_CMA into the
MIGRATE_METADATA migratetype that the series introduces. Not very
appealing, because that means treating normal memory that is also on the
MIGRATE_CMA lists as tagged memory.

2. Keep track of which pages are tag storage at page granularity (either by
a page flag, or by checking that the pfn falls in one of the tag storage
region, or by some other mechanism). When the page allocator takes free
pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
gfp mask with the page type, and if the allocation is tagged and the page
is a tag storage page, put it back at the tail of the free list and choose
the next page. Repeat until the page allocator finds a normal memory page
that can be tagged (some refinements obviously needed to need to avoid
infinite loops).

I considered solution 2 to be more complicated than keeping track of tag
storage page at the migratetype level. Conceptually, keeping two distinct
memory type on separate migrate types looked to me like the cleaner and
simpler solution.

Maybe I missed something, I'm definitely open to suggestions regarding
putting the tag storage pages on MIGRATE_CMA (or another migratetype) if
that's a better approach.

Might be worth pointing out that putting the tag storage memory on the
MIGRATE_CMA migratetype only changes how the page allocator allocates
pages; all the other changes to migration/compaction/mprotect/etc will
still be there, because they are needed not because of how the tag storage
memory is represented by the page allocator, but because tag storage memory
cannot be tagged, and regular memory can.

Thanks,
Alex

>
> > Another simpler idea to get started would be to just intercept the first
> > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> > use PROT_MTE can have that additional 3% of memory.
>
> We had this on the table as well but the most likely deployment, at
> least initially, is only some secure services enabling MTE with various
> apps gradually moving towards this in time. So that's why the main
> pushback from vendors is having this 3% reserved permanently. Even if
> all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
> fully using the tag storage.
>
> --
> Catalin
>

Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> Introduction
> ============
>
> Arm has implemented memory coloring in hardware, and the feature is
> called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in
> bits
> 59..56 of a pointer, and storing this tag to a reserved memory
> location.
> When the pointer is dereferenced, the hardware compares the tag
> embedded in
> the pointer (logical tag) with the tag stored in memory (allocation
> tag).
>
> The relation between memory and where the tag for that memory is
> stored is
> static.
>
> The memory where the tags are stored have been so far unaccessible to
> Linux.
> This series aims to change that, by adding support for using the tag
> storage
> memory only as data memory; tag storage memory cannot be itself
> tagged.
>
>
> Implementation
> ==============
>
> The series is based on v6.5-rc3 with these two patches cherry picked:
>
> - mm: Call arch_swap_restore() from unuse_pte():
>
>
> https://lore.kernel.org/all/[email protected]/
>
> - arm64: mte: Simplify swap tag restoration logic:
>
>
> https://lore.kernel.org/all/[email protected]/
>
> The above two patches are queued for the v6.6 merge window:
>
>
> https://lore.kernel.org/all/[email protected]/
>
> The entire series, including the above patches, can be cloned with:
>
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> -b arm-mte-dynamic-carveout-rfc-v1
>
> On the arm64 architecture side, an extension is being worked on that
> will
> clarify how MTE tag storage reuse should behave. The extension will
> be
> made public soon.
>
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
>
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA,
> with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
>
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken
> and
> the corresponding tag storage is reserved.
>
> 3. When the code tries to copy tags to a page which doesn't have the
> tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.
>
> KVM support has not been implemented yet, that because a non-MTE
> enabled VMA
> can back the memory of an MTE-enabled VM. After there is a consensus
> on the
> right approach on the memory management support, I will add it.
>
> Explanations for the last two changes follow. The gist of it is that
> they
> were added mostly because of races, and it my intention to make the
> code
> more robust.
>
> PAGE_METADATA_NONE was introduced to avoid races with
> mprotect(PROT_MTE).
> For example, migration can race with mprotect(PROT_MTE):
> - thread 0 initiates migration for a page in a non-MTE enabled VMA
> and a
> destination page is allocated without tag storage.
> - thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and
> an
> access turns the source page that is in the process of being
> migrated
> into a tagged page.
> - thread 0 finishes migration and the destination page is mapped as
> tagged,
> but without tag storage reserved.
> More details and examples can be found in the patches.
>
> This race is also related to how tag restoring is handled when tag
> storage
> is missing: when a tagged page is swapped out, the tags are saved in
> an
> xarray indexed by swp_entry.val. When a page is swapped back in, if
> there
> are tags corresponding to the swp_entry that the page will replace,
> the
> tags are unconditionally restored, even if the page will be mapped as
> untagged. Because the page will be mapped as untagged, tag storage
> was
> not reserved when the page was allocated to replace the swp_entry
> which has
> tags associated with it.
>
> To get around this, save the tags in a new xarray, this time indexed
> by
> pfn, and restore them when the same page is mapped as tagged.
>
> This also solves another race, this time with copy_highpage. In the
> scenario where migration races with mprotect(PROT_MTE), before the
> page is
> mapped, the contents of the source page is copied to the destination.
> And
> this includes tags, which will be copied to a page with missing tag
> storage, which can to data corruption if the missing tag storage is
> in use
> for data. So copy_highpage() has received a similar treatment to the
> swap
> code, and the source tags are copied in the xarray indexed by the
> destination page pfn.
>
>
> Overview of the patches
> =======================
>
> Patches 1-3 do some preparatory work by renaming a few functions and
> a gfp
> flag.
>
> Patches 4-12 are arch independent and introduce MIGRATE_METADATA to
> the
> page allocator.
>
> Patches 13-18 are arm64 specific and add support for detecting the
> tag
> storage region and onlining it with the MIGRATE_METADATA migratetype.
>
> Patches 19-24 are arch independent and modify the page allocator to
> callback into arch dependant functions to reserve metadata storage
> for an
> allocation which requires metadata.
>
> Patches 25-28 are mostly arm64 specific and implement the reservation
> and
> freeing of tag storage on tagged page allocation. Patch #28 ("mm:
> sched:
> Introduce PF_MEMALLOC_ISOLATE") adds a current flag,
> PF_MEMALLOC_ISOLATE,
> which ignores page isolation limits; this is used by arm64 when
> reserving
> tag storage in the same patch.
>
> Patches 29-30 add arch independent support for doing
> mprotect(PROT_MTE)
> when metadata storage is enabled.
>
> Patches 31-37 are mostly arm64 specific and handle the restoring of
> tags
> when tag storage is missing. The exceptions are patches 32 (adds the
> arch_swap_prepare_to_restore() function) and 35 (add
> PAGE_METADATA_NONE
> support for THPs).
>
> Testing
> =======
>
> To enable MTE dynamic tag storage:
>
> - CONFIG_ARM64_MTE_TAG_STORAGE=y
> - system_supports_mte() returns true
> - kasan_hw_tags_enabled() returns false
> - correct DTB node (for the specification, see commit "arm64: mte:
> Reserve tag
> storage memory")
>
> Check dmesg for the message "MTE tag storage enabled" or grep for
> metadata
> in /proc/vmstat.
>
> I've tested the series using FVP with MTE enabled, but without
> support for
> dynamic tag storage reuse. To simulate it, I've added two fake tag
> storage
> regions in the DTB by splitting a 2GB region roughly into 33 slices
> of size
> 0x3e0_0000, and using 32 of them for tagged memory and one slice for
> tag
> storage:
>
> diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> index 60472d65a355..bd050373d6cf 100644
> --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> };
> };
>
> - memory@80000000 {
> + memory0: memory@80000000 {
> device_type = "memory";
> - reg = <0x00000000 0x80000000 0 0x80000000>,
> - <0x00000008 0x80000000 0 0x80000000>;
> + reg = <0x00 0x80000000 0x00 0x7c000000>;
> + };
> +
> + metadata0: metadata@c0000000 {
> + compatible = "arm,mte-tag-storage";
> + reg = <0x00 0xfc000000 0x00 0x3e00000>;
> + block-size = <0x1000>;
> + memory = <&memory0>;
> + };
> +
> + memory1: memory@880000000 {
> + device_type = "memory";
> + reg = <0x08 0x80000000 0x00 0x7c000000>;
> + };
> +
> + metadata1: metadata@8c0000000 {
> + compatible = "arm,mte-tag-storage";
> + reg = <0x08 0xfc000000 0x00 0x3e00000>;
> + block-size = <0x1000>;
> + memory = <&memory1>;
> };
>

Hi Alexandru,

AFAIK, the above memory configuration means that there are two region
of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
is called PDD memory map.

Document[1] said there are some constraints of tag memory as below.

| The following constraints apply to the tag regions in DRAM:
| 1. The tag region cannot be interleaved with the data region.
| The tag region must also be above the data region within DRAM.
|
| 2.The tag region in the physical address space cannot straddle
| multiple regions of a memory map.
|
| PDD memory map is not allowed to have part of the tag region between
| 2GB-4GB and another part between 34GB-64GB.


I'm not sure if we can separate tag memory with the above
configuration. Or do I miss something?

[1] https://developer.arm.com/documentation/101569/0300/?lang=en
(Section 5.4.6.1)

Thanks,
Kuan-Ying Lee
> reserved-memory {
>
>
> Alexandru Elisei (37):
> mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
> gfp_to_alloc_flags_fast
> arm64: mte: Rework naming for tag manipulation functions
> arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
> mm: Add MIGRATE_METADATA allocation policy
> mm: Add memory statistics for the MIGRATE_METADATA allocation
> policy
> mm: page_alloc: Allocate from movable pcp lists only if
> ALLOC_FROM_METADATA
> mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
> mm: compaction: Account for free metadata pages in
> __compact_finished()
> mm: compaction: Handle metadata pages as source for direct
> compaction
> mm: compaction: Do not use MIGRATE_METADATA to replace pages with
> metadata
> mm: migrate/mempolicy: Allocate metadata-enabled destination page
> mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
> arm64: mte: Reserve tag storage memory
> arm64: mte: Expose tag storage pages to the MIGRATE_METADATA
> freelist
> arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
> arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is
> disabled
> arm64: mte: Disable dynamic tag storage management if HW KASAN is
> enabled
> arm64: mte: Check that tag storage blocks are in the same zone
> mm: page_alloc: Manage metadata storage on page allocation
> mm: compaction: Reserve metadata storage in compaction_alloc()
> mm: khugepaged: Handle metadata-enabled VMAs
> mm: shmem: Allocate metadata storage for in-memory filesystems
> mm: Teach vma_alloc_folio() about metadata-enabled VMAs
> mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
> arm64: mte: Manage tag storage on page allocation
> arm64: mte: Perform CMOs for tag blocks on tagged page
> allocation/free
> arm64: mte: Reserve tag block for the zero page
> mm: sched: Introduce PF_MEMALLOC_ISOLATE
> mm: arm64: Define the PAGE_METADATA_NONE page protection
> mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
> mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing
> metadata
> storage
> mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
> arm64: mte: swap/copypage: Handle tag restoring when missing tag
> storage
> arm64: mte: Handle fatal signal in reserve_metadata_storage()
> mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
> KVM: arm64: Disable MTE is tag storage is enabled
> arm64: mte: Enable tag storage management
>
> arch/arm64/Kconfig | 13 +
> arch/arm64/include/asm/assembler.h | 10 +
> arch/arm64/include/asm/memory_metadata.h | 49 ++
> arch/arm64/include/asm/mte-def.h | 16 +-
> arch/arm64/include/asm/mte.h | 40 +-
> arch/arm64/include/asm/mte_tag_storage.h | 36 ++
> arch/arm64/include/asm/page.h | 5 +-
> arch/arm64/include/asm/pgtable-prot.h | 2 +
> arch/arm64/include/asm/pgtable.h | 33 +-
> arch/arm64/kernel/Makefile | 1 +
> arch/arm64/kernel/elfcore.c | 14 +-
> arch/arm64/kernel/hibernate.c | 46 +-
> arch/arm64/kernel/mte.c | 31 +-
> arch/arm64/kernel/mte_tag_storage.c | 667
> +++++++++++++++++++++++
> arch/arm64/kernel/setup.c | 7 +
> arch/arm64/kvm/arm.c | 6 +-
> arch/arm64/lib/mte.S | 30 +-
> arch/arm64/mm/copypage.c | 26 +
> arch/arm64/mm/fault.c | 35 +-
> arch/arm64/mm/mteswap.c | 113 +++-
> fs/proc/meminfo.c | 8 +
> fs/proc/page.c | 1 +
> include/asm-generic/Kbuild | 1 +
> include/asm-generic/memory_metadata.h | 50 ++
> include/linux/gfp.h | 10 +
> include/linux/gfp_types.h | 14 +-
> include/linux/huge_mm.h | 6 +
> include/linux/kernel-page-flags.h | 1 +
> include/linux/migrate_mode.h | 1 +
> include/linux/mm.h | 12 +-
> include/linux/mmzone.h | 26 +-
> include/linux/page-flags.h | 1 +
> include/linux/pgtable.h | 19 +
> include/linux/sched.h | 2 +-
> include/linux/sched/mm.h | 13 +
> include/linux/vm_event_item.h | 5 +
> include/linux/vmstat.h | 2 +
> include/trace/events/mmflags.h | 5 +-
> mm/Kconfig | 5 +
> mm/compaction.c | 52 +-
> mm/huge_memory.c | 109 ++++
> mm/internal.h | 7 +
> mm/khugepaged.c | 7 +
> mm/memory.c | 180 +++++-
> mm/mempolicy.c | 7 +
> mm/migrate.c | 6 +
> mm/mm_init.c | 23 +-
> mm/mprotect.c | 46 ++
> mm/page_alloc.c | 136 ++++-
> mm/page_isolation.c | 19 +-
> mm/page_owner.c | 3 +-
> mm/shmem.c | 14 +-
> mm/show_mem.c | 4 +
> mm/swapfile.c | 4 +
> mm/vmscan.c | 3 +
> mm/vmstat.c | 13 +-
> 56 files changed, 1834 insertions(+), 161 deletions(-)
> create mode 100644 arch/arm64/include/asm/memory_metadata.h
> create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> create mode 100644 include/asm-generic/memory_metadata.h
>

2023-09-13 15:44:51

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> On 11.09.23 13:52, Catalin Marinas wrote:
> > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > >
> > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > behavior using a new migrate-type.
> > [...]
> > > I considered mixing the tag storage memory memory with normal memory and
> > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > request to use MIGRATE_CMA.
> > >
> > > I considered two solutions to this problem:
> > >
> > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > appealing, because that means treating normal memory that is also on the
> > > MIGRATE_CMA lists as tagged memory.
> >
> > That's indeed not ideal. We could try this if it makes the patches
> > significantly simpler, though I'm not so sure.
> >
> > Allocating metadata is the easier part as we know the correspondence
> > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > storage page), so alloc_contig_range() does this for us. Just adding it
> > to the CMA range is sufficient.
> >
> > However, making sure that we don't allocate PROT_MTE pages from the
> > metadata range is what led us to another migrate type. I guess we could
> > achieve something similar with a new zone or a CPU-less NUMA node,
>
> Ideally, no significant core-mm changes to optimize for an architecture
> oddity. That implies, no new zones and no new migratetypes -- unless it is
> unavoidable and you are confident that you can convince core-MM people that
> the use case (giving back 3% of system RAM at max in some setups) is worth
> the trouble.

If I was an mm maintainer, I'd also question this ;). But vendors seem
pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
16G platform does look somewhat big). As more and more apps adopt MTE,
the wastage would be smaller but the first step is getting vendors to
enable it.

> I also had CPU-less NUMA nodes in mind when thinking about that, but not
> sure how easy it would be to integrate it. If the tag memory has actually
> different performance characteristics as well, a NUMA node would be the
> right choice.

In general I'd expect the same characteristics. However, changing the
memory designation from tag to data (and vice-versa) requires some cache
maintenance. The allocation cost is slightly higher (not the runtime
one), so it would help if the page allocator does not favour this range.
Anyway, that's an optimisation to worry about later.

> If we could find some way to easily support this either via CMA or CPU-less
> NUMA nodes, that would be much preferable; even if we cannot cover each and
> every future use case right now. I expect some issues with CXL+MTE either
> way , but are happy to be taught otherwise :)

I think CXL+MTE is rather theoretical at the moment. Given that PCIe
doesn't have any notion of MTE, more likely there would be some piece of
interconnect that generates two memory accesses: one for data and the
other for tags at a configurable offset (which may or may not be in the
same CXL range).

> Another thought I had was adding something like CMA memory characteristics.
> Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> CMA area set?)?

I don't think adding CMA memory characteristics helps much. The metadata
allocation wouldn't go through cma_alloc() but rather
alloc_contig_range() directly for a specific pfn corresponding to the
data pages with PROT_MTE. The core mm code doesn't need to know about
the tag storage layout.

It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
That's typically coming from device drivers (DMA API) with their own
mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
therefore PROT_MTE is rejected).

What we need though is to prevent vma_alloc_folio() from allocating from
a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
removing __GFP_MOVABLE in those cases. As long as we don't have large
ZONE_MOVABLE areas, it shouldn't be an issue.

> When you need memory that supports tagging and have a page that does not
> support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> (eventually we could also try adding !CMA).
>
> Was that discussed and what would be the challenges with that? Page
> migration due to compaction comes to mind, but it might also be easy to
> handle if we can just avoid CMA memory for that.

IIRC that was because PROT_MTE pages would have to come only from
!MOVABLE ranges. Maybe that's not such big deal.

We'll give this a go and hopefully it simplifies the patches a bit (it
will take a while as Alex keeps going on holiday ;)). In the meantime,
I'm talking to the hardware people to see whether we can have MTE pages
in the tag storage/metadata range. We'd still need to reserve about 0.1%
of the RAM for the metadata corresponding to the tag storage range when
used as data but that's negligible (1/32 of 1/32). So if some future
hardware allows this, we can drop the page allocation restriction from
the CMA range.

> > though the latter is not guaranteed not to allocate memory from the
> > range, only make it less likely. Both these options are less flexible in
> > terms of size/alignment/placement.
> >
> > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > expect some CXL-attached memory to support MTE with additional carveout
> > reserved.
>
> I have no idea how we could possibly cleanly support memory hotplug in
> virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> memory to the VM) makes it rather hard and complicated.

The current thinking is that the VM is not aware of the tag storage,
that's entirely managed by the host. The host would treat the guest
memory similarly to the PROT_MTE user allocations, reserve metadata etc.

Thanks for the feedback so far, very useful.

--
Catalin

2023-09-14 17:44:34

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

Hi Kuan-Ying,

On Wed, Sep 13, 2023 at 08:11:40AM +0000, Kuan-Ying Lee (李冠穎) wrote:
> On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..bd050373d6cf 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> > };
> > };
> >
> > - memory@80000000 {
> > + memory0: memory@80000000 {
> > device_type = "memory";
> > - reg = <0x00000000 0x80000000 0 0x80000000>,
> > - <0x00000008 0x80000000 0 0x80000000>;
> > + reg = <0x00 0x80000000 0x00 0x7c000000>;
> > + };
> > +
> > + metadata0: metadata@c0000000 {
> > + compatible = "arm,mte-tag-storage";
> > + reg = <0x00 0xfc000000 0x00 0x3e00000>;
> > + block-size = <0x1000>;
> > + memory = <&memory0>;
> > + };
> > +
> > + memory1: memory@880000000 {
> > + device_type = "memory";
> > + reg = <0x08 0x80000000 0x00 0x7c000000>;
> > + };
> > +
> > + metadata1: metadata@8c0000000 {
> > + compatible = "arm,mte-tag-storage";
> > + reg = <0x08 0xfc000000 0x00 0x3e00000>;
> > + block-size = <0x1000>;
> > + memory = <&memory1>;
> > };
> >
>
> AFAIK, the above memory configuration means that there are two region
> of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
> is called PDD memory map.
>
> Document[1] said there are some constraints of tag memory as below.
>
> | The following constraints apply to the tag regions in DRAM:
> | 1. The tag region cannot be interleaved with the data region.
> | The tag region must also be above the data region within DRAM.
> |
> | 2.The tag region in the physical address space cannot straddle
> | multiple regions of a memory map.
> |
> | PDD memory map is not allowed to have part of the tag region between
> | 2GB-4GB and another part between 34GB-64GB.
>
> I'm not sure if we can separate tag memory with the above
> configuration. Or do I miss something?
>
> [1] https://developer.arm.com/documentation/101569/0300/?lang=en
> (Section 5.4.6.1)

Good point, thanks. The above dts some random layout we picked as an
example, it doesn't match any real hardware and we didn't pay attention
to the interconnect limitations (we fake the tag storage on the model).

I'll try to dig out how the mtu_tag_addr_shutter registers work and how
the sparse DRAM space is compressed to a smaller tag range. But that's
something done by firmware and the kernel only learns the tag storage
location from the DT (provided by firmware). We also don't need to know
the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in
the tag storage, only the block size in the tag storage space that
covers all interleaving done by the interconnect (it can be from 1 byte
to something larger like a page; the kernel will then use the lowest
common multiple between a page size and this tag block size to figure
out how many pages to reserve).

--
Catalin

2023-10-25 08:47:56

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

Hi,

On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > >
> > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > behavior using a new migrate-type.
> > > > [...]
> > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > request to use MIGRATE_CMA.
> > > > >
> > > > > I considered two solutions to this problem:
> > > > >
> > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > appealing, because that means treating normal memory that is also on the
> > > > > MIGRATE_CMA lists as tagged memory.
> > > >
> > > > That's indeed not ideal. We could try this if it makes the patches
> > > > significantly simpler, though I'm not so sure.
> > > >
> > > > Allocating metadata is the easier part as we know the correspondence
> > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > to the CMA range is sufficient.
> > > >
> > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > metadata range is what led us to another migrate type. I guess we could
> > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > >
> > > Ideally, no significant core-mm changes to optimize for an architecture
> > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > unavoidable and you are confident that you can convince core-MM people that
> > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > the trouble.
> >
> > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > the wastage would be smaller but the first step is getting vendors to
> > enable it.
> >
> > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > sure how easy it would be to integrate it. If the tag memory has actually
> > > different performance characteristics as well, a NUMA node would be the
> > > right choice.
> >
> > In general I'd expect the same characteristics. However, changing the
> > memory designation from tag to data (and vice-versa) requires some cache
> > maintenance. The allocation cost is slightly higher (not the runtime
> > one), so it would help if the page allocator does not favour this range.
> > Anyway, that's an optimisation to worry about later.
> >
> > > If we could find some way to easily support this either via CMA or CPU-less
> > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > every future use case right now. I expect some issues with CXL+MTE either
> > > way , but are happy to be taught otherwise :)
> >
> > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > doesn't have any notion of MTE, more likely there would be some piece of
> > interconnect that generates two memory accesses: one for data and the
> > other for tags at a configurable offset (which may or may not be in the
> > same CXL range).
> >
> > > Another thought I had was adding something like CMA memory characteristics.
> > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > CMA area set?)?
> >
> > I don't think adding CMA memory characteristics helps much. The metadata
> > allocation wouldn't go through cma_alloc() but rather
> > alloc_contig_range() directly for a specific pfn corresponding to the
> > data pages with PROT_MTE. The core mm code doesn't need to know about
> > the tag storage layout.
> >
> > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > That's typically coming from device drivers (DMA API) with their own
> > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > therefore PROT_MTE is rejected).
> >
> > What we need though is to prevent vma_alloc_folio() from allocating from
> > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > ZONE_MOVABLE areas, it shouldn't be an issue.
> >
>
> How about unsetting ALLOC_CMA if GFP_TAGGED ?
> Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> unmovable migratetype, which may not be desirable for page fragmentation.

Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
intending to do.

>
> > > When you need memory that supports tagging and have a page that does not
> > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > (eventually we could also try adding !CMA).
> > >
> > > Was that discussed and what would be the challenges with that? Page
> > > migration due to compaction comes to mind, but it might also be easy to
> > > handle if we can just avoid CMA memory for that.
> >
> > IIRC that was because PROT_MTE pages would have to come only from
> > !MOVABLE ranges. Maybe that's not such big deal.
> >
>
> Could you explain what it means that PROT_MTE have to come only from
> !MOVABLE range ? I don't understand this part very well.

I believe that was with the old approach, where tag storage cannot be tagged.

I'm guessing that the idea was that during migration of a tagged page, to make
sure that the destination page is not a tag storage page (which cannot be
tagged), the gfp flags used for allocating the destination page would be set
without __GFP_MOVABLE, which ensures that the destination page is not
allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
ALLOC_CMA if __GFP_TAGGED.

Thanks,
Alex

>
> Thanks,
> Hyesoo.
>
> > We'll give this a go and hopefully it simplifies the patches a bit (it
> > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > I'm talking to the hardware people to see whether we can have MTE pages
> > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > of the RAM for the metadata corresponding to the tag storage range when
> > used as data but that's negligible (1/32 of 1/32). So if some future
> > hardware allows this, we can drop the page allocation restriction from
> > the CMA range.
> >
> > > > though the latter is not guaranteed not to allocate memory from the
> > > > range, only make it less likely. Both these options are less flexible in
> > > > terms of size/alignment/placement.
> > > >
> > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > reserved.
> > >
> > > I have no idea how we could possibly cleanly support memory hotplug in
> > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > memory to the VM) makes it rather hard and complicated.
> >
> > The current thinking is that the VM is not aware of the tag storage,
> > that's entirely managed by the host. The host would treat the guest
> > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> >
> > Thanks for the feedback so far, very useful.
> >
> > --
> > Catalin
> >


2023-10-27 11:05:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote:
> If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
> the next iteration even if the hardware does not support "tag of tag" ?

It depends on how the next iteration looks like. The plan was not to
support this so that we avoid another complication where a non-tagged
page is mprotect'ed to become tagged and it would need to be migrated
out of the CMA range. Not sure how much code it would save.

> I am not sure every vendor will support tag of tag, since there is no information
> related to that feature, like in the Google spec document.

If you are aware of any vendors not supporting this, please direct them
to the Arm support team, it would be very useful information for us.

Thanks.

--
Catalin