2021-06-14 09:07:03

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 0/7] MTE support for KVM guest

This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

I realise there are still open questions[1] around the performance of
this series (the 'big lock', tag_sync_lock, introduced in the first
patch). But there should be no impact on non-MTE workloads and until we
get real MTE-enabled hardware it's hard to know whether there is a need
for something more sophisticated or not. Peter Collingbourne's patch[3]
to clear the tags at page allocation time should hide more of the impact
for non-VM cases. So the remaining concern is around VM startup which
could be effectively serialised through the lock.

Changes since v14[2]:

* Dropped "Handle MTE tags zeroing" patch in favour of Peter's similar
patch[3] (now in arm64 tree).

* Improved documentation following Catalin's review.

[1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
[2]: https://lore.kernel.org/r/[email protected]/
[3]: https://lore.kernel.org/r/[email protected]/

Steven Price (7):
arm64: mte: Handle race when synchronising tags
arm64: mte: Sync tags for pages where PTE is untagged
KVM: arm64: Introduce MTE VM feature
KVM: arm64: Save/restore MTE registers
KVM: arm64: Expose KVM_ARM_CAP_MTE
KVM: arm64: ioctl to fetch/store tags in a guest
KVM: arm64: Document MTE capability and ioctl

Documentation/virt/kvm/api.rst | 57 +++++++++++++++
arch/arm64/include/asm/kvm_arm.h | 3 +-
arch/arm64/include/asm/kvm_emulate.h | 3 +
arch/arm64/include/asm/kvm_host.h | 12 ++++
arch/arm64/include/asm/kvm_mte.h | 66 +++++++++++++++++
arch/arm64/include/asm/mte-def.h | 1 +
arch/arm64/include/asm/mte.h | 8 ++-
arch/arm64/include/asm/pgtable.h | 22 +++++-
arch/arm64/include/asm/sysreg.h | 3 +-
arch/arm64/include/uapi/asm/kvm.h | 11 +++
arch/arm64/kernel/asm-offsets.c | 2 +
arch/arm64/kernel/mte.c | 54 ++++++++++++--
arch/arm64/kvm/arm.c | 16 +++++
arch/arm64/kvm/guest.c | 82 ++++++++++++++++++++++
arch/arm64/kvm/hyp/entry.S | 7 ++
arch/arm64/kvm/hyp/exception.c | 3 +-
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++
arch/arm64/kvm/mmu.c | 42 ++++++++++-
arch/arm64/kvm/reset.c | 3 +-
arch/arm64/kvm/sys_regs.c | 32 +++++++--
include/uapi/linux/kvm.h | 2 +
21 files changed, 429 insertions(+), 21 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_mte.h

--
2.20.1


2021-06-14 09:07:05

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 1/7] arm64: mte: Handle race when synchronising tags

mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kernel/mte.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a3583a7fd400 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
u64 gcr_kernel_excl __ro_after_init;

static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);

#ifdef CONFIG_KASAN_HW_TAGS
/* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);

static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
{
+ unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);

+ spin_lock_irqsave(&tag_sync_lock, flags);
+
+ /* Recheck with the lock held */
+ if (test_bit(PG_mte_tagged, &page->flags))
+ goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);

- if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
- return;
+ if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+ set_bit(PG_mte_tagged, &page->flags);
+ goto out;
+ }
}

page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
*/
smp_wmb();
mte_clear_page_tags(page_address(page));
+ set_bit(PG_mte_tagged, &page->flags);
+
+out:
+ spin_unlock_irqrestore(&tag_sync_lock, flags);
}

void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,7 +77,7 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)

/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
- if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ if (!test_bit(PG_mte_tagged, &page->flags))
mte_sync_page_tags(page, ptep, check_swap);
}
}
--
2.20.1

2021-06-14 09:07:13

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 2/7] arm64: mte: Sync tags for pages where PTE is untagged

A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/mte.h | 4 ++--
arch/arm64/include/asm/pgtable.h | 22 +++++++++++++++++++---
arch/arm64/kernel/mte.c | 17 +++++++++++++----
3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
/* track which pages have valid allocation tags */
#define PG_mte_tagged PG_arch_2

-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
/* unused if !CONFIG_ARM64_MTE, silence the compiler */
#define PG_mte_tagged 0

-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
{
}
static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);

- if (system_supports_mte() &&
- pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
- mte_sync_tags(ptep, pte);
+ /*
+ * If the PTE would provide user space access to the tags associated
+ * with it then ensure that the MTE tags are synchronised. Although
+ * pte_access_permitted() returns false for exec only mappings, they
+ * don't expose tags (instruction fetches don't check tags).
+ */
+ if (system_supports_mte() && pte_access_permitted(pte, false) &&
+ !pte_special(pte)) {
+ pte_t old_pte = READ_ONCE(*ptep);
+ /*
+ * We only need to synchronise if the new PTE has tags enabled
+ * or if swapping in (in which case another mapping may have
+ * set tags in the past even if this PTE isn't tagged).
+ * (!pte_none() && !pte_present()) is an open coded version of
+ * is_swap_pte()
+ */
+ if (pte_tagged(pte) || (!pte_none(old_pte) && !pte_present(old_pte)))
+ mte_sync_tags(old_pte, pte);
+ }

__check_racy_pte_update(mm, ptep, pte);

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a3583a7fd400..ae0a3c68fece 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,10 +33,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
EXPORT_SYMBOL_GPL(mte_async_mode);
#endif

-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+ bool check_swap, bool pte_is_tagged)
{
unsigned long flags;
- pte_t old_pte = READ_ONCE(*ptep);

spin_lock_irqsave(&tag_sync_lock, flags);

@@ -53,6 +53,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
}
}

+ if (!pte_is_tagged)
+ goto out;
+
page_kasan_tag_reset(page);
/*
* We need smp_wmb() in between setting the flags and clearing the
@@ -69,16 +72,22 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
spin_unlock_irqrestore(&tag_sync_lock, flags);
}

-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
{
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+ bool pte_is_tagged = pte_tagged(pte);
+
+ /* Early out if there's nothing to do */
+ if (!check_swap && !pte_is_tagged)
+ return;

/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, &page->flags))
- mte_sync_page_tags(page, ptep, check_swap);
+ mte_sync_page_tags(page, old_pte, check_swap,
+ pte_is_tagged);
}
}

--
2.20.1

2021-06-14 09:07:38

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 5/7] KVM: arm64: Expose KVM_ARM_CAP_MTE

It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arm.c | 9 +++++++++
arch/arm64/kvm/reset.c | 3 ++-
arch/arm64/kvm/sys_regs.c | 3 +++
3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+ case KVM_CAP_ARM_MTE:
+ if (!system_supports_mte() || kvm->created_vcpus)
+ return -EINVAL;
+ r = 0;
+ kvm->arch.mte_enabled = true;
+ break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
*/
r = 1;
break;
+ case KVM_CAP_ARM_MTE:
+ r = system_supports_mte();
+ break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
- if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+ if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+ vcpu->kvm->arch.mte_enabled) {
ret = -EINVAL;
goto out;
}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 440315a556c2..d4e1c1b1a08d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
const struct sys_reg_desc *rd)
{
+ if (kvm_has_mte(vcpu->kvm))
+ return 0;
+
return REG_HIDDEN;
}

--
2.20.1

2021-06-14 09:07:40

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 3/7] KVM: arm64: Introduce MTE VM feature

Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_emulate.h | 3 ++
arch/arm64/include/asm/kvm_host.h | 3 ++
arch/arm64/include/asm/mte.h | 4 +++
arch/arm64/kernel/mte.c | 17 +++++++++++
arch/arm64/kvm/hyp/exception.c | 3 +-
arch/arm64/kvm/mmu.c | 42 +++++++++++++++++++++++++++-
arch/arm64/kvm/sys_regs.c | 7 +++++
include/uapi/linux/kvm.h | 1 +
8 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+ if (kvm_has_mte(vcpu->kvm))
+ vcpu->arch.hcr_el2 |= HCR_ATA;
}

static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {

u8 pfr0_csv2;
u8 pfr0_csv3;
+ /* Memory Tagging Extension enabled for the guest */
+ bool mte_enabled;
};

struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)

+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
#define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 347ef38a35f7..be1de541a11c 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,6 +37,7 @@ void mte_free_tag_storage(char *storage);
/* track which pages have valid allocation tags */
#define PG_mte_tagged PG_arch_2

+void mte_prepare_page_tags(struct page *page);
void mte_sync_tags(pte_t old_pte, pte_t pte);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
@@ -53,6 +54,9 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
/* unused if !CONFIG_ARM64_MTE, silence the compiler */
#define PG_mte_tagged 0

+static inline void mte_prepare_page_tags(struct page *page)
+{
+}
static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
{
}
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index ae0a3c68fece..b120f82a2258 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -72,6 +72,23 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
spin_unlock_irqrestore(&tag_sync_lock, flags);
}

+void mte_prepare_page_tags(struct page *page)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&tag_sync_lock, flags);
+
+ /* Recheck with the lock held */
+ if (test_bit(PG_mte_tagged, &page->flags))
+ goto out;
+
+ mte_clear_page_tags(page_address(page));
+ set_bit(PG_mte_tagged, &page->flags);
+
+out:
+ spin_unlock_irqrestore(&tag_sync_lock, flags);
+}
+
void mte_sync_tags(pte_t old_pte, pte_t pte)
{
struct page *page = pte_page(pte);
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);

- // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+ if (kvm_has_mte(vcpu->kvm))
+ new |= PSR_TCO_BIT;

new |= (old & PSR_DIT_BIT);

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..ed7c624e7362 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,36 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
return PAGE_SIZE;
}

+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+ unsigned long size)
+{
+ unsigned long i, nr_pages = size >> PAGE_SHIFT;
+ struct page *page;
+
+ if (!kvm_has_mte(kvm))
+ return 0;
+
+ /*
+ * The page will be mapped in stage 2 as Normal Cacheable, so
+ * the VM will be able to see the page's tags and therefore
+ * they must be initialised first. If PG_mte_tagged is set,
+ * tags have already been initialised.
+ * pfn_to_online_page() is used to reject ZONE_DEVICE pages
+ * that may not support tags.
+ */
+ page = pfn_to_online_page(pfn);
+
+ if (!page)
+ return -EFAULT;
+
+ for (i = 0; i < nr_pages; i++, page++) {
+ if (!test_bit(PG_mte_tagged, &page->flags))
+ mte_prepare_page_tags(page);
+ }
+
+ return 0;
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
unsigned long fault_status)
@@ -971,8 +1001,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;

- if (fault_status != FSC_PERM && !device)
+ if (fault_status != FSC_PERM && !device) {
+ ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
+ if (ret)
+ goto out_unlock;
+
clean_dcache_guest_page(pfn, vma_pagesize);
+ }

if (exec_fault) {
prot |= KVM_PGTABLE_PROT_X;
@@ -1168,12 +1203,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_pfn_t pfn = pte_pfn(range->pte);
+ int ret;

if (!kvm->arch.mmu.pgt)
return 0;

WARN_ON(range->end - range->start != 1);

+ ret = sanitise_mte_tags(kvm, pfn, PAGE_SIZE);
+ if (ret)
+ return false;
+
/*
* We've moved a page around, probably through CoW, so let's treat it
* just like a translation fault and clean the cache to the PoC.
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 76ea2800c33e..4a98902eaf1a 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,13 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
break;
case SYS_ID_AA64PFR1_EL1:
val &= ~FEATURE(ID_AA64PFR1_MTE);
+ if (kvm_has_mte(vcpu->kvm)) {
+ u64 pfr, mte;
+
+ pfr = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
+ mte = cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR1_MTE_SHIFT);
+ val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), mte);
+ }
break;
case SYS_ID_AA64ISAR1_EL1:
if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..8c95ba0fadda 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_SGX_ATTRIBUTE 196
#define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
#define KVM_CAP_PTP_KVM 198
+#define KVM_CAP_ARM_MTE 199

#ifdef KVM_CAP_IRQ_ROUTING

--
2.20.1

2021-06-14 09:07:48

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 4/7] KVM: arm64: Save/restore MTE registers

Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 3 +-
arch/arm64/include/asm/kvm_host.h | 6 ++
arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++++++
arch/arm64/include/asm/sysreg.h | 3 +-
arch/arm64/kernel/asm-offsets.c | 2 +
arch/arm64/kvm/hyp/entry.S | 7 +++
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
arch/arm64/kvm/sys_regs.c | 22 ++++++--
8 files changed, 124 insertions(+), 6 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 692c9049befa..d436831dd706 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -12,7 +12,8 @@
#include <asm/types.h>

/* Hyp Configuration Register (HCR) bits */
-#define HCR_ATA (UL(1) << 56)
+#define HCR_ATA_SHIFT 56
+#define HCR_ATA (UL(1) << HCR_ATA_SHIFT)
#define HCR_FWB (UL(1) << 46)
#define HCR_API (UL(1) << 41)
#define HCR_APK (UL(1) << 40)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,

+ /* Memory Tagging Extension registers */
+ RGSR_EL1, /* Random Allocation Tag Seed Register */
+ GCR_EL1, /* Tag Control Register */
+ TFSR_EL1, /* Tag Fault Status Register (EL1) */
+ TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index 000000000000..88dd1199670b
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020-2021 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include <asm/sysreg.h>
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+ b .L__skip_switch\@
+alternative_else_nop_endif
+ mrs \reg1, hcr_el2
+ tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+ mrs_s \reg1, SYS_RGSR_EL1
+ str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+ mrs_s \reg1, SYS_GCR_EL1
+ str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+ ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+ msr_s SYS_RGSR_EL1, \reg1
+ ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+ msr_s SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+ b .L__skip_switch\@
+alternative_else_nop_endif
+ mrs \reg1, hcr_el2
+ tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+ mrs_s \reg1, SYS_RGSR_EL1
+ str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+ mrs_s \reg1, SYS_GCR_EL1
+ str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+ ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+ msr_s SYS_RGSR_EL1, \reg1
+ ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+ msr_s SYS_GCR_EL1, \reg1
+
+ isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@

#define INIT_SCTLR_EL2_MMU_ON \
(SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I | \
- SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+ SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | \
+ SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)

#define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
DEFINE(VCPU_WORKAROUND_FLAGS, offsetof(struct kvm_vcpu, arch.workaround_flags));
DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs));
+ DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
+ DEFINE(CPU_GCR_EL1, offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
DEFINE(CPU_APIAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
DEFINE(CPU_APIBKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
DEFINE(CPU_APDAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index e831d3dfd50d..435346ea1504 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -13,6 +13,7 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_mmu.h>
+#include <asm/kvm_mte.h>
#include <asm/kvm_ptrauth.h>

.text
@@ -51,6 +52,9 @@ alternative_else_nop_endif

add x29, x0, #VCPU_CONTEXT

+ // mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
+ mte_switch_to_guest x29, x1, x2
+
// Macro ptrauth_switch_to_guest format:
// ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
// The below macro to restore guest keys is not implemented in C code
@@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
// when this feature is enabled for kernel code.
ptrauth_switch_to_hyp x1, x2, x3, x4, x5

+ // mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
+ mte_switch_to_hyp x1, x2, x3
+
// Restore hyp's sp_el0
restore_sp_el0 x2, x3

diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
index cce43bfe158f..de7e14c862e6 100644
--- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
+++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
@@ -14,6 +14,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>

static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
{
@@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
ctxt_sys_reg(ctxt, TPIDRRO_EL0) = read_sysreg(tpidrro_el0);
}

+static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
+{
+ struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
+
+ if (!vcpu)
+ vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
+
+ return kvm_has_mte(kern_hyp_va(vcpu->kvm));
+}
+
static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
{
ctxt_sys_reg(ctxt, CSSELR_EL1) = read_sysreg(csselr_el1);
@@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
ctxt_sys_reg(ctxt, PAR_EL1) = read_sysreg_par();
ctxt_sys_reg(ctxt, TPIDR_EL1) = read_sysreg(tpidr_el1);

+ if (ctxt_has_mte(ctxt)) {
+ ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
+ ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
+ }
+
ctxt_sys_reg(ctxt, SP_EL1) = read_sysreg(sp_el1);
ctxt_sys_reg(ctxt, ELR_EL1) = read_sysreg_el1(SYS_ELR);
ctxt_sys_reg(ctxt, SPSR_EL1) = read_sysreg_el1(SYS_SPSR);
@@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1), par_el1);
write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1), tpidr_el1);

+ if (ctxt_has_mte(ctxt)) {
+ write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR);
+ write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1);
+ }
+
if (!has_vhe() &&
cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) &&
ctxt->__hyp_running_vcpu) {
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4a98902eaf1a..440315a556c2 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1309,6 +1309,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
return true;
}

+static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *rd)
+{
+ return REG_HIDDEN;
+}
+
+#define MTE_REG(name) { \
+ SYS_DESC(SYS_##name), \
+ .access = undef_access, \
+ .reset = reset_unknown, \
+ .reg = name, \
+ .visibility = mte_visibility, \
+}
+
/* sys_reg_desc initialiser for known cpufeature ID registers */
#define ID_SANITISED(name) { \
SYS_DESC(SYS_##name), \
@@ -1477,8 +1491,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 },
{ SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 },

- { SYS_DESC(SYS_RGSR_EL1), undef_access },
- { SYS_DESC(SYS_GCR_EL1), undef_access },
+ MTE_REG(RGSR_EL1),
+ MTE_REG(GCR_EL1),

{ SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility },
{ SYS_DESC(SYS_TRFCR_EL1), undef_access },
@@ -1505,8 +1519,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi },
{ SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi },

- { SYS_DESC(SYS_TFSR_EL1), undef_access },
- { SYS_DESC(SYS_TFSRE0_EL1), undef_access },
+ MTE_REG(TFSR_EL1),
+ MTE_REG(TFSRE0_EL1),

{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
--
2.20.1

2021-06-14 09:08:32

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 6/7] KVM: arm64: ioctl to fetch/store tags in a guest

The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 3 ++
arch/arm64/include/asm/mte-def.h | 1 +
arch/arm64/include/uapi/asm/kvm.h | 11 +++++
arch/arm64/kvm/arm.c | 7 +++
arch/arm64/kvm/guest.c | 82 +++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 1 +
6 files changed, 105 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..6a2ac4636d42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr);

+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+ struct kvm_arm_copy_mte_tags *copy_tags);
+
/* Guest/host FPSIMD coordination helpers */
int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@

#define MTE_GRANULE_SIZE UL(16)
#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE)
#define MTE_TAG_SHIFT 56
#define MTE_TAG_SIZE 4
#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
};

+struct kvm_arm_copy_mte_tags {
+ __u64 guest_ipa;
+ __u64 length;
+ void __user *addr;
+ __u64 flags;
+ __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST 0
+#define KVM_ARM_TAGS_FROM_GUEST 1
+
/* If you need to interpret the index values, here is the key: */
#define KVM_REG_ARM_COPROC_MASK 0x000000000FFF0000
#define KVM_REG_ARM_COPROC_SHIFT 16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..baa33359e477 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,

return 0;
}
+ case KVM_ARM_MTE_COPY_TAGS: {
+ struct kvm_arm_copy_mte_tags copy_tags;
+
+ if (copy_from_user(&copy_tags, argp, sizeof(copy_tags)))
+ return -EFAULT;
+ return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags);
+ }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..4ddb20017b2f 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,

return ret;
}
+
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+ struct kvm_arm_copy_mte_tags *copy_tags)
+{
+ gpa_t guest_ipa = copy_tags->guest_ipa;
+ size_t length = copy_tags->length;
+ void __user *tags = copy_tags->addr;
+ gpa_t gfn;
+ bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+ int ret = 0;
+
+ if (!kvm_has_mte(kvm))
+ return -EINVAL;
+
+ if (copy_tags->reserved[0] || copy_tags->reserved[1])
+ return -EINVAL;
+
+ if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+ return -EINVAL;
+
+ if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+ return -EINVAL;
+
+ gfn = gpa_to_gfn(guest_ipa);
+
+ mutex_lock(&kvm->slots_lock);
+
+ while (length > 0) {
+ kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+ void *maddr;
+ unsigned long num_tags;
+ struct page *page;
+
+ if (is_error_noslot_pfn(pfn)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ page = pfn_to_online_page(pfn);
+ if (!page) {
+ /* Reject ZONE_DEVICE memory */
+ ret = -EFAULT;
+ goto out;
+ }
+ maddr = page_address(page);
+
+ if (!write) {
+ if (test_bit(PG_mte_tagged, &page->flags))
+ num_tags = mte_copy_tags_to_user(tags, maddr,
+ MTE_GRANULES_PER_PAGE);
+ else
+ /* No tags in memory, so write zeros */
+ num_tags = MTE_GRANULES_PER_PAGE -
+ clear_user(tags, MTE_GRANULES_PER_PAGE);
+ kvm_release_pfn_clean(pfn);
+ } else {
+ num_tags = mte_copy_tags_from_user(maddr, tags,
+ MTE_GRANULES_PER_PAGE);
+ kvm_release_pfn_dirty(pfn);
+ }
+
+ if (num_tags != MTE_GRANULES_PER_PAGE) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ /* Set the flag after checking the write completed fully */
+ if (write)
+ set_bit(PG_mte_tagged, &page->flags);
+
+ gfn++;
+ tags += num_tags;
+ length -= PAGE_SIZE;
+ }
+
+out:
+ mutex_unlock(&kvm->slots_lock);
+ /* If some data has been copied report the number of bytes copied */
+ if (length != copy_tags->length)
+ return copy_tags->length - length;
+ return ret;
+}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8c95ba0fadda..4c011c60d468 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1428,6 +1428,7 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_PMU_EVENT_FILTER */
#define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter)
#define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3)
+#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags)

/* ioctl for vm fd */
#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
--
2.20.1

2021-06-14 09:10:46

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 7/7] KVM: arm64: Document MTE capability and ioctl

A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
Documentation/virt/kvm/api.rst | 57 ++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..d412928e50cd 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,43 @@ see KVM_XEN_VCPU_SET_ATTR above.
The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
with the KVM_XEN_VCPU_GET_ATTR ioctl.

+4.130 KVM_ARM_MTE_COPY_TAGS
+---------------------------
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+ arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+ struct kvm_arm_copy_mte_tags {
+ __u64 guest_ipa;
+ __u64 length;
+ void __user *addr;
+ __u64 flags;
+ __u64 reserved[2];
+ };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
5. The kvm_run structure
========================

@@ -6362,6 +6399,26 @@ default.

See Documentation/x86/sgx/2.Kernel-internals.rst for more details.

+7.26 KVM_CAP_ARM_MTE
+--------------------
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
8. Other capabilities.
======================

--
2.20.1

2021-06-17 14:15:50

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v15 0/7] MTE support for KVM guest

On Thu, 17 Jun 2021 13:13:22 +0100,
Catalin Marinas <[email protected]> wrote:
>
> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> > I realise there are still open questions[1] around the performance of
> > this series (the 'big lock', tag_sync_lock, introduced in the first
> > patch). But there should be no impact on non-MTE workloads and until we
> > get real MTE-enabled hardware it's hard to know whether there is a need
> > for something more sophisticated or not. Peter Collingbourne's patch[3]
> > to clear the tags at page allocation time should hide more of the impact
> > for non-VM cases. So the remaining concern is around VM startup which
> > could be effectively serialised through the lock.
> [...]
> > [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
>
> Start-up, VM resume, migration could be affected by this lock, basically
> any time you fault a page into the guest. As you said, for now it should
> be fine as long as the hardware doesn't support MTE or qemu doesn't
> enable MTE in guests. But the problem won't go away.

Indeed. And I find it odd to say "it's not a problem, we don't have
any HW available". By this token, why should we merge this work the
first place, or any of the MTE work that has gone into the kernel over
the past years?

> We have a partial solution with an array of locks to mitigate against
> this but there's still the question of whether we should actually bother
> for something that's unlikely to happen in practice: MAP_SHARED memory
> in guests (ignoring the stage 1 case for now).
>
> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> enabled for guests, we can reject the mapping.

That's a reasonable approach. I wonder whether we could do that right
at the point where the memslot is associated with the VM, like this:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a36a2e3082d8..ebd3b3224386 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
if (!vma)
break;

+ if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
+ return -EINVAL;
+
/*
* Take the intersection of this VMA with the memory region
*/

which takes the problem out of the fault path altogether? We document
the restriction and move on. With that, we can use a non-locking
version of mte_sync_page_tags().

> We can discuss the stage 1 case separately from this series.

Works for me.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-06-17 16:03:26

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 0/7] MTE support for KVM guest

On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> I realise there are still open questions[1] around the performance of
> this series (the 'big lock', tag_sync_lock, introduced in the first
> patch). But there should be no impact on non-MTE workloads and until we
> get real MTE-enabled hardware it's hard to know whether there is a need
> for something more sophisticated or not. Peter Collingbourne's patch[3]
> to clear the tags at page allocation time should hide more of the impact
> for non-VM cases. So the remaining concern is around VM startup which
> could be effectively serialised through the lock.
[...]
> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org

Start-up, VM resume, migration could be affected by this lock, basically
any time you fault a page into the guest. As you said, for now it should
be fine as long as the hardware doesn't support MTE or qemu doesn't
enable MTE in guests. But the problem won't go away.

We have a partial solution with an array of locks to mitigate against
this but there's still the question of whether we should actually bother
for something that's unlikely to happen in practice: MAP_SHARED memory
in guests (ignoring the stage 1 case for now).

If MAP_SHARED in guests is not a realistic use-case, we have the vma in
user_mem_abort() and if the VM_SHARED flag is set together with MTE
enabled for guests, we can reject the mapping.

We can discuss the stage 1 case separately from this series.

--
Catalin

2021-06-17 16:07:03

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v15 0/7] MTE support for KVM guest

On 17/06/2021 14:15, Marc Zyngier wrote:
> On Thu, 17 Jun 2021 13:13:22 +0100,
> Catalin Marinas <[email protected]> wrote:
>>
>> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
>>> I realise there are still open questions[1] around the performance of
>>> this series (the 'big lock', tag_sync_lock, introduced in the first
>>> patch). But there should be no impact on non-MTE workloads and until we
>>> get real MTE-enabled hardware it's hard to know whether there is a need
>>> for something more sophisticated or not. Peter Collingbourne's patch[3]
>>> to clear the tags at page allocation time should hide more of the impact
>>> for non-VM cases. So the remaining concern is around VM startup which
>>> could be effectively serialised through the lock.
>> [...]
>>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
>>
>> Start-up, VM resume, migration could be affected by this lock, basically
>> any time you fault a page into the guest. As you said, for now it should
>> be fine as long as the hardware doesn't support MTE or qemu doesn't
>> enable MTE in guests. But the problem won't go away.
>
> Indeed. And I find it odd to say "it's not a problem, we don't have
> any HW available". By this token, why should we merge this work the
> first place, or any of the MTE work that has gone into the kernel over
> the past years?
>
>> We have a partial solution with an array of locks to mitigate against
>> this but there's still the question of whether we should actually bother
>> for something that's unlikely to happen in practice: MAP_SHARED memory
>> in guests (ignoring the stage 1 case for now).
>>
>> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
>> user_mem_abort() and if the VM_SHARED flag is set together with MTE
>> enabled for guests, we can reject the mapping.
>
> That's a reasonable approach. I wonder whether we could do that right
> at the point where the memslot is associated with the VM, like this:
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a36a2e3082d8..ebd3b3224386 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> if (!vma)
> break;
>
> + if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> + return -EINVAL;
> +
> /*
> * Take the intersection of this VMA with the memory region
> */
>
> which takes the problem out of the fault path altogether? We document
> the restriction and move on. With that, we can use a non-locking
> version of mte_sync_page_tags().

Does this deal with the case where the VMAs are changed after the
memslot is created? While we can do the check here to give the VMM a
heads-up if it gets it wrong, I think we also need it in
user_mem_abort() to deal with a VMM which mmap()s over the VA of the
memslot. Or am I missing something?

But if everyone is happy with the restriction (just for KVM) of not
allowing MTE+VM_SHARED then that sounds like a good way forward.

Thanks,

Steve

>> We can discuss the stage 1 case separately from this series.
>
> Works for me.
>
> Thanks,
>
> M.
>

2021-06-17 18:00:44

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v15 0/7] MTE support for KVM guest

On Thu, 17 Jun 2021 14:24:25 +0100,
Steven Price <[email protected]> wrote:
>
> On 17/06/2021 14:15, Marc Zyngier wrote:
> > On Thu, 17 Jun 2021 13:13:22 +0100,
> > Catalin Marinas <[email protected]> wrote:
> >>
> >> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> >>> I realise there are still open questions[1] around the performance of
> >>> this series (the 'big lock', tag_sync_lock, introduced in the first
> >>> patch). But there should be no impact on non-MTE workloads and until we
> >>> get real MTE-enabled hardware it's hard to know whether there is a need
> >>> for something more sophisticated or not. Peter Collingbourne's patch[3]
> >>> to clear the tags at page allocation time should hide more of the impact
> >>> for non-VM cases. So the remaining concern is around VM startup which
> >>> could be effectively serialised through the lock.
> >> [...]
> >>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
> >>
> >> Start-up, VM resume, migration could be affected by this lock, basically
> >> any time you fault a page into the guest. As you said, for now it should
> >> be fine as long as the hardware doesn't support MTE or qemu doesn't
> >> enable MTE in guests. But the problem won't go away.
> >
> > Indeed. And I find it odd to say "it's not a problem, we don't have
> > any HW available". By this token, why should we merge this work the
> > first place, or any of the MTE work that has gone into the kernel over
> > the past years?
> >
> >> We have a partial solution with an array of locks to mitigate against
> >> this but there's still the question of whether we should actually bother
> >> for something that's unlikely to happen in practice: MAP_SHARED memory
> >> in guests (ignoring the stage 1 case for now).
> >>
> >> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> >> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> >> enabled for guests, we can reject the mapping.
> >
> > That's a reasonable approach. I wonder whether we could do that right
> > at the point where the memslot is associated with the VM, like this:
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index a36a2e3082d8..ebd3b3224386 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > if (!vma)
> > break;
> >
> > + if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> > + return -EINVAL;
> > +
> > /*
> > * Take the intersection of this VMA with the memory region
> > */
> >
> > which takes the problem out of the fault path altogether? We document
> > the restriction and move on. With that, we can use a non-locking
> > version of mte_sync_page_tags().
>
> Does this deal with the case where the VMAs are changed after the
> memslot is created? While we can do the check here to give the VMM a
> heads-up if it gets it wrong, I think we also need it in
> user_mem_abort() to deal with a VMM which mmap()s over the VA of the
> memslot. Or am I missing something?

No, you're right. I wish the memslot API wasn't so lax... Anyway, even
a VMA flag check in user_mem_abort() will be cheaper than this new BKL.

> But if everyone is happy with the restriction (just for KVM) of not
> allowing MTE+VM_SHARED then that sounds like a good way forward.

Definitely works for me.

M.

--
Without deviation from the norm, progress is not possible.