2023-05-05 15:23:15

by Mickaël Salaün

[permalink] [raw]
Subject: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

Hi,

This patch series is a proof-of-concept that implements new KVM features
(extended page tracking, MBEC support, CR pinning) and defines a new API to
protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be delegated
to a more privileged part of the system, hence the hypervisor. It is still the
role of the guest kernel to request such restrictions according to its
configuration. The high-level security guarantees provided by the hypervisor
are semantically the same as a subset of those the kernel already enforces on
itself (CR pinning hardening and memory page table protections), but with much
higher guarantees.

We'd like the mainline kernel to support such hardening features leveraging
virtualization. We're looking for reviews and comments that can help mainline
these two parts: the KVM implementation and the guest kernel API layer designed
to support different hypervisors. The struct heki_hypervisor enables to plug in
different backend implementations that are initialized with the
heki_early_init() and heki_late_init() calls. This RFC is an initial
call for collaboration. There is a lot to do, either on hypervisors,
guest kernels or VMMs sides.

We took inspiration from previous patches, mainly the KVMI [1] [2] and KVM
CR-pinning [3] series, revamped and simplified relevant parts to fit well with
our goal, added support for MBEC to enable a deny-by-default kernel execution
policy (e.g., write xor execute), added two hypercalls, and created a kernel
API for VMs to request protection in a generic way that can be leveraged by any
hypervisor.

This proof-of-concept is named Hypervisor-Enforced Kernel Integrity (Heki),
which reflects the goal to empower guest kernels to protect themselves. This
name is new to the kernel, and it enables to easily identify new code required
for this set of features.

This patch series is based on Linux 6.2 and requires the host to support
MBEC. This can easily be checked with:
grep ept_mode_based_exec /proc/cpuinfo
You can test it by enabling CONFIG_HEKI, CONFIG_HEKI_TEST,
CONFIG_KUNIT_DEFAULT_ENABLED, and adding the heki_test=N boot argument
to the guest as explained in the last patch. Another way to test it is
to try to load a kernel module in the guest: you'll see KVM creating
synthetic page faults. This only works using a bare metal machine as KVM
host; nested virtualization is not supported yet.

# Threat model

The initial threat model is a malicious user space process exploiting a kernel
vulnerability to gain more privileges or to bypass the access-control system.
An extended threat model could include attacks coming from network or storage
data (e.g., malformed network packet, inconsistent drive content).

Considering all potential ways to compromise a kernel, Heki's goal is to harden
a sane kernel before a runtime attack to make it more difficult, and
potentially to make such an attack failed. We consider the kernel itself to be
partially malicious during its lifetime e.g., because a ROP attack that could
disable kernel self-protection mechanisms and make kernel exploitation much
easier. Indeed, an exploit is often split into several stages, each bypassing
some security measures. Getting the guarantee that new kernel executable code
is not possible increases the cost of an attack, hopefully to the point that it
is not worth it.

To protect against persistent attacks, complementary security mechanisms should
be used (e.g., kernel module signing, IMA, IPE, Lockdown).

# Prerequisites

For this set of features to be useful, guest kernels must be trusted by the VM
owners at boot time, before launching any user space processes nor receiving
potentially malicious network packets. It is then required to have a security
mechanism to provide or check this initial trust (e.g., secure boot, kernel
module signing).

# How does it work?

This implementation mainly leverages KVM capabilities to control the Second
Layer Address Translation (or the Two Dimensional Paging e.g., Intel's EPT or
AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) introduced with
the Kaby Lake (7th generation) architecture. This allows to set permissions on
memory pages in a complementary way to the guest kernel's managed memory
permissions. Once these permissions are set, they are locked and there is no
way back.

A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest kernel to lock
a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE or the
HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a specific
set of pages (allow-list approach), and the second only allows kernel execution
for a set of pages (deny-list approach).

The current implementation sets the whole kernel's .rodata (i.e., any const or
__ro_after_init variables, which includes critical security data such as LSM
parameters) and .text sections as non-writable, and the .text section is the
only one where kernel execution is allowed. This is possible thanks to the new
MBEC support also brough by this series (otherwise the vDSO would have to be
executable). Thanks to this hardware support (VT-x, EPT and MBEC), the
performance impact of such guest protection is negligible.

The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some of its
CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP),
which is another complementary hardening mechanism.

Heki can be enabled with the heki=1 boot command argument.

# Similar implementations

Here is a non-exhaustive list of similar implementations that we looked at and
took some ideas. Linux mainline doesn't support such security features, let's
change that!

Windows's Virtualization-Based Security is a proprietary technology relying
that provides a superset of this kind of security mechanism, relying on Hyper-V
and Virtual Trust Levels which enables to have light and secure VM enforcing
restrictions on a full guest VM. This includes several components such as HVCI
which is in charge of code authenticity, or HyperGuard which monitors and
protects kernel code and data.

Samsung's Real-time Kernel Protection (RKP) and Huawei Hypervisor Execution
Environment (HHEE) rely on proprietary hypervisors to protect some Android
devices. They monitor critical kernel data (e.g., page tables, credentials,
selinux_enforcing).

The iOS Kernel Patch Protection is a proprietary solution that relies on a
secure enclave (dedicated hardware component) to monitor and protect critical
parts of the kernel.

Bitdefender's Hypervisor Memory Introspection (HVMI) is an open-source (but out
of tree) set of components leveraging virtualization. HVMI implementation is
very complex, and this approach implies potential semantic gap issues (i.e.,
kernel data structures may change from one version to another).

Linux Kernel Runtime Guard is an open-source kernel module that can detect some
kernel data illegitimate modifications. Because it is the same kernel as the
compromised one, an attacker could also bypass or disable these checks.

Intel's Virtualization Based Hardening [4] [5] is an open-source
proof-of-concept of a thin hypervisor dedicated to guest protection. As such,
it cannot be used to manage several VMs.

# Similar Linux patches

The VM introspection [1] [2] patch series proposed a set of features to
put probes and introspect VMs for debugging and security reasons. We
changed and included the prewrite page tracking and the fault_gva parts.
Heki is much simpler because it focuses on guest hardening, not
introspection.

Paravirtualized Control Register pinning [3] added a set of KVM IOCTLs to
restrict some flags to be set. Heki doesn't implement such user space
interface, but only a dedicated hypercall to lock such registers. A superset of
these flags is configurable with Heki.

The Hypervisor Based Integrity patches [6] [7] only contain a generic IPC
mechanism (KVM_HC_UCALL hypercall) to request protection to the VMM. The idea
was to extend the KVM_SET_USER_MEMORY_REGION IOCTL to support more permission
than read-only.

# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this needs to
be addressed. Mechanisms that dynamically impact kernel executable memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such
code will need to be authenticated. Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want to
implement a code authentication mechanism in the hypervisor itself but delegate
this verification to something much less privileged. We are thinking of two
ways to solve this: implement this verification in the VMM or spawn a dedicated
special VM (similar to Windows's VBS). There are pros on cons to each approach:
complexity, verification code ownership (guest's or VMM's), access to guest
memory (i.e., confidential computing).

Because the guest's virtual address translation is not protected by the
hypervisor, a compromised kernel could map existing physical pages into
arbitrary virtual addresses. The new Intel's Hypervisor-Managed Linear Address
Translation [8] (HLAT) could be used to extend the current protection and cover
this case.

ROP is not covered by this patch series. Guest kernels can still jump to
arbitrary executable pages according to their control-flow integrity
protection.

# Future work

We think this kind of restrictions can be leveraged to log attempts of
forbidden actions. Forwarding such signals to the VMM could help improve attack
detection.

Giving visibility to the VMM would also enable to migrate VMs.

New dynamic restrictions could enable to improve the protected data by
including security-sensitive data such as LSM states, seccomp filters,
keyrings... This requires support outside of the hypervisor.

An execute-only mode could also be useful (cf. XOM for KVM [9] [10]).

Extending register pinning (e.g., MSRs).

Being able to protect nested guests might be possible but we need to figure out
the potential security implications.

Protecting the host would be useful, but that doesn't really fit with the KVM
model. The Protected KVM project is a first step to help in this direction
[11].

We only tested this with an Intel CPU, but this approach should work the same
with an AMD CPU starting with the Zen 2 generation and their Guest Mode Execute
Trap (GMET) capability.

We also kept some TODOs to highlight missing checks and code sharing issues,
and some pr_warn() calls to help understand how it works. Tests need to be
improved (e.g., invalid hypercall arguments).

We'll present this work at the Linux Security Summit North America next week.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf
[3] https://lore.kernel.org/all/[email protected]/
[4] https://github.com/intel/vbh
[5] https://sched.co/TmwN
[6] https://sched.co/eE3f
[7] https://lore.kernel.org/all/[email protected]/
[8] https://sched.co/eE4F
[9] https://lore.kernel.org/kvm/[email protected]/
[10] https://lpc.events/event/4/contributions/283/
[11] https://sched.co/eE24

Please reach out to us by replying to this thread, we're looking for
people to join and collaborate on this project!

Regards,

Madhavan T. Venkataraman (2):
virt: Implement Heki common code
KVM: x86: Add Heki hypervisor support

Mickaël Salaün (7):
KVM: x86: Add kvm_x86_ops.fault_gva()
KVM: x86/mmu: Add support for prewrite page tracking
KVM: x86: Add new hypercall to set EPT permissions
KVM: x86: Add new hypercall to lock control registers
KVM: VMX: Add MBEC support
KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC
virt: Add Heki KUnit tests

Documentation/virt/kvm/x86/hypercalls.rst | 34 +++
Kconfig | 2 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/kvm_page_track.h | 12 +
arch/x86/include/asm/sections.h | 4 +
arch/x86/include/asm/vmx.h | 11 +-
arch/x86/include/asm/x86_init.h | 2 +
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kernel/cpu/hypervisor.c | 1 +
arch/x86/kernel/kvm.c | 72 +++++
arch/x86/kernel/setup.c | 49 +++
arch/x86/kernel/x86_init.c | 1 +
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu.h | 3 +-
arch/x86/kvm/mmu/mmu.c | 105 ++++++-
arch/x86/kvm/mmu/mmutrace.h | 11 +-
arch/x86/kvm/mmu/page_track.c | 33 +-
arch/x86/kvm/mmu/paging_tmpl.h | 16 +-
arch/x86/kvm/mmu/spte.c | 29 +-
arch/x86/kvm/mmu/spte.h | 15 +-
arch/x86/kvm/mmu/tdp_mmu.c | 73 +++++
arch/x86/kvm/mmu/tdp_mmu.h | 4 +
arch/x86/kvm/svm/svm.c | 9 +
arch/x86/kvm/vmx/capabilities.h | 7 +
arch/x86/kvm/vmx/nested.c | 7 +
arch/x86/kvm/vmx/vmx.c | 48 ++-
arch/x86/kvm/vmx/vmx.h | 1 +
arch/x86/kvm/x86.c | 352 +++++++++++++++++++++-
arch/x86/kvm/x86.h | 23 ++
include/linux/heki.h | 90 ++++++
include/linux/kvm_host.h | 20 ++
include/uapi/linux/kvm_para.h | 2 +
init/main.c | 3 +
virt/Makefile | 1 +
virt/heki/Kconfig | 41 +++
virt/heki/Makefile | 3 +
virt/heki/heki.c | 321 ++++++++++++++++++++
virt/kvm/kvm_main.c | 5 +
40 files changed, 1377 insertions(+), 40 deletions(-)
create mode 100644 include/linux/heki.h
create mode 100644 virt/heki/Kconfig
create mode 100644 virt/heki/Makefile
create mode 100644 virt/heki/heki.c


base-commit: c9c3395d5e3dcc6daee66c6908354d47bf98cb0c
--
2.40.1


2023-05-05 15:43:50

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 8/9] KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC

This changes enable to enforce a deny-by-default execution security
policy for guest kernels, leveraged by the Heki implementation.

Create synthetic page faults when an access is denied by Heki. This
kind of kernel page fault needs to be handled by guests, which is not
currently the case, making it try again and again, but we are working to
calm down such guests by teaching them how to handle such page faults.

The MMU tracepoints are updated to reflect the difference between kernel
and user space executions.

kvm_heki_fix_all_ept_exec_perm() walks through all guest memory pages to
set the configured default execution permissions (i.e. only allow
configured executabel memory pages).

The struct heki_mem_range's attribute field now understand
HEKI_ATTR_MEM_EXEC, which allows the related kernel sections to be
executable, and deny any other kernel memory from being executable for
the whole lifetime of the guest. This obviously can only work with
static kernels and we are exploring ways to handle authenticated and
dynamic kernel memory permission updates.

If the host doesn't have MBEC enabled, the KVM_HC_LOCK_MEM_PAGE_RANGES
hypecall will return -KVM_EOPNOTSUPP and might only apply the previous
ranges, if any. This is useful to develop this RFC and make sure
execution restrictions are enforced (and not silently ignored), but this
behavior might change in a future patch series. Guest kernels could
check for MBEC support to not use the HEKI_ATTR_MEM_EXEC attribute.

The number of configurable memory ranges per guest is 16 for now. This
will change with a follow-up.

There are currently some pr_warn() calls to make it easy to test this
code.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/virt/kvm/x86/hypercalls.rst | 4 +-
arch/x86/kvm/mmu/mmu.c | 35 ++++++++-
arch/x86/kvm/mmu/mmutrace.h | 11 ++-
arch/x86/kvm/mmu/spte.c | 19 ++++-
arch/x86/kvm/mmu/spte.h | 15 +++-
arch/x86/kvm/mmu/tdp_mmu.c | 73 ++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 4 +
arch/x86/kvm/x86.c | 90 ++++++++++++++++++++++-
arch/x86/kvm/x86.h | 7 ++
include/linux/kvm_host.h | 4 +
virt/kvm/kvm_main.c | 1 +
11 files changed, 250 insertions(+), 13 deletions(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index 8aa5d28986e3..5accf5f6de13 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -204,7 +204,9 @@ must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.

The hypercall lets a guest request memory permissions to be removed for itself,
identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE
-memory page range attribute forbids related modification to the guest.
+memory page range attribute forbids related modification to the guest. The
+HEKI_ATTR_MEM_EXEC attribute allows execution on the specified pages while
+removing it for all the others.

Returns 0 on success or a KVM error code otherwise.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a47e63217eb8..56a8bcac1b82 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3313,7 +3313,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
{
if (fault->exec)
- return is_executable_pte(spte);
+ return is_executable_pte(spte, !fault->user);

if (fault->write)
return is_writable_pte(spte);
@@ -5602,6 +5602,39 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;

+ /* Skips real page faults if not needed. */
+ if ((error_code & PFERR_FETCH_MASK) &&
+ !kvm_heki_is_exec_allowed(vcpu, cr2_or_gpa)) {
+ /*
+ * TODO: To avoid kvm_heki_is_exec_allowed() call, check
+ * enable_mbec and EPT_VIOLATION_KERNEL_INSTR, see
+ * handle_ept_violation().
+ */
+ struct x86_exception fault = {
+ .vector = PF_VECTOR,
+ .error_code_valid = true,
+ .error_code = error_code,
+ .nested_page_fault = false,
+ /*
+ * TODO: This kind of kernel page fault needs to be handled by
+ * the guest, which is not currently the case, making it try
+ * again and again.
+ *
+ * You may want to test with cr2_or_gva to see the page
+ * fault caught by the guest kernel (thinking it is a
+ * user space fault).
+ */
+ .address = static_call(kvm_x86_fault_gva)(vcpu),
+ .async_page_fault = false,
+ };
+
+ pr_warn_ratelimited(
+ "heki-kvm: Creating fetch #PF at 0x%016llx\n",
+ fault.address);
+ kvm_inject_page_fault(vcpu, &fault);
+ return RET_PF_INVALID;
+ }
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..cb7df95aec25 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -342,7 +342,8 @@ TRACE_EVENT(
__field(u8, level)
/* These depend on page entry type, so compute them now. */
__field(bool, r)
- __field(bool, x)
+ __field(bool, kx)
+ __field(bool, ux)
__field(signed char, u)
),

@@ -352,15 +353,17 @@ TRACE_EVENT(
__entry->sptep = virt_to_phys(sptep);
__entry->level = level;
__entry->r = shadow_present_mask || (__entry->spte & PT_PRESENT_MASK);
- __entry->x = is_executable_pte(__entry->spte);
+ __entry->kx = is_executable_pte(__entry->spte, true);
+ __entry->ux = is_executable_pte(__entry->spte, false);
__entry->u = shadow_user_mask ? !!(__entry->spte & shadow_user_mask) : -1;
),

- TP_printk("gfn %llx spte %llx (%s%s%s%s) level %d at %llx",
+ TP_printk("gfn %llx spte %llx (%s%s%s%s%s) level %d at %llx",
__entry->gfn, __entry->spte,
__entry->r ? "r" : "-",
__entry->spte & PT_WRITABLE_MASK ? "w" : "-",
- __entry->x ? "x" : "-",
+ __entry->kx ? "X" : "-",
+ __entry->ux ? "x" : "-",
__entry->u == -1 ? "" : (__entry->u ? "u" : "-"),
__entry->level, __entry->sptep
)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index f1e2e3cad878..c9fabb3c9cb2 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -184,10 +184,25 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
pte_access &= ~ACC_EXEC_MASK;
}

- if (pte_access & ACC_EXEC_MASK)
+ if (pte_access & ACC_EXEC_MASK) {
spte |= shadow_x_mask;
- else
+#ifdef CONFIG_HEKI
+ /*
+ * FIXME: Race condition (at boot) if no
+ * lockdep_assert_held_write(vcpu->kvm->mmu_lock);
+ */
+ if (READ_ONCE(vcpu->kvm->heki_kernel_exec_locked)) {
+ if (!heki_exec_is_allowed(vcpu->kvm, gfn))
+ spte &= ~VMX_EPT_EXECUTABLE_MASK;
+ else
+ pr_warn("heki-kvm: Allowing kernel execution "
+ "for GFN 0x%llx\n",
+ gfn);
+ }
+#endif /* CONFIG_HEKI */
+ } else {
spte |= shadow_nx_mask;
+ }

if (pte_access & ACC_USER_MASK)
spte |= shadow_user_mask;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6f54dc9409c9..30b250d03132 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -3,7 +3,10 @@
#ifndef KVM_X86_MMU_SPTE_H
#define KVM_X86_MMU_SPTE_H

+#include <asm/vmx.h>
+
#include "mmu_internal.h"
+#include "../vmx/vmx.h"

/*
* A MMU present SPTE is backed by actual memory and may or may not be present
@@ -307,9 +310,17 @@ static inline bool is_last_spte(u64 pte, int level)
return (level == PG_LEVEL_4K) || is_large_pte(pte);
}

-static inline bool is_executable_pte(u64 spte)
+static inline bool is_executable_pte(u64 spte, bool for_kernel_mode)
{
- return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
+ u64 x_mask = shadow_x_mask;
+
+ if (enable_mbec) {
+ if (for_kernel_mode)
+ x_mask &= ~VMX_EPT_USER_EXECUTABLE_MASK;
+ else
+ x_mask &= ~VMX_EPT_EXECUTABLE_MASK;
+ }
+ return (spte & (x_mask | shadow_nx_mask)) == x_mask;
}

static inline kvm_pfn_t spte_to_pfn(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d6df38d371a0..0be34a9e90c0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,7 +7,10 @@
#include "tdp_mmu.h"
#include "spte.h"

+#include "../x86.h"
+
#include <asm/cmpxchg.h>
+#include <asm/vmx.h>
#include <trace/events/kvm.h>

static bool __read_mostly tdp_mmu_enabled = true;
@@ -1021,6 +1024,76 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
}
}

+#ifdef CONFIG_HEKI
+
+/* TODO: Handle flush? */
+void kvm_heki_fix_all_ept_exec_perm(struct kvm *const kvm)
+{
+ int i;
+ struct kvm_mmu_page *root;
+ const gfn_t start = 0;
+ const gfn_t end = tdp_mmu_max_gfn_exclusive();
+
+ if (WARN_ON_ONCE(!is_tdp_mmu_enabled(kvm)))
+ return;
+
+ if (WARN_ON_ONCE(!enable_mbec))
+ return;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Because heki_exec_locked is only set with this code, it cannot be
+ * unlocked. This is protected against race condition thanks to
+ * mmu_lock.
+ */
+ WRITE_ONCE(kvm->heki_kernel_exec_locked, true);
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for_each_tdp_mmu_root(kvm, root, i) {
+ struct tdp_iter iter;
+
+ WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
+
+ /*
+ * TODO: Make sure
+ * !is_shadow_present_pte()/SPTE_MMU_PRESENT_MASK are
+ * well handled when they are present.
+ */
+
+ rcu_read_lock();
+ tdp_root_for_each_leaf_pte(iter, root, start, end) {
+ u64 new_spte;
+
+ if (heki_exec_is_allowed(kvm, iter.gfn)) {
+ pr_warn("heki-kvm: Allowing kernel "
+ "execution for GFN 0x%llx\n",
+ iter.gfn);
+ continue;
+ }
+ pr_warn("heki-kvm: Denying kernel execution "
+ "for GFN 0x%llx\n",
+ iter.gfn);
+
+retry:
+ new_spte = iter.old_spte &
+ ~VMX_EPT_EXECUTABLE_MASK;
+ if (new_spte == iter.old_spte)
+ continue;
+
+ if (tdp_mmu_set_spte_atomic(kvm, &iter,
+ new_spte))
+ goto retry;
+ }
+ rcu_read_unlock();
+ }
+ }
+ write_unlock(&kvm->mmu_lock);
+ pr_warn("heki-kvm: Locked executable kernel memory\n");
+}
+
+#endif /* CONFIG_HEKI */
+
/*
* Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
* zap" completes.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index d3714200b932..8b70b6af68d4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -24,6 +24,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);

+#ifdef CONFIG_HEKI
+void kvm_heki_fix_all_ept_exec_perm(struct kvm *const kvm);
+#endif /* CONFIG_HEKI */
+
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);

bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a529455359ac..7ac8d9fabc18 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -20,6 +20,7 @@
#include "irq.h"
#include "ioapic.h"
#include "mmu.h"
+#include "mmu/tdp_mmu.h"
#include "i8254.h"
#include "tss.h"
#include "kvm_cache_regs.h"
@@ -31,6 +32,7 @@
#include "lapic.h"
#include "xen.h"
#include "smm.h"
+#include "vmx/capabilities.h"

#include <linux/clocksource.h>
#include <linux/interrupt.h>
@@ -9705,6 +9707,45 @@ heki_page_track_prewrite(struct kvm_vcpu *const vcpu, const gpa_t gpa,
return true;
}

+bool heki_exec_is_allowed(const struct kvm *const kvm, const gfn_t gfn)
+{
+ unsigned int gfn_last;
+
+ if (!READ_ONCE(kvm->heki_kernel_exec_locked))
+ return true;
+
+ /*
+ * heki_gfn_exec_last is initialized with (HEKI_GFN_MAX + 1),
+ * and 0 means that heki_gfn_exec_last is full.
+ */
+ for (gfn_last = atomic_read(&kvm->heki_gfn_exec_last);
+ gfn_last > 0 && gfn_last <= HEKI_GFN_MAX;) {
+ gfn_last--;
+
+ /* Ignores unused slots. */
+ if (kvm->heki_gfn_exec[gfn_last].end == 0)
+ break;
+
+ if (gfn >= kvm->heki_gfn_exec[gfn_last].start &&
+ gfn <= kvm->heki_gfn_exec[gfn_last].end) {
+ /* TODO: Opportunistically shrink heki_gfn_exec. */
+ return true;
+ }
+ }
+ return false;
+}
+
+bool kvm_heki_is_exec_allowed(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ const gfn_t gfn = gpa_to_gfn(gpa);
+ const struct kvm *const kvm = vcpu->kvm;
+
+ if (heki_exec_is_allowed(kvm, gfn))
+ return true;
+
+ return false;
+}
+
static int kvm_heki_init_vm(struct kvm *const kvm)
{
struct kvm_page_track_notifier_node *const node =
@@ -9733,6 +9774,7 @@ static int heki_lock_mem_page_ranges(struct kvm *const kvm, gpa_t mem_ranges,
int err;
size_t i, ranges_num;
struct heki_pa_range *ranges;
+ bool has_exec_restriction = false;

if (mem_ranges_size > HEKI_PA_RANGE_MAX_SIZE)
return -KVM_E2BIG;
@@ -9752,7 +9794,8 @@ static int heki_lock_mem_page_ranges(struct kvm *const kvm, gpa_t mem_ranges,

ranges_num = mem_ranges_size / sizeof(struct heki_pa_range);
for (i = 0; i < ranges_num; i++) {
- const u64 attributes_mask = HEKI_ATTR_MEM_NOWRITE;
+ const u64 attributes_mask = HEKI_ATTR_MEM_NOWRITE |
+ HEKI_ATTR_MEM_EXEC;
const gfn_t gfn_start = ranges[i].gfn_start;
const gfn_t gfn_end = ranges[i].gfn_end;
const u64 attributes = ranges[i].attributes;
@@ -9799,11 +9842,52 @@ static int heki_lock_mem_page_ranges(struct kvm *const kvm, gpa_t mem_ranges,
kvm, gfn, KVM_PAGE_TRACK_PREWRITE));
}

- pr_warn("heki-kvm: Locking GFN 0x%llx-0x%llx with %s\n",
+ /*
+ * Allow-list for execute permission,
+ * see kvm_heki_fix_all_ept_exec_perm().
+ */
+ if (attributes & HEKI_ATTR_MEM_EXEC) {
+ size_t gfn_i;
+
+ if (!enable_mbec) {
+ /*
+ * Guests can check for MBEC support to avoid
+ * such error by not using HEKI_ATTR_MEM_EXEC.
+ */
+ err = -KVM_EOPNOTSUPP;
+ pr_warn("heki-kvm: HEKI_ATTR_MEM_EXEC "
+ "depends on MBEC, which is disabled.");
+ /*
+ * We should continue partially applying
+ * restrictions, but it is useful for this RFC
+ * to exit early in case of missing MBEC
+ * support.
+ */
+ goto out_free_ranges;
+ }
+
+ has_exec_restriction = true;
+ gfn_i = atomic_dec_if_positive(
+ &kvm->heki_gfn_exec_last);
+ if (gfn_i == 0) {
+ err = -KVM_E2BIG;
+ goto out_free_ranges;
+ }
+
+ gfn_i--;
+ kvm->heki_gfn_exec[gfn_i].start = gfn_start;
+ kvm->heki_gfn_exec[gfn_i].end = gfn_end;
+ }
+
+ pr_warn("heki-kvm: Locking GFN 0x%llx-0x%llx with %s%s\n",
gfn_start, gfn_end,
- (attributes & HEKI_ATTR_MEM_NOWRITE) ? "[nowrite]" : "");
+ (attributes & HEKI_ATTR_MEM_NOWRITE) ? "[nowrite]" : "",
+ (attributes & HEKI_ATTR_MEM_EXEC) ? "[exec]" : "");
}

+ if (has_exec_restriction)
+ kvm_heki_fix_all_ept_exec_perm(kvm);
+
out_free_ranges:
kfree(ranges);
return err;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 3e80a60ecbd8..2127e551202d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -282,6 +282,8 @@ int heki_check_cr(const struct kvm *kvm, unsigned long cr, unsigned long val);

bool kvm_heki_is_exec_allowed(struct kvm_vcpu *vcpu, gpa_t gpa);

+bool heki_exec_is_allowed(const struct kvm *const kvm, const gfn_t gfn);
+
#else /* CONFIG_HEKI */

static inline int heki_check_cr(const struct kvm *const kvm,
@@ -290,6 +292,11 @@ static inline int heki_check_cr(const struct kvm *const kvm,
return 0;
}

+static inline bool kvm_heki_is_exec_allowed(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ return true;
+}
+
#endif /* CONFIG_HEKI */

void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab9dc723bc89..82c7b02cbcc3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -812,9 +812,13 @@ struct kvm {
#define HEKI_GFN_MAX 16
atomic_t heki_gfn_no_write_num;
struct heki_gfn_range heki_gfn_no_write[HEKI_GFN_MAX];
+ atomic_t heki_gfn_exec_last;
+ struct heki_gfn_range heki_gfn_exec[HEKI_GFN_MAX];

atomic_long_t heki_pinned_cr0;
atomic_long_t heki_pinned_cr4;
+
+ bool heki_kernel_exec_locked;
#endif /* CONFIG_HEKI */

#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4aea936dfe73..a177f8ff5123 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1232,6 +1232,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)

#ifdef CONFIG_HEKI
atomic_set(&kvm->heki_gfn_no_write_num, HEKI_GFN_MAX + 1);
+ atomic_set(&kvm->heki_gfn_exec_last, HEKI_GFN_MAX + 1);
#endif /* CONFIG_HEKI */

preempt_notifier_inc();
--
2.40.1

2023-05-05 15:44:12

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 3/9] virt: Implement Heki common code

From: Madhavan T. Venkataraman <[email protected]>

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Configuration
=============

Define the config variables for the feature. This feature depends on
support from the architecture as well as the hypervisor.

Enabling HEKI
=============

Define a kernel command line parameter "heki" to turn the feature on or
off. By default, Heki is on.

Feature initialization
======================

The linker script, vmlinux.lds.S, defines a number of sections that are
loaded in kernel memory. Each of these sections has its own permissions.
For instance, .text has HEKI_ATTR_MEM_EXEC | HEKI_ATTR_MEM_NOWRITE, and
.rodata has HEKI_ATTR_MEM_NOWRITE.

Define an architecture specific init function, heki_arch_init(). In this
function, collect the ranges of all of the sections. These sections will
be protected in the host page table with their respective permissions so
that even if the guest kernel is compromised, their permissions cannot
be changed.

Define heki_early_init() to initialize the feature. For now, this
function just checks if the feature is enabled and calls
heki_arch_init().

Define heki_late_init() that protects the sections in the host page
table. This needs hypervisor support which will be introduced in the
future. This function is called at the end of kernel init.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Mickaël Salaün <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Madhavan T. Venkataraman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Kconfig | 2 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/sections.h | 4 +
arch/x86/kernel/setup.c | 49 ++++++++++++
include/linux/heki.h | 90 +++++++++++++++++++++
init/main.c | 3 +
virt/Makefile | 1 +
virt/heki/Kconfig | 22 ++++++
virt/heki/Makefile | 3 +
virt/heki/heki.c | 135 ++++++++++++++++++++++++++++++++
10 files changed, 310 insertions(+)
create mode 100644 include/linux/heki.h
create mode 100644 virt/heki/Kconfig
create mode 100644 virt/heki/Makefile
create mode 100644 virt/heki/heki.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..0c844d9bcb03 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"

source "lib/Kconfig.debug"

+source "virt/heki/Kconfig"
+
source "Documentation/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..5cf5a7a97811 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -297,6 +297,7 @@ config X86
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+ select ARCH_SUPPORTS_HEKI if X86_64

config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..42ef1e33b8a5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -18,6 +18,10 @@ extern char __end_of_kernel_reserve[];

extern unsigned long _brk_start, _brk_end;

+extern int __start_orc_unwind_ip[], __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[], __stop_orc_unwind[];
+extern unsigned int orc_lookup[], orc_lookup_end[];
+
static inline bool arch_is_kernel_initmem_freed(unsigned long addr)
{
/*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 88188549647c..f0ddaf24ab63 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -11,6 +11,7 @@
#include <linux/dma-map-ops.h>
#include <linux/dmi.h>
#include <linux/efi.h>
+#include <linux/heki.h>
#include <linux/ima.h>
#include <linux/init_ohci1394_dma.h>
#include <linux/initrd.h>
@@ -850,6 +851,54 @@ static void __init x86_report_nx(void)
}
}

+#ifdef CONFIG_HEKI
+
+/*
+ * Gather all of the statically defined sections so heki_late_init() can
+ * protect these sections in the host page table.
+ *
+ * The sections are defined under "SECTIONS" in vmlinux.lds.S
+ * Keep this array in sync with SECTIONS.
+ */
+struct heki_va_range __initdata heki_va_ranges[] = {
+ {
+ .va_start = _stext,
+ .va_end = _etext,
+ .attributes = HEKI_ATTR_MEM_NOWRITE | HEKI_ATTR_MEM_EXEC,
+ },
+ {
+ .va_start = __start_rodata,
+ .va_end = __end_rodata,
+ .attributes = HEKI_ATTR_MEM_NOWRITE,
+ },
+#ifdef CONFIG_UNWINDER_ORC
+ {
+ .va_start = __start_orc_unwind_ip,
+ .va_end = __stop_orc_unwind_ip,
+ .attributes = HEKI_ATTR_MEM_NOWRITE,
+ },
+ {
+ .va_start = __start_orc_unwind,
+ .va_end = __stop_orc_unwind,
+ .attributes = HEKI_ATTR_MEM_NOWRITE,
+ },
+ {
+ .va_start = orc_lookup,
+ .va_end = orc_lookup_end,
+ .attributes = HEKI_ATTR_MEM_NOWRITE,
+ },
+#endif /* CONFIG_UNWINDER_ORC */
+};
+
+void __init heki_arch_init(void)
+{
+ heki.num_static_ranges = ARRAY_SIZE(heki_va_ranges);
+ heki.static_ranges =
+ heki_alloc_pa_ranges(heki_va_ranges, heki.num_static_ranges);
+}
+
+#endif /* CONFIG_HEKI */
+
/*
* Determine if we were loaded by an EFI loader. If so, then we have also been
* passed the efi memmap, systab, etc., so we should use these data structures
diff --git a/include/linux/heki.h b/include/linux/heki.h
new file mode 100644
index 000000000000..e4a3192ba687
--- /dev/null
+++ b/include/linux/heki.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Headers
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#ifndef __HEKI_H__
+#define __HEKI_H__
+
+#ifdef CONFIG_HEKI
+
+#include <linux/kvm_types.h>
+
+/* Heki attributes for memory pages. */
+/* clang-format off */
+#define HEKI_ATTR_MEM_NOWRITE (1ULL << 0)
+#define HEKI_ATTR_MEM_EXEC (1ULL << 1)
+/* clang-format on */
+
+/*
+ * heki_va_range is used to specify a virtual address range within the kernel
+ * address space along with their attributes.
+ */
+struct heki_va_range {
+ void *va_start;
+ void *va_end;
+ u64 attributes;
+};
+
+/*
+ * heki_pa_range is passed to the VMM or hypervisor so it can be processed by
+ * the VMM or the hypervisor based on range attributes. Examples of ranges:
+ *
+ * - a range whose permissions need to be set in the host page table
+ * - a range that contains information needed for authentication
+ *
+ * When an array of these is passed to the Hypervisor or VMM, the array
+ * must be in physically contiguous memory.
+ */
+struct heki_pa_range {
+ gfn_t gfn_start;
+ gfn_t gfn_end;
+ u64 attributes;
+};
+
+/*
+ * A hypervisor that supports Heki will instantiate this structure to
+ * provide hypervisor specific functions for Heki.
+ */
+struct heki_hypervisor {
+ int (*protect_ranges)(struct heki_pa_range *ranges, int num_ranges);
+ int (*lock_crs)(void);
+};
+
+/*
+ * If the architecture supports Heki, it will initialize static_ranges in
+ * early boot.
+ *
+ * If the active hypervisor supports Heki, it will plug its heki_hypervisor
+ * pointer into this heki structure.
+ */
+struct heki {
+ struct heki_pa_range *static_ranges;
+ int num_static_ranges;
+ struct heki_hypervisor *hypervisor;
+};
+
+extern struct heki heki;
+
+void heki_early_init(void);
+void heki_arch_init(void);
+void heki_late_init(void);
+
+struct heki_pa_range *heki_alloc_pa_ranges(struct heki_va_range *va_ranges,
+ int num_ranges);
+void heki_free_pa_ranges(struct heki_pa_range *pa_ranges, int num_ranges);
+
+#else /* !CONFIG_HEKI */
+
+static inline void heki_early_init(void)
+{
+}
+static inline void heki_late_init(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
+#endif /* __HEKI_H__ */
diff --git a/init/main.c b/init/main.c
index e1c3911d7c70..8649dbb07f18 100644
--- a/init/main.c
+++ b/init/main.c
@@ -102,6 +102,7 @@
#include <linux/stackdepot.h>
#include <linux/randomize_kstack.h>
#include <net/net_namespace.h>
+#include <linux/heki.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -999,6 +1000,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
sort_main_extable();
trap_init();
mm_init();
+ heki_early_init();
poking_init();
ftrace_init();

@@ -1530,6 +1532,7 @@ static int __ref kernel_init(void *unused)
exit_boot_config();
free_initmem();
mark_readonly();
+ heki_late_init();

/*
* Kernel mappings are now finalized - update the userspace page-table
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..4550dc624466 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y += lib/
+obj-$(CONFIG_HEKI) += heki/
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
new file mode 100644
index 000000000000..9858a827fe17
--- /dev/null
+++ b/virt/heki/Kconfig
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Hypervisor Enforced Kernel Integrity (HEKI)
+#
+
+config HEKI
+ bool "Hypervisor Enforced Kernel Integrity (Heki)"
+ default y
+ depends on !JUMP_LABEL && ARCH_SUPPORTS_HEKI
+ select KVM_EXTERNAL_WRITE_TRACKING if KVM
+ help
+ This feature enhances guest virtual machine security by taking
+ advantage of security features provided by the hypervisor for guests.
+ This feature is helpful in maintaining guest virtual machine security
+ even after the guest kernel has been compromised.
+
+config ARCH_SUPPORTS_HEKI
+ bool "Architecture support for HEKI"
+ help
+ An architecture should select this when it can successfully build
+ and run with CONFIG_HEKI. That is, it should provide all of the
+ architecture support required for the HEKI feature.
diff --git a/virt/heki/Makefile b/virt/heki/Makefile
new file mode 100644
index 000000000000..2bc2061c9dfc
--- /dev/null
+++ b/virt/heki/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-y += heki.o
diff --git a/virt/heki/heki.c b/virt/heki/heki.c
new file mode 100644
index 000000000000..c8cb1b84cceb
--- /dev/null
+++ b/virt/heki/heki.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Common code
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#include <linux/cache.h>
+#include <linux/heki.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/printk.h>
+#include <linux/types.h>
+#include <linux/vmalloc.h>
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) "heki-guest: " fmt
+
+static bool heki_enabled __ro_after_init = true;
+
+struct heki heki = {};
+
+struct heki_pa_range *heki_alloc_pa_ranges(struct heki_va_range *va_ranges,
+ int num_ranges)
+{
+ struct heki_pa_range *pa_ranges, *pa_range;
+ struct heki_va_range *va_range;
+ u64 attributes;
+ size_t size;
+ int i;
+
+ size = PAGE_ALIGN(sizeof(struct heki_pa_range) * num_ranges);
+ pa_ranges = alloc_pages_exact(size, GFP_KERNEL);
+ if (!pa_ranges)
+ return NULL;
+
+ for (i = 0; i < num_ranges; i++) {
+ va_range = &va_ranges[i];
+ pa_range = &pa_ranges[i];
+
+ pa_range->gfn_start = PFN_DOWN(__pa_symbol(va_range->va_start));
+ pa_range->gfn_end = PFN_UP(__pa_symbol(va_range->va_end)) - 1;
+ pa_range->attributes = va_range->attributes;
+
+ /*
+ * WARNING:
+ * Leaks addresses, should only be kept for development.
+ */
+ attributes = pa_range->attributes;
+ pr_warn("Configuring GFN 0x%llx-0x%llx with %s\n",
+ pa_range->gfn_start, pa_range->gfn_end,
+ (attributes & HEKI_ATTR_MEM_NOWRITE) ? "[nowrite]" :
+ "");
+ }
+
+ return pa_ranges;
+}
+
+void heki_free_pa_ranges(struct heki_pa_range *pa_ranges, int num_ranges)
+{
+ size_t size;
+
+ size = PAGE_ALIGN(sizeof(struct heki_pa_range) * num_ranges);
+ free_pages_exact(pa_ranges, size);
+}
+
+void __init heki_early_init(void)
+{
+ if (!heki_enabled) {
+ pr_warn("Disabled\n");
+ return;
+ }
+ pr_warn("Enabled\n");
+
+ heki_arch_init();
+}
+
+void heki_late_init(void)
+{
+ struct heki_hypervisor *hypervisor = heki.hypervisor;
+ int ret;
+
+ if (!heki_enabled)
+ return;
+
+ if (!heki.static_ranges) {
+ pr_warn("Architecture did not initialize static ranges\n");
+ return;
+ }
+
+ /*
+ * Hypervisor support will be added in the future. When it is, the
+ * hypervisor will be used to protect guest kernel memory and
+ * control registers.
+ */
+
+ if (!hypervisor) {
+ /* This happens for kernels running on bare metal as well. */
+ pr_warn("No hypervisor support\n");
+ goto out;
+ }
+
+ /* Protects statically defined sections in the host page table. */
+ ret = hypervisor->protect_ranges(heki.static_ranges,
+ heki.num_static_ranges);
+ if (WARN(ret, "Failed to protect static sections: %d\n", ret))
+ goto out;
+ pr_warn("Static sections protected\n");
+
+ /*
+ * Locks control registers so a compromised guest cannot change
+ * them.
+ */
+ ret = hypervisor->lock_crs();
+ if (WARN(ret, "Failed to lock control registers: %d\n", ret))
+ goto out;
+ pr_warn("Control registers locked\n");
+
+out:
+ heki_free_pa_ranges(heki.static_ranges, heki.num_static_ranges);
+ heki.static_ranges = NULL;
+ heki.num_static_ranges = 0;
+}
+
+static int __init heki_parse_config(char *str)
+{
+ if (strtobool(str, &heki_enabled))
+ pr_warn("Invalid option string for heki: '%s'\n", str);
+ return 1;
+}
+
+__setup("heki=", heki_parse_config);
--
2.40.1

2023-05-05 15:44:55

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 7/9] KVM: VMX: Add MBEC support

This changes add support for VMX_FEATURE_MODE_BASED_EPT_EXEC (named
ept_mode_based_exec in /proc/cpuinfo and MBEC elsewhere), which enables
to separate EPT execution bits for supervisor vs. user. It transforms
the semantic of VMX_EPT_EXECUTABLE_MASK from a global execution to a
kernel execution, and use the VMX_EPT_USER_EXECUTABLE_MASK bit to
identify user execution.

The main use case is to be able to restrict kernel execution while
ignoring user space execution from the hypervisor point of view.
Indeed, user space execution can already be restricted by the guest
kernel.

This change enables MBEC but doesn't change the default configuration,
which is to allow execution for all guest memory. However, the next
commit levages MBEC to restrict kernel memory pages.

MBEC can be configured with the new "enable_mbec" module parameter, set
to true by default. However, MBEC is disable for L1 and L2 for now.

Replace EPT_VIOLATION_RWX_MASK (3 bits) with 4 dedicated
EPT_VIOLATION_READ, EPT_VIOLATION_WRITE, EPT_VIOLATION_KERNEL_INSTR, and
EPT_VIOLATION_USER_INSTR bits.

From the Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3C (System Programming Guide), Part 3:

SECONDARY_EXEC_MODE_BASED_EPT_EXEC (bit 22):
If either the "unrestricted guest" VM-execution control or the
"mode-based execute control for EPT" VM-execution control is 1, the
"enable EPT" VM-execution control must also be 1.

EPT_VIOLATION_KERNEL_INSTR_BIT (bit 5):
The logical-AND of bit 2 in the EPT paging-structure entries used to
translate the guest-physical address of the access causing the EPT
violation. If the "mode-based execute control for EPT" VM-execution
control is 0, this indicates whether the guest-physical address was
executable. If that control is 1, this indicates whether the
guest-physical address was executable for supervisor-mode linear
addresses.

EPT_VIOLATION_USER_INSTR_BIT (bit 6):
If the "mode-based execute control" VM-execution control is 0, the value
of this bit is undefined. If that control is 1, this bit is the
logical-AND of bit 10 in the EPT paging-structures entries used to
translate the guest-physical address of the access causing the EPT
violation. In this case, it indicates whether the guest-physical address
was executable for user-mode linear addresses.

PT_USER_EXEC_MASK (bit 10):
Execute access for user-mode linear addresses. If the "mode-based
execute control for EPT" VM-execution control is 1, indicates whether
instruction fetches are allowed from user-mode linear addresses in the
512-GByte region controlled by this entry. If that control is 0, this
bit is ignored.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/vmx.h | 11 +++++++++--
arch/x86/kvm/mmu.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 6 +++++-
arch/x86/kvm/mmu/paging_tmpl.h | 16 ++++++++++++++--
arch/x86/kvm/mmu/spte.c | 4 +++-
arch/x86/kvm/vmx/capabilities.h | 7 +++++++
arch/x86/kvm/vmx/nested.c | 7 +++++++
arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 1 +
9 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..452e7d153832 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -511,6 +511,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_USER_EXECUTABLE_MASK (1ull << 10)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
@@ -556,13 +557,19 @@ enum vm_entry_failure_code {
#define EPT_VIOLATION_ACC_READ_BIT 0
#define EPT_VIOLATION_ACC_WRITE_BIT 1
#define EPT_VIOLATION_ACC_INSTR_BIT 2
-#define EPT_VIOLATION_RWX_SHIFT 3
+#define EPT_VIOLATION_READ_BIT 3
+#define EPT_VIOLATION_WRITE_BIT 4
+#define EPT_VIOLATION_KERNEL_INSTR_BIT 5
+#define EPT_VIOLATION_USER_INSTR_BIT 6
#define EPT_VIOLATION_GVA_IS_VALID_BIT 7
#define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
#define EPT_VIOLATION_ACC_READ (1 << EPT_VIOLATION_ACC_READ_BIT)
#define EPT_VIOLATION_ACC_WRITE (1 << EPT_VIOLATION_ACC_WRITE_BIT)
#define EPT_VIOLATION_ACC_INSTR (1 << EPT_VIOLATION_ACC_INSTR_BIT)
-#define EPT_VIOLATION_RWX_MASK (VMX_EPT_RWX_MASK << EPT_VIOLATION_RWX_SHIFT)
+#define EPT_VIOLATION_READ (1 << EPT_VIOLATION_READ_BIT)
+#define EPT_VIOLATION_WRITE (1 << EPT_VIOLATION_WRITE_BIT)
+#define EPT_VIOLATION_KERNEL_INSTR (1 << EPT_VIOLATION_KERNEL_INSTR_BIT)
+#define EPT_VIOLATION_USER_INSTR (1 << EPT_VIOLATION_USER_INSTR_BIT)
#define EPT_VIOLATION_GVA_IS_VALID (1 << EPT_VIOLATION_GVA_IS_VALID_BIT)
#define EPT_VIOLATION_GVA_TRANSLATED (1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..3c4fd4618cc1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -24,6 +24,7 @@ extern bool __read_mostly enable_mmio_caching;
#define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
#define PT_PAT_MASK (1ULL << 7)
#define PT_GLOBAL_MASK (1ULL << 8)
+#define PT_USER_EXEC_MASK (1ULL << 10)
#define PT64_NX_SHIFT 63
#define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)

@@ -102,7 +103,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, bool has_mbec);

void kvm_init_mmu(struct kvm_vcpu *vcpu);
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e5d1e241ff0f..a47e63217eb8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -27,6 +27,9 @@
#include "cpuid.h"
#include "spte.h"

+/* Required by paging_tmpl.h for enable_mbec */
+#include "../vmx/capabilities.h"
+
#include <linux/kvm_host.h>
#include <linux/types.h>
#include <linux/string.h>
@@ -3763,7 +3766,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
*/
pm_mask = PT_PRESENT_MASK | shadow_me_value;
if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
- pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
+ pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK |
+ PT_USER_EXEC_MASK;

if (WARN_ON_ONCE(!mmu->pml4_root)) {
r = -EIO;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..12119d519c77 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -498,8 +498,20 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
* Note, pte_access holds the raw RWX bits from the EPTE, not
* ACC_*_MASK flags!
*/
- vcpu->arch.exit_qualification |= (pte_access & VMX_EPT_RWX_MASK) <<
- EPT_VIOLATION_RWX_SHIFT;
+ vcpu->arch.exit_qualification |=
+ !!(pte_access & VMX_EPT_READABLE_MASK)
+ << EPT_VIOLATION_READ_BIT;
+ vcpu->arch.exit_qualification |=
+ !!(pte_access & VMX_EPT_WRITABLE_MASK)
+ << EPT_VIOLATION_WRITE_BIT;
+ vcpu->arch.exit_qualification |=
+ !!(pte_access & VMX_EPT_EXECUTABLE_MASK)
+ << EPT_VIOLATION_KERNEL_INSTR_BIT;
+ if (enable_mbec) {
+ vcpu->arch.exit_qualification |=
+ !!(pte_access & VMX_EPT_USER_EXECUTABLE_MASK)
+ << EPT_VIOLATION_USER_INSTR_BIT;
+ }
}
#endif
walker->fault.address = addr;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 639f220a1ed5..f1e2e3cad878 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -430,13 +430,15 @@ void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask);

-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, bool has_mbec)
{
shadow_user_mask = VMX_EPT_READABLE_MASK;
shadow_accessed_mask = has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
+ if (has_mbec)
+ shadow_x_mask |= VMX_EPT_USER_EXECUTABLE_MASK;
shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
/*
* EPT overrides the host MTRRs, and so KVM must program the desired
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cd2ac9536c99..2cc5d7d20144 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -13,6 +13,7 @@ extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
extern bool __read_mostly enable_ept;
extern bool __read_mostly enable_unrestricted_guest;
+extern bool __read_mostly enable_mbec;
extern bool __read_mostly enable_ept_ad_bits;
extern bool __read_mostly enable_pml;
extern bool __read_mostly enable_ipiv;
@@ -255,6 +256,12 @@ static inline bool cpu_has_vmx_xsaves(void)
SECONDARY_EXEC_XSAVES;
}

+static inline bool cpu_has_vmx_mbec(void)
+{
+ return vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+}
+
static inline bool cpu_has_vmx_waitpkg(void)
{
return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d93c715cda6a..3c381c75e2a9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2317,6 +2317,9 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
/* VMCS shadowing for L2 is emulated for now */
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;

+ /* MBEC is currently only handled for L0. */
+ exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+
/*
* Preset *DT exiting when emulating UMIP, so that vmx_set_cr4()
* will not have to rewrite the controls just for this bit.
@@ -6870,6 +6873,10 @@ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
*/
msrs->secondary_ctls_low = 0;

+ /*
+ * Currently, SECONDARY_EXEC_MODE_BASED_EPT_EXEC is only handled for
+ * L0 and doesn't need to be exposed to L1 nor L2.
+ */
msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl;
msrs->secondary_ctls_high &=
SECONDARY_EXEC_DESC |
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 931688edc8eb..004fd4e5e057 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -94,6 +94,9 @@ bool __read_mostly enable_unrestricted_guest = 1;
module_param_named(unrestricted_guest,
enable_unrestricted_guest, bool, S_IRUGO);

+bool __read_mostly enable_mbec = true;
+module_param_named(mbec, enable_mbec, bool, 0444);
+
bool __read_mostly enable_ept_ad_bits = 1;
module_param_named(eptad, enable_ept_ad_bits, bool, S_IRUGO);

@@ -4518,10 +4521,21 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
if (!enable_ept) {
exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+ /*
+ * From Intel's SDM:
+ * If either the "unrestricted guest" VM-execution control or
+ * the "mode-based execute control for EPT" VM-execution
+ * control is 1, the "enable EPT" VM-execution control must
+ * also be 1.
+ */
enable_unrestricted_guest = 0;
+ enable_mbec = false;
}
if (!enable_unrestricted_guest)
exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+ if (!enable_mbec)
+ exec_control &= ~SECONDARY_EXEC_MODE_BASED_EPT_EXEC;
+
if (kvm_pause_in_guest(vmx->vcpu.kvm))
exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
if (!kvm_vcpu_apicv_active(vcpu))
@@ -5658,7 +5672,7 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)

static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
- unsigned long exit_qualification;
+ unsigned long exit_qualification, rwx_mask;
gpa_t gpa;
u64 error_code;

@@ -5688,7 +5702,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
? PFERR_FETCH_MASK : 0;
/* ept page table entry is present? */
- error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+ rwx_mask = EPT_VIOLATION_READ | EPT_VIOLATION_WRITE |
+ EPT_VIOLATION_KERNEL_INSTR;
+ if (enable_mbec)
+ rwx_mask |= EPT_VIOLATION_USER_INSTR;
+ error_code |= (exit_qualification & rwx_mask)
? PFERR_PRESENT_MASK : 0;

error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
@@ -8345,6 +8363,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
enable_unrestricted_guest = 0;

+ if (!cpu_has_vmx_mbec() || !enable_ept)
+ enable_mbec = false;
+
if (!cpu_has_vmx_flexpriority())
flexpriority_enabled = 0;

@@ -8404,7 +8425,8 @@ static __init int hardware_setup(void)

if (enable_ept)
kvm_mmu_set_ept_masks(enable_ept_ad_bits,
- cpu_has_vmx_ept_execute_only());
+ cpu_has_vmx_ept_execute_only(),
+ enable_mbec);

/*
* Setup shadow_me_value/shadow_me_mask to include MKTME KeyID
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index a3da84f4ea45..815db44cd51e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -585,6 +585,7 @@ static inline u8 vmx_get_rvi(void)
SECONDARY_EXEC_ENABLE_VMFUNC | \
SECONDARY_EXEC_BUS_LOCK_DETECTION | \
SECONDARY_EXEC_NOTIFY_VM_EXITING | \
+ SECONDARY_EXEC_MODE_BASED_EPT_EXEC | \
SECONDARY_EXEC_ENCLS_EXITING)

#define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
--
2.40.1

2023-05-05 15:46:41

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

Add a new page tracking mode to deny a page update and throw a page
fault to the guest. This is useful for KVM to be able to make some
pages non-writable (not read-only because it doesn't imply execution
restrictions), see the next Heki commits.

This kind of synthetic kernel page fault needs to be handled by the
guest, which is not currently the case, making it try again and again.
This will be part of a follow-up patch series.

Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and
X86EMUL_PROPAGATE_FAULT.

Update page_fault_handle_page_track() to call
kvm_slot_page_track_is_active() whenever this is required for
KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker
already returned true.

Invert the return code semantic for read_emulate() and write_emulate():
- from 1=Ok 0=Error
- to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0)

Imported the prewrite page tracking support part originally written by
Mihai Donțu, Marian Rotariu, and Ștefan Șicleru:
https://lore.kernel.org/r/[email protected]
https://lore.kernel.org/r/[email protected]
Removed the GVA changes for page tracking, removed the
X86EMUL_RETRY_INSTR case, and some emulation part for now.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Marian Rotariu <[email protected]>
Cc: Mihai Donțu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Ștefan Șicleru <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/kvm_page_track.h | 12 +++++
arch/x86/kvm/mmu/mmu.c | 64 ++++++++++++++++++++++-----
arch/x86/kvm/mmu/page_track.c | 33 +++++++++++++-
arch/x86/kvm/mmu/spte.c | 6 +++
arch/x86/kvm/x86.c | 27 +++++++----
5 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@
#define _ASM_X86_KVM_PAGE_TRACK_H

enum kvm_page_track_mode {
+ KVM_PAGE_TRACK_PREWRITE,
KVM_PAGE_TRACK_WRITE,
KVM_PAGE_TRACK_MAX,
};
@@ -22,6 +23,16 @@ struct kvm_page_track_notifier_head {
struct kvm_page_track_notifier_node {
struct hlist_node node;

+ /*
+ * It is called when guest is writing the write-tracked page
+ * and the write emulation didn't happened yet.
+ *
+ * @vcpu: the vcpu where the write access happened
+ * @gpa: the physical address written by guest
+ * @node: this nodet
+ */
+ bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa,
+ struct kvm_page_track_notifier_node *node);
/*
* It is called when guest is writing the write-tracked page
* and write emulation is finished at that time.
@@ -73,6 +84,7 @@ kvm_page_track_register_notifier(struct kvm *kvm,
void
kvm_page_track_unregister_notifier(struct kvm *kvm,
struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa);
void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes);
void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 835426254e76..e5d1e241ff0f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,9 +793,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);

/* the non-leaf shadow pages are keeping readonly. */
- if (sp->role.level > PG_LEVEL_4K)
- return kvm_slot_page_track_add_page(kvm, slot, gfn,
- KVM_PAGE_TRACK_WRITE);
+ if (sp->role.level > PG_LEVEL_4K) {
+ kvm_slot_page_track_add_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_PREWRITE);
+ kvm_slot_page_track_add_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+ return;
+ }

kvm_mmu_gfn_disallow_lpage(slot, gfn);

@@ -840,9 +844,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
- if (sp->role.level > PG_LEVEL_4K)
- return kvm_slot_page_track_remove_page(kvm, slot, gfn,
- KVM_PAGE_TRACK_WRITE);
+ if (sp->role.level > PG_LEVEL_4K) {
+ kvm_slot_page_track_remove_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_PREWRITE);
+ kvm_slot_page_track_remove_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+ return;
+ }

kvm_mmu_gfn_allow_lpage(slot, gfn);
}
@@ -2714,7 +2722,10 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
* track machinery is used to write-protect upper-level shadow pages,
* i.e. this guards the role.level == 4K assertion below!
*/
- if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(kvm, slot, gfn,
+ KVM_PAGE_TRACK_PREWRITE) ||
+ kvm_slot_page_track_is_active(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE))
return -EPERM;

/*
@@ -4103,6 +4114,8 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
+ bool ret = false;
+
if (unlikely(fault->rsvd))
return false;

@@ -4113,10 +4126,14 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
* guest is writing the page which is write tracked which can
* not be fixed by page fault handler.
*/
- if (kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
- return true;
+ ret = kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn,
+ KVM_PAGE_TRACK_PREWRITE) ||
+ ret;
+ ret = kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn,
+ KVM_PAGE_TRACK_WRITE) ||
+ ret;

- return false;
+ return ret;
}

static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
@@ -5600,6 +5617,33 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (r != RET_PF_EMULATE)
return 1;

+ if ((error_code & PFERR_WRITE_MASK) &&
+ !kvm_page_track_prewrite(vcpu, cr2_or_gpa)) {
+ struct x86_exception fault = {
+ .vector = PF_VECTOR,
+ .error_code_valid = true,
+ .error_code = error_code,
+ .nested_page_fault = false,
+ /*
+ * TODO: This kind of kernel page fault needs to be handled by
+ * the guest, which is not currently the case, making it try
+ * again and again.
+ *
+ * You may want to test with cr2_or_gva to see the page
+ * fault caught by the guest kernel (thinking it is a
+ * user space fault).
+ */
+ .address = static_call(kvm_x86_fault_gva)(vcpu),
+ .async_page_fault = false,
+ };
+
+ pr_warn_ratelimited(
+ "heki-kvm: Creating write #PF at 0x%016llx\n",
+ fault.address);
+ kvm_inject_page_fault(vcpu, &fault);
+ return RET_PF_INVALID;
+ }
+
/*
* Before emulating the instruction, check if the error code
* was due to a RO violation while translating the guest page.
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..2454887cd48b 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -131,9 +131,10 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
*/
kvm_mmu_gfn_disallow_lpage(slot, gfn);

- if (mode == KVM_PAGE_TRACK_WRITE)
+ if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
kvm_flush_remote_tlbs(kvm);
+ }
}
EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);

@@ -248,6 +249,36 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
}
EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);

+/*
+ * Notify the node that a write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ struct kvm_page_track_notifier_head *head;
+ struct kvm_page_track_notifier_node *n;
+ int idx;
+ bool ret = true;
+
+ head = &vcpu->kvm->arch.track_notifier_head;
+
+ if (hlist_empty(&head->track_notifier_list))
+ return ret;
+
+ idx = srcu_read_lock(&head->track_srcu);
+ hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
+ srcu_read_lock_held(&head->track_srcu))
+ if (n->track_prewrite)
+ if (!n->track_prewrite(vcpu, gpa, n))
+ ret = false;
+ srcu_read_unlock(&head->track_srcu, idx);
+ return ret;
+}
+
/*
* Notify the node that write access is intercepted and write emulation is
* finished at this time.
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c0fd7e049b4e..639f220a1ed5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -144,6 +144,12 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
u64 spte = SPTE_MMU_PRESENT_MASK;
bool wrprot = false;

+ if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn,
+ KVM_PAGE_TRACK_PREWRITE) ||
+ kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE))
+ pte_access &= ~ACC_WRITE_MASK;
+
WARN_ON_ONCE(!pte_access && !shadow_present_mask);

if (sp->role.ad_disabled)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a2c299d47e69..fd05f42c9913 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7325,6 +7325,7 @@ static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes
r = X86EMUL_IO_NEEDED;
goto out;
}
+ kvm_page_track_write(vcpu, gpa, data, towrite);

bytes -= towrite;
data += towrite;
@@ -7441,13 +7442,12 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes)
{
- int ret;
-
- ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
- if (ret < 0)
- return 0;
+ if (!kvm_page_track_prewrite(vcpu, gpa))
+ return X86EMUL_PROPAGATE_FAULT;
+ if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes))
+ return X86EMUL_UNHANDLEABLE;
kvm_page_track_write(vcpu, gpa, val, bytes);
- return 1;
+ return X86EMUL_CONTINUE;
}

struct read_write_emulator_ops {
@@ -7477,7 +7477,9 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
void *val, int bytes)
{
- return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+ if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes))
+ return X86EMUL_UNHANDLEABLE;
+ return X86EMUL_CONTINUE;
}

static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -7551,8 +7553,12 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
return X86EMUL_PROPAGATE_FAULT;
}

- if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
- return X86EMUL_CONTINUE;
+ if (!ret) {
+ ret = ops->read_write_emulate(vcpu, gpa, val, bytes);
+ if (ret != X86EMUL_UNHANDLEABLE)
+ /* Handles X86EMUL_CONTINUE and X86EMUL_PROPAGATE_FAULT. */
+ return ret;
+ }

/*
* Is this MMIO handled locally?
@@ -7689,6 +7695,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
if (kvm_is_error_hva(hva))
goto emul_write;

+ if (!kvm_page_track_prewrite(vcpu, gpa))
+ return X86EMUL_PROPAGATE_FAULT;
+
hva += offset_in_page(gpa);

switch (bytes) {
--
2.40.1

2023-05-05 15:48:04

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
set EPT permissions on a set of page ranges.

This hypercall takes three arguments. The first contains the GPA
pointing to an array of struct heki_pa_range. The second argument is
the size of the array, not the number of elements. The third argument
is for future proofness and is designed to contains optional flags (e.g.
to change the array type), but must be zero for now.

The struct heki_pa_range contains a GFN that starts the range and
another that is the indicate the last (included) page. A bit field of
attributes are tied to this range.

The HEKI_ATTR_MEM_NOWRITE attribute is interpreted as a removal of the
EPT write permission to deny any write access from the guest through its
lifetime. We choose "nowrite" because "read-only" exclude
execution, it follows a deny-list approach, and most importantly because
it is an incremental addition to the status quo (i.e., everything is
allowed from the TDP point of view). This is implemented thanks to the
KVM_PAGE_TRACK_PREWRITE mode previously introduced.

The page ranges recording is currently implemented with a static array
of 16 elements to make it simple, but this mechanism will be dynamic in
a follow-up.

Define a kernel command line parameter "heki" to turn the feature on or
off. By default, Heki is turned on.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/virt/kvm/x86/hypercalls.rst | 17 +++
arch/x86/kvm/x86.c | 169 ++++++++++++++++++++++
include/linux/kvm_host.h | 13 ++
include/uapi/linux/kvm_para.h | 1 +
virt/kvm/kvm_main.c | 4 +
5 files changed, 204 insertions(+)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index 10db7924720f..0ec79cc77f53 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In
addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+
+9. KVM_HC_LOCK_MEM_PAGE_RANGES
+------------------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Request memory page ranges to be restricted.
+
+- a0: physical address of a struct heki_pa_range array
+- a1: size of the array
+- a2: optional flags, must be 0 for now
+
+The hypercall lets a guest request memory permissions to be removed for itself,
+identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE
+memory page range attribute forbids related modification to the guest.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd05f42c9913..ffab64d08de3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -59,6 +59,7 @@
#include <linux/mem_encrypt.h>
#include <linux/entry-kvm.h>
#include <linux/suspend.h>
+#include <linux/heki.h>

#include <trace/events/kvm.h>

@@ -9596,6 +9597,161 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
return;
}

+#ifdef CONFIG_HEKI
+
+static int heki_page_track_add(struct kvm *const kvm, const gfn_t gfn,
+ const enum kvm_page_track_mode mode)
+{
+ struct kvm_memory_slot *slot;
+ int idx;
+
+ BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING));
+
+ idx = srcu_read_lock(&kvm->srcu);
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot) {
+ srcu_read_unlock(&kvm->srcu, idx);
+ return -EINVAL;
+ }
+
+ write_lock(&kvm->mmu_lock);
+ kvm_slot_page_track_add_page(kvm, slot, gfn, mode);
+ write_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
+ return 0;
+}
+
+static bool
+heki_page_track_prewrite(struct kvm_vcpu *const vcpu, const gpa_t gpa,
+ struct kvm_page_track_notifier_node *const node)
+{
+ const gfn_t gfn = gpa_to_gfn(gpa);
+ const struct kvm *const kvm = vcpu->kvm;
+ size_t i;
+
+ /* Checks if it is our own tracked pages, or those of someone else. */
+ for (i = 0; i < HEKI_GFN_MAX; i++) {
+ if (gfn >= kvm->heki_gfn_no_write[i].start &&
+ gfn <= kvm->heki_gfn_no_write[i].end)
+ return false;
+ }
+
+ return true;
+}
+
+static int kvm_heki_init_vm(struct kvm *const kvm)
+{
+ struct kvm_page_track_notifier_node *const node =
+ kzalloc(sizeof(*node), GFP_KERNEL);
+
+ if (!node)
+ return -ENOMEM;
+
+ node->track_prewrite = heki_page_track_prewrite;
+ kvm_page_track_register_notifier(kvm, node);
+ return 0;
+}
+
+static bool is_gfn_overflow(unsigned long val)
+{
+ const gfn_t gfn_mask = gpa_to_gfn(~0);
+
+ return (val | gfn_mask) != gfn_mask;
+}
+
+#define HEKI_PA_RANGE_MAX_SIZE (sizeof(struct heki_pa_range) * HEKI_GFN_MAX)
+
+static int heki_lock_mem_page_ranges(struct kvm *const kvm, gpa_t mem_ranges,
+ unsigned long mem_ranges_size)
+{
+ int err;
+ size_t i, ranges_num;
+ struct heki_pa_range *ranges;
+
+ if (mem_ranges_size > HEKI_PA_RANGE_MAX_SIZE)
+ return -KVM_E2BIG;
+
+ if ((mem_ranges_size % sizeof(struct heki_pa_range)) != 0)
+ return -KVM_EINVAL;
+
+ ranges = kzalloc(mem_ranges_size, GFP_KERNEL);
+ if (!ranges)
+ return -KVM_E2BIG;
+
+ err = kvm_read_guest(kvm, mem_ranges, ranges, mem_ranges_size);
+ if (err) {
+ err = -KVM_EFAULT;
+ goto out_free_ranges;
+ }
+
+ ranges_num = mem_ranges_size / sizeof(struct heki_pa_range);
+ for (i = 0; i < ranges_num; i++) {
+ const u64 attributes_mask = HEKI_ATTR_MEM_NOWRITE;
+ const gfn_t gfn_start = ranges[i].gfn_start;
+ const gfn_t gfn_end = ranges[i].gfn_end;
+ const u64 attributes = ranges[i].attributes;
+
+ if (is_gfn_overflow(ranges[i].gfn_start)) {
+ err = -KVM_EINVAL;
+ goto out_free_ranges;
+ }
+ if (is_gfn_overflow(ranges[i].gfn_end)) {
+ err = -KVM_EINVAL;
+ goto out_free_ranges;
+ }
+ if (ranges[i].gfn_start > ranges[i].gfn_end) {
+ err = -KVM_EINVAL;
+ goto out_free_ranges;
+ }
+ if (!ranges[i].attributes) {
+ err = -KVM_EINVAL;
+ goto out_free_ranges;
+ }
+ if ((ranges[i].attributes | attributes_mask) !=
+ attributes_mask) {
+ err = -KVM_EINVAL;
+ goto out_free_ranges;
+ }
+
+ if (attributes & HEKI_ATTR_MEM_NOWRITE) {
+ unsigned long gfn;
+ size_t gfn_i;
+
+ gfn_i = atomic_dec_if_positive(
+ &kvm->heki_gfn_no_write_num);
+ if (gfn_i == 0) {
+ err = -KVM_E2BIG;
+ goto out_free_ranges;
+ }
+
+ gfn_i--;
+ kvm->heki_gfn_no_write[gfn_i].start = gfn_start;
+ kvm->heki_gfn_no_write[gfn_i].end = gfn_end;
+
+ for (gfn = gfn_start; gfn <= gfn_end; gfn++)
+ WARN_ON_ONCE(heki_page_track_add(
+ kvm, gfn, KVM_PAGE_TRACK_PREWRITE));
+ }
+
+ pr_warn("heki-kvm: Locking GFN 0x%llx-0x%llx with %s\n",
+ gfn_start, gfn_end,
+ (attributes & HEKI_ATTR_MEM_NOWRITE) ? "[nowrite]" : "");
+ }
+
+out_free_ranges:
+ kfree(ranges);
+ return err;
+}
+
+#else /* CONFIG_HEKI */
+
+static int kvm_heki_init_vm(struct kvm *const kvm)
+{
+ return 0;
+}
+
+#endif /* CONFIG_HEKI */
+
static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
{
u64 ret = vcpu->run->hypercall.ret;
@@ -9694,6 +9850,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
vcpu->arch.complete_userspace_io = complete_hypercall_exit;
return 0;
}
+#ifdef CONFIG_HEKI
+ case KVM_HC_LOCK_MEM_PAGE_RANGES:
+ /* No flags for now. */
+ if (a2)
+ ret = -KVM_EINVAL;
+ else
+ ret = heki_lock_mem_page_ranges(vcpu->kvm, a0, a1);
+ break;
+#endif /* CONFIG_HEKI */
default:
ret = -KVM_ENOSYS;
break;
@@ -12126,6 +12291,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (ret)
goto out_page_track;

+ ret = kvm_heki_init_vm(kvm);
+ if (ret)
+ goto out_page_track;
+
ret = static_call(kvm_x86_vm_init)(kvm);
if (ret)
goto out_uninit_mmu;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..39a1bdc2ba42 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -699,6 +699,13 @@ struct kvm_memslots {
int node_idx;
};

+#ifdef CONFIG_HEKI
+struct heki_gfn_range {
+ gfn_t start;
+ gfn_t end;
+};
+#endif /* CONFIG_HEKI */
+
struct kvm {
#ifdef KVM_HAVE_MMU_RWLOCK
rwlock_t mmu_lock;
@@ -801,6 +808,12 @@ struct kvm {
bool vm_bugged;
bool vm_dead;

+#ifdef CONFIG_HEKI
+#define HEKI_GFN_MAX 16
+ atomic_t heki_gfn_no_write_num;
+ struct heki_gfn_range heki_gfn_no_write[HEKI_GFN_MAX];
+#endif /* CONFIG_HEKI */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
struct notifier_block pm_notifier;
#endif
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 960c7e93d1a9..d7512a10880e 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
#define KVM_HC_SEND_IPI 10
#define KVM_HC_SCHED_YIELD 11
#define KVM_HC_MAP_GPA_RANGE 12
+#define KVM_HC_LOCK_MEM_PAGE_RANGES 13

/*
* hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384b5ae0..4aea936dfe73 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1230,6 +1230,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
list_add(&kvm->vm_list, &vm_list);
mutex_unlock(&kvm_lock);

+#ifdef CONFIG_HEKI
+ atomic_set(&kvm->heki_gfn_no_write_num, HEKI_GFN_MAX + 1);
+#endif /* CONFIG_HEKI */
+
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);

--
2.40.1

2023-05-05 15:48:12

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 9/9] virt: Add Heki KUnit tests

This adds a new CONFIG_HEKI_TEST option to run tests at boot. Indeed,
because this patch series forbids the loading of kernel modules after
the boot, we need to make built-in tests. Furthermore, because we use
some symbols not exported to modules (e.g., kernel_set_to_readonly) this
could not work as modules.

To run these tests, we need to boot the kernel with the heki_test=N boot
argument with N selecting a specific test:
1. heki_test_cr_disable_smep: Check CR pinning and try to disable SMEP.
2. heki_test_write_to_const: Check .rodata (const) protection.
3. heki_test_write_to_ro_after_init: Check __ro_after_init protection.
4. heki_test_exec: Check non-executable kernel memory.

This way to select tests should not be required when the kernel will
properly handle the triggered synthetic page faults. For now, these
page faults make the kernel loop.

All these tests temporarily disable the related kernel self-protections
and should then failed if Heki doesn't protect the kernel. They are
verbose to make it easier to understand what is going on.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
virt/heki/Kconfig | 12 +++
virt/heki/heki.c | 194 +++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 96f18ce03013..806981f2b22d 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -27,3 +27,15 @@ config HYPERVISOR_SUPPORTS_HEKI
A hypervisor should select this when it can successfully build
and run with CONFIG_HEKI. That is, it should provide all of the
hypervisor support required for the Heki feature.
+
+config HEKI_TEST
+ bool "Tests for Heki" if !KUNIT_ALL_TESTS
+ depends on HEKI && KUNIT=y
+ default KUNIT_ALL_TESTS
+ help
+ Run Heki tests at runtime according to the heki_test=N boot
+ parameter, with N identifying the test to run (between 1 and 4).
+
+ Before launching the init process, the system might not respond
+ because of unhandled kernel page fault. This will be fixed in a
+ next patch series.
diff --git a/virt/heki/heki.c b/virt/heki/heki.c
index 142b5dc98a2f..361e7734e950 100644
--- a/virt/heki/heki.c
+++ b/virt/heki/heki.c
@@ -5,11 +5,13 @@
* Copyright © 2023 Microsoft Corporation
*/

+#include <kunit/test.h>
#include <linux/cache.h>
#include <linux/heki.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/printk.h>
+#include <linux/set_memory.h>
#include <linux/types.h>
#include <linux/vmalloc.h>

@@ -78,13 +80,201 @@ void __init heki_early_init(void)
heki_arch_init();
}

+#ifdef CONFIG_HEKI_TEST
+
+/* Heki test data */
+
+/* Takes two pages to not change permission of other read-only pages. */
+const char heki_test_const_buf[PAGE_SIZE * 2] = {};
+char heki_test_ro_after_init_buf[PAGE_SIZE * 2] __ro_after_init = {};
+
+long heki_test_exec_data(long);
+void _test_exec_data_end(void);
+
+/* Used to test ROP execution against the .rodata section. */
+/* clang-format off */
+asm(
+".pushsection .rodata;" // NOT .text section
+".global heki_test_exec_data;"
+".type heki_test_exec_data, @function;"
+"heki_test_exec_data:"
+ASM_ENDBR
+"movq %rdi, %rax;"
+"inc %rax;"
+ASM_RET
+".size heki_test_exec_data, .-heki_test_exec_data;"
+"_test_exec_data_end:"
+".popsection");
+/* clang-format on */
+
+static void heki_test_cr_disable_smep(struct kunit *test)
+{
+ unsigned long cr4;
+
+ /* SMEP should be initially enabled. */
+ KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP);
+
+ kunit_warn(test,
+ "Starting control register pinning tests with SMEP check\n");
+
+ /*
+ * Trying to disable SMEP, bypassing kernel self-protection by not
+ * using cr4_clear_bits(X86_CR4_SMEP).
+ */
+ cr4 = __read_cr4() & ~X86_CR4_SMEP;
+ asm volatile("mov %0,%%cr4" : "+r"(cr4) : : "memory");
+
+ /* SMEP should still be enabled. */
+ KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP);
+}
+
+static inline void print_addr(struct kunit *test, const char *const buf_name,
+ void *const buf)
+{
+ const pte_t pte = *virt_to_kpte((unsigned long)buf);
+ const phys_addr_t paddr = slow_virt_to_phys(buf);
+ bool present = pte_flags(pte) & (_PAGE_PRESENT);
+ bool accessible = pte_accessible(&init_mm, pte);
+
+ kunit_warn(
+ test,
+ "%s vaddr:%llx paddr:%llx exec:%d write:%d present:%d accessible:%d\n",
+ buf_name, (unsigned long long)buf, paddr, !!pte_exec(pte),
+ !!pte_write(pte), present, accessible);
+}
+
+extern int kernel_set_to_readonly;
+
+static void heki_test_write_to_rodata(struct kunit *test,
+ const char *const buf_name,
+ char *const ro_buf)
+{
+ print_addr(test, buf_name, (void *)ro_buf);
+ KUNIT_EXPECT_EQ(test, 0, *ro_buf);
+
+ kunit_warn(
+ test,
+ "Bypassing kernel self-protection: mark memory as writable\n");
+ kernel_set_to_readonly = 0;
+ /*
+ * Removes execute permission that might be set by bugdoor-exec,
+ * because change_page_attr_clear() is not use by set_memory_rw().
+ * This is required since commit 652c5bf380ad ("x86/mm: Refuse W^X
+ * violations").
+ */
+ KUNIT_ASSERT_FALSE(test, set_memory_nx((unsigned long)PTR_ALIGN_DOWN(
+ ro_buf, PAGE_SIZE),
+ 1));
+ KUNIT_ASSERT_FALSE(test, set_memory_rw((unsigned long)PTR_ALIGN_DOWN(
+ ro_buf, PAGE_SIZE),
+ 1));
+ kernel_set_to_readonly = 1;
+
+ kunit_warn(test, "Trying memory write\n");
+ *ro_buf = 0x11;
+ KUNIT_EXPECT_EQ(test, 0, *ro_buf);
+ kunit_warn(test, "New content: 0x%02x\n", *ro_buf);
+}
+
+static void heki_test_write_to_const(struct kunit *test)
+{
+ heki_test_write_to_rodata(test, "const_buf",
+ (void *)heki_test_const_buf);
+}
+
+static void heki_test_write_to_ro_after_init(struct kunit *test)
+{
+ heki_test_write_to_rodata(test, "ro_after_init_buf",
+ (void *)heki_test_ro_after_init_buf);
+}
+
+typedef long test_exec_t(long);
+
+static void heki_test_exec(struct kunit *test)
+{
+ const size_t exec_size = 7;
+ unsigned long nx_page_start = (unsigned long)PTR_ALIGN_DOWN(
+ (const void *const)heki_test_exec_data, PAGE_SIZE);
+ unsigned long nx_page_end = (unsigned long)PTR_ALIGN(
+ (const void *const)heki_test_exec_data + exec_size, PAGE_SIZE);
+ test_exec_t *exec = (test_exec_t *)heki_test_exec_data;
+ long ret;
+
+ /* Starting non-executable memory tests. */
+ print_addr(test, "test_exec_data", heki_test_exec_data);
+
+ kunit_warn(
+ test,
+ "Bypassing kernel-self protection: mark memory as executable\n");
+ kernel_set_to_readonly = 0;
+ KUNIT_ASSERT_FALSE(test,
+ set_memory_rox(nx_page_start,
+ PFN_UP(nx_page_end - nx_page_start)));
+ kernel_set_to_readonly = 1;
+
+ kunit_warn(
+ test,
+ "Trying to execute data (ROP) in (initially) non-executable memory\n");
+ ret = exec(3);
+
+ /* This should not be reached because of the uncaught page fault. */
+ KUNIT_EXPECT_EQ(test, 3, ret);
+ kunit_warn(test, "Result of execution: 3 + 1 = %ld\n", ret);
+}
+
+const struct kunit_case heki_test_cases[] = {
+ KUNIT_CASE(heki_test_cr_disable_smep),
+ KUNIT_CASE(heki_test_write_to_const),
+ KUNIT_CASE(heki_test_write_to_ro_after_init),
+ KUNIT_CASE(heki_test_exec),
+ {}
+};
+
+static unsigned long heki_test __ro_after_init;
+
+static int __init parse_heki_test_config(char *str)
+{
+ if (kstrtoul(str, 10, &heki_test) ||
+ heki_test > (ARRAY_SIZE(heki_test_cases) - 1))
+ pr_warn("Invalid option string for heki_test: '%s'\n", str);
+ return 1;
+}
+
+__setup("heki_test=", parse_heki_test_config);
+
+static void heki_run_test(void)
+{
+ struct kunit_case heki_test_case[2] = {};
+ struct kunit_suite heki_test_suite = {
+ .name = "heki",
+ .test_cases = heki_test_case,
+ };
+ struct kunit_suite *const test_suite = &heki_test_suite;
+
+ if (!kunit_enabled() || heki_test == 0 ||
+ heki_test >= ARRAY_SIZE(heki_test_cases))
+ return;
+
+ pr_warn("Running test #%lu\n", heki_test);
+ heki_test_case[0] = heki_test_cases[heki_test - 1];
+ __kunit_test_suites_init(&test_suite, 1);
+}
+
+#else /* CONFIG_HEKI_TEST */
+
+static inline void heki_run_test(void)
+{
+}
+
+#endif /* CONFIG_HEKI_TEST */
+
void heki_late_init(void)
{
struct heki_hypervisor *hypervisor = heki.hypervisor;
int ret;

if (!heki_enabled)
- return;
+ return heki_run_test();

if (!heki.static_ranges) {
pr_warn("Architecture did not initialize static ranges\n");
@@ -113,6 +303,8 @@ void heki_late_init(void)
goto out;
pr_warn("Control registers locked\n");

+ heki_run_test();
+
out:
heki_free_pa_ranges(heki.static_ranges, heki.num_static_ranges);
heki.static_ranges = NULL;
--
2.40.1

2023-05-05 15:48:19

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first
is to identify the control register, and the second is a bit mask to
pin (i.e. mark as read-only).

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/virt/kvm/x86/hypercalls.rst | 15 +++++
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 10 ++++
arch/x86/kvm/x86.c | 72 +++++++++++++++++++++++
arch/x86/kvm/x86.h | 16 +++++
include/linux/kvm_host.h | 3 +
include/uapi/linux/kvm_para.h | 1 +
7 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index 0ec79cc77f53..8aa5d28986e3 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -207,3 +207,18 @@ identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE
memory page range attribute forbids related modification to the guest.

Returns 0 on success or a KVM error code otherwise.
+
+10. KVM_HC_LOCK_CR_UPDATE
+-------------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Request some control registers to be restricted.
+
+- a0: identify a control register
+- a1: bit mask to make some flags read-only
+
+The hypercall lets a guest request control register flags to be pinned for
+itself.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f3cc7699e1e1..dd89379fe5ac 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -413,7 +413,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
}

/* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask =
+const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
X86_CR4_FSGSBASE | X86_CR4_CET;
static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9870db887a62..931688edc8eb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3162,6 +3162,11 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long hw_cr0, old_cr0_pg;
u32 tmp;
+ int res;
+
+ res = heki_check_cr(vcpu->kvm, 0, cr0);
+ if (res)
+ return;

old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG);

@@ -3323,6 +3328,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
* this bit, even if host CR4.MCE == 0.
*/
unsigned long hw_cr4;
+ int res;
+
+ res = heki_check_cr(vcpu->kvm, 4, cr4);
+ if (res)
+ return;

hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
if (is_unrestricted_guest(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffab64d08de3..a529455359ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
return value;
}

+#ifdef CONFIG_HEKI
+
+extern unsigned long cr4_pinned_mask;
+
+static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
+ unsigned long pin)
+{
+ if (!pin)
+ return -KVM_EINVAL;
+
+ switch (cr) {
+ case 0:
+ /* Cf. arch/x86/kernel/cpu/common.c */
+ if (!(pin & X86_CR0_WP))
+ return -KVM_EINVAL;
+
+ if ((read_cr0() & pin) != pin)
+ return -KVM_EINVAL;
+
+ atomic_long_or(pin, &kvm->heki_pinned_cr0);
+ return 0;
+ case 4:
+ /* Checks for irrelevant bits. */
+ if ((pin & cr4_pinned_mask) != pin)
+ return -KVM_EINVAL;
+
+ /* Ignores bits not present in host. */
+ pin &= __read_cr4();
+ atomic_long_or(pin, &kvm->heki_pinned_cr4);
+ return 0;
+ }
+ return -KVM_EINVAL;
+}
+
+int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
+ const unsigned long val)
+{
+ unsigned long pinned;
+
+ switch (cr) {
+ case 0:
+ pinned = atomic_long_read(&kvm->heki_pinned_cr0);
+ if ((val & pinned) != pinned) {
+ pr_warn_ratelimited(
+ "heki-kvm: Blocked CR0 update: 0x%lx\n", val);
+ return -KVM_EPERM;
+ }
+ return 0;
+ case 4:
+ pinned = atomic_long_read(&kvm->heki_pinned_cr4);
+ if ((val & pinned) != pinned) {
+ pr_warn_ratelimited(
+ "heki-kvm: Blocked CR4 update: 0x%lx\n", val);
+ return -KVM_EPERM;
+ }
+ return 0;
+ }
+ return 0;
+}
+
+#endif /* CONFIG_HEKI */
+
static int emulator_set_cr(struct x86_emulate_ctxt *ctxt, int cr, ulong val)
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
int res = 0;

+ res = heki_check_cr(vcpu->kvm, cr, val);
+ if (res)
+ return res;
+
switch (cr) {
case 0:
res = kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
@@ -9858,6 +9924,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
else
ret = heki_lock_mem_page_ranges(vcpu->kvm, a0, a1);
break;
+ case KVM_HC_LOCK_CR_UPDATE:
+ if (a0 > U32_MAX)
+ ret = -KVM_EINVAL;
+ else
+ ret = heki_lock_cr(vcpu->kvm, a0, a1);
+ break;
#endif /* CONFIG_HEKI */
default:
ret = -KVM_ENOSYS;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9de72586f406..3e80a60ecbd8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -276,6 +276,22 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
return !(kvm->arch.disabled_quirks & quirk);
}

+#ifdef CONFIG_HEKI
+
+int heki_check_cr(const struct kvm *kvm, unsigned long cr, unsigned long val);
+
+bool kvm_heki_is_exec_allowed(struct kvm_vcpu *vcpu, gpa_t gpa);
+
+#else /* CONFIG_HEKI */
+
+static inline int heki_check_cr(const struct kvm *const kvm,
+ const unsigned long cr, const unsigned long val)
+{
+ return 0;
+}
+
+#endif /* CONFIG_HEKI */
+
void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);

u64 get_kvmclock_ns(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 39a1bdc2ba42..ab9dc723bc89 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -812,6 +812,9 @@ struct kvm {
#define HEKI_GFN_MAX 16
atomic_t heki_gfn_no_write_num;
struct heki_gfn_range heki_gfn_no_write[HEKI_GFN_MAX];
+
+ atomic_long_t heki_pinned_cr0;
+ atomic_long_t heki_pinned_cr4;
#endif /* CONFIG_HEKI */

#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index d7512a10880e..9f68d4ba646b 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -31,6 +31,7 @@
#define KVM_HC_SCHED_YIELD 11
#define KVM_HC_MAP_GPA_RANGE 12
#define KVM_HC_LOCK_MEM_PAGE_RANGES 13
+#define KVM_HC_LOCK_CR_UPDATE 14

/*
* hypercalls use architecture specific
--
2.40.1

2023-05-05 15:50:25

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 1/9] KVM: x86: Add kvm_x86_ops.fault_gva()

This function is needed for kvm_mmu_page_fault() to create synthetic
page faults.

Code originally written by Mihai Donțu and Nicușor Cîțu:
https://lore.kernel.org/r/[email protected]
Renamed fault_gla() to fault_gva() and use the new
EPT_VIOLATION_GVA_IS_VALID.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Madhavan T. Venkataraman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Co-developed-by: Mihai Donțu <[email protected]>
Signed-off-by: Mihai Donțu <[email protected]>
Co-developed-by: Nicușor Cîțu <[email protected]>
Signed-off-by: Nicușor Cîțu <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/svm/svm.c | 9 +++++++++
arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index abccd51dcfca..b761182a9444 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP(fault_gva)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6aaae18f1854..f319bcdeb8bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1706,6 +1706,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ u64 (*fault_gva)(struct kvm_vcpu *vcpu);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9a194aa1a75a..8b47b38aaf7f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4700,6 +4700,13 @@ static int svm_vm_init(struct kvm *kvm)
return 0;
}

+static u64 svm_fault_gva(struct kvm_vcpu *vcpu)
+{
+ const struct vcpu_svm *svm = to_svm(vcpu);
+
+ return svm->vcpu.arch.cr2 ? svm->vcpu.arch.cr2 : ~0ull;
+}
+
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = "kvm_amd",

@@ -4826,6 +4833,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+
+ .fault_gva = svm_fault_gva,
};

/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7eec0226d56a..9870db887a62 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8067,6 +8067,14 @@ static void vmx_vm_destroy(struct kvm *kvm)
free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
}

+static u64 vmx_fault_gva(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_IS_VALID)
+ return vmcs_readl(GUEST_LINEAR_ADDRESS);
+
+ return ~0ull;
+}
+
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = "kvm_intel",

@@ -8204,6 +8212,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .fault_gva = vmx_fault_gva,
};

static unsigned int vmx_handle_intel_pt_intr(void)
--
2.40.1

2023-05-05 15:51:43

by Mickaël Salaün

[permalink] [raw]
Subject: [PATCH v1 6/9] KVM: x86: Add Heki hypervisor support

From: Madhavan T. Venkataraman <[email protected]>

Each supported hypervisor in x86 implements a struct x86_hyper_init to
define the init functions for the hypervisor. Define a new init_heki()
entry point in struct x86_hyper_init. Hypervisors that support Heki
must define this init_heki() function. Call init_heki() of the chosen
hypervisor in init_hypervisor_platform().

Create a heki_hypervisor structure that each hypervisor can fill
with its data and functions. This will allow the Heki feature to work
in a hypervisor agnostic way.

Declare and initialize a "heki_hypervisor" structure for KVM so KVM can
support Heki. Define the init_heki() function for KVM. In init_heki(),
set the hypervisor field in the generic "heki" structure to the KVM
"heki_hypervisor". After this point, generic Heki code can access the
KVM Heki data and functions.

Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
Co-developed-by: Mickaël Salaün <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Signed-off-by: Madhavan T. Venkataraman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/x86_init.h | 2 +
arch/x86/kernel/cpu/hypervisor.c | 1 +
arch/x86/kernel/kvm.c | 72 ++++++++++++++++++++++++++++++++
arch/x86/kernel/x86_init.c | 1 +
arch/x86/kvm/Kconfig | 1 +
virt/heki/Kconfig | 9 +++-
virt/heki/heki.c | 6 ---
7 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index c1c8c581759d..0fc5041a66c6 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -119,6 +119,7 @@ struct x86_init_pci {
* @msi_ext_dest_id: MSI supports 15-bit APIC IDs
* @init_mem_mapping: setup early mappings during init_mem_mapping()
* @init_after_bootmem: guest init after boot allocator is finished
+ * @init_heki: Hypervisor enforced kernel integrity
*/
struct x86_hyper_init {
void (*init_platform)(void);
@@ -127,6 +128,7 @@ struct x86_hyper_init {
bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
+ void (*init_heki)(void);
};

/**
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 553bfbfc3a1b..6085c8129e0c 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void)

x86_hyper_type = h->type;
x86_init.hyper.init_platform();
+ x86_init.hyper.init_heki();
}
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984da..e53cebdcf3ac 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,6 +29,7 @@
#include <linux/syscore_ops.h>
#include <linux/cc_platform.h>
#include <linux/efi.h>
+#include <linux/heki.h>
#include <asm/timer.h>
#include <asm/cpu.h>
#include <asm/traps.h>
@@ -866,6 +867,45 @@ static void __init kvm_guest_init(void)
hardlockup_detector_disable();
}

+#ifdef CONFIG_HEKI
+
+static int kvm_protect_ranges(struct heki_pa_range *ranges, int num_ranges)
+{
+ size_t size;
+ long err;
+
+ WARN_ON(in_interrupt());
+
+ size = sizeof(ranges[0]) * num_ranges;
+ err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, __pa(ranges), size, 0);
+ if (WARN(err, "Failed to enforce memory protection: %ld\n", err))
+ return err;
+
+ return 0;
+}
+
+extern unsigned long cr4_pinned_mask;
+
+/*
+ * TODO: Check SMP policy consistency, e.g. with
+ * this_cpu_read(cpu_tlbstate.cr4)
+ */
+static int kvm_lock_crs(void)
+{
+ unsigned long cr4;
+ int err;
+
+ err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP);
+ if (err)
+ return err;
+
+ cr4 = __read_cr4();
+ err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask);
+ return err;
+}
+
+#endif /* CONFIG_HEKI */
+
static noinline uint32_t __kvm_cpuid_base(void)
{
if (boot_cpu_data.cpuid_level < 0)
@@ -999,6 +1039,37 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
}
#endif

+#ifdef CONFIG_HEKI
+
+static struct heki_hypervisor kvm_heki_hypervisor = {
+ .protect_ranges = kvm_protect_ranges,
+ .lock_crs = kvm_lock_crs,
+};
+
+static void kvm_init_heki(void)
+{
+ long err;
+
+ if (!kvm_para_available())
+ /* Cannot make KVM hypercalls. */
+ return;
+
+ err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1);
+ if (err == -KVM_ENOSYS)
+ /* Ignores host. */
+ return;
+
+ heki.hypervisor = &kvm_heki_hypervisor;
+}
+
+#else /* CONFIG_HEKI */
+
+static void kvm_init_heki(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
@@ -1007,6 +1078,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.init.x2apic_available = kvm_para_available,
.init.msi_ext_dest_id = kvm_msi_ext_dest_id,
.init.init_platform = kvm_init_platform,
+ .init.init_heki = kvm_init_heki,
#if defined(CONFIG_AMD_MEM_ENCRYPT)
.runtime.sev_es_hcall_prepare = kvm_sev_es_hcall_prepare,
.runtime.sev_es_hcall_finish = kvm_sev_es_hcall_finish,
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index ef80d361b463..0a023c24fcdb 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -114,6 +114,7 @@ struct x86_init_ops x86_init __initdata = {
.msi_ext_dest_id = bool_x86_init_noop,
.init_mem_mapping = x86_init_noop,
.init_after_bootmem = x86_init_noop,
+ .init_heki = x86_init_noop,
},

.acpi = {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbeaa9ddef59..ba355171ceeb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
select SRCU
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
+ select HYPERVISOR_SUPPORTS_HEKI
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 9858a827fe17..96f18ce03013 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -6,7 +6,7 @@
config HEKI
bool "Hypervisor Enforced Kernel Integrity (Heki)"
default y
- depends on !JUMP_LABEL && ARCH_SUPPORTS_HEKI
+ depends on !JUMP_LABEL && ARCH_SUPPORTS_HEKI && HYPERVISOR_SUPPORTS_HEKI
select KVM_EXTERNAL_WRITE_TRACKING if KVM
help
This feature enhances guest virtual machine security by taking
@@ -20,3 +20,10 @@ config ARCH_SUPPORTS_HEKI
An architecture should select this when it can successfully build
and run with CONFIG_HEKI. That is, it should provide all of the
architecture support required for the HEKI feature.
+
+config HYPERVISOR_SUPPORTS_HEKI
+ bool "Hypervisor support for Heki"
+ help
+ A hypervisor should select this when it can successfully build
+ and run with CONFIG_HEKI. That is, it should provide all of the
+ hypervisor support required for the Heki feature.
diff --git a/virt/heki/heki.c b/virt/heki/heki.c
index c8cb1b84cceb..142b5dc98a2f 100644
--- a/virt/heki/heki.c
+++ b/virt/heki/heki.c
@@ -91,12 +91,6 @@ void heki_late_init(void)
return;
}

- /*
- * Hypervisor support will be added in the future. When it is, the
- * hypervisor will be used to protect guest kernel memory and
- * control registers.
- */
-
if (!hypervisor) {
/* This happens for kernels running on bare metal as well. */
pr_warn("No hypervisor support\n");
--
2.40.1

2023-05-05 16:46:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

On Fri, May 05, 2023, Micka�l Sala�n wrote:
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..a7fb4ff888e6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,6 +3,7 @@
> #define _ASM_X86_KVM_PAGE_TRACK_H
>
> enum kvm_page_track_mode {
> + KVM_PAGE_TRACK_PREWRITE,

Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

Drop "support" for multiple page-track modes, as there is no evidence
that array-based and refcounted metadata is the optimal solution for
other modes, nor is there any evidence that other use cases, e.g. for
access-tracking, will be a good fit for the page-track machinery in
general.

E.g. one potential use case of access-tracking would be to prevent guest
access to poisoned memory (from the guest's perspective). In that case,
the number of poisoned pages is likely to be a very small percentage of
the guest memory, and there is no need to reference count the number of
access-tracking users, i.e. expanding gfn_track[] for a new mode would be
grossly inefficient. And for poisoned memory, host userspace would also
likely want to trap accesses, e.g. to inject #MC into the guest, and that
isn't currently supported by the page-track framework.

A better alternative for that poisoned page use case is likely a
variation of the proposed per-gfn attributes overlay (linked), which
would allow efficiently tracking the sparse set of poisoned pages, and by
default would exit to userspace on access.

Of particular relevance:

- Using the page-track machinery is inefficient because the guest is likely
going to write-protect a minority of its memory. And this

select KVM_EXTERNAL_WRITE_TRACKING if KVM

is particularly nasty because simply enabling HEKI in the Kconfig will cause
KVM to allocate rmaps and gfn tracking.

- There's no need to reference count the protection, i.e. 15 of the 16 bits of
gfn_track are dead weight.

- As proposed, adding a second "mode" would double the cost of gfn tracking.

- Tying the protections to the memslots will create an impossible-to-maintain
ABI because the protections will be lost if the owning memslot is deleted and
recreated.

- The page-track framework provides incomplete protection and will lead to an
ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map().

- The scaling and maintenance issues will only get worse if/when someone tries
to support dropping read and/or execute permissions, e.g. for execute-only.

- The code is x86-only, and is likely to stay that way for the foreseeable
future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs. This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound. And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
[3] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]

2023-05-05 16:49:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

On Fri, May 05, 2023, Micka�l Sala�n wrote:
> Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> set EPT permissions on a set of page ranges.

IMO, manipulation of protections, both for memory (this patch) and CPU state
(control registers in the next patch) should come from userspace. I have no
objection to KVM providing plumbing if necessary, but I think userspace needs to
to have full control over the actual state.

One of the things that caused Intel's control register pinning series to stall
out was how to handle edge cases like kexec() and reboot. Deferring to userspace
means the kernel doesn't need to define policy, e.g. when to unprotect memory,
and avoids questions like "should userspace be able to overwrite pinned control
registers".

And like the confidential VM use case, keeping userspace in the loop is a big
beneifit, e.g. the guest can't circumvent protections by coercing userspace into
writing to protected memory .

2023-05-05 16:57:06

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking


On 05/05/2023 18:28, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
>> index eb186bc57f6a..a7fb4ff888e6 100644
>> --- a/arch/x86/include/asm/kvm_page_track.h
>> +++ b/arch/x86/include/asm/kvm_page_track.h
>> @@ -3,6 +3,7 @@
>> #define _ASM_X86_KVM_PAGE_TRACK_H
>>
>> enum kvm_page_track_mode {
>> + KVM_PAGE_TRACK_PREWRITE,
>
> Heh, just when I decide to finally kill off support for multiple modes[1] :-)
>
> My assessment from that changelog still holds true for this case:
>
> Drop "support" for multiple page-track modes, as there is no evidence
> that array-based and refcounted metadata is the optimal solution for
> other modes, nor is there any evidence that other use cases, e.g. for
> access-tracking, will be a good fit for the page-track machinery in
> general.
>
> E.g. one potential use case of access-tracking would be to prevent guest
> access to poisoned memory (from the guest's perspective). In that case,
> the number of poisoned pages is likely to be a very small percentage of
> the guest memory, and there is no need to reference count the number of
> access-tracking users, i.e. expanding gfn_track[] for a new mode would be
> grossly inefficient. And for poisoned memory, host userspace would also
> likely want to trap accesses, e.g. to inject #MC into the guest, and that
> isn't currently supported by the page-track framework.
>
> A better alternative for that poisoned page use case is likely a
> variation of the proposed per-gfn attributes overlay (linked), which
> would allow efficiently tracking the sparse set of poisoned pages, and by
> default would exit to userspace on access.
>
> Of particular relevance:
>
> - Using the page-track machinery is inefficient because the guest is likely
> going to write-protect a minority of its memory. And this
>
> select KVM_EXTERNAL_WRITE_TRACKING if KVM
>
> is particularly nasty because simply enabling HEKI in the Kconfig will cause
> KVM to allocate rmaps and gfn tracking.
>
> - There's no need to reference count the protection, i.e. 15 of the 16 bits of
> gfn_track are dead weight.
>
> - As proposed, adding a second "mode" would double the cost of gfn tracking.
>
> - Tying the protections to the memslots will create an impossible-to-maintain
> ABI because the protections will be lost if the owning memslot is deleted and
> recreated.
>
> - The page-track framework provides incomplete protection and will lead to an
> ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
> adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map().
>
> - The scaling and maintenance issues will only get worse if/when someone tries
> to support dropping read and/or execute permissions, e.g. for execute-only.
>
> - The code is x86-only, and is likely to stay that way for the foreseeable
> future.
>
> The proposed alternative is to piggyback the memory attributes implementation[2]
> that is being added (if all goes according to plan) for confidential VMs. This
> use case (dropping permissions) came up not too long ago[3], which is why I have
> a ready-made answer).
>
> I have no doubt that we'll need to solve performance and scaling issues with the
> memory attributes implementation, e.g. to utilize xarray multi-range support
> instead of storing information on a per-4KiB-page basis, but AFAICT, the core
> idea is sound. And a very big positive from a maintenance perspective is that
> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
> benefit the other use case.
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]
> [3] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]

I agree, I used this mechanism because it was easier at first to rely on
a previous work, but while I was working on the MBEC support, I realized
that it's not the optimal way to do it.

I was thinking about using a new special EPT bit similar to
EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
think?

2023-05-05 17:12:50

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions


On 05/05/2023 18:44, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>> Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
>> set EPT permissions on a set of page ranges.
>
> IMO, manipulation of protections, both for memory (this patch) and CPU state
> (control registers in the next patch) should come from userspace. I have no
> objection to KVM providing plumbing if necessary, but I think userspace needs to
> to have full control over the actual state.

By user space, do you mean the host user space or the guest user space?

About the guest user space, I see several issues to delegate this kind
of control:
- These are restrictions only relevant to the kernel.
- The threat model is to protect against user space as early as possible.
- It would be more complex for no obvious gain.

This patch series is an extension of the kernel self-protections
mechanisms, and they are not configured by user space.


>
> One of the things that caused Intel's control register pinning series to stall
> out was how to handle edge cases like kexec() and reboot. Deferring to userspace
> means the kernel doesn't need to define policy, e.g. when to unprotect memory,
> and avoids questions like "should userspace be able to overwrite pinned control
> registers".

The idea is to authenticate every changes. For kexec, the VMM (or
something else) would have to authenticate the new kernel. Do you have
something else in mind that could legitimately require such memory or CR
changes?


>
> And like the confidential VM use case, keeping userspace in the loop is a big
> beneifit, e.g. the guest can't circumvent protections by coercing userspace into
> writing to protected memory .

I don't understand this part. Are you talking about the host user space?
How the guest could circumvent protections?

2023-05-05 17:22:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

On Fri, May 05, 2023, Micka�l Sala�n wrote:
>
> On 05/05/2023 18:44, Sean Christopherson wrote:
> > On Fri, May 05, 2023, Micka�l Sala�n wrote:
> > > Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> > > set EPT permissions on a set of page ranges.
> >
> > IMO, manipulation of protections, both for memory (this patch) and CPU state
> > (control registers in the next patch) should come from userspace. I have no
> > objection to KVM providing plumbing if necessary, but I think userspace needs to
> > to have full control over the actual state.
>
> By user space, do you mean the host user space or the guest user space?

Host userspace, a.k.a. the VMM. Definitely not guest userspace.

> About the guest user space, I see several issues to delegate this kind of
> control:
> - These are restrictions only relevant to the kernel.
> - The threat model is to protect against user space as early as possible.
> - It would be more complex for no obvious gain.
>
> This patch series is an extension of the kernel self-protections mechanisms,
> and they are not configured by user space.
>
>
> >
> > One of the things that caused Intel's control register pinning series to stall
> > out was how to handle edge cases like kexec() and reboot. Deferring to userspace
> > means the kernel doesn't need to define policy, e.g. when to unprotect memory,
> > and avoids questions like "should userspace be able to overwrite pinned control
> > registers".
>
> The idea is to authenticate every changes. For kexec, the VMM (or something
> else) would have to authenticate the new kernel. Do you have something else
> in mind that could legitimately require such memory or CR changes?

I think we're on the same page, the VMM (host userspace) would need to ack any
changes.

FWIW, SMM is another wart as entry to SMM clobbers CRs. Now that CONFIG_KVM_SMM
is a thing, the easiest solution would be to disallow coexistence with SMM, though
that might not be an option for many use cases (IIUC, QEMU-based deployments use
SMM to implement secure boot).

> > And like the confidential VM use case, keeping userspace in the loop is a big
> > beneifit, e.g. the guest can't circumvent protections by coercing userspace into
> > writing to protected memory .
>
> I don't understand this part. Are you talking about the host user space? How
> the guest could circumvent protections?

Host userspace. Guest configures a device buffer in write-protected memory, gets
a host (synthetic) device to write into the memory.

2023-05-05 17:44:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

On Fri, May 05, 2023, Micka�l Sala�n wrote:
>
> On 05/05/2023 18:28, Sean Christopherson wrote:
> > I have no doubt that we'll need to solve performance and scaling issues with the
> > memory attributes implementation, e.g. to utilize xarray multi-range support
> > instead of storing information on a per-4KiB-page basis, but AFAICT, the core
> > idea is sound. And a very big positive from a maintenance perspective is that
> > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
> > benefit the other use case.
> >
> > [1] https://lore.kernel.org/all/[email protected]
> > [2] https://lore.kernel.org/all/[email protected]
> > [3] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]
>
> I agree, I used this mechanism because it was easier at first to rely on a
> previous work, but while I was working on the MBEC support, I realized that
> it's not the optimal way to do it.
>
> I was thinking about using a new special EPT bit similar to
> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
> think?

On x86, SPTEs are even more ephemeral than memslots. E.g. for historical reasons,
KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the guest
is moving around BARs, using option ROMs, etc.

ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
otrack attributes, but that works only because pKVM is more privileged than the
host kernel, and the shared vs. private memory attribute that pKVM cares about
is very, very restricted in how it can be used and changed.

I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, and
it ended up being a constant battle with the kernel, e.g. page migration, and with
KVM itself, e.g. the above memslot mess.

2023-05-08 17:38:40

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v1 3/9] virt: Implement Heki common code

On Fri, May 05, 2023 at 05:20:40PM +0200, Micka?l Sala?n wrote:
> From: Madhavan T. Venkataraman <[email protected]>
>
> Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
> the hypervisor to enhance guest virtual machine security.
>
> Configuration
> =============
>
> Define the config variables for the feature. This feature depends on
> support from the architecture as well as the hypervisor.
>
> Enabling HEKI
> =============
>
> Define a kernel command line parameter "heki" to turn the feature on or
> off. By default, Heki is on.

For such a newfangled feature can we have it off by default? Especially
when there are unsolved issues around dynamically loaded code.

>
[...]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3604074a878b..5cf5a7a97811 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -297,6 +297,7 @@ config X86
> select FUNCTION_ALIGNMENT_4B
> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> + select ARCH_SUPPORTS_HEKI if X86_64

Why is there a restriction on X86_64?

>
> config INSTRUCTION_DECODER
> def_bool y
> diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> index a6e8373a5170..42ef1e33b8a5 100644
> --- a/arch/x86/include/asm/sections.h
> +++ b/arch/x86/include/asm/sections.h
[...]
>
> +#ifdef CONFIG_HEKI
> +
> +/*
> + * Gather all of the statically defined sections so heki_late_init() can
> + * protect these sections in the host page table.
> + *
> + * The sections are defined under "SECTIONS" in vmlinux.lds.S
> + * Keep this array in sync with SECTIONS.
> + */

This seems a bit fragile, because it requires constant attention from
people who care about this functionality. Can this table be
automatically generated?

Thanks,
Wei.

> +struct heki_va_range __initdata heki_va_ranges[] = {
> + {
> + .va_start = _stext,
> + .va_end = _etext,
> + .attributes = HEKI_ATTR_MEM_NOWRITE | HEKI_ATTR_MEM_EXEC,
> + },
> + {
> + .va_start = __start_rodata,
> + .va_end = __end_rodata,
> + .attributes = HEKI_ATTR_MEM_NOWRITE,
> + },
> +#ifdef CONFIG_UNWINDER_ORC
> + {
> + .va_start = __start_orc_unwind_ip,
> + .va_end = __stop_orc_unwind_ip,
> + .attributes = HEKI_ATTR_MEM_NOWRITE,
> + },
> + {
> + .va_start = __start_orc_unwind,
> + .va_end = __stop_orc_unwind,
> + .attributes = HEKI_ATTR_MEM_NOWRITE,
> + },
> + {
> + .va_start = orc_lookup,
> + .va_end = orc_lookup_end,
> + .attributes = HEKI_ATTR_MEM_NOWRITE,
> + },
> +#endif /* CONFIG_UNWINDER_ORC */
> +};
> +

2023-05-08 21:20:53

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v1 6/9] KVM: x86: Add Heki hypervisor support

On Fri, May 05, 2023 at 05:20:43PM +0200, Micka?l Sala?n wrote:
> From: Madhavan T. Venkataraman <[email protected]>
>
> Each supported hypervisor in x86 implements a struct x86_hyper_init to
> define the init functions for the hypervisor. Define a new init_heki()
> entry point in struct x86_hyper_init. Hypervisors that support Heki
> must define this init_heki() function. Call init_heki() of the chosen
> hypervisor in init_hypervisor_platform().
>
> Create a heki_hypervisor structure that each hypervisor can fill
> with its data and functions. This will allow the Heki feature to work
> in a hypervisor agnostic way.
>
> Declare and initialize a "heki_hypervisor" structure for KVM so KVM can
> support Heki. Define the init_heki() function for KVM. In init_heki(),
> set the hypervisor field in the generic "heki" structure to the KVM
> "heki_hypervisor". After this point, generic Heki code can access the
> KVM Heki data and functions.
>
[...]
> +static void kvm_init_heki(void)
> +{
> + long err;
> +
> + if (!kvm_para_available())
> + /* Cannot make KVM hypercalls. */
> + return;
> +
> + err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1);

Why not do a proper version check or capability check here? If the ABI
or supported features ever change then we have something to rely on?

Thanks,
Wei.

2023-05-08 21:25:09

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

On Fri, May 05, 2023 at 05:20:42PM +0200, Micka?l Sala?n wrote:
> This enables guests to lock their CR0 and CR4 registers with a subset of
> X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
> and X86_CR4_CET flags.
>
> The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first
> is to identify the control register, and the second is a bit mask to
> pin (i.e. mark as read-only).
>
> These register flags should already be pinned by Linux guests, but once
> compromised, this self-protection mechanism could be disabled, which is
> not the case with this dedicated hypercall.
>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Madhavan T. Venkataraman <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Signed-off-by: Micka?l Sala?n <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
[...]
> hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
> if (is_unrestricted_guest(vcpu))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ffab64d08de3..a529455359ac 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
> return value;
> }
>
> +#ifdef CONFIG_HEKI
> +
> +extern unsigned long cr4_pinned_mask;
> +

Can this be moved to a header file?

> +static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
> + unsigned long pin)
> +{
> + if (!pin)
> + return -KVM_EINVAL;
> +
> + switch (cr) {
> + case 0:
> + /* Cf. arch/x86/kernel/cpu/common.c */
> + if (!(pin & X86_CR0_WP))
> + return -KVM_EINVAL;
> +
> + if ((read_cr0() & pin) != pin)
> + return -KVM_EINVAL;
> +
> + atomic_long_or(pin, &kvm->heki_pinned_cr0);
> + return 0;
> + case 4:
> + /* Checks for irrelevant bits. */
> + if ((pin & cr4_pinned_mask) != pin)
> + return -KVM_EINVAL;
> +

It is enforcing the host mask on the guest, right? If the guest's set is a
super set of the host's then it will get rejected.


> + /* Ignores bits not present in host. */
> + pin &= __read_cr4();
> + atomic_long_or(pin, &kvm->heki_pinned_cr4);
> + return 0;
> + }
> + return -KVM_EINVAL;
> +}
> +
> +int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
> + const unsigned long val)
> +{
> + unsigned long pinned;
> +
> + switch (cr) {
> + case 0:
> + pinned = atomic_long_read(&kvm->heki_pinned_cr0);
> + if ((val & pinned) != pinned) {
> + pr_warn_ratelimited(
> + "heki-kvm: Blocked CR0 update: 0x%lx\n", val);

I think if the message contains the VM and VCPU identifier it will
become more useful.

Thanks,
Wei.

2023-05-17 13:01:25

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: Re: [PATCH v1 3/9] virt: Implement Heki common code

Sorry for the delay. See inline...

On 5/8/23 12:29, Wei Liu wrote:
> On Fri, May 05, 2023 at 05:20:40PM +0200, Mickaël Salaün wrote:
>> From: Madhavan T. Venkataraman <[email protected]>
>>
>> Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
>> the hypervisor to enhance guest virtual machine security.
>>
>> Configuration
>> =============
>>
>> Define the config variables for the feature. This feature depends on
>> support from the architecture as well as the hypervisor.
>>
>> Enabling HEKI
>> =============
>>
>> Define a kernel command line parameter "heki" to turn the feature on or
>> off. By default, Heki is on.
>
> For such a newfangled feature can we have it off by default? Especially
> when there are unsolved issues around dynamically loaded code.
>

Yes. We can certainly do that.

>>
> [...]
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 3604074a878b..5cf5a7a97811 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -297,6 +297,7 @@ config X86
>> select FUNCTION_ALIGNMENT_4B
>> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
>> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>> + select ARCH_SUPPORTS_HEKI if X86_64
>
> Why is there a restriction on X86_64?
>

We want to get the PoC working and reviewed on X64 first. We have tested this only on X64 so far.

>>
>> config INSTRUCTION_DECODER
>> def_bool y
>> diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
>> index a6e8373a5170..42ef1e33b8a5 100644
>> --- a/arch/x86/include/asm/sections.h
>> +++ b/arch/x86/include/asm/sections.h
> [...]
>>
>> +#ifdef CONFIG_HEKI
>> +
>> +/*
>> + * Gather all of the statically defined sections so heki_late_init() can
>> + * protect these sections in the host page table.
>> + *
>> + * The sections are defined under "SECTIONS" in vmlinux.lds.S
>> + * Keep this array in sync with SECTIONS.
>> + */
>
> This seems a bit fragile, because it requires constant attention from
> people who care about this functionality. Can this table be
> automatically generated?
>

We realize that. But I don't know of a way this can be automatically generated. Also, the permissions for
each section is specific to the use of that section. The developer who introduces a new section is the
one who will know what the permissions should be.

If any one has any ideas of how we can generate this table automatically or even just add a build time check
of some sort, please let us know.

Thanks.

Madhavan

> Thanks,
> Wei.
>
>> +struct heki_va_range __initdata heki_va_ranges[] = {
>> + {
>> + .va_start = _stext,
>> + .va_end = _etext,
>> + .attributes = HEKI_ATTR_MEM_NOWRITE | HEKI_ATTR_MEM_EXEC,
>> + },
>> + {
>> + .va_start = __start_rodata,
>> + .va_end = __end_rodata,
>> + .attributes = HEKI_ATTR_MEM_NOWRITE,
>> + },
>> +#ifdef CONFIG_UNWINDER_ORC
>> + {
>> + .va_start = __start_orc_unwind_ip,
>> + .va_end = __stop_orc_unwind_ip,
>> + .attributes = HEKI_ATTR_MEM_NOWRITE,
>> + },
>> + {
>> + .va_start = __start_orc_unwind,
>> + .va_end = __stop_orc_unwind,
>> + .attributes = HEKI_ATTR_MEM_NOWRITE,
>> + },
>> + {
>> + .va_start = orc_lookup,
>> + .va_end = orc_lookup_end,
>> + .attributes = HEKI_ATTR_MEM_NOWRITE,
>> + },
>> +#endif /* CONFIG_UNWINDER_ORC */
>> +};
>> +

2023-05-24 21:03:05

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking



On 5/5/23 12:31, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>>
>> On 05/05/2023 18:28, Sean Christopherson wrote:
>>> I have no doubt that we'll need to solve performance and scaling issues with the
>>> memory attributes implementation, e.g. to utilize xarray multi-range support
>>> instead of storing information on a per-4KiB-page basis, but AFAICT, the core
>>> idea is sound. And a very big positive from a maintenance perspective is that
>>> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
>>> benefit the other use case.
>>>
>>> [1] https://lore.kernel.org/all/[email protected]
>>> [2] https://lore.kernel.org/all/[email protected]
>>> [3] https://lore.kernel.org/all/Y1a1i9vbJ%[email protected]
>>
>> I agree, I used this mechanism because it was easier at first to rely on a
>> previous work, but while I was working on the MBEC support, I realized that
>> it's not the optimal way to do it.
>>
>> I was thinking about using a new special EPT bit similar to
>> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
>> think?
>
> On x86, SPTEs are even more ephemeral than memslots. E.g. for historical reasons,
> KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the guest
> is moving around BARs, using option ROMs, etc.
>
> ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
> otrack attributes, but that works only because pKVM is more privileged than the
> host kernel, and the shared vs. private memory attribute that pKVM cares about
> is very, very restricted in how it can be used and changed.
>
> I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, and
> it ended up being a constant battle with the kernel, e.g. page migration, and with
> KVM itself, e.g. the above memslot mess.

Sorry for the delay in responding to this. I wanted to study the KVM code and fully
understand your comment before responding.

Yes, I quite agree with you. I will make an attempt to address this in the next version.
I am working on it right now.

Thanks.

Madhavan

2023-05-24 21:08:45

by Trilok Soni

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On 5/5/2023 8:20 AM, Mickaël Salaün wrote:
> Hi,
>
> This patch series is a proof-of-concept that implements new KVM features
> (extended page tracking, MBEC support, CR pinning) and defines a new API to
> protect guest VMs. No VMM (e.g., Qemu) modification is required.
>
> The main idea being that kernel self-protection mechanisms should be delegated
> to a more privileged part of the system, hence the hypervisor. It is still the
> role of the guest kernel to request such restrictions according to its

Only for the guest kernel images here? Why not for the host OS kernel?
Embedded devices w/ Android you have mentioned below supports the host
OS as well it seems, right?

Do we suggest that all the functionalities should be implemented in the
Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).

I am hoping that whatever we suggest the interface here from the Guest
to the Hypervisor becomes the ABI right?


>
> # Current limitations
>
> The main limitation of this patch series is the statically enforced
> permissions. This is not an issue for kernels without module but this needs to
> be addressed. Mechanisms that dynamically impact kernel executable memory are
> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such
> code will need to be authenticated. Because the hypervisor is highly
> privileged and critical to the security of all the VMs, we don't want to
> implement a code authentication mechanism in the hypervisor itself but delegate
> this verification to something much less privileged. We are thinking of two
> ways to solve this: implement this verification in the VMM or spawn a dedicated
> special VM (similar to Windows's VBS). There are pros on cons to each approach:
> complexity, verification code ownership (guest's or VMM's), access to guest
> memory (i.e., confidential computing).

Do you foresee the performance regressions due to lot of tracking here?
Production kernels do have lot of tracepoints and we use it as feature
in the GKI kernel for the vendor hooks implementation and in those cases
every vendor driver is a module. Separate VM further fragments this
design and delegates more of it to proprietary solutions?

Do you have any performance numbers w/ current RFC?

---Trilok Soni

2023-05-24 22:54:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
> # How does it work?
>
> This implementation mainly leverages KVM capabilities to control the
> Second
> Layer Address Translation (or the Two Dimensional Paging e.g.,
> Intel's EPT or
> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
> introduced with
> the Kaby Lake (7th generation) architecture. This allows to set
> permissions on
> memory pages in a complementary way to the guest kernel's managed
> memory
> permissions. Once these permissions are set, they are locked and
> there is no
> way back.
>
> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
> kernel to lock
> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
> or the
> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
> specific
> set of pages (allow-list approach), and the second only allows kernel
> execution
> for a set of pages (deny-list approach).
>
> The current implementation sets the whole kernel's .rodata (i.e., any
> const or
> __ro_after_init variables, which includes critical security data such
> as LSM
> parameters) and .text sections as non-writable, and the .text section
> is the
> only one where kernel execution is allowed. This is possible thanks
> to the new
> MBEC support also brough by this series (otherwise the vDSO would
> have to be
> executable). Thanks to this hardware support (VT-x, EPT and MBEC),
> the
> performance impact of such guest protection is negligible.
>
> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
> of its
> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
> X86_CR4_SMAP),
> which is another complementary hardening mechanism.
>
> Heki can be enabled with the heki=1 boot command argument.
>
>

Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.

[ snip ]

>
> # Current limitations
>
> The main limitation of this patch series is the statically enforced
> permissions. This is not an issue for kernels without module but this
> needs to
> be addressed.  Mechanisms that dynamically impact kernel executable
> memory are
> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
> and such
> code will need to be authenticated.  Because the hypervisor is highly
> privileged and critical to the security of all the VMs, we don't want
> to
> implement a code authentication mechanism in the hypervisor itself
> but delegate
> this verification to something much less privileged. We are thinking
> of two
> ways to solve this: implement this verification in the VMM or spawn a
> dedicated
> special VM (similar to Windows's VBS). There are pros on cons to each
> approach:
> complexity, verification code ownership (guest's or VMM's), access to
> guest
> memory (i.e., confidential computing).

The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as text
is being first written out (alternatives for example), and some happens
way later (jump labels, etc). So for verification, I wonder what stage
you would be verifying? If you want to verify the end state, you would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.

It also seems it will be a decent ask for the guest kernel to keep
track of GPA permissions as well as normal virtual memory pemirssions,
if this thing is not widely used.

So I wondering if you could go in two directions with this:
1. Make this a feature only for super locked down kernels (no modules,
etc). Forbid any configurations that might modify text. But eBPF is
used for seccomp, so you might be turning off some security protections
to get this.
2. Loosen the rules to allow the protections to not be so one-way
enable. Get less security, but used more widely.

There were similar dilemmas with the PV CR pinning stuff.

2023-05-25 01:21:56

by Trilok Soni

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On 5/24/2023 3:20 PM, Edgecombe, Rick P wrote:
> On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
>> # How does it work?
>>
>> This implementation mainly leverages KVM capabilities to control the
>> Second
>> Layer Address Translation (or the Two Dimensional Paging e.g.,
>> Intel's EPT or
>> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
>> introduced with
>> the Kaby Lake (7th generation) architecture. This allows to set
>> permissions on
>> memory pages in a complementary way to the guest kernel's managed
>> memory
>> permissions. Once these permissions are set, they are locked and
>> there is no
>> way back.
>>
>> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
>> kernel to lock
>> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
>> or the
>> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
>> specific
>> set of pages (allow-list approach), and the second only allows kernel
>> execution
>> for a set of pages (deny-list approach).
>>
>> The current implementation sets the whole kernel's .rodata (i.e., any
>> const or
>> __ro_after_init variables, which includes critical security data such
>> as LSM
>> parameters) and .text sections as non-writable, and the .text section
>> is the
>> only one where kernel execution is allowed. This is possible thanks
>> to the new
>> MBEC support also brough by this series (otherwise the vDSO would
>> have to be
>> executable). Thanks to this hardware support (VT-x, EPT and MBEC),
>> the
>> performance impact of such guest protection is negligible.
>>
>> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
>> of its
>> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
>> X86_CR4_SMAP),
>> which is another complementary hardening mechanism.
>>
>> Heki can be enabled with the heki=1 boot command argument.
>>
>>
>
> Can the guest kernel ask the host VMM's emulated devices to DMA into
> the protected data? It should go through the host userspace mappings I
> think, which don't care about EPT permissions. Or did I miss where you
> are protecting that another way? There are a lot of easy ways to ask
> the host to write to guest memory that don't involve the EPT. You
> probably need to protect the host userspace mappings, and also the
> places in KVM that kmap a GPA provided by the guest.
>
> [ snip ]
>
>>
>> # Current limitations
>>
>> The main limitation of this patch series is the statically enforced
>> permissions. This is not an issue for kernels without module but this
>> needs to
>> be addressed.  Mechanisms that dynamically impact kernel executable
>> memory are
>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
>> and such
>> code will need to be authenticated.  Because the hypervisor is highly
>> privileged and critical to the security of all the VMs, we don't want
>> to
>> implement a code authentication mechanism in the hypervisor itself
>> but delegate
>> this verification to something much less privileged. We are thinking
>> of two
>> ways to solve this: implement this verification in the VMM or spawn a
>> dedicated
>> special VM (similar to Windows's VBS). There are pros on cons to each
>> approach:
>> complexity, verification code ownership (guest's or VMM's), access to
>> guest
>> memory (i.e., confidential computing).
>
> The kernel often creates writable aliases in order to write to
> protected data (kernel text, etc). Some of this is done right as text
> is being first written out (alternatives for example), and some happens
> way later (jump labels, etc). So for verification, I wonder what stage
> you would be verifying? If you want to verify the end state, you would
> have to maintain knowledge in the verifier of all the touch-ups the
> kernel does. I think it would get very tricky.

Right and for the ARM (from what I know) is that Erratas can be applied
using the alternatives fwk when you hotplug in the CPU post boot.

---Trilok Soni

2023-05-25 13:40:04

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 24/05/2023 23:04, Trilok Soni wrote:
> On 5/5/2023 8:20 AM, Mickaël Salaün wrote:
>> Hi,
>>
>> This patch series is a proof-of-concept that implements new KVM features
>> (extended page tracking, MBEC support, CR pinning) and defines a new API to
>> protect guest VMs. No VMM (e.g., Qemu) modification is required.
>>
>> The main idea being that kernel self-protection mechanisms should be delegated
>> to a more privileged part of the system, hence the hypervisor. It is still the
>> role of the guest kernel to request such restrictions according to its
>
> Only for the guest kernel images here? Why not for the host OS kernel?

As explained in the Future work section, protecting the host would be
useful, but that doesn't really fit with the KVM model. The Protected
KVM project is a first step to help in this direction [11].

In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel
is also part of the hypervisor.


> Embedded devices w/ Android you have mentioned below supports the host
> OS as well it seems, right?

What do you mean?


>
> Do we suggest that all the functionalities should be implemented in the
> Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).

KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means
that we may not control the related code.

This patch series is dedicated to hypervisor-enforced kernel integrity,
then KVM.

>
> I am hoping that whatever we suggest the interface here from the Guest
> to the Hypervisor becomes the ABI right?

Yes, hypercalls are part of the KVM ABI.

>
>
>>
>> # Current limitations
>>
>> The main limitation of this patch series is the statically enforced
>> permissions. This is not an issue for kernels without module but this needs to
>> be addressed. Mechanisms that dynamically impact kernel executable memory are
>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such
>> code will need to be authenticated. Because the hypervisor is highly
>> privileged and critical to the security of all the VMs, we don't want to
>> implement a code authentication mechanism in the hypervisor itself but delegate
>> this verification to something much less privileged. We are thinking of two
>> ways to solve this: implement this verification in the VMM or spawn a dedicated
>> special VM (similar to Windows's VBS). There are pros on cons to each approach:
>> complexity, verification code ownership (guest's or VMM's), access to guest
>> memory (i.e., confidential computing).
>
> Do you foresee the performance regressions due to lot of tracking here?

The performance impact of execution prevention should be negligible
because once configured the hypervisor do nothing except catch
illegitimate access attempts.


> Production kernels do have lot of tracepoints and we use it as feature
> in the GKI kernel for the vendor hooks implementation and in those cases
> every vendor driver is a module.

As explained in this section, dynamic kernel modifications such as
tracepoints or modules are not currently supported by this patch series.
Handling tracepoints is possible but requires more work to define and
check legitimate changes. This proposal is still useful for static
kernels though.


> Separate VM further fragments this
> design and delegates more of it to proprietary solutions?

What do you mean? KVM is not a proprietary solution.

For dynamic checks, this would require code not run by KVM itself, but
either the VMM or a dedicated VM. In this case, the dynamic
authentication code could come from the guest VM or from the VMM itself.
In the former case, it is more challenging from a security point of view
but doesn't rely on external (proprietary) solution. In the latter case,
open-source VMMs should implement the specification to provide the
required service (e.g. check kernel module signature).

The goal of the common API layer provided by this RFC is to share code
as much as possible between different hypervisor backends.


>
> Do you have any performance numbers w/ current RFC?

No, but the only hypervisor performance impact is at boot time and
should be negligible. I'll try to get some numbers for the
hardware-enforcement impact, but it should be negligible too.

2023-05-25 14:22:05

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 25/05/2023 00:20, Edgecombe, Rick P wrote:
> On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
>> # How does it work?
>>
>> This implementation mainly leverages KVM capabilities to control the
>> Second
>> Layer Address Translation (or the Two Dimensional Paging e.g.,
>> Intel's EPT or
>> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
>> introduced with
>> the Kaby Lake (7th generation) architecture. This allows to set
>> permissions on
>> memory pages in a complementary way to the guest kernel's managed
>> memory
>> permissions. Once these permissions are set, they are locked and
>> there is no
>> way back.
>>
>> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
>> kernel to lock
>> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
>> or the
>> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
>> specific
>> set of pages (allow-list approach), and the second only allows kernel
>> execution
>> for a set of pages (deny-list approach).
>>
>> The current implementation sets the whole kernel's .rodata (i.e., any
>> const or
>> __ro_after_init variables, which includes critical security data such
>> as LSM
>> parameters) and .text sections as non-writable, and the .text section
>> is the
>> only one where kernel execution is allowed. This is possible thanks
>> to the new
>> MBEC support also brough by this series (otherwise the vDSO would
>> have to be
>> executable). Thanks to this hardware support (VT-x, EPT and MBEC),
>> the
>> performance impact of such guest protection is negligible.
>>
>> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
>> of its
>> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
>> X86_CR4_SMAP),
>> which is another complementary hardening mechanism.
>>
>> Heki can be enabled with the heki=1 boot command argument.
>>
>>
>
> Can the guest kernel ask the host VMM's emulated devices to DMA into
> the protected data? It should go through the host userspace mappings I
> think, which don't care about EPT permissions. Or did I miss where you
> are protecting that another way? There are a lot of easy ways to ask
> the host to write to guest memory that don't involve the EPT. You
> probably need to protect the host userspace mappings, and also the
> places in KVM that kmap a GPA provided by the guest.

Good point, I'll check this confused deputy attack. Extended KVM
protections should indeed handle all ways to map guests' memory. I'm
wondering if current VMMs would gracefully handle such new restrictions
though.


>
> [ snip ]
>
>>
>> # Current limitations
>>
>> The main limitation of this patch series is the statically enforced
>> permissions. This is not an issue for kernels without module but this
>> needs to
>> be addressed.  Mechanisms that dynamically impact kernel executable
>> memory are
>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
>> and such
>> code will need to be authenticated.  Because the hypervisor is highly
>> privileged and critical to the security of all the VMs, we don't want
>> to
>> implement a code authentication mechanism in the hypervisor itself
>> but delegate
>> this verification to something much less privileged. We are thinking
>> of two
>> ways to solve this: implement this verification in the VMM or spawn a
>> dedicated
>> special VM (similar to Windows's VBS). There are pros on cons to each
>> approach:
>> complexity, verification code ownership (guest's or VMM's), access to
>> guest
>> memory (i.e., confidential computing).
>
> The kernel often creates writable aliases in order to write to
> protected data (kernel text, etc). Some of this is done right as text
> is being first written out (alternatives for example), and some happens
> way later (jump labels, etc). So for verification, I wonder what stage
> you would be verifying? If you want to verify the end state, you would
> have to maintain knowledge in the verifier of all the touch-ups the
> kernel does. I think it would get very tricky.

For now, in the static kernel case, all rodata and text GPA is
restricted, so aliasing such memory in a writable way before or after
the KVM enforcement would still restrict write access to this memory,
which could be an issue but not a security one. Do you have such
examples in mind?


>
> It also seems it will be a decent ask for the guest kernel to keep
> track of GPA permissions as well as normal virtual memory pemirssions,
> if this thing is not widely used.

This would indeed be required to properly handle the dynamic cases.


>
> So I wondering if you could go in two directions with this:
> 1. Make this a feature only for super locked down kernels (no modules,
> etc). Forbid any configurations that might modify text. But eBPF is
> used for seccomp, so you might be turning off some security protections
> to get this.

Good idea. For "super locked down kernels" :) , we should disable all
kernel executable changes with the related kernel build configuration
(e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such
legitimate access. This looks like an acceptable initial feature.


> 2. Loosen the rules to allow the protections to not be so one-way
> enable. Get less security, but used more widely.

This is our goal. I think both static and dynamic cases are legitimate
and have value according to the level of security sought. This should be
a build-time configuration.

>
> There were similar dilemmas with the PV CR pinning stuff.

2023-05-25 16:01:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote:
[ snip ]

> > The kernel often creates writable aliases in order to write to
> > protected data (kernel text, etc). Some of this is done right as
> > text
> > is being first written out (alternatives for example), and some
> > happens
> > way later (jump labels, etc). So for verification, I wonder what
> > stage
> > you would be verifying? If you want to verify the end state, you
> > would
> > have to maintain knowledge in the verifier of all the touch-ups the
> > kernel does. I think it would get very tricky.
>
> For now, in the static kernel case, all rodata and text GPA is
> restricted, so aliasing such memory in a writable way before or after
> the KVM enforcement would still restrict write access to this memory,
> which could be an issue but not a security one. Do you have such
> examples in mind?
>

On x86, look at all the callers of the text_poke() family. In
arch/x86/include/asm/text-patching.h.

>
> >
> > It also seems it will be a decent ask for the guest kernel to keep
> > track of GPA permissions as well as normal virtual memory
> > pemirssions,
> > if this thing is not widely used.
>
> This would indeed be required to properly handle the dynamic cases.
>
>
> >
> > So I wondering if you could go in two directions with this:
> > 1. Make this a feature only for super locked down kernels (no
> > modules,
> > etc). Forbid any configurations that might modify text. But eBPF is
> > used for seccomp, so you might be turning off some security
> > protections
> > to get this.
>
> Good idea. For "super locked down kernels" :) , we should disable all
> kernel executable changes with the related kernel build configuration
> (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no
> such
> legitimate access. This looks like an acceptable initial feature.

How many users do you think will want this protection but not
protections that would have to be disabled? The main one that came to
mind for me is cBPF seccomp stuff.

But also, the alternative to JITing cBPF is the eBPF interpreter which
AFAIU is considered a juicy enough target for speculative attacks that
they created an option to compile it out. And leaving an interpreter in
the kernel means any data could be "executed" in the normal non-
speculative scenario, kind of working around the hypervisor executable
protections. Dropping e/cBPF entirely would be an option, but then I
wonder how many users you have left. Hopefully that is all correct,
it's hard to keep track with the pace of BPF development.

I wonder if it might be a good idea to POC the guest side before
settling on the KVM interface. Then you can also look at the whole
thing and judge how much usage it would get for the different options
of restrictions.

>
>
> > 2. Loosen the rules to allow the protections to not be so one-way
> > enable. Get less security, but used more widely.
>
> This is our goal. I think both static and dynamic cases are
> legitimate
> and have value according to the level of security sought. This should
> be
> a build-time configuration.

Yea, the proper way to do this is probably to move all text handling
stuff into a separate domain of some sort, like you mentioned
elsewhere. It would be quite a job.

2023-05-25 16:33:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Thu, May 25, 2023, Rick P Edgecombe wrote:
> I wonder if it might be a good idea to POC the guest side before
> settling on the KVM interface. Then you can also look at the whole
> thing and judge how much usage it would get for the different options
> of restrictions.

As I said earlier[*], IMO the control plane logic needs to live in host userspace.
I think any attempt to have KVM providen anything but the low level plumbing will
suffer the same fate as CR4 pinning and XO memory. Iterating on an imperfect
solution to incremently improve security is far, far easier to do in userspace,
and far more likely to get merged.

[*] https://lore.kernel.org/all/[email protected]

2023-05-25 19:39:44

by Trilok Soni

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On 5/25/2023 6:25 AM, Mickaël Salaün wrote:
>
> On 24/05/2023 23:04, Trilok Soni wrote:
>> On 5/5/2023 8:20 AM, Mickaël Salaün wrote:
>>> Hi,
>>>
>>> This patch series is a proof-of-concept that implements new KVM features
>>> (extended page tracking, MBEC support, CR pinning) and defines a new
>>> API to
>>> protect guest VMs. No VMM (e.g., Qemu) modification is required.
>>>
>>> The main idea being that kernel self-protection mechanisms should be
>>> delegated
>>> to a more privileged part of the system, hence the hypervisor. It is
>>> still the
>>> role of the guest kernel to request such restrictions according to its
>>
>> Only for the guest kernel images here? Why not for the host OS kernel?
>
> As explained in the Future work section, protecting the host would be
> useful, but that doesn't really fit with the KVM model. The Protected
> KVM project is a first step to help in this direction [11].
>
> In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel
> is also part of the hypervisor.
>
>
>> Embedded devices w/ Android you have mentioned below supports the host
>> OS as well it seems, right?
>
> What do you mean?

I think you have answered this above w/ pKVM and I was referring the
host protection as well w/ Heki. The link/references below refers to the
Android OS it seems and not guest VM.

>
>
>>
>> Do we suggest that all the functionalities should be implemented in the
>> Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).
>
> KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means
> that we may not control the related code.
>
> This patch series is dedicated to hypervisor-enforced kernel integrity,
> then KVM.
>
>>
>> I am hoping that whatever we suggest the interface here from the Guest
>> to the Hypervisor becomes the ABI right?
>
> Yes, hypercalls are part of the KVM ABI.

Sure. I just hope that they are extensible enough to support for other
Hypervisors too. I am not sure if they are on this list like ACRN / Xen
and see if it fits their need too.

Is there any other Hypervisor you plan to test this feature as well?

>
>>
>>
>>>
>>> # Current limitations
>>>
>>> The main limitation of this patch series is the statically enforced
>>> permissions. This is not an issue for kernels without module but this
>>> needs to
>>> be addressed.  Mechanisms that dynamically impact kernel executable
>>> memory are
>>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
>>> and such
>>> code will need to be authenticated.  Because the hypervisor is highly
>>> privileged and critical to the security of all the VMs, we don't want to
>>> implement a code authentication mechanism in the hypervisor itself
>>> but delegate
>>> this verification to something much less privileged. We are thinking
>>> of two
>>> ways to solve this: implement this verification in the VMM or spawn a
>>> dedicated
>>> special VM (similar to Windows's VBS). There are pros on cons to each
>>> approach:
>>> complexity, verification code ownership (guest's or VMM's), access to
>>> guest
>>> memory (i.e., confidential computing).
>>
>> Do you foresee the performance regressions due to lot of tracking here?
>
> The performance impact of execution prevention should be negligible
> because once configured the hypervisor do nothing except catch
> illegitimate access attempts.

Yes, if you are using the static kernel only and not considering the
other dynamic patching features like explained. They need to be thought
upon differently to reduce the likely impact.

>
>
>> Production kernels do have lot of tracepoints and we use it as feature
>> in the GKI kernel for the vendor hooks implementation and in those cases
>> every vendor driver is a module.
>
> As explained in this section, dynamic kernel modifications such as
> tracepoints or modules are not currently supported by this patch series.
> Handling tracepoints is possible but requires more work to define and
> check legitimate changes. This proposal is still useful for static
> kernels though.
>
>
>> Separate VM further fragments this
>> design and delegates more of it to proprietary solutions?
>
> What do you mean? KVM is not a proprietary solution.

Ah, I was referring the VBS Windows VM mentioned in the above text. Is
it open-source? The reference of VM (or dedicated VM) didn't mention
that VM itself will be open-source running Linux kernel.

>
> For dynamic checks, this would require code not run by KVM itself, but
> either the VMM or a dedicated VM. In this case, the dynamic
> authentication code could come from the guest VM or from the VMM itself.
> In the former case, it is more challenging from a security point of view
> but doesn't rely on external (proprietary) solution. In the latter case,
> open-source VMMs should implement the specification to provide the
> required service (e.g. check kernel module signature).
>
> The goal of the common API layer provided by this RFC is to share code
> as much as possible between different hypervisor backends.
>
>
>>
>> Do you have any performance numbers w/ current RFC?
>
> No, but the only hypervisor performance impact is at boot time and
> should be negligible. I'll try to get some numbers for the
> hardware-enforcement impact, but it should be negligible too.

Thanks. Please share the data once you have it ready.

---Trilok Soni


2023-05-25 19:45:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Thu, 2023-05-25 at 09:07 -0700, Sean Christopherson wrote:
> On Thu, May 25, 2023, Rick P Edgecombe wrote:
> > I wonder if it might be a good idea to POC the guest side before
> > settling on the KVM interface. Then you can also look at the whole
> > thing and judge how much usage it would get for the different
> > options
> > of restrictions.
>
> As I said earlier[*], IMO the control plane logic needs to live in
> host userspace.
> I think any attempt to have KVM providen anything but the low level
> plumbing will
> suffer the same fate as CR4 pinning and XO memory.  Iterating on an
> imperfect
> solution to incremently improve security is far, far easier to do in
> userspace,
> and far more likely to get merged.
>
> [*] https://lore.kernel.org/all/[email protected]

Sure, I should have put it more generally. I just meant people are not
going to want to maintain host-based features that guests can't
effectively use.

My takeaway from the CR pinning was that the guest kernel integration
was surprisingly tricky due to the one-way nature of the interface. XO
was more flexible than CR pinning in that respect, because the guest
could turn it off (and indeed, in the XO kernel text patches it had to
do this a lot).

2023-05-26 02:37:09

by James Morris

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

[Side topic]

Would folks be interested in a Linux Plumbers Conference MC on this
topic generally, across different hypervisors, VMMs, and architectures?

If so, please let me know who the key folk would be and we can try writing
up an MC proposal.



--
James Morris
<[email protected]>

2023-05-26 15:36:41

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 25/05/2023 15:59, Mickaël Salaün wrote:
>
> On 25/05/2023 00:20, Edgecombe, Rick P wrote:
>> On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
>>> # How does it work?
>>>
>>> This implementation mainly leverages KVM capabilities to control the
>>> Second
>>> Layer Address Translation (or the Two Dimensional Paging e.g.,
>>> Intel's EPT or
>>> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
>>> introduced with
>>> the Kaby Lake (7th generation) architecture. This allows to set
>>> permissions on
>>> memory pages in a complementary way to the guest kernel's managed
>>> memory
>>> permissions. Once these permissions are set, they are locked and
>>> there is no
>>> way back.
>>>
>>> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
>>> kernel to lock
>>> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
>>> or the
>>> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
>>> specific
>>> set of pages (allow-list approach), and the second only allows kernel
>>> execution
>>> for a set of pages (deny-list approach).
>>>
>>> The current implementation sets the whole kernel's .rodata (i.e., any
>>> const or
>>> __ro_after_init variables, which includes critical security data such
>>> as LSM
>>> parameters) and .text sections as non-writable, and the .text section
>>> is the
>>> only one where kernel execution is allowed. This is possible thanks
>>> to the new
>>> MBEC support also brough by this series (otherwise the vDSO would
>>> have to be
>>> executable). Thanks to this hardware support (VT-x, EPT and MBEC),
>>> the
>>> performance impact of such guest protection is negligible.
>>>
>>> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
>>> of its
>>> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
>>> X86_CR4_SMAP),
>>> which is another complementary hardening mechanism.
>>>
>>> Heki can be enabled with the heki=1 boot command argument.
>>>
>>>
>>
>> Can the guest kernel ask the host VMM's emulated devices to DMA into
>> the protected data? It should go through the host userspace mappings I
>> think, which don't care about EPT permissions. Or did I miss where you
>> are protecting that another way? There are a lot of easy ways to ask
>> the host to write to guest memory that don't involve the EPT. You
>> probably need to protect the host userspace mappings, and also the
>> places in KVM that kmap a GPA provided by the guest.
>
> Good point, I'll check this confused deputy attack. Extended KVM
> protections should indeed handle all ways to map guests' memory. I'm
> wondering if current VMMs would gracefully handle such new restrictions
> though.

I guess the host could map arbitrary data to the guest, so that need to
be handled, but how could the VMM (not the host kernel) bypass/update
EPT initially used for the guest (and potentially later mapped to the host)?

2023-05-26 15:41:39

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 25/05/2023 17:52, Edgecombe, Rick P wrote:
> On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote:
> [ snip ]
>
>>> The kernel often creates writable aliases in order to write to
>>> protected data (kernel text, etc). Some of this is done right as
>>> text
>>> is being first written out (alternatives for example), and some
>>> happens
>>> way later (jump labels, etc). So for verification, I wonder what
>>> stage
>>> you would be verifying? If you want to verify the end state, you
>>> would
>>> have to maintain knowledge in the verifier of all the touch-ups the
>>> kernel does. I think it would get very tricky.
>>
>> For now, in the static kernel case, all rodata and text GPA is
>> restricted, so aliasing such memory in a writable way before or after
>> the KVM enforcement would still restrict write access to this memory,
>> which could be an issue but not a security one. Do you have such
>> examples in mind?
>>
>
> On x86, look at all the callers of the text_poke() family. In
> arch/x86/include/asm/text-patching.h.

OK, thanks!


>
>>
>>>
>>> It also seems it will be a decent ask for the guest kernel to keep
>>> track of GPA permissions as well as normal virtual memory
>>> pemirssions,
>>> if this thing is not widely used.
>>
>> This would indeed be required to properly handle the dynamic cases.
>>
>>
>>>
>>> So I wondering if you could go in two directions with this:
>>> 1. Make this a feature only for super locked down kernels (no
>>> modules,
>>> etc). Forbid any configurations that might modify text. But eBPF is
>>> used for seccomp, so you might be turning off some security
>>> protections
>>> to get this.
>>
>> Good idea. For "super locked down kernels" :) , we should disable all
>> kernel executable changes with the related kernel build configuration
>> (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no
>> such
>> legitimate access. This looks like an acceptable initial feature.
>
> How many users do you think will want this protection but not
> protections that would have to be disabled? The main one that came to
> mind for me is cBPF seccomp stuff.
>
> But also, the alternative to JITing cBPF is the eBPF interpreter which
> AFAIU is considered a juicy enough target for speculative attacks that
> they created an option to compile it out. And leaving an interpreter in
> the kernel means any data could be "executed" in the normal non-
> speculative scenario, kind of working around the hypervisor executable
> protections. Dropping e/cBPF entirely would be an option, but then I
> wonder how many users you have left. Hopefully that is all correct,
> it's hard to keep track with the pace of BPF development.

seccomp-bpf doesn't rely on JIT, so it is not an issue. For eBPF, JIT is
optional, but other text changes may be required according to the eBPF
program type (e.g. using kprobes).


>
> I wonder if it might be a good idea to POC the guest side before
> settling on the KVM interface. Then you can also look at the whole
> thing and judge how much usage it would get for the different options
> of restrictions.

The next step is to handle dynamic permissions, but it will be easier to
first implement that in KVM itself (which already has the required
authentication code). The current interface may be flexible enough
though, only new attribute flags should be required (and potentially an
async mode). Anyway, this will enable to look at the whole thing.


>
>>
>>
>>> 2. Loosen the rules to allow the protections to not be so one-way
>>> enable. Get less security, but used more widely.
>>
>> This is our goal. I think both static and dynamic cases are
>> legitimate
>> and have value according to the level of security sought. This should
>> be
>> a build-time configuration.
>
> Yea, the proper way to do this is probably to move all text handling
> stuff into a separate domain of some sort, like you mentioned
> elsewhere. It would be quite a job.

Not necessarily to move this code, but to make sure that the changes are
legitimate (e.g. text signatures, legitimate addresses). This doesn't
need to be perfect but it should improve the current state by increasing
the cost of attacks.

2023-05-26 16:57:04

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [PATCH v1 6/9] KVM: x86: Add Heki hypervisor support


On 08/05/2023 23:18, Wei Liu wrote:
> On Fri, May 05, 2023 at 05:20:43PM +0200, Mickaël Salaün wrote:
>> From: Madhavan T. Venkataraman <[email protected]>
>>
>> Each supported hypervisor in x86 implements a struct x86_hyper_init to
>> define the init functions for the hypervisor. Define a new init_heki()
>> entry point in struct x86_hyper_init. Hypervisors that support Heki
>> must define this init_heki() function. Call init_heki() of the chosen
>> hypervisor in init_hypervisor_platform().
>>
>> Create a heki_hypervisor structure that each hypervisor can fill
>> with its data and functions. This will allow the Heki feature to work
>> in a hypervisor agnostic way.
>>
>> Declare and initialize a "heki_hypervisor" structure for KVM so KVM can
>> support Heki. Define the init_heki() function for KVM. In init_heki(),
>> set the hypervisor field in the generic "heki" structure to the KVM
>> "heki_hypervisor". After this point, generic Heki code can access the
>> KVM Heki data and functions.
>>
> [...]
>> +static void kvm_init_heki(void)
>> +{
>> + long err;
>> +
>> + if (!kvm_para_available())
>> + /* Cannot make KVM hypercalls. */
>> + return;
>> +
>> + err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1);
>
> Why not do a proper version check or capability check here? If the ABI
> or supported features ever change then we have something to rely on?

The attributes will indeed get extended, but I wanted to have a simple
proposal for now.

Do you mean to get the version of this hypercall e.g., with a dedicated
flag, like with the
landlock_create_ruleset/LANDLOCK_CREATE_RULESET_VERSION syscall?


>
> Thanks,
> Wei.

2023-05-29 16:11:19

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [PATCH v1 3/9] virt: Implement Heki common code


On 17/05/2023 14:47, Madhavan T. Venkataraman wrote:
> Sorry for the delay. See inline...
>
> On 5/8/23 12:29, Wei Liu wrote:
>> On Fri, May 05, 2023 at 05:20:40PM +0200, Mickaël Salaün wrote:
>>> From: Madhavan T. Venkataraman <[email protected]>
>>>
>>> Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
>>> the hypervisor to enhance guest virtual machine security.
>>>
>>> Configuration
>>> =============
>>>
>>> Define the config variables for the feature. This feature depends on
>>> support from the architecture as well as the hypervisor.
>>>
>>> Enabling HEKI
>>> =============
>>>
>>> Define a kernel command line parameter "heki" to turn the feature on or
>>> off. By default, Heki is on.
>>
>> For such a newfangled feature can we have it off by default? Especially
>> when there are unsolved issues around dynamically loaded code.
>>
>
> Yes. We can certainly do that.

By default the Kconfig option should definitely be off. We also need to
change the Kconfig option to only be set if kernel module, JIT, kprobes
and other dynamic text change feature are disabled at build time (see
discussion with Sean).

With this new Kconfig option for the static case, I think the boot
option should be on by default because otherwise it would not really be
possible to switch back to on later without taking the risk to silently
breaking users' machines. However, we should rename this option to
something like "heki_static" to be in line with the new Kconfig option.

The goal of Heki is to improve and complement kernel self-protection
mechanisms (which don't have boot time options), and to make it
available to everyone, see
https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings
In practice, it would then be kind of useless to be required to set a
boot option to enable Heki (rather than to disable it).


>
>>>
>> [...]
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 3604074a878b..5cf5a7a97811 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -297,6 +297,7 @@ config X86
>>> select FUNCTION_ALIGNMENT_4B
>>> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
>>> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>> + select ARCH_SUPPORTS_HEKI if X86_64
>>
>> Why is there a restriction on X86_64?
>>
>
> We want to get the PoC working and reviewed on X64 first. We have tested this only on X64 so far.

X86_64 includes Intel CPUs, which can support EPT and MBEC, which are a
requirement for Heki. ARM might have similar features but we're focused
on x86 for now.

As a side note, I only have access to an Intel machine, which means that
I cannot work on AMD support. However, I'll be pleased to implement such
support if I get access to a machine with a recent AMD CPU.


>
>>>
>>> config INSTRUCTION_DECODER
>>> def_bool y
>>> diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
>>> index a6e8373a5170..42ef1e33b8a5 100644
>>> --- a/arch/x86/include/asm/sections.h
>>> +++ b/arch/x86/include/asm/sections.h
>> [...]
>>>
>>> +#ifdef CONFIG_HEKI
>>> +
>>> +/*
>>> + * Gather all of the statically defined sections so heki_late_init() can
>>> + * protect these sections in the host page table.
>>> + *
>>> + * The sections are defined under "SECTIONS" in vmlinux.lds.S
>>> + * Keep this array in sync with SECTIONS.
>>> + */
>>
>> This seems a bit fragile, because it requires constant attention from
>> people who care about this functionality. Can this table be
>> automatically generated?
>>
>
> We realize that. But I don't know of a way this can be automatically generated. Also, the permissions for
> each section is specific to the use of that section. The developer who introduces a new section is the
> one who will know what the permissions should be.
>
> If any one has any ideas of how we can generate this table automatically or even just add a build time check
> of some sort, please let us know.

One clean solution might be to parse the vmlinux.lds.S file, extract
section and their permission, and fill that into an automatically
generated header file.

Another way to do it would be to extract sections and associated
permissions with objdump, but that could be an issue because of longer
build time.

A better solution would be to extract such sections and associated
permissions at boot time. I guess the kernel already has such helpers
used in early boot.

2023-05-29 16:49:53

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers


On 08/05/2023 23:11, Wei Liu wrote:
> On Fri, May 05, 2023 at 05:20:42PM +0200, Mickaël Salaün wrote:
>> This enables guests to lock their CR0 and CR4 registers with a subset of
>> X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
>> and X86_CR4_CET flags.
>>
>> The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first
>> is to identify the control register, and the second is a bit mask to
>> pin (i.e. mark as read-only).
>>
>> These register flags should already be pinned by Linux guests, but once
>> compromised, this self-protection mechanism could be disabled, which is
>> not the case with this dedicated hypercall.
>>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Madhavan T. Venkataraman <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Sean Christopherson <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Vitaly Kuznetsov <[email protected]>
>> Cc: Wanpeng Li <[email protected]>
>> Signed-off-by: Mickaël Salaün <[email protected]>
>> Link: https://lore.kernel.org/r/[email protected]
> [...]
>> hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
>> if (is_unrestricted_guest(vcpu))
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index ffab64d08de3..a529455359ac 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
>> return value;
>> }
>>
>> +#ifdef CONFIG_HEKI
>> +
>> +extern unsigned long cr4_pinned_mask;
>> +
>
> Can this be moved to a header file?

Yep, but I'm not sure which one. Any preference Kees?


>
>> +static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
>> + unsigned long pin)
>> +{
>> + if (!pin)
>> + return -KVM_EINVAL;
>> +
>> + switch (cr) {
>> + case 0:
>> + /* Cf. arch/x86/kernel/cpu/common.c */
>> + if (!(pin & X86_CR0_WP))
>> + return -KVM_EINVAL;
>> +
>> + if ((read_cr0() & pin) != pin)
>> + return -KVM_EINVAL;
>> +
>> + atomic_long_or(pin, &kvm->heki_pinned_cr0);
>> + return 0;
>> + case 4:
>> + /* Checks for irrelevant bits. */
>> + if ((pin & cr4_pinned_mask) != pin)
>> + return -KVM_EINVAL;
>> +
>
> It is enforcing the host mask on the guest, right? If the guest's set is a
> super set of the host's then it will get rejected.
>
>
>> + /* Ignores bits not present in host. */
>> + pin &= __read_cr4();
>> + atomic_long_or(pin, &kvm->heki_pinned_cr4);

We assume that the host's mask is a superset of the guest's mask. I
guess we should check the absolute supported bits instead, even if it
would be weird for the host to not support these bits.


>> + return 0;
>> + }
>> + return -KVM_EINVAL;
>> +}
>> +
>> +int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
>> + const unsigned long val)
>> +{
>> + unsigned long pinned;
>> +
>> + switch (cr) {
>> + case 0:
>> + pinned = atomic_long_read(&kvm->heki_pinned_cr0);
>> + if ((val & pinned) != pinned) {
>> + pr_warn_ratelimited(
>> + "heki-kvm: Blocked CR0 update: 0x%lx\n", val);
>
> I think if the message contains the VM and VCPU identifier it will
> become more useful.

Indeed, and this should be the case for all log messages, but I'd left
that for future work. ;) I'll update the logs for the next series with a
new kvm_warn_ratelimited() helper using VCPU's PID.

2023-05-30 10:07:15

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 25/05/2023 20:34, Trilok Soni wrote:
> On 5/25/2023 6:25 AM, Mickaël Salaün wrote:
>>
>> On 24/05/2023 23:04, Trilok Soni wrote:
>>> On 5/5/2023 8:20 AM, Mickaël Salaün wrote:
>>>> Hi,
>>>>
>>>> This patch series is a proof-of-concept that implements new KVM features
>>>> (extended page tracking, MBEC support, CR pinning) and defines a new
>>>> API to
>>>> protect guest VMs. No VMM (e.g., Qemu) modification is required.
>>>>
>>>> The main idea being that kernel self-protection mechanisms should be
>>>> delegated
>>>> to a more privileged part of the system, hence the hypervisor. It is
>>>> still the
>>>> role of the guest kernel to request such restrictions according to its
>>>
>>> Only for the guest kernel images here? Why not for the host OS kernel?
>>
>> As explained in the Future work section, protecting the host would be
>> useful, but that doesn't really fit with the KVM model. The Protected
>> KVM project is a first step to help in this direction [11].
>>
>> In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel
>> is also part of the hypervisor.
>>
>>
>>> Embedded devices w/ Android you have mentioned below supports the host
>>> OS as well it seems, right?
>>
>> What do you mean?
>
> I think you have answered this above w/ pKVM and I was referring the
> host protection as well w/ Heki. The link/references below refers to the
> Android OS it seems and not guest VM.
>
>>
>>
>>>
>>> Do we suggest that all the functionalities should be implemented in the
>>> Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).
>>
>> KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means
>> that we may not control the related code.
>>
>> This patch series is dedicated to hypervisor-enforced kernel integrity,
>> then KVM.
>>
>>>
>>> I am hoping that whatever we suggest the interface here from the Guest
>>> to the Hypervisor becomes the ABI right?
>>
>> Yes, hypercalls are part of the KVM ABI.
>
> Sure. I just hope that they are extensible enough to support for other
> Hypervisors too. I am not sure if they are on this list like ACRN / Xen
> and see if it fits their need too.

KVM, Hyper-V and Xen mailing lists are CCed. The KVM hypercalls are
specific to KVM, but this patch series also include a common guest API
intended to be used with all hypervisors.


>
> Is there any other Hypervisor you plan to test this feature as well?

We're also working on Hyper-V.

>
>>
>>>
>>>
>>>>
>>>> # Current limitations
>>>>
>>>> The main limitation of this patch series is the statically enforced
>>>> permissions. This is not an issue for kernels without module but this
>>>> needs to
>>>> be addressed.  Mechanisms that dynamically impact kernel executable
>>>> memory are
>>>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
>>>> and such
>>>> code will need to be authenticated.  Because the hypervisor is highly
>>>> privileged and critical to the security of all the VMs, we don't want to
>>>> implement a code authentication mechanism in the hypervisor itself
>>>> but delegate
>>>> this verification to something much less privileged. We are thinking
>>>> of two
>>>> ways to solve this: implement this verification in the VMM or spawn a
>>>> dedicated
>>>> special VM (similar to Windows's VBS). There are pros on cons to each
>>>> approach:
>>>> complexity, verification code ownership (guest's or VMM's), access to
>>>> guest
>>>> memory (i.e., confidential computing).
>>>
>>> Do you foresee the performance regressions due to lot of tracking here?
>>
>> The performance impact of execution prevention should be negligible
>> because once configured the hypervisor do nothing except catch
>> illegitimate access attempts.
>
> Yes, if you are using the static kernel only and not considering the
> other dynamic patching features like explained. They need to be thought
> upon differently to reduce the likely impact.

What do you mean? We plan to support dynamic code, and performance is of
course part of the requirement.


>
>>
>>
>>> Production kernels do have lot of tracepoints and we use it as feature
>>> in the GKI kernel for the vendor hooks implementation and in those cases
>>> every vendor driver is a module.
>>
>> As explained in this section, dynamic kernel modifications such as
>> tracepoints or modules are not currently supported by this patch series.
>> Handling tracepoints is possible but requires more work to define and
>> check legitimate changes. This proposal is still useful for static
>> kernels though.
>>
>>
>>> Separate VM further fragments this
>>> design and delegates more of it to proprietary solutions?
>>
>> What do you mean? KVM is not a proprietary solution.
>
> Ah, I was referring the VBS Windows VM mentioned in the above text. Is
> it open-source? The reference of VM (or dedicated VM) didn't mention
> that VM itself will be open-source running Linux kernel.

This patch series is dedicated to KVM. Windows VBS was only mentioned as
a comparable (but much more advanced) set of features. Everything
required to use this new KVM features is and will be open-source. There
is nothing to worry about licensing, the goal is to make it widely and
freely available to protect users.


>
>>
>> For dynamic checks, this would require code not run by KVM itself, but
>> either the VMM or a dedicated VM. In this case, the dynamic
>> authentication code could come from the guest VM or from the VMM itself.
>> In the former case, it is more challenging from a security point of view
>> but doesn't rely on external (proprietary) solution. In the latter case,
>> open-source VMMs should implement the specification to provide the
>> required service (e.g. check kernel module signature).
>>
>> The goal of the common API layer provided by this RFC is to share code
>> as much as possible between different hypervisor backends.
>>
>>
>>>
>>> Do you have any performance numbers w/ current RFC?
>>
>> No, but the only hypervisor performance impact is at boot time and
>> should be negligible. I'll try to get some numbers for the
>> hardware-enforcement impact, but it should be negligible too.
>
> Thanks. Please share the data once you have it ready.

It's on my todo list, but again, that should not be an issue and I even
doubt the difference to be measurable.

2023-05-30 16:32:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Fri, 2023-05-26 at 17:22 +0200, Mickaël Salaün wrote:
> > > Can the guest kernel ask the host VMM's emulated devices to DMA
> > > into
> > > the protected data? It should go through the host userspace
> > > mappings I
> > > think, which don't care about EPT permissions. Or did I miss
> > > where you
> > > are protecting that another way? There are a lot of easy ways to
> > > ask
> > > the host to write to guest memory that don't involve the EPT. You
> > > probably need to protect the host userspace mappings, and also
> > > the
> > > places in KVM that kmap a GPA provided by the guest.
> >
> > Good point, I'll check this confused deputy attack. Extended KVM
> > protections should indeed handle all ways to map guests' memory.
> > I'm
> > wondering if current VMMs would gracefully handle such new
> > restrictions
> > though.
>
> I guess the host could map arbitrary data to the guest, so that need
> to
> be handled, but how could the VMM (not the host kernel) bypass/update
> EPT initially used for the guest (and potentially later mapped to the
> host)?

Well traditionally both QEMU and KVM accessed guest memory via host
mappings instead of the EPT. So I'm wondering what is stopping the
guest from passing a protected gfn when setting up the DMA, and QEMU
being enticed to write to it? The emulator as well would use these host
userspace mappings and not consult the EPT IIRC.

I think Sean was suggesting host userspace should be more involved in
this process, so perhaps it could protect its own alias of the
protected memory, for example mprotect() it as read-only.

There is (was?) some KVM PV features that accessed guest memory via the
host direct map as well. I would think mprotect() should protect this
at the get_user_pages() stage, but it looks like the details have
changed since I last understood it.

2023-05-30 23:33:16

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

On Mon, May 29, 2023 at 06:48:03PM +0200, Micka?l Sala?n wrote:
>
> On 08/05/2023 23:11, Wei Liu wrote:
> > On Fri, May 05, 2023 at 05:20:42PM +0200, Micka?l Sala?n wrote:
> > > This enables guests to lock their CR0 and CR4 registers with a subset of
> > > X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
> > > and X86_CR4_CET flags.
> > >
> > > The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first
> > > is to identify the control register, and the second is a bit mask to
> > > pin (i.e. mark as read-only).
> > >
> > > These register flags should already be pinned by Linux guests, but once
> > > compromised, this self-protection mechanism could be disabled, which is
> > > not the case with this dedicated hypercall.
> > >
> > > Cc: Borislav Petkov <[email protected]>
> > > Cc: Dave Hansen <[email protected]>
> > > Cc: H. Peter Anvin <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Kees Cook <[email protected]>
> > > Cc: Madhavan T. Venkataraman <[email protected]>
> > > Cc: Paolo Bonzini <[email protected]>
> > > Cc: Sean Christopherson <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Vitaly Kuznetsov <[email protected]>
> > > Cc: Wanpeng Li <[email protected]>
> > > Signed-off-by: Micka?l Sala?n <[email protected]>
> > > Link: https://lore.kernel.org/r/[email protected]
> > [...]
> > > hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
> > > if (is_unrestricted_guest(vcpu))
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index ffab64d08de3..a529455359ac 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
> > > return value;
> > > }
> > > +#ifdef CONFIG_HEKI
> > > +
> > > +extern unsigned long cr4_pinned_mask;
> > > +
> >
> > Can this be moved to a header file?
>
> Yep, but I'm not sure which one. Any preference Kees?

Uh, er, I was never expecting that mask to be non-static. ;) To that
end, how about putting it in arch/x86/kvm/x86.h ?

--
Kees Cook

2023-05-31 20:35:53

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

On Tue, May 30, 2023, Rick P Edgecombe wrote:
> On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote:
> > > > Can the guest kernel ask the host VMM's emulated devices to DMA into
> > > > the protected data? It should go through the host userspace mappings I
> > > > think, which don't care about EPT permissions. Or did I miss where you
> > > > are protecting that another way? There are a lot of easy ways to ask
> > > > the host to write to guest memory that don't involve the EPT. You
> > > > probably need to protect the host userspace mappings, and also the
> > > > places in KVM that kmap a GPA provided by the guest.
> > >
> > > Good point, I'll check this confused deputy attack. Extended KVM
> > > protections should indeed handle all ways to map guests' memory. I'm
> > > wondering if current VMMs would gracefully handle such new restrictions
> > > though.
> >
> > I guess the host could map arbitrary data to the guest, so that need to be
> > handled, but how could the VMM (not the host kernel) bypass/update EPT
> > initially used for the guest (and potentially later mapped to the host)?
>
> Well traditionally both QEMU and KVM accessed guest memory via host
> mappings instead of the EPT.�So I'm wondering what is stopping the
> guest from passing a protected gfn when setting up the DMA, and QEMU
> being enticed to write to it? The emulator as well would use these host
> userspace mappings and not consult the EPT IIRC.
>
> I think Sean was suggesting host userspace should be more involved in
> this process, so perhaps it could protect its own alias of the
> protected memory, for example mprotect() it as read-only.

Ya, though "suggesting" is really "demanding, unless someone provides super strong
justification for handling this directly in KVM". It's basically the same argument
that led to Linux Security Modules: I'm all for KVM providing the framework and
plumbing, but I don't want KVM to get involved in defining policy, thread models, etc.

2023-06-02 15:13:01

by Mickaël Salaün

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity


On 31/05/2023 22:24, Sean Christopherson wrote:
> On Tue, May 30, 2023, Rick P Edgecombe wrote:
>> On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote:
>>>>> Can the guest kernel ask the host VMM's emulated devices to DMA into
>>>>> the protected data? It should go through the host userspace mappings I
>>>>> think, which don't care about EPT permissions. Or did I miss where you
>>>>> are protecting that another way? There are a lot of easy ways to ask
>>>>> the host to write to guest memory that don't involve the EPT. You
>>>>> probably need to protect the host userspace mappings, and also the
>>>>> places in KVM that kmap a GPA provided by the guest.
>>>>
>>>> Good point, I'll check this confused deputy attack. Extended KVM
>>>> protections should indeed handle all ways to map guests' memory. I'm
>>>> wondering if current VMMs would gracefully handle such new restrictions
>>>> though.
>>>
>>> I guess the host could map arbitrary data to the guest, so that need to be
>>> handled, but how could the VMM (not the host kernel) bypass/update EPT
>>> initially used for the guest (and potentially later mapped to the host)?
>>
>> Well traditionally both QEMU and KVM accessed guest memory via host
>> mappings instead of the EPT.�So I'm wondering what is stopping the
>> guest from passing a protected gfn when setting up the DMA, and QEMU
>> being enticed to write to it? The emulator as well would use these host
>> userspace mappings and not consult the EPT IIRC.
>>
>> I think Sean was suggesting host userspace should be more involved in
>> this process, so perhaps it could protect its own alias of the
>> protected memory, for example mprotect() it as read-only.
>
> Ya, though "suggesting" is really "demanding, unless someone provides super strong
> justification for handling this directly in KVM". It's basically the same argument
> that led to Linux Security Modules: I'm all for KVM providing the framework and
> plumbing, but I don't want KVM to get involved in defining policy, thread models, etc.

I agree that KVM should not provide its own policy but only the building
blocks to enforce one. There is two complementary points:
- policy definition by the guest, provided to KVM and the host;
- policy enforcement by KVM and the host.

A potential extension of this framework could be to enable the host to
define it's own policy for guests, but this would be a different threat
model.

To avoid too much latency because of the host being involved in policy
enforcement, I'd like to explore an asynchronous approach that would
especially fit well for dynamic restrictions.