2022-02-23 08:28:31

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 00/47] Address Space Isolation for KVM

This patch series is a proof-of-concept RFC for an end-to-end implementation of
Address Space Isolation for KVM. It has similar goals and a somewhat similar
high-level design as the original ASI patches from Alexandre Chartre
([1],[2],[3],[4]), but with a different underlying implementation. This also
includes several memory management changes to help with differentiating between
sensitive and non-sensitive memory and mapping of non-sensitive memory into the
ASI restricted address spaces.

This RFC is intended as a demonstration of what a full ASI implementation for
KVM could look like, not necessarily as a direct proposal for what might
eventually be merged. In particular, these patches do not yet implement KPTI on
top of ASI, although the framework is generic enough to be able to support it.
Similarly, these patches do not include non-sensitive annotations for data
structures that did not get frequently accessed during execution of our test
workloads, but the framework is designed such that new non-sensitive memory
annotations can be added trivially.

The patches apply on top of Linux v5.16. These patches are also available via
gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

Background
==========
Address Space Isolation is a comprehensive security mitigation for several types
of speculative execution attacks. Even though the kernel already has several
speculative execution vulnerability mitigations, some of them can be quite
expensive if enabled fully e.g. to fully mitigate L1TF using the existing
mechanisms requires doing an L1 cache flush on every single VM entry as well as
disabling hyperthreading altogether. (Although core scheduling can provide some
protection when hyperthreading is enabled, it is not sufficient by itself to
protect against all leaks unless sibling hyperthread stunning is also performed
on every VM exit.) ASI provides a much less expensive mitigation for such
vulnerabilities while still providing an almost similar level of protection.

There are a couple of basic insights/assumptions behind ASI:

1. Most execution paths in the kernel (especially during virtual machine
execution) access only memory that is not particularly sensitive even if it were
to get leaked to the executing process/VM (setting aside for a moment what
exactly should be considered sensitive or non-sensitive).
2. Even when executing speculatively, the CPU can generally only bring memory
that is mapped in the current page tables into its various caches and internal
buffers.

Given these, the idea of using ASI to thwart speculative attacks is that we can
execute the kernel using a restricted set of page tables most of the time and
switch to the full unrestricted kernel address space only when the kernel needs
to access something that is not mapped in the restricted address space. And we
keep track of when a switch to the full kernel address space is done, so that
before returning back to the process/VM, we can switch back to the restricted
address space. In the paths where the kernel is able to execute entirely while
remaining in the restricted address space, we can skip other mitigations for
speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
sibling hyperthread stunning etc.). Only in the cases where we do end up
switching the page tables, we perform these more expensive mitigations. Assuming
that happens relatively infrequently, the performance can be significantly
better compared to performing these mitigations all the time.

Please note that although we do have a sibling hyperthread stunning
implementation internally, which is fully integrated with KVM-ASI, it is not
included in this RFC for the time being. The earlier upstream proposal for
sibling stunning [6] could potentially be integrated into an upstream ASI
implementation.

Basic concepts
==============
Different types of restricted address spaces are represented by different ASI
classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
would be another ASI class. An ASI instance (struct asi) represents a single
restricted address space. There is a separate ASI instance for each untrusted
context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
there can be multiple untrusted security contexts (and thus multiple restricted
address spaces) within a single process e.g. in the case of VMs, the userspace
process is a different security context than the guest VM, and in principle,
even each VCPU could be considered a separate security context (That would be
primarily useful for securing nested virtualization).

In this RFC, a process can have at most one ASI instance of each class, though
this is not an inherent limitation and multiple instances of the same class
should eventually be supported. (A process can still have ASI instances of
different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
entirely necessary to tie an ASI instance to a process. That is just a
simplification for the initial implementation.

An asi_enter operation switches into the restricted address space represented by
the given ASI instance. An asi_exit operation switches to the full unrestricted
kernel address space. Each ASI class can provide hooks to be executed during
these operations, which can be used to perform speculative attack mitigations
relevant to that class. For instance, the KVM-ASI hooks would perform a
sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
hand, the hooks for the KPTI class would be NO-OP, since the switching of the
page tables is enough mitigation in that case.

If the kernel attempts to access memory that is not mapped in the currently
active ASI instance, the page fault handler automatically performs an asi_exit
operation. This means that except for a few critical pieces of memory, leaving
something out of an unrestricted address space will result in only a performance
hit, rather than a catastrophic failure. The kernel can also perform explicit
asi_exit operations in some paths as needed.

Apart from the page fault handler, other exceptions and interrupts (even NMIs)
do not automatically cause an asi_exit and could potentially be executed
completely within a restricted address space if they don't end up accessing any
sensitive piece of memory.

The mappings within a restricted address space are always a subset of the full
kernel address space and each mapping is always the same as the corresponding
mapping in the full kernel address space. This is necessary because we could
potentially end up performing an asi_exit at any point.

Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
class could also be implemented on top of the same infrastructure. Furthermore,
in the future we could also implement a KPTI-Next class that actually uses the
ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
the restricted address space and trying to execute most syscalls/interrupts
without switching to the full kernel address space, as opposed to the current
KPTI which requires an address space switch on every kernel/user mode
transition.

Memory classification
=====================
We divide memory into three categories.

1. Sensitive memory
This is memory that should never get leaked to any process or VM. Sensitive
memory is only mapped in the unrestricted kernel page tables. By default, all
memory is considered sensitive unless specifically categorized otherwise.

2. Globally non-sensitive memory
This is memory that does not present a substantial security threat even if it
were to get leaked to any process or VM in the system. Globally non-sensitive
memory is mapped in the restricted address spaces for all processes.

3. Locally non-sensitive memory
This is memory that does not present a substantial security threat if it were to
get leaked to the currently running process or VM, but would present a security
issue if it were to get leaked to any other process or VM in the system.
Examples include userspace memory (or guest memory in the case of VMs) or kernel
structures containing userspace/guest register context etc. Locally
non-sensitive memory is mapped only in the restricted address space of a single
process.

Various mechanisms are provided to annotate different types of memory (static,
buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
addition, the ASI infrastructure takes care to ensure that different classes of
memory do not share the same physical page. This includes separation of
sensitive, globally non-sensitive and locally non-sensitive memory into
different pages and also separation of locally non-sensitive memory for
different processes into different pages as well.

What exactly should be considered non-sensitive (either globally or locally) is
somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
many things also fall into a gray area, depending on how paranoid one wants to
be. For this proof of concept, we have generally treated such things as
non-sensitive, though that may not necessarily be the ideal classification in
each case. Similarly, there is also a gray area between globally and locally
non-sensitive classifications in some cases, and in those cases this RFC has
mostly erred on the side of marking them as locally non-sensitive, even though
many of those cases could likely be safely classified as globally non-sensitive.

Although this implementation includes fairly extensive support for marking most
types of dynamically allocated memory as locally non-sensitive, it is possibly
feasible, at least for KVM-ASI, to get away with a simpler implementation (such
as [5]), if we are very selective about what memory we treat as locally
non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
general mechanism is included in this proof of concept as an illustration for
what could be done if we really needed to treat any arbitrary kernel memory as
locally non-sensitive.

It is also possible to have ASI classes that do not utilize the above described
infrastructure and instead manage all the memory mappings inside the restricted
address space on their own.


References
==========
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]
[3] https://lore.kernel.org/lkml/[email protected]
[4] https://lore.kernel.org/lkml/[email protected]
[5] https://lore.kernel.org/lkml/[email protected]
[6] https://lore.kernel.org/lkml/[email protected]

Cc: Paul Turner <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>


Junaid Shahid (32):
mm: asi: Introduce ASI core API
mm: asi: Add command-line parameter to enable/disable ASI
mm: asi: Switch to unrestricted address space when entering scheduler
mm: asi: ASI support in interrupts/exceptions
mm: asi: Make __get_current_cr3_fast() ASI-aware
mm: asi: ASI page table allocation and free functions
mm: asi: Functions to map/unmap a memory range into ASI page tables
mm: asi: Add basic infrastructure for global non-sensitive mappings
mm: Add __PAGEFLAG_FALSE
mm: asi: Support for global non-sensitive direct map allocations
mm: asi: Global non-sensitive vmalloc/vmap support
mm: asi: Support for global non-sensitive slab caches
mm: asi: Disable ASI API when ASI is not enabled for a process
kvm: asi: Restricted address space for VM execution
mm: asi: Support for mapping non-sensitive pcpu chunks
mm: asi: Aliased direct map for local non-sensitive allocations
mm: asi: Support for pre-ASI-init local non-sensitive allocations
mm: asi: Support for locally nonsensitive page allocations
mm: asi: Support for locally non-sensitive vmalloc allocations
mm: asi: Add support for locally non-sensitive VM_USERMAP pages
mm: asi: Add support for mapping all userspace memory into ASI
mm: asi: Support for local non-sensitive slab caches
mm: asi: Avoid warning from NMI userspace accesses in ASI context
mm: asi: Use separate PCIDs for restricted address spaces
mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
mm: asi: Reduce TLB flushes when freeing pages asynchronously
mm: asi: Add API for mapping userspace address ranges
mm: asi: Support for non-sensitive SLUB caches
x86: asi: Allocate FPU state separately when ASI is enabled.
kvm: asi: Map guest memory into restricted ASI address space
kvm: asi: Unmap guest memory from ASI address space when using nested
virt

Ofir Weisse (15):
asi: Added ASI memory cgroup flag
mm: asi: Added refcounting when initilizing an asi
mm: asi: asi_exit() on PF, skip handling if address is accessible
mm: asi: Adding support for dynamic percpu ASI allocations
mm: asi: ASI annotation support for static variables.
mm: asi: ASI annotation support for dynamic modules.
mm: asi: Skip conventional L1TF/MDS mitigations
mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
mm: asi: Annotation of static variables to be nonsensitive
mm: asi: Annotation of PERCPU variables to be nonsensitive
mm: asi: Annotation of dynamic variables to be nonsensitive
kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
mm: asi: Mapping global nonsensitive areas in asi_global_init
kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
mm: asi: Properly un/mapping task stack from ASI + tlb flush

arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/csky/include/asm/Kbuild | 1 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nds32/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/events/core.c | 6 +-
arch/x86/events/intel/bts.c | 2 +-
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/msr.c | 2 +-
arch/x86/events/perf_event.h | 4 +-
arch/x86/include/asm/asi.h | 215 ++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/current.h | 2 +-
arch/x86/include/asm/debugreg.h | 2 +-
arch/x86/include/asm/desc.h | 2 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 3 +-
arch/x86/include/asm/hardirq.h | 2 +-
arch/x86/include/asm/hw_irq.h | 2 +-
arch/x86/include/asm/idtentry.h | 25 +-
arch/x86/include/asm/kvm_host.h | 124 +-
arch/x86/include/asm/page.h | 19 +-
arch/x86/include/asm/page_64.h | 27 +-
arch/x86/include/asm/page_64_types.h | 20 +
arch/x86/include/asm/percpu.h | 2 +-
arch/x86/include/asm/pgtable_64_types.h | 10 +
arch/x86/include/asm/preempt.h | 2 +-
arch/x86/include/asm/processor.h | 17 +-
arch/x86/include/asm/smp.h | 2 +-
arch/x86/include/asm/tlbflush.h | 49 +-
arch/x86/include/asm/topology.h | 2 +-
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/apic/apic.c | 2 +-
arch/x86/kernel/apic/x2apic_cluster.c | 8 +-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/cpu/common.c | 12 +-
arch/x86/kernel/e820.c | 7 +-
arch/x86/kernel/fpu/core.c | 47 +-
arch/x86/kernel/fpu/init.c | 7 +-
arch/x86/kernel/fpu/internal.h | 1 +
arch/x86/kernel/fpu/xstate.c | 21 +-
arch/x86/kernel/head_64.S | 12 +
arch/x86/kernel/hw_breakpoint.c | 2 +-
arch/x86/kernel/irq.c | 2 +-
arch/x86/kernel/irqinit.c | 2 +-
arch/x86/kernel/nmi.c | 6 +-
arch/x86/kernel/process.c | 13 +-
arch/x86/kernel/setup.c | 4 +-
arch/x86/kernel/setup_percpu.c | 4 +-
arch/x86/kernel/smp.c | 2 +-
arch/x86/kernel/smpboot.c | 3 +-
arch/x86/kernel/traps.c | 2 +
arch/x86/kernel/tsc.c | 10 +-
arch/x86/kernel/vmlinux.lds.S | 2 +-
arch/x86/kvm/cpuid.c | 18 +-
arch/x86/kvm/kvm_cache_regs.h | 22 +-
arch/x86/kvm/lapic.c | 11 +-
arch/x86/kvm/mmu.h | 16 +-
arch/x86/kvm/mmu/mmu.c | 209 ++--
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 40 +-
arch/x86/kvm/mmu/spte.c | 6 +-
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 14 +-
arch/x86/kvm/mtrr.c | 2 +-
arch/x86/kvm/svm/nested.c | 34 +-
arch/x86/kvm/svm/sev.c | 70 +-
arch/x86/kvm/svm/svm.c | 52 +-
arch/x86/kvm/trace.h | 10 +-
arch/x86/kvm/vmx/capabilities.h | 14 +-
arch/x86/kvm/vmx/nested.c | 90 +-
arch/x86/kvm/vmx/vmx.c | 152 ++-
arch/x86/kvm/x86.c | 315 +++--
arch/x86/kvm/x86.h | 4 +-
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 1397 ++++++++++++++++++++++
arch/x86/mm/fault.c | 67 +-
arch/x86/mm/init.c | 7 +-
arch/x86/mm/init_64.c | 26 +-
arch/x86/mm/kaslr.c | 34 +-
arch/x86/mm/mm_internal.h | 5 +
arch/x86/mm/physaddr.c | 8 +
arch/x86/mm/tlb.c | 419 ++++++-
arch/xtensa/include/asm/Kbuild | 1 +
fs/binfmt_elf.c | 2 +-
fs/eventfd.c | 2 +-
fs/eventpoll.c | 10 +-
fs/exec.c | 7 +
fs/file.c | 3 +-
fs/timerfd.c | 2 +-
include/asm-generic/asi.h | 149 +++
include/asm-generic/irq_regs.h | 2 +-
include/asm-generic/percpu.h | 6 +
include/asm-generic/vmlinux.lds.h | 36 +-
include/linux/arch_topology.h | 2 +-
include/linux/debug_locks.h | 4 +-
include/linux/gfp.h | 13 +-
include/linux/hrtimer.h | 2 +-
include/linux/interrupt.h | 2 +-
include/linux/jiffies.h | 4 +-
include/linux/kernel_stat.h | 4 +-
include/linux/kvm_host.h | 7 +-
include/linux/kvm_types.h | 3 +
include/linux/memcontrol.h | 3 +
include/linux/mm_types.h | 59 +
include/linux/module.h | 15 +
include/linux/notifier.h | 2 +-
include/linux/page-flags.h | 19 +
include/linux/percpu-defs.h | 39 +
include/linux/percpu.h | 8 +-
include/linux/pgtable.h | 3 +
include/linux/prandom.h | 2 +-
include/linux/profile.h | 2 +-
include/linux/rcupdate.h | 4 +-
include/linux/rcutree.h | 2 +-
include/linux/sched.h | 5 +
include/linux/sched/mm.h | 12 +
include/linux/sched/sysctl.h | 1 +
include/linux/slab.h | 68 +-
include/linux/slab_def.h | 4 +
include/linux/slub_def.h | 6 +
include/linux/vmalloc.h | 16 +-
include/trace/events/mmflags.h | 14 +-
init/main.c | 2 +-
kernel/cgroup/cgroup.c | 9 +-
kernel/cpu.c | 14 +-
kernel/entry/common.c | 6 +
kernel/events/core.c | 25 +-
kernel/exit.c | 2 +
kernel/fork.c | 69 +-
kernel/freezer.c | 2 +-
kernel/irq_work.c | 6 +-
kernel/locking/lockdep.c | 14 +-
kernel/module-internal.h | 1 +
kernel/module.c | 210 +++-
kernel/panic.c | 2 +-
kernel/printk/printk.c | 4 +-
kernel/profile.c | 4 +-
kernel/rcu/srcutree.c | 3 +-
kernel/rcu/tree.c | 12 +-
kernel/rcu/update.c | 4 +-
kernel/sched/clock.c | 2 +-
kernel/sched/core.c | 23 +-
kernel/sched/cpuacct.c | 10 +-
kernel/sched/cpufreq.c | 3 +-
kernel/sched/cputime.c | 4 +-
kernel/sched/fair.c | 7 +-
kernel/sched/loadavg.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 25 +-
kernel/sched/topology.c | 28 +-
kernel/smp.c | 26 +-
kernel/softirq.c | 5 +-
kernel/time/hrtimer.c | 4 +-
kernel/time/jiffies.c | 8 +-
kernel/time/ntp.c | 30 +-
kernel/time/tick-common.c | 6 +-
kernel/time/tick-internal.h | 6 +-
kernel/time/tick-sched.c | 4 +-
kernel/time/timekeeping.c | 10 +-
kernel/time/timekeeping.h | 2 +-
kernel/time/timer.c | 4 +-
kernel/trace/ring_buffer.c | 5 +-
kernel/trace/trace.c | 4 +-
kernel/trace/trace_preemptirq.c | 2 +-
kernel/trace/trace_sched_switch.c | 4 +-
kernel/tracepoint.c | 2 +-
kernel/watchdog.c | 12 +-
lib/debug_locks.c | 5 +-
lib/irq_regs.c | 2 +-
lib/radix-tree.c | 6 +-
lib/random32.c | 3 +-
mm/init-mm.c | 2 +
mm/internal.h | 3 +
mm/memcontrol.c | 37 +-
mm/memory.c | 4 +-
mm/page_alloc.c | 204 +++-
mm/percpu-internal.h | 23 +-
mm/percpu-km.c | 5 +-
mm/percpu-vm.c | 57 +-
mm/percpu.c | 273 ++++-
mm/slab.c | 42 +-
mm/slab.h | 166 ++-
mm/slab_common.c | 461 ++++++-
mm/slub.c | 140 ++-
mm/sparse.c | 4 +-
mm/util.c | 3 +-
mm/vmalloc.c | 193 ++-
net/core/skbuff.c | 2 +-
net/core/sock.c | 2 +-
security/Kconfig | 12 +
tools/perf/builtin-kmem.c | 2 +
virt/kvm/coalesced_mmio.c | 2 +-
virt/kvm/eventfd.c | 5 +-
virt/kvm/kvm_main.c | 61 +-
211 files changed, 5727 insertions(+), 959 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/mm/asi.c
create mode 100644 include/asm-generic/asi.h

--
2.35.1.473.g83b2b277ed-goog


2022-02-23 08:34:36

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 11/47] mm: asi: Global non-sensitive vmalloc/vmap support

A new flag, VM_GLOBAL_NONSENSITIVE is added to designate globally
non-sensitive vmalloc/vmap areas. When using the __vmalloc /
__vmalloc_node APIs, if the corresponding GFP flag is specified, the
VM flag is automatically added. When using the __vmalloc_node_range API,
either flag can be specified independently. The VM flag will only map
the vmalloc area as non-sensitive, while the GFP flag will only map the
underlying direct map area as non-sensitive.

When using the __vmalloc_node_range API, instead of VMALLOC_START/END,
VMALLOC_GLOBAL_NONSENSITIVE_START/END should be used. This is to
keep these mappings separate from locally non-sensitive vmalloc areas,
which will be added later. Areas outside of the standard vmalloc range
can specify the range as before.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/pgtable_64_types.h | 5 +++
arch/x86/mm/asi.c | 3 +-
include/asm-generic/asi.h | 3 ++
include/linux/vmalloc.h | 6 +++
mm/vmalloc.c | 53 ++++++++++++++++++++++---
5 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 91ac10654570..0fc380ba25b8 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -141,6 +141,11 @@ extern unsigned int ptrs_per_p4d;

#define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START
+#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END
+#endif
+
#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
/* The module sections ends with the start of the fixmap */
#ifndef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index d381ae573af9..71348399baf1 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -198,7 +198,8 @@ static int __init asi_global_init(void)
"ASI Global Non-sensitive direct map");

preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd,
- VMALLOC_START, VMALLOC_END,
+ VMALLOC_GLOBAL_NONSENSITIVE_START,
+ VMALLOC_GLOBAL_NONSENSITIVE_END,
"ASI Global Non-sensitive vmalloc");

return 0;
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index 012691e29895..f918cd052722 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -14,6 +14,9 @@

#define ASI_GLOBAL_NONSENSITIVE NULL

+#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START
+#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END
+
#ifndef _ASSEMBLY_

struct asi_hooks {};
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 6e022cc712e6..c7c66decda3e 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -39,6 +39,12 @@ struct notifier_block; /* in notifier.h */
* determine which allocations need the module shadow freed.
*/

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define VM_GLOBAL_NONSENSITIVE 0x00000800 /* Similar to __GFP_GLOBAL_NONSENSITIVE */
+#else
+#define VM_GLOBAL_NONSENSITIVE 0
+#endif
+
/* bits [20..32] reserved for arch specific ioremap internals */

/*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2ef719f1cba..ba588a37ee75 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2393,6 +2393,33 @@ void __init vmalloc_init(void)
vmap_initialized = true;
}

+static int asi_map_vm_area(struct vm_struct *area)
+{
+ if (!static_asi_enabled())
+ return 0;
+
+ if (area->flags & VM_GLOBAL_NONSENSITIVE)
+ return asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr,
+ get_vm_area_size(area));
+
+ return 0;
+}
+
+static void asi_unmap_vm_area(struct vm_struct *area)
+{
+ if (!static_asi_enabled())
+ return;
+
+ /*
+ * TODO: The TLB flush here could potentially be avoided in
+ * the case when the existing flush from try_purge_vmap_area_lazy()
+ * and/or vm_unmap_aliases() happens non-lazily.
+ */
+ if (area->flags & VM_GLOBAL_NONSENSITIVE)
+ asi_unmap(ASI_GLOBAL_NONSENSITIVE, area->addr,
+ get_vm_area_size(area), true);
+}
+
static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
struct vmap_area *va, unsigned long flags, const void *caller)
{
@@ -2570,6 +2597,7 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
int flush_dmap = 0;
int i;

+ asi_unmap_vm_area(area);
remove_vm_area(area->addr);

/* If this is not VM_FLUSH_RESET_PERMS memory, no need for the below. */
@@ -2787,16 +2815,20 @@ void *vmap(struct page **pages, unsigned int count,

addr = (unsigned long)area->addr;
if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
- pages, PAGE_SHIFT) < 0) {
- vunmap(area->addr);
- return NULL;
- }
+ pages, PAGE_SHIFT) < 0)
+ goto err;
+
+ if (asi_map_vm_area(area))
+ goto err;

if (flags & VM_MAP_PUT_PAGES) {
area->pages = pages;
area->nr_pages = count;
}
return area->addr;
+err:
+ vunmap(area->addr);
+ return NULL;
}
EXPORT_SYMBOL(vmap);

@@ -2991,6 +3023,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
goto fail;
}

+ if (asi_map_vm_area(area))
+ goto fail;
+
return area->addr;

fail:
@@ -3038,6 +3073,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
if (WARN_ON_ONCE(!size))
return NULL;

+ if (static_asi_enabled() && (vm_flags & VM_GLOBAL_NONSENSITIVE))
+ gfp_mask |= __GFP_ZERO;
+
if ((size >> PAGE_SHIFT) > totalram_pages()) {
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, exceeds total pages",
@@ -3127,8 +3165,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, int node, const void *caller)
{
+ ulong vm_flags = 0;
+
+ if (static_asi_enabled() && (gfp_mask & __GFP_GLOBAL_NONSENSITIVE))
+ vm_flags |= VM_GLOBAL_NONSENSITIVE;
+
return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
- gfp_mask, PAGE_KERNEL, 0, node, caller);
+ gfp_mask, PAGE_KERNEL, vm_flags, node, caller);
}
/*
* This is only for performance analysis of vmalloc and stress purpose.
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 08:41:32

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 13/47] asi: Added ASI memory cgroup flag

From: Ofir Weisse <[email protected]>

Adds a cgroup flag to control if ASI is enabled for processes in
that cgroup.

Can be set or cleared by writing to the memory.use_asi file in the
memory cgroup. The flag only affects new processes created after
the flag was set.

In addition to the cgroup flag, we may also want to add a per-process
flag, though it will have to be something that can be set at process
creation time.

Signed-off-by: Ofir Weisse <[email protected]>
Co-developed-by: Junaid Shahid <[email protected]>
Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/mm/asi.c | 14 ++++++++++++++
include/linux/memcontrol.h | 3 +++
include/linux/mm_types.h | 17 +++++++++++++++++
mm/memcontrol.c | 30 ++++++++++++++++++++++++++++++
4 files changed, 64 insertions(+)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 71348399baf1..ca50a32ecd7e 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -2,6 +2,7 @@

#include <linux/init.h>
#include <linux/memblock.h>
+#include <linux/memcontrol.h>

#include <asm/asi.h>
#include <asm/pgalloc.h>
@@ -322,7 +323,20 @@ EXPORT_SYMBOL_GPL(asi_exit);

void asi_init_mm_state(struct mm_struct *mm)
{
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
memset(mm->asi, 0, sizeof(mm->asi));
+ mm->asi_enabled = false;
+
+ /*
+ * TODO: In addition to a cgroup flag, we may also want a per-process
+ * flag.
+ */
+ if (memcg) {
+ mm->asi_enabled = boot_cpu_has(X86_FEATURE_ASI) &&
+ memcg->use_asi;
+ css_put(&memcg->css);
+ }
}

static bool is_page_within_range(size_t addr, size_t page_size,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403f4be6..a883cb458b06 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -259,6 +259,9 @@ struct mem_cgroup {
*/
bool oom_group;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ bool use_asi;
+#endif
/* protected by memcg_oom_lock */
bool oom_lock;
int under_oom;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5b8028fcfe67..8624d2783661 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -607,6 +607,14 @@ struct mm_struct {
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
+
+#endif
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* Is ASI enabled for this mm? ASI requires allocating extra
+ * resources, such as ASI page tables. To prevent allocationg
+ * these resources for every mm in the system, we expect that
+ * only VM mm's will have this flag set. */
+ bool asi_enabled;
#endif
struct user_namespace *user_ns;

@@ -665,6 +673,15 @@ struct mm_struct {

extern struct mm_struct init_mm;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+static inline bool mm_asi_enabled(struct mm_struct *mm)
+{
+ return mm->asi_enabled;
+}
+#else
+static inline bool mm_asi_enabled(struct mm_struct *mm) { return false; }
+#endif
+
/* Pointer magic because the dynamic array size confuses some compilers. */
static inline void mm_init_cpumask(struct mm_struct *mm)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d..a66d6b222ecf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3539,6 +3539,29 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
return -EINVAL;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static u64 mem_cgroup_asi_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return mem_cgroup_from_css(css)->use_asi;
+}
+
+static int mem_cgroup_asi_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ if (val == 1 || val == 0)
+ memcg->use_asi = val;
+ else
+ return -EINVAL;
+
+ return 0;
+}
+
+#endif
+
static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
{
unsigned long val;
@@ -4888,6 +4911,13 @@ static struct cftype mem_cgroup_legacy_files[] = {
.write_u64 = mem_cgroup_hierarchy_write,
.read_u64 = mem_cgroup_hierarchy_read,
},
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ {
+ .name = "use_asi",
+ .write_u64 = mem_cgroup_asi_write,
+ .read_u64 = mem_cgroup_asi_read,
+ },
+#endif
{
.name = "cgroup.event_control", /* XXX: for compat */
.write = memcg_write_event_control,
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 09:13:33

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 03/47] mm: asi: Switch to unrestricted address space when entering scheduler

To keep things simpler, we run the scheduler only in the full
unrestricted address space for the time being.

Signed-off-by: Junaid Shahid <[email protected]>


---
kernel/sched/core.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77563109c0ea..44ea197c16ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -19,6 +19,7 @@

#include <asm/switch_to.h>
#include <asm/tlb.h>
+#include <asm/asi.h>

#include "../workqueue_internal.h"
#include "../../fs/io-wq.h"
@@ -6141,6 +6142,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq = cpu_rq(cpu);
prev = rq->curr;

+ /* This could possibly be delayed to just before the context switch. */
+ VM_WARN_ON(!asi_is_target_unrestricted());
+ asi_exit();
+
schedule_debug(prev, !!sched_mode);

if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 09:30:05

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 08/47] mm: asi: Add basic infrastructure for global non-sensitive mappings

A pseudo-PGD is added to store global non-sensitive ASI mappings.
Actual ASI PGDs copy entries from this pseudo-PGD during asi_init().

Memory can be mapped as globally non-sensitive by calling asi_map()
with ASI_GLOBAL_NONSENSITIVE.

Page tables allocated for global non-sensitive mappings are never
freed.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 12 ++++++++++++
arch/x86/mm/asi.c | 36 +++++++++++++++++++++++++++++++++++-
arch/x86/mm/init_64.c | 26 +++++++++++++++++---------
arch/x86/mm/mm_internal.h | 3 +++
include/asm-generic/asi.h | 5 +++++
mm/init-mm.c | 2 ++
6 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 521b40d1864b..64c2b4d1dba2 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -15,6 +15,8 @@
#define ASI_MAX_NUM_ORDER 2
#define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER)

+#define ASI_GLOBAL_NONSENSITIVE (&init_mm.asi[0])
+
struct asi_state {
struct asi *curr_asi;
struct asi *target_asi;
@@ -41,6 +43,8 @@ struct asi {

DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);

+extern pgd_t asi_global_nonsensitive_pgd[];
+
void asi_init_mm_state(struct mm_struct *mm);

int asi_register_class(const char *name, uint flags,
@@ -117,6 +121,14 @@ static inline void asi_intr_exit(void)
}
}

+#define INIT_MM_ASI(init_mm) \
+ .asi = { \
+ [0] = { \
+ .pgd = asi_global_nonsensitive_pgd, \
+ .mm = &init_mm \
+ } \
+ },
+
static inline pgd_t *asi_pgd(struct asi *asi)
{
return asi->pgd;
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 84d220cbdcfc..d381ae573af9 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -1,11 +1,13 @@
// SPDX-License-Identifier: GPL-2.0

#include <linux/init.h>
+#include <linux/memblock.h>

#include <asm/asi.h>
#include <asm/pgalloc.h>
#include <asm/mmu_context.h>

+#include "mm_internal.h"
#include "../../../mm/internal.h"

#undef pr_fmt
@@ -17,6 +19,8 @@ static DEFINE_SPINLOCK(asi_class_lock);
DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state);

+__aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD];
+
int asi_register_class(const char *name, uint flags,
const struct asi_hooks *ops)
{
@@ -160,12 +164,17 @@ static void asi_free_pgd_range(struct asi *asi, uint start, uint end)
* Free the page tables allocated for the given ASI instance.
* The caller must ensure that all the mappings have already been cleared
* and appropriate TLB flushes have been issued before calling this function.
+ *
+ * For standard non-sensitive ASI classes, the page tables shared with the
+ * master pseudo-PGD are not freed.
*/
static void asi_free_pgd(struct asi *asi)
{
VM_BUG_ON(asi->mm == &init_mm);

- asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD);
+ if (!(asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE))
+ asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD);
+
free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER);
}

@@ -178,6 +187,24 @@ static int __init set_asi_param(char *str)
}
early_param("asi", set_asi_param);

+static int __init asi_global_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_ASI))
+ return 0;
+
+ preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd,
+ PAGE_OFFSET,
+ PAGE_OFFSET + PFN_PHYS(max_possible_pfn) - 1,
+ "ASI Global Non-sensitive direct map");
+
+ preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd,
+ VMALLOC_START, VMALLOC_END,
+ "ASI Global Non-sensitive vmalloc");
+
+ return 0;
+}
+subsys_initcall(asi_global_init)
+
int asi_init(struct mm_struct *mm, int asi_index)
{
struct asi *asi = &mm->asi[asi_index];
@@ -202,6 +229,13 @@ int asi_init(struct mm_struct *mm, int asi_index)
asi->class = &asi_class[asi_index];
asi->mm = mm;

+ if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) {
+ uint i;
+
+ for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++)
+ set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(asi_init);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 36098226a957..ebd512c64ed0 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1277,18 +1277,15 @@ static void __init register_page_bootmem_info(void)
#endif
}

-/*
- * Pre-allocates page-table pages for the vmalloc area in the kernel page-table.
- * Only the level which needs to be synchronized between all page-tables is
- * allocated because the synchronization can be expensive.
- */
-static void __init preallocate_vmalloc_pages(void)
+void __init preallocate_toplevel_pgtbls(pgd_t *pgd_table,
+ ulong start, ulong end,
+ const char *name)
{
unsigned long addr;
const char *lvl;

- for (addr = VMALLOC_START; addr <= VMALLOC_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
- pgd_t *pgd = pgd_offset_k(addr);
+ for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+ pgd_t *pgd = pgd_offset_pgd(pgd_table, addr);
p4d_t *p4d;
pud_t *pud;

@@ -1324,7 +1321,18 @@ static void __init preallocate_vmalloc_pages(void)
* The pages have to be there now or they will be missing in
* process page-tables later.
*/
- panic("Failed to pre-allocate %s pages for vmalloc area\n", lvl);
+ panic("Failed to pre-allocate %s pages for %s area\n", lvl, name);
+}
+
+/*
+ * Pre-allocates page-table pages for the vmalloc area in the kernel page-table.
+ * Only the level which needs to be synchronized between all page-tables is
+ * allocated because the synchronization can be expensive.
+ */
+static void __init preallocate_vmalloc_pages(void)
+{
+ preallocate_toplevel_pgtbls(init_mm.pgd, VMALLOC_START, VMALLOC_END,
+ "vmalloc");
}

void __init mem_init(void)
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 3f37b5c80bb3..a1e8c523ab08 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -19,6 +19,9 @@ unsigned long kernel_physical_mapping_change(unsigned long start,
unsigned long page_size_mask);
void zone_sizes_init(void);

+void preallocate_toplevel_pgtbls(pgd_t *pgd_table, ulong start, ulong end,
+ const char *name);
+
extern int after_bootmem;

void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache);
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index 7da91cbe075d..012691e29895 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -12,6 +12,8 @@
#define ASI_MAX_NUM_ORDER 0
#define ASI_MAX_NUM 0

+#define ASI_GLOBAL_NONSENSITIVE NULL
+
#ifndef _ASSEMBLY_

struct asi_hooks {};
@@ -63,8 +65,11 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { }
static inline
void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { }

+#define INIT_MM_ASI(init_mm)
+
#define static_asi_enabled() false

+
#endif /* !_ASSEMBLY_ */

#endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */
diff --git a/mm/init-mm.c b/mm/init-mm.c
index b4a6f38fb51d..47a6a66610fb 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
#include <linux/atomic.h>
#include <linux/user_namespace.h>
#include <asm/mmu.h>
+#include <asm/asi.h>

#ifndef INIT_MM_CONTEXT
#define INIT_MM_CONTEXT(name)
@@ -38,6 +39,7 @@ struct mm_struct init_mm = {
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
+ INIT_MM_ASI(init_mm)
INIT_MM_CONTEXT(init_mm)
};

--
2.35.1.473.g83b2b277ed-goog

2022-02-23 09:49:55

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 27/47] mm: asi: Avoid TLB flushes during ASI CR3 switches when possible

The TLB flush functions are modified to flush the ASI PCIDs in addition
to the unrestricted kernel PCID and the KPTI PCID. Some tracking is
also added to figure out when the TLB state for ASI PCIDs is out-of-date
(e.g. due to lack of INVPCID support), and ASI Enter/Exit use this
information to skip a TLB flush during the CR3 switch when the TLB is
already up-to-date.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 11 ++-
arch/x86/include/asm/tlbflush.h | 47 ++++++++++
arch/x86/mm/asi.c | 38 +++++++-
arch/x86/mm/tlb.c | 152 ++++++++++++++++++++++++++++++--
4 files changed, 234 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index aaa0d0bdbf59..1a77917c79c7 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -126,11 +126,18 @@ static inline void asi_intr_exit(void)
if (static_cpu_has(X86_FEATURE_ASI)) {
barrier();

- if (--current->thread.intr_nest_depth == 0)
+ if (--current->thread.intr_nest_depth == 0) {
+ barrier();
__asi_enter();
+ }
}
}

+static inline int asi_intr_nest_depth(void)
+{
+ return current->thread.intr_nest_depth;
+}
+
#define INIT_MM_ASI(init_mm) \
.asi = { \
[0] = { \
@@ -150,6 +157,8 @@ static inline void asi_intr_enter(void) { }

static inline void asi_intr_exit(void) { }

+static inline int asi_intr_nest_depth(void) { return 0; }
+
static inline void asi_init_thread_state(struct thread_struct *thread) { }

static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f9ec5e67e361..295bebdb4395 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -12,6 +12,7 @@
#include <asm/invpcid.h>
#include <asm/pti.h>
#include <asm/processor-flags.h>
+#include <asm/asi.h>

void __flush_tlb_all(void);

@@ -59,9 +60,20 @@ static inline void cr4_clear_bits(unsigned long mask)
*/
#define TLB_NR_DYN_ASIDS 6

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+struct asi_tlb_context {
+ bool flush_pending;
+};
+
+#endif
+
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct asi_tlb_context asi_context[ASI_MAX_NUM];
+#endif
};

struct tlb_state {
@@ -100,6 +112,10 @@ struct tlb_state {
*/
bool invalidate_other;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* If set, ASI Exit needs to do a TLB flush during the CR3 switch */
+ bool kern_pcid_needs_flush;
+#endif
/*
* Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
* the corresponding user PCID needs a flush next time we
@@ -262,8 +278,39 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
unsigned long build_cr3(pgd_t *pgd, u16 asid);
unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush);

+u16 kern_pcid(u16 asid);
u16 asi_pcid(struct asi *asi, u16 asid);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static inline bool *__asi_tlb_flush_pending(struct asi *asi)
+{
+ struct tlb_state *tlb_state;
+ struct tlb_context *tlb_context;
+
+ tlb_state = this_cpu_ptr(&cpu_tlbstate);
+ tlb_context = &tlb_state->ctxs[tlb_state->loaded_mm_asid];
+ return &tlb_context->asi_context[asi->pcid_index].flush_pending;
+}
+
+static inline bool asi_get_and_clear_tlb_flush_pending(struct asi *asi)
+{
+ bool *tlb_flush_pending_ptr = __asi_tlb_flush_pending(asi);
+ bool tlb_flush_pending = READ_ONCE(*tlb_flush_pending_ptr);
+
+ if (tlb_flush_pending)
+ WRITE_ONCE(*tlb_flush_pending_ptr, false);
+
+ return tlb_flush_pending;
+}
+
+static inline void asi_clear_pending_tlb_flush(struct asi *asi)
+{
+ WRITE_ONCE(*__asi_tlb_flush_pending(asi), false);
+}
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
#endif /* !MODULE */

#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index dbfea3dc4bb1..17b8e6e60312 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -388,6 +388,7 @@ void __asi_enter(void)
{
u64 asi_cr3;
u16 pcid;
+ bool need_flush = false;
struct asi *target = this_cpu_read(asi_cpu_state.target_asi);

VM_BUG_ON(preemptible());
@@ -401,8 +402,18 @@ void __asi_enter(void)

this_cpu_write(asi_cpu_state.curr_asi, target);

+ if (static_cpu_has(X86_FEATURE_PCID))
+ need_flush = asi_get_and_clear_tlb_flush_pending(target);
+
+ /*
+ * It is possible that we may get a TLB flush IPI after
+ * already reading need_flush, in which case we won't do the
+ * flush below. However, in that case the interrupt epilog
+ * will also call __asi_enter(), which will do the flush.
+ */
+
pcid = asi_pcid(target, this_cpu_read(cpu_tlbstate.loaded_mm_asid));
- asi_cr3 = build_cr3_pcid(target->pgd, pcid, false);
+ asi_cr3 = build_cr3_pcid(target->pgd, pcid, !need_flush);
write_cr3(asi_cr3);

if (target->class->ops.post_asi_enter)
@@ -437,12 +448,31 @@ void asi_exit(void)
asi = this_cpu_read(asi_cpu_state.curr_asi);

if (asi) {
+ bool need_flush = false;
+
if (asi->class->ops.pre_asi_exit)
asi->class->ops.pre_asi_exit();

- unrestricted_cr3 =
- build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ if (static_cpu_has(X86_FEATURE_PCID) &&
+ !static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+ need_flush = this_cpu_read(
+ cpu_tlbstate.kern_pcid_needs_flush);
+ this_cpu_write(cpu_tlbstate.kern_pcid_needs_flush,
+ false);
+ }
+
+ /*
+ * It is possible that we may get a TLB flush IPI after
+ * already reading need_flush. However, in that case the IPI
+ * will not set flush_pending for the unrestricted address
+ * space, as that is done by flush_tlb_one_user() only if
+ * asi_intr_nest_depth() is 0.
+ */
+
+ unrestricted_cr3 = build_cr3_pcid(
+ this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ kern_pcid(this_cpu_read(cpu_tlbstate.loaded_mm_asid)),
+ !need_flush);

write_cr3(unrestricted_cr3);
this_cpu_write(asi_cpu_state.curr_asi, NULL);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 312b9c185a55..5c9681df3a16 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -114,7 +114,7 @@ static_assert(TLB_NR_DYN_ASIDS < BIT(CR3_AVAIL_PCID_BITS));
/*
* Given @asid, compute kPCID
*/
-static inline u16 kern_pcid(u16 asid)
+inline u16 kern_pcid(u16 asid)
{
VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);

@@ -166,6 +166,60 @@ u16 asi_pcid(struct asi *asi, u16 asid)
return kern_pcid(asid) | (asi->pcid_index << ASI_PCID_BITS_SHIFT);
}

+static void invalidate_kern_pcid(void)
+{
+ this_cpu_write(cpu_tlbstate.kern_pcid_needs_flush, true);
+}
+
+static void invalidate_asi_pcid(struct asi *asi, u16 asid)
+{
+ uint i;
+ struct asi_tlb_context *asi_tlb_context;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) ||
+ !static_cpu_has(X86_FEATURE_PCID))
+ return;
+
+ asi_tlb_context = this_cpu_ptr(cpu_tlbstate.ctxs[asid].asi_context);
+
+ if (asi)
+ asi_tlb_context[asi->pcid_index].flush_pending = true;
+ else
+ for (i = 1; i < ASI_MAX_NUM; i++)
+ asi_tlb_context[i].flush_pending = true;
+}
+
+static void flush_asi_pcid(struct asi *asi)
+{
+ u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ /*
+ * The flag should be cleared before the INVPCID, to avoid clearing it
+ * in case an interrupt/exception sets it again after the INVPCID.
+ */
+ asi_clear_pending_tlb_flush(asi);
+ invpcid_flush_single_context(asi_pcid(asi, asid));
+}
+
+static void __flush_tlb_one_asi(struct asi *asi, u16 asid, size_t addr)
+{
+ if (!static_cpu_has(X86_FEATURE_ASI))
+ return;
+
+ if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+ invalidate_asi_pcid(asi, asid);
+ } else if (asi) {
+ invpcid_flush_one(asi_pcid(asi, asid), addr);
+ } else {
+ uint i;
+ struct mm_struct *mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+ for (i = 1; i < ASI_MAX_NUM; i++)
+ if (mm->asi[i].pgd)
+ invpcid_flush_one(asi_pcid(&mm->asi[i], asid),
+ addr);
+ }
+}
+
#else /* CONFIG_ADDRESS_SPACE_ISOLATION */

u16 asi_pcid(struct asi *asi, u16 asid)
@@ -173,6 +227,11 @@ u16 asi_pcid(struct asi *asi, u16 asid)
return kern_pcid(asid);
}

+static inline void invalidate_kern_pcid(void) { }
+static inline void invalidate_asi_pcid(struct asi *asi, u16 asid) { }
+static inline void flush_asi_pcid(struct asi *asi) { }
+static inline void __flush_tlb_one_asi(struct asi *asi, u16 asid, size_t addr) { }
+
#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */

unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush)
@@ -223,7 +282,8 @@ static void clear_asid_other(void)
* This is only expected to be set if we have disabled
* kernel _PAGE_GLOBAL pages.
*/
- if (!static_cpu_has(X86_FEATURE_PTI)) {
+ if (!static_cpu_has(X86_FEATURE_PTI) &&
+ !cpu_feature_enabled(X86_FEATURE_ASI)) {
WARN_ON_ONCE(1);
return;
}
@@ -313,6 +373,7 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)

if (need_flush) {
invalidate_user_asid(new_asid);
+ invalidate_asi_pcid(NULL, new_asid);
new_mm_cr3 = build_cr3(pgdir, new_asid);
} else {
new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
@@ -741,11 +802,17 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
+ invalidate_asi_pcid(NULL, 0);

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
}

+static inline void invlpg(unsigned long addr)
+{
+ asm volatile("invlpg (%0)" ::"r"(addr) : "memory");
+}
+
/*
* flush_tlb_func()'s memory ordering requirement is that any
* TLB fills that happen after we flush the TLB are ordered after we
@@ -967,7 +1034,8 @@ void flush_tlb_multi(const struct cpumask *cpumask,
* least 95%) of allocations, and is small enough that we are
* confident it will not cause too much overhead. Each single
* flush is about 100 ns, so this caps the maximum overhead at
- * _about_ 3,000 ns.
+ * _about_ 3,000 ns (plus upto an additional ~3000 ns for each
+ * ASI instance, or for KPTI).
*
* This is in units of pages.
*/
@@ -1157,7 +1225,8 @@ void flush_tlb_one_kernel(unsigned long addr)
*/
flush_tlb_one_user(addr);

- if (!static_cpu_has(X86_FEATURE_PTI))
+ if (!static_cpu_has(X86_FEATURE_PTI) &&
+ !cpu_feature_enabled(X86_FEATURE_ASI))
return;

/*
@@ -1174,9 +1243,45 @@ void flush_tlb_one_kernel(unsigned long addr)
*/
STATIC_NOPV void native_flush_tlb_one_user(unsigned long addr)
{
- u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ u16 loaded_mm_asid;

- asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+ if (!static_cpu_has(X86_FEATURE_PCID)) {
+ invlpg(addr);
+ return;
+ }
+
+ loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+ /*
+ * If we don't have INVPCID support, then we do an ASI Exit so that
+ * the invlpg happens in the unrestricted address space, and we
+ * invalidate the ASI PCID so that it is flushed at the next ASI Enter.
+ *
+ * But if a valid target ASI is set, then an ASI Exit can be ephemeral
+ * due to interrupts/exceptions/NMIs (except if we are already inside
+ * one), so we just invalidate both the ASI and the unrestricted kernel
+ * PCIDs and let the invlpg flush whichever happens to be the current
+ * address space. This is a bit more wasteful, but this scenario is not
+ * actually expected to occur with the current usage of ASI, and is
+ * handled here just for completeness. (If we wanted to optimize this,
+ * we could manipulate the intr_nest_depth to guarantee that an ASI
+ * Exit is not ephemeral).
+ */
+ if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+ if (unlikely(!asi_is_target_unrestricted()) &&
+ asi_intr_nest_depth() == 0)
+ invalidate_kern_pcid();
+ else
+ asi_exit();
+ }
+
+ /* Flush the unrestricted kernel address space */
+ if (!is_asi_active())
+ invlpg(addr);
+ else
+ invpcid_flush_one(kern_pcid(loaded_mm_asid), addr);
+
+ __flush_tlb_one_asi(NULL, loaded_mm_asid, addr);

if (!static_cpu_has(X86_FEATURE_PTI))
return;
@@ -1235,6 +1340,9 @@ STATIC_NOPV void native_flush_tlb_global(void)
*/
STATIC_NOPV void native_flush_tlb_local(void)
{
+ struct asi *asi;
+ u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/*
* Preemption or interrupts must be disabled to protect the access
* to the per CPU variable and to prevent being preempted between
@@ -1242,10 +1350,36 @@ STATIC_NOPV void native_flush_tlb_local(void)
*/
WARN_ON_ONCE(preemptible());

- invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ /*
+ * If we don't have INVPCID support, then we have to use
+ * write_cr3(read_cr3()). However, that is not safe when ASI is active,
+ * as an interrupt/exception/NMI could cause an ASI Exit in the middle
+ * and change CR3. So we trigger an ASI Exit beforehand. But if a valid
+ * target ASI is set, then an ASI Exit can also be ephemeral due to
+ * interrupts (except if we are already inside one), and thus we have to
+ * fallback to a global TLB flush.
+ */
+ if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+ if (unlikely(!asi_is_target_unrestricted()) &&
+ asi_intr_nest_depth() == 0) {
+ native_flush_tlb_global();
+ return;
+ }
+ asi_exit();
+ }

- /* If current->mm == NULL then the read_cr3() "borrows" an mm */
- native_write_cr3(__native_read_cr3());
+ invalidate_user_asid(asid);
+ invalidate_asi_pcid(NULL, asid);
+
+ asi = asi_get_current();
+
+ if (!asi) {
+ /* If current->mm == NULL then the read_cr3() "borrows" an mm */
+ native_write_cr3(__native_read_cr3());
+ } else {
+ invpcid_flush_single_context(kern_pcid(asid));
+ flush_asi_pcid(asi);
+ }
}

void flush_tlb_local(void)
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 10:05:02

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 45/47] mm: asi: Mapping global nonsensitive areas in asi_global_init

From: Ofir Weisse <[email protected]>

There are several areas in memory which we consider non sensitive.
These areas should be mapped in every ASI domain. We map there areas
in asi_global_init(). We modified some of the linking scripts to
ensure these areas are starting and ending on page boundaries.

The areas:
- _stext --> _etext
- __init_begin --> __init_end
- __start_rodata --> __end_rodata
- __start_once --> __end_once
- __start___ex_table --> __stop___ex_table
- __start_asi_nonsensitive --> __end_asi_nonsensitive
- __start_asi_nonsensitive_readmostly -->
__end_asi_nonsensitive_readmostly
- __vvar_page --> + PAGE_SIZE
- APIC_BASE --> + PAGE_SIZE
- phys_base --> + PAGE_SIZE
- __start___tracepoints_ptrs --> __stop___tracepoints_ptrs
- __start___tracepoint_str --> __stop___tracepoint_str
- __per_cpu_asi_start --> __per_cpu_asi_end (percpu)
- irq_stack_backing_store --> + sizeof(irq_stack_backing_store)
(percpu)

The pgd's of the following addresses are cloned, modeled after KPTI:
- CPU_ENTRY_AREA_BASE
- ESPFIX_BASE_ADDR

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/kernel/head_64.S | 12 +++++
arch/x86/kernel/vmlinux.lds.S | 2 +-
arch/x86/mm/asi.c | 82 +++++++++++++++++++++++++++++++
include/asm-generic/vmlinux.lds.h | 13 +++--
4 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d8b3ebd2bb85..3d3874661895 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -574,9 +574,21 @@ SYM_DATA_LOCAL(early_gdt_descr_base, .quad INIT_PER_CPU_VAR(gdt_page))

.align 16
/* This must match the first entry in level2_kernel_pgt */
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+/* TODO: Find a way to mark .section for phys_base */
+/* Ideally, we want to map phys_base in .data..asi_non_sensitive. That doesn't
+ * seem to work properly. For now, we just make sure phys_base is in it's own
+ * page. */
+ .align PAGE_SIZE
+#endif
SYM_DATA(phys_base, .quad 0x0)
EXPORT_SYMBOL(phys_base)

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ .align PAGE_SIZE
+#endif
+
#include "../../x86/xen/xen-head.S"

__PAGE_ALIGNED_BSS
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3d6dc12d198f..2b3668291785 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -148,8 +148,8 @@ SECTIONS
} :text =0xcccc

/* End of text section, which should occupy whole number of pages */
- _etext = .;
. = ALIGN(PAGE_SIZE);
+ _etext = .;

X86_ALIGN_RODATA_BEGIN
RO_DATA(PAGE_SIZE)
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 04628949e89d..7f2aa1823736 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -9,6 +9,7 @@

#include <asm/asi.h>
#include <asm/pgalloc.h>
+#include <asm/processor.h> /* struct irq_stack */
#include <asm/mmu_context.h>

#include "mm_internal.h"
@@ -17,6 +18,24 @@
#undef pr_fmt
#define pr_fmt(fmt) "ASI: " fmt

+#include <linux/extable.h>
+#include <asm-generic/sections.h>
+
+extern struct exception_table_entry __start___ex_table[];
+extern struct exception_table_entry __stop___ex_table[];
+
+extern const char __start_asi_nonsensitive[], __end_asi_nonsensitive[];
+extern const char __start_asi_nonsensitive_readmostly[],
+ __end_asi_nonsensitive_readmostly[];
+extern const char __per_cpu_asi_start[], __per_cpu_asi_end[];
+extern const char *__start___tracepoint_str[];
+extern const char *__stop___tracepoint_str[];
+extern const char *__start___tracepoints_ptrs[];
+extern const char *__stop___tracepoints_ptrs[];
+extern const char __vvar_page[];
+
+DECLARE_PER_CPU_PAGE_ALIGNED(struct irq_stack, irq_stack_backing_store);
+
static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive;
static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive);

@@ -412,6 +431,7 @@ void asi_unload_module(struct module* module)
static int __init asi_global_init(void)
{
uint i, n;
+ int err = 0;

if (!boot_cpu_has(X86_FEATURE_ASI))
return 0;
@@ -436,6 +456,68 @@ static int __init asi_global_init(void)

pcpu_map_asi_reserved_chunk();

+
+ /*
+ * TODO: We need to ensure that all the sections mapped below are
+ * actually page-aligned by the linker. For now, we temporarily just
+ * align the start/end addresses here, but that is incorrect as the
+ * rest of the page could potentially contain sensitive data.
+ */
+#define MAP_SECTION(start, end) \
+ pr_err("%s:%d mapping 0x%lx --> 0x%lx", \
+ __FUNCTION__, __LINE__, start, end); \
+ err = asi_map(ASI_GLOBAL_NONSENSITIVE, \
+ (void*)((unsigned long)(start) & PAGE_MASK),\
+ PAGE_ALIGN((unsigned long)(end)) - \
+ ((unsigned long)(start) & PAGE_MASK)); \
+ BUG_ON(err);
+
+#define MAP_SECTION_PERCPU(start, size) \
+ pr_err("%s:%d mapping PERCPU 0x%lx --> 0x%lx", \
+ __FUNCTION__, __LINE__, start, (unsigned long)start+size); \
+ err = asi_map_percpu(ASI_GLOBAL_NONSENSITIVE, \
+ (void*)((unsigned long)(start) & PAGE_MASK), \
+ PAGE_ALIGN((unsigned long)(size))); \
+ BUG_ON(err);
+
+ MAP_SECTION(_stext, _etext);
+ MAP_SECTION(__init_begin, __init_end);
+ MAP_SECTION(__start_rodata, __end_rodata);
+ MAP_SECTION(__start_once, __end_once);
+ MAP_SECTION(__start___ex_table, __stop___ex_table);
+ MAP_SECTION(__start_asi_nonsensitive, __end_asi_nonsensitive);
+ MAP_SECTION(__start_asi_nonsensitive_readmostly,
+ __end_asi_nonsensitive_readmostly);
+ MAP_SECTION(__vvar_page, __vvar_page + PAGE_SIZE);
+ MAP_SECTION(APIC_BASE, APIC_BASE + PAGE_SIZE);
+ MAP_SECTION(&phys_base, &phys_base + PAGE_SIZE);
+
+ /* TODO: add a build flag to enable disable mapping only when
+ * instrumentation is used */
+ MAP_SECTION(__start___tracepoints_ptrs, __stop___tracepoints_ptrs);
+ MAP_SECTION(__start___tracepoint_str, __stop___tracepoint_str);
+
+ MAP_SECTION_PERCPU((void*)__per_cpu_asi_start,
+ __per_cpu_asi_end - __per_cpu_asi_start);
+
+ MAP_SECTION_PERCPU(&irq_stack_backing_store,
+ sizeof(irq_stack_backing_store));
+
+ /* We have to map the stack canary into ASI. This is far from ideal, as
+ * attackers can use L1TF to steal the canary value, and then perhaps
+ * mount some other attack including a buffer overflow. This is a price
+ * we must pay to use ASI.
+ */
+ MAP_SECTION_PERCPU(&fixed_percpu_data, PAGE_SIZE);
+
+#define CLONE_INIT_PGD(addr) \
+ asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd, addr);
+
+ CLONE_INIT_PGD(CPU_ENTRY_AREA_BASE);
+#ifdef CONFIG_X86_ESPFIX64
+ CLONE_INIT_PGD(ESPFIX_BASE_ADDR);
+#endif
+
return 0;
}
subsys_initcall(asi_global_init)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 0a931aedc285..7152ce3613f5 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -235,8 +235,10 @@
#define TRACE_PRINTKS() __start___trace_bprintk_fmt = .; \
KEEP(*(__trace_printk_fmt)) /* Trace_printk fmt' pointer */ \
__stop___trace_bprintk_fmt = .;
-#define TRACEPOINT_STR() __start___tracepoint_str = .; \
+#define TRACEPOINT_STR() . = ALIGN(PAGE_SIZE); \
+ __start___tracepoint_str = .; \
KEEP(*(__tracepoint_str)) /* Trace_printk fmt' pointer */ \
+ . = ALIGN(PAGE_SIZE); \
__stop___tracepoint_str = .;
#else
#define TRACE_PRINTKS()
@@ -348,8 +350,10 @@
MEM_KEEP(init.data*) \
MEM_KEEP(exit.data*) \
*(.data.unlikely) \
+ . = ALIGN(PAGE_SIZE); \
__start_once = .; \
*(.data.once) \
+ . = ALIGN(PAGE_SIZE); \
__end_once = .; \
STRUCT_ALIGN(); \
*(__tracepoints) \
@@ -453,9 +457,10 @@
*(.rodata) *(.rodata.*) \
SCHED_DATA \
RO_AFTER_INIT_DATA /* Read only after init */ \
- . = ALIGN(8); \
+ . = ALIGN(PAGE_SIZE); \
__start___tracepoints_ptrs = .; \
KEEP(*(__tracepoints_ptrs)) /* Tracepoints: pointer array */ \
+ . = ALIGN(PAGE_SIZE); \
__stop___tracepoints_ptrs = .; \
*(__tracepoints_strings)/* Tracepoints: strings */ \
} \
@@ -671,11 +676,13 @@
*/
#define EXCEPTION_TABLE(align) \
. = ALIGN(align); \
+ . = ALIGN(PAGE_SIZE); \
__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { \
__start___ex_table = .; \
KEEP(*(__ex_table)) \
+ . = ALIGN(PAGE_SIZE); \
__stop___ex_table = .; \
- }
+ } \

/*
* .BTF
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 10:05:43

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 33/47] kvm: asi: Map guest memory into restricted ASI address space

A module parameter treat_all_userspace_as_nonsensitive is added,
which if set, maps the entire userspace of the process running the VM
into the ASI restricted address space.

If the flag is not set (the default), then just the userspace memory
mapped into the VM's address space is mapped into the ASI restricted
address space.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 6 ++++
arch/x86/kvm/mmu/mmu.c | 54 +++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/paging_tmpl.h | 14 +++++++++
arch/x86/kvm/x86.c | 19 +++++++++++-
include/linux/kvm_host.h | 3 ++
virt/kvm/kvm_main.c | 7 +++++
7 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 98cbd6447e3e..e63a2f244d7b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -681,6 +681,8 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_gfn_array_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

+ struct asi_pgtbl_pool asi_pgtbl_pool;
+
/*
* QEMU userspace and the guest each have their own FPU state.
* In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9ae6168d381e..60b84331007d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -49,6 +49,12 @@

#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+extern bool treat_all_userspace_as_nonsensitive;
+#else
+#define treat_all_userspace_as_nonsensitive true
+#endif
+
static __always_inline u64 rsvd_bits(int s, int e)
{
BUILD_BUG_ON(__builtin_constant_p(e) && __builtin_constant_p(s) && e < s);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fcdf3f8bb59a..485c0ba3ce8b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -91,6 +91,11 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
static bool __read_mostly force_flush_and_sync_on_reuse;
module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+bool __ro_after_init treat_all_userspace_as_nonsensitive;
+module_param(treat_all_userspace_as_nonsensitive, bool, 0444);
+#endif
+
/*
* When setting this variable to true it enables Two-Dimensional-Paging
* where the hardware walks 2 page tables:
@@ -2757,6 +2762,21 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
return ret;
}

+static void asi_map_gfn_range(struct kvm_vcpu *vcpu,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, size_t npages)
+{
+ int err;
+ size_t hva = __gfn_to_hva_memslot(slot, gfn);
+
+ err = asi_map_user(vcpu->kvm->asi, (void *)hva, PAGE_SIZE * npages,
+ &vcpu->arch.asi_pgtbl_pool, slot->userspace_addr,
+ slot->userspace_addr + slot->npages * PAGE_SIZE);
+ if (err)
+ kvm_err("asi_map_user for %lx-%lx failed with code %d", hva,
+ hva + PAGE_SIZE * npages, err);
+}
+
static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
u64 *start, u64 *end)
@@ -2776,6 +2796,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
if (ret <= 0)
return -1;

+ if (!treat_all_userspace_as_nonsensitive)
+ asi_map_gfn_range(vcpu, slot, gfn, ret);
+
for (i = 0; i < ret; i++, gfn++, start++) {
mmu_set_spte(vcpu, slot, start, access, gfn,
page_to_pfn(pages[i]), NULL);
@@ -3980,6 +4003,15 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return true;
}

+static void vcpu_fill_asi_pgtbl_pool(struct kvm_vcpu *vcpu)
+{
+ int err = asi_fill_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool,
+ CONFIG_PGTABLE_LEVELS - 1, GFP_KERNEL);
+
+ if (err)
+ kvm_err("asi_fill_pgtbl_pool failed with code %d", err);
+}
+
/*
* Returns true if the page fault is stale and needs to be retried, i.e. if the
* root was invalidated by a memslot update or a relevant mmu_notifier fired.
@@ -4013,6 +4045,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);

unsigned long mmu_seq;
+ bool try_asi_map;
int r;

fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4038,6 +4071,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
return r;

+ try_asi_map = !treat_all_userspace_as_nonsensitive &&
+ !is_noslot_pfn(fault->pfn);
+
+ if (try_asi_map)
+ vcpu_fill_asi_pgtbl_pool(vcpu);
+
r = RET_PF_RETRY;

if (is_tdp_mmu_fault)
@@ -4052,6 +4091,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r)
goto out_unlock;

+ if (try_asi_map)
+ asi_map_gfn_range(vcpu, fault->slot, fault->gfn, 1);
+
if (is_tdp_mmu_fault)
r = kvm_tdp_mmu_map(vcpu, fault);
else
@@ -5584,6 +5626,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)

vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa;

+ asi_init_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool);
+
ret = __kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
if (ret)
return ret;
@@ -5713,6 +5757,15 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
{
+ /*
+ * Currently, we just zap the entire address range, instead of only the
+ * memslot. So we also just asi_unmap the entire userspace. But in the
+ * future, if we zap only the range belonging to the memslot, then we
+ * should also asi_unmap only that range.
+ */
+ if (!treat_all_userspace_as_nonsensitive)
+ asi_unmap_user(kvm->asi, 0, TASK_SIZE_MAX);
+
kvm_mmu_zap_all_fast(kvm);
}

@@ -6194,6 +6247,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
free_mmu_pages(&vcpu->arch.root_mmu);
free_mmu_pages(&vcpu->arch.guest_mmu);
mmu_free_memory_caches(vcpu);
+ asi_clear_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool);
}

void kvm_mmu_module_exit(void)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 708a5d297fe1..193317ad60a4 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -584,6 +584,9 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (is_error_pfn(pfn))
return false;

+ if (!treat_all_userspace_as_nonsensitive)
+ asi_map_gfn_range(vcpu, slot, gfn, 1);
+
mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL);
kvm_release_pfn_clean(pfn);
return true;
@@ -836,6 +839,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
int r;
unsigned long mmu_seq;
bool is_self_change_mapping;
+ bool try_asi_map;

pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
WARN_ON_ONCE(fault->is_tdp);
@@ -890,6 +894,12 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
return r;

+ try_asi_map = !treat_all_userspace_as_nonsensitive &&
+ !is_noslot_pfn(fault->pfn);
+
+ if (try_asi_map)
+ vcpu_fill_asi_pgtbl_pool(vcpu);
+
/*
* Do not change pte_access if the pfn is a mmio page, otherwise
* we will cache the incorrect access into mmio spte.
@@ -919,6 +929,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
r = make_mmu_pages_available(vcpu);
if (r)
goto out_unlock;
+
+ if (try_asi_map)
+ asi_map_gfn_range(vcpu, fault->slot, walker.gfn, 1);
+
r = FNAME(fetch)(vcpu, fault, &walker);
kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd07f677d084..d0df14deae80 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8722,7 +8722,10 @@ int kvm_arch_init(void *opaque)
goto out_free_percpu;

if (ops->runtime_ops->flush_sensitive_cpu_state) {
- r = asi_register_class("KVM", ASI_MAP_STANDARD_NONSENSITIVE,
+ r = asi_register_class("KVM",
+ ASI_MAP_STANDARD_NONSENSITIVE |
+ (treat_all_userspace_as_nonsensitive ?
+ ASI_MAP_ALL_USERSPACE : 0),
&kvm_asi_hooks);
if (r < 0)
goto out_mmu_exit;
@@ -9675,6 +9678,17 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+ if (!treat_all_userspace_as_nonsensitive)
+ asi_unmap_user(kvm->asi, (void *)start, end - start);
+}
+
+void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm,
+ unsigned long start,
+ unsigned long end)
+{
+ if (!treat_all_userspace_as_nonsensitive)
+ asi_unmap_user(kvm->asi, (void *)start, end - start);
}

void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
@@ -11874,6 +11888,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
+ if (!treat_all_userspace_as_nonsensitive)
+ asi_unmap_user(kvm->asi, 0, TASK_SIZE_MAX);
+
kvm_mmu_zap_all(kvm);
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9dd63ed21f75..f31f7442eced 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1819,6 +1819,9 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,

void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
unsigned long start, unsigned long end);
+void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm,
+ unsigned long start,
+ unsigned long end);

#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 72c4e6b39389..e8e9c8588908 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -162,6 +162,12 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
{
}

+__weak void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm,
+ unsigned long start,
+ unsigned long end)
+{
+}
+
bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
{
/*
@@ -685,6 +691,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
spin_unlock(&kvm->mn_invalidate_lock);

__kvm_handle_hva_range(kvm, &hva_range);
+ kvm_arch_mmu_notifier_invalidate_range_start(kvm, range->start, range->end);

return 0;
}
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 10:06:09

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions

Add support for potentially switching address spaces from within
interrupts/exceptions/NMIs etc. An interrupt does not automatically
switch to the unrestricted address space. It can switch if needed to
access some memory not available in the restricted address space, using
the normal asi_exit call.

On return from the outermost interrupt, if the target address space was
the restricted address space (e.g. we were in the critical code path
between ASI Enter and VM Enter), the restricted address space will be
automatically restored. Otherwise, execution will continue in the
unrestricted address space until the next explicit ASI Enter.

In order to keep track of when to restore the restricted address space,
an interrupt/exception nesting depth counter is maintained per-task.
An alternative implementation without needing this counter is also
possible, but the counter unlocks an additional nice-to-have benefit by
allowing detection of whether or not we are currently executing inside
an exception context, which would be useful in a later patch.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 35 ++++++++++++++++++++++++++++++++
arch/x86/include/asm/idtentry.h | 25 +++++++++++++++++++++--
arch/x86/include/asm/processor.h | 5 +++++
arch/x86/kernel/process.c | 2 ++
arch/x86/kernel/traps.c | 2 ++
arch/x86/mm/asi.c | 3 ++-
kernel/entry/common.c | 6 ++++++
7 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 0a4af23ed0eb..7702332c62e8 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -4,6 +4,8 @@

#include <asm-generic/asi.h>

+#include <linux/sched.h>
+
#include <asm/pgtable_types.h>
#include <asm/percpu.h>
#include <asm/cpufeature.h>
@@ -51,6 +53,11 @@ void asi_destroy(struct asi *asi);
void asi_enter(struct asi *asi);
void asi_exit(void);

+static inline void asi_init_thread_state(struct thread_struct *thread)
+{
+ thread->intr_nest_depth = 0;
+}
+
static inline void asi_set_target_unrestricted(void)
{
if (static_cpu_has(X86_FEATURE_ASI)) {
@@ -85,6 +92,34 @@ static inline bool asi_is_target_unrestricted(void)

#define static_asi_enabled() cpu_feature_enabled(X86_FEATURE_ASI)

+static inline void asi_intr_enter(void)
+{
+ if (static_cpu_has(X86_FEATURE_ASI)) {
+ current->thread.intr_nest_depth++;
+ barrier();
+ }
+}
+
+static inline void asi_intr_exit(void)
+{
+ void __asi_enter(void);
+
+ if (static_cpu_has(X86_FEATURE_ASI)) {
+ barrier();
+
+ if (--current->thread.intr_nest_depth == 0)
+ __asi_enter();
+ }
+}
+
+#else /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+static inline void asi_intr_enter(void) { }
+
+static inline void asi_intr_exit(void) { }
+
+static inline void asi_init_thread_state(struct thread_struct *thread) { }
+
#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */

#endif
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..ea5cdc90403d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,6 +10,7 @@
#include <linux/hardirq.h>

#include <asm/irq_stack.h>
+#include <asm/asi.h>

/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
@@ -133,7 +134,16 @@ static __always_inline void __##func(struct pt_regs *regs, \
* is required before the enter/exit() helpers are invoked.
*/
#define DEFINE_IDTENTRY_RAW(func) \
-__visible noinstr void func(struct pt_regs *regs)
+static __always_inline void __##func(struct pt_regs *regs); \
+ \
+__visible noinstr void func(struct pt_regs *regs) \
+{ \
+ asi_intr_enter(); \
+ __##func (regs); \
+ asi_intr_exit(); \
+} \
+ \
+static __always_inline void __##func(struct pt_regs *regs)

/**
* DECLARE_IDTENTRY_RAW_ERRORCODE - Declare functions for raw IDT entry points
@@ -161,7 +171,18 @@ __visible noinstr void func(struct pt_regs *regs)
* is required before the enter/exit() helpers are invoked.
*/
#define DEFINE_IDTENTRY_RAW_ERRORCODE(func) \
-__visible noinstr void func(struct pt_regs *regs, unsigned long error_code)
+static __always_inline void __##func(struct pt_regs *regs, \
+ unsigned long error_code); \
+ \
+__visible noinstr void func(struct pt_regs *regs, unsigned long error_code)\
+{ \
+ asi_intr_enter(); \
+ __##func (regs, error_code); \
+ asi_intr_exit(); \
+} \
+ \
+static __always_inline void __##func(struct pt_regs *regs, \
+ unsigned long error_code)

/**
* DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 355d38c0cf60..20116efd2756 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -519,6 +519,11 @@ struct thread_struct {
unsigned int iopl_warn:1;
unsigned int sig_on_uaccess_err:1;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* The nesting depth of exceptions/interrupts */
+ int intr_nest_depth;
+#endif
+
/*
* Protection Keys Register for Userspace. Loaded immediately on
* context switch. Store it in thread_struct to avoid a lookup in
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..c8d4a00a4de7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -90,6 +90,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
#ifdef CONFIG_VM86
dst->thread.vm86 = NULL;
#endif
+ asi_init_thread_state(&dst->thread);
+
/* Drop the copied pointer to current's fpstate */
dst->thread.fpu.fpstate = NULL;

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..acf675ddda96 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/asi.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -413,6 +414,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

+ asi_exit();
irqentry_nmi_enter(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index d274c86f89b7..2453124f221d 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -107,12 +107,13 @@ void asi_destroy(struct asi *asi)
}
EXPORT_SYMBOL_GPL(asi_destroy);

-static void __asi_enter(void)
+void __asi_enter(void)
{
u64 asi_cr3;
struct asi *target = this_cpu_read(asi_cpu_state.target_asi);

VM_BUG_ON(preemptible());
+ VM_BUG_ON(current->thread.intr_nest_depth != 0);

if (!target || target == this_cpu_read(asi_cpu_state.curr_asi))
return;
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..9064253085c7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -9,6 +9,8 @@

#include "common.h"

+#include <asm/asi.h>
+
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

@@ -321,6 +323,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
.exit_rcu = false,
};

+ asi_intr_enter();
+
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
return ret;
@@ -416,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
instrumentation_end();
rcu_irq_exit();
lockdep_hardirqs_on(CALLER_ADDR0);
+ asi_intr_exit();
return;
}

@@ -438,6 +443,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
if (state.exit_rcu)
rcu_irq_exit();
}
+ asi_intr_exit();
}

irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 11:33:06

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 26/47] mm: asi: Use separate PCIDs for restricted address spaces

Each restricted address space is assigned a separate PCID. Since
currently only one ASI instance per-class exists for a given process,
the PCID is just derived from the class index.

This commit only sets the appropriate PCID when switching CR3, but does
not set the NOFLUSH bit. That will be done by later patches.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 3 ++-
arch/x86/include/asm/tlbflush.h | 3 +++
arch/x86/mm/asi.c | 6 +++--
arch/x86/mm/tlb.c | 45 ++++++++++++++++++++++++++++++---
4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 062ccac07fd9..aaa0d0bdbf59 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -40,7 +40,8 @@ struct asi {
pgd_t *pgd;
struct asi_class *class;
struct mm_struct *mm;
- int64_t asi_ref_count;
+ u16 pcid_index;
+ int64_t asi_ref_count;
};

DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c43ad46c14a..f9ec5e67e361 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -260,6 +260,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

unsigned long build_cr3(pgd_t *pgd, u16 asid);
+unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush);
+
+u16 asi_pcid(struct asi *asi, u16 asid);

#endif /* !MODULE */

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 6b9a0f5ab391..dbfea3dc4bb1 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -335,6 +335,7 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)

asi->class = &asi_class[asi_index];
asi->mm = mm;
+ asi->pcid_index = asi_index;

if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) {
uint i;
@@ -386,6 +387,7 @@ EXPORT_SYMBOL_GPL(asi_destroy);
void __asi_enter(void)
{
u64 asi_cr3;
+ u16 pcid;
struct asi *target = this_cpu_read(asi_cpu_state.target_asi);

VM_BUG_ON(preemptible());
@@ -399,8 +401,8 @@ void __asi_enter(void)

this_cpu_write(asi_cpu_state.curr_asi, target);

- asi_cr3 = build_cr3(target->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ pcid = asi_pcid(target, this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ asi_cr3 = build_cr3_pcid(target->pgd, pcid, false);
write_cr3(asi_cr3);

if (target->class->ops.post_asi_enter)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 628f1cd904ac..312b9c185a55 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -97,7 +97,12 @@
# define PTI_CONSUMED_PCID_BITS 0
#endif

-#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS)
+#define ASI_CONSUMED_PCID_BITS ASI_MAX_NUM_ORDER
+#define ASI_PCID_BITS_SHIFT CR3_AVAIL_PCID_BITS
+#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS - \
+ ASI_CONSUMED_PCID_BITS)
+
+static_assert(TLB_NR_DYN_ASIDS < BIT(CR3_AVAIL_PCID_BITS));

/*
* ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid. -1 below to account
@@ -154,6 +159,34 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+u16 asi_pcid(struct asi *asi, u16 asid)
+{
+ return kern_pcid(asid) | (asi->pcid_index << ASI_PCID_BITS_SHIFT);
+}
+
+#else /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+u16 asi_pcid(struct asi *asi, u16 asid)
+{
+ return kern_pcid(asid);
+}
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush)
+{
+ u64 noflush_bit = 0;
+
+ if (!static_cpu_has(X86_FEATURE_PCID))
+ pcid = 0;
+ else if (noflush)
+ noflush_bit = CR3_NOFLUSH;
+
+ return __sme_pa(pgd) | pcid | noflush_bit;
+}
+
inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
{
if (static_cpu_has(X86_FEATURE_PCID)) {
@@ -1078,13 +1111,17 @@ unsigned long __get_current_cr3_fast(void)
pgd_t *pgd;
u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
struct asi *asi = asi_get_current();
+ u16 pcid;

- if (asi)
+ if (asi) {
pgd = asi_pgd(asi);
- else
+ pcid = asi_pcid(asi, asid);
+ } else {
pgd = this_cpu_read(cpu_tlbstate.loaded_mm)->pgd;
+ pcid = kern_pcid(asid);
+ }

- cr3 = build_cr3(pgd, asid);
+ cr3 = build_cr3_pcid(pgd, pcid, false);

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 11:35:26

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 05/47] mm: asi: Make __get_current_cr3_fast() ASI-aware

When ASI is active, __get_current_cr3_fast() adjusts the returned CR3
value accordingly to reflect the actual ASI CR3.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 7 +++++++
arch/x86/mm/tlb.c | 20 ++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 7702332c62e8..95557211dabd 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -112,6 +112,11 @@ static inline void asi_intr_exit(void)
}
}

+static inline pgd_t *asi_pgd(struct asi *asi)
+{
+ return asi->pgd;
+}
+
#else /* CONFIG_ADDRESS_SPACE_ISOLATION */

static inline void asi_intr_enter(void) { }
@@ -120,6 +125,8 @@ static inline void asi_intr_exit(void) { }

static inline void asi_init_thread_state(struct thread_struct *thread) { }

+static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; }
+
#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */

#endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 88d9298720dc..25bee959d1d3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -17,6 +17,7 @@
#include <asm/cacheflush.h>
#include <asm/apic.h>
#include <asm/perf_event.h>
+#include <asm/asi.h>

#include "mm_internal.h"

@@ -1073,12 +1074,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
*/
unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3;
+ pgd_t *pgd;
+ u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ struct asi *asi = asi_get_current();
+
+ if (asi)
+ pgd = asi_pgd(asi);
+ else
+ pgd = this_cpu_read(cpu_tlbstate.loaded_mm)->pgd;
+
+ cr3 = build_cr3(pgd, asid);

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());

+ /*
+ * CR3 is unstable if the target ASI is unrestricted
+ * and a restricted ASI is currently loaded.
+ */
+ VM_WARN_ON_ONCE(asi && asi_is_target_unrestricted());
+
VM_BUG_ON(cr3 != __read_cr3());
return cr3;
}
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 11:37:53

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations

A new GFP flag is added to specify that an allocation should be
considered globally non-sensitive. The pages will be mapped into the
ASI global non-sensitive pseudo-PGD, which is shared between all
standard ASI instances. A new page flag is also added so that when
these pages are freed, they can also be unmapped from the ASI page
tables.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/gfp.h | 10 ++-
include/linux/mm_types.h | 5 ++
include/linux/page-flags.h | 9 ++
include/trace/events/mmflags.h | 12 ++-
mm/page_alloc.c | 145 ++++++++++++++++++++++++++++++++-
tools/perf/builtin-kmem.c | 1 +
6 files changed, 178 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8fcc38467af6..07a99a463a34 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -60,6 +60,11 @@ struct vm_area_struct;
#else
#define ___GFP_NOLOCKDEP 0
#endif
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define ___GFP_GLOBAL_NONSENSITIVE 0x4000000u
+#else
+#define ___GFP_GLOBAL_NONSENSITIVE 0
+#endif
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
@@ -248,8 +253,11 @@ struct vm_area_struct;
/* Disable lockdep for GFP context tracking */
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

+/* Allocate non-sensitive memory */
+#define __GFP_GLOBAL_NONSENSITIVE ((__force gfp_t)___GFP_GLOBAL_NONSENSITIVE)
+
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT 27
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3de1afa57289..5b8028fcfe67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -191,6 +191,11 @@ struct page {

/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* Links the pages_to_free_async list */
+ struct llist_node async_free_node;
+#endif
};

union { /* This union is 4 bytes in size. */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b90a17e9796d..a07434cc679c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -140,6 +140,9 @@ enum pageflags {
#endif
#ifdef CONFIG_KASAN_HW_TAGS
PG_skip_kasan_poison,
+#endif
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ PG_global_nonsensitive,
#endif
__NR_PAGEFLAGS,

@@ -542,6 +545,12 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+__PAGEFLAG(GlobalNonSensitive, global_nonsensitive, PF_ANY);
+#else
+__PAGEFLAG_FALSE(GlobalNonSensitive, global_nonsensitive);
+#endif
+
#ifdef CONFIG_KASAN_HW_TAGS
PAGEFLAG(SkipKASanPoison, skip_kasan_poison, PF_HEAD)
#else
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..73a49197ef54 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,8 @@
{(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\
{(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\
{(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \
- {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"}\
+ {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"},\
+ {(unsigned long)__GFP_GLOBAL_NONSENSITIVE, "__GFP_GLOBAL_NONSENSITIVE"}\

#define show_gfp_flags(flags) \
(flags) ? __print_flags(flags, "|", \
@@ -93,6 +94,12 @@
#define IF_HAVE_PG_SKIP_KASAN_POISON(flag,string)
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define IF_HAVE_ASI(flag, string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_ASI(flag, string)
+#endif
+
#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
{1UL << PG_waiters, "waiters" }, \
@@ -121,7 +128,8 @@ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
IF_HAVE_PG_IDLE(PG_young, "young" ) \
IF_HAVE_PG_IDLE(PG_idle, "idle" ) \
IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" ) \
-IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
+IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") \
+IF_HAVE_ASI(PG_global_nonsensitive, "global_nonsensitive")

#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..a4048fa1868a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,7 +697,7 @@ static inline bool pcp_allowed_order(unsigned int order)
return false;
}

-static inline void free_the_page(struct page *page, unsigned int order)
+static inline void __free_the_page(struct page *page, unsigned int order)
{
if (pcp_allowed_order(order)) /* Via pcp? */
free_unref_page(page, order);
@@ -705,6 +705,14 @@ static inline void free_the_page(struct page *page, unsigned int order)
__free_pages_ok(page, order, FPI_NONE);
}

+static bool asi_unmap_freed_pages(struct page *page, unsigned int order);
+
+static inline void free_the_page(struct page *page, unsigned int order)
+{
+ if (asi_unmap_freed_pages(page, order))
+ __free_the_page(page, order);
+}
+
/*
* Higher-order pages are called "compound pages". They are structured thusly:
*
@@ -5162,6 +5170,129 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
return true;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static DEFINE_PER_CPU(struct work_struct, async_free_work);
+static DEFINE_PER_CPU(struct llist_head, pages_to_free_async);
+static bool async_free_work_initialized;
+
+static void __free_the_page(struct page *page, unsigned int order);
+
+static void async_free_work_fn(struct work_struct *work)
+{
+ struct page *page, *tmp;
+ struct llist_node *pages_to_free;
+ void *va;
+ size_t len;
+ uint order;
+
+ pages_to_free = llist_del_all(this_cpu_ptr(&pages_to_free_async));
+
+ /* A later patch will do a more optimized TLB flush. */
+
+ llist_for_each_entry_safe(page, tmp, pages_to_free, async_free_node) {
+ va = page_to_virt(page);
+ order = page->private;
+ len = PAGE_SIZE * (1 << order);
+
+ asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, va, len);
+ __free_the_page(page, order);
+ }
+}
+
+static int __init asi_page_alloc_init(void)
+{
+ int cpu;
+
+ if (!static_asi_enabled())
+ return 0;
+
+ for_each_possible_cpu(cpu)
+ INIT_WORK(per_cpu_ptr(&async_free_work, cpu),
+ async_free_work_fn);
+
+ /*
+ * This function is called before SMP is initialized, so we can assume
+ * that this is the only running CPU at this point.
+ */
+
+ barrier();
+ async_free_work_initialized = true;
+ barrier();
+
+ if (!llist_empty(this_cpu_ptr(&pages_to_free_async)))
+ queue_work_on(smp_processor_id(), mm_percpu_wq,
+ this_cpu_ptr(&async_free_work));
+
+ return 0;
+}
+early_initcall(asi_page_alloc_init);
+
+static int asi_map_alloced_pages(struct page *page, uint order, gfp_t gfp_mask)
+{
+ uint i;
+
+ if (!static_asi_enabled())
+ return 0;
+
+ if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) {
+ for (i = 0; i < (1 << order); i++)
+ __SetPageGlobalNonSensitive(page + i);
+
+ return asi_map_gfp(ASI_GLOBAL_NONSENSITIVE, page_to_virt(page),
+ PAGE_SIZE * (1 << order), gfp_mask);
+ }
+
+ return 0;
+}
+
+static bool asi_unmap_freed_pages(struct page *page, unsigned int order)
+{
+ void *va;
+ size_t len;
+ bool async_flush_needed;
+
+ if (!static_asi_enabled())
+ return true;
+
+ if (!PageGlobalNonSensitive(page))
+ return true;
+
+ va = page_to_virt(page);
+ len = PAGE_SIZE * (1 << order);
+ async_flush_needed = irqs_disabled() || in_interrupt();
+
+ asi_unmap(ASI_GLOBAL_NONSENSITIVE, va, len, !async_flush_needed);
+
+ if (!async_flush_needed)
+ return true;
+
+ page->private = order;
+ llist_add(&page->async_free_node, this_cpu_ptr(&pages_to_free_async));
+
+ if (async_free_work_initialized)
+ queue_work_on(smp_processor_id(), mm_percpu_wq,
+ this_cpu_ptr(&async_free_work));
+
+ return false;
+}
+
+#else /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+static inline
+int asi_map_alloced_pages(struct page *pages, uint order, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline
+bool asi_unmap_freed_pages(struct page *page, unsigned int order)
+{
+ return true;
+}
+
+#endif
+
/*
* __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array
* @gfp: GFP flags for the allocation
@@ -5345,6 +5476,9 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
return NULL;
}

+ if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE))
+ gfp |= __GFP_ZERO;
+
gfp &= gfp_allowed_mask;
/*
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
@@ -5388,6 +5522,15 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
page = NULL;
}

+ if (page) {
+ int err = asi_map_alloced_pages(page, order, gfp);
+
+ if (unlikely(err)) {
+ __free_pages(page, order);
+ page = NULL;
+ }
+ }
+
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);

return page;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index da03a341c63c..5857953cd5c1 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -660,6 +660,7 @@ static const struct {
{ "__GFP_RECLAIM", "R" },
{ "__GFP_DIRECT_RECLAIM", "DR" },
{ "__GFP_KSWAPD_RECLAIM", "KR" },
+ { "__GFP_GLOBAL_NONSENSITIVE", "GNS" },
};

static size_t max_gfp_len;
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 11:40:55

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 20/47] mm: asi: Support for locally non-sensitive vmalloc allocations

A new flag, VM_LOCAL_NONSENSITIVE is added to designate locally
non-sensitive vmalloc/vmap areas. When using the __vmalloc /
__vmalloc_node APIs, if the corresponding GFP flag is specified, the
VM flag is automatically added. When using the __vmalloc_node_range API,
either flag can be specified independently. The VM flag will only map
the vmalloc area as non-sensitive, while the GFP flag will only map the
underlying direct map area as non-sensitive.

When using the __vmalloc_node_range API, instead of VMALLOC_START/END,
VMALLOC_LOCAL_NONSENSITIVE_START/END should be used. This is the range
that will have different ASI page tables for each process, thereby
providing the local mapping.

A command line parameter vmalloc_local_nonsensitive_percent is added to
specify the approximate division between the per-process and global
vmalloc ranges. Note that regular/sensitive vmalloc/vmap allocations
are not restricted by this division and can go anywhere in the entire
vmalloc range. The division only applies to non-sensitive allocations.

Since no attempt is made to balance regular/sensitive allocations across
the division, it is possible that one of these ranges gets filled up
by regular allocations, leaving no room for the non-sensitive
allocations for which that range was designated. But since the vmalloc
range is fairly large, so hopefully that will not be a problem in
practice. If that assumption turns out to be incorrect, we could
implement a more sophisticated scheme.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 2 +
arch/x86/include/asm/page_64.h | 2 +
arch/x86/include/asm/pgtable_64_types.h | 7 ++-
arch/x86/mm/asi.c | 57 ++++++++++++++++++
include/asm-generic/asi.h | 5 ++
include/linux/vmalloc.h | 6 ++
mm/vmalloc.c | 78 ++++++++++++++++++++-----
7 files changed, 142 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index f11010c0334b..e3cbf6d8801e 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -46,6 +46,8 @@ DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);

extern pgd_t asi_global_nonsensitive_pgd[];

+void asi_vmalloc_init(void);
+
int asi_init_mm_state(struct mm_struct *mm);
void asi_free_mm_state(struct mm_struct *mm);

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 2845eca02552..b17574349572 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -18,6 +18,8 @@ extern unsigned long vmemmap_base;

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION

+extern unsigned long vmalloc_global_nonsensitive_start;
+extern unsigned long vmalloc_local_nonsensitive_end;
extern unsigned long asi_local_map_base;
DECLARE_STATIC_KEY_FALSE(asi_local_map_initialized);

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 0fc380ba25b8..06793f7ef1aa 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -142,8 +142,13 @@ extern unsigned int ptrs_per_p4d;
#define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
-#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START
+
+#define VMALLOC_LOCAL_NONSENSITIVE_START VMALLOC_START
+#define VMALLOC_LOCAL_NONSENSITIVE_END vmalloc_local_nonsensitive_end
+
+#define VMALLOC_GLOBAL_NONSENSITIVE_START vmalloc_global_nonsensitive_start
#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END
+
#endif

#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 3ba0971a318d..91e5ff1224ff 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -3,6 +3,7 @@
#include <linux/init.h>
#include <linux/memblock.h>
#include <linux/memcontrol.h>
+#include <linux/moduleparam.h>

#include <asm/asi.h>
#include <asm/pgalloc.h>
@@ -28,6 +29,17 @@ EXPORT_SYMBOL(asi_local_map_initialized);
unsigned long asi_local_map_base __ro_after_init;
EXPORT_SYMBOL(asi_local_map_base);

+unsigned long vmalloc_global_nonsensitive_start __ro_after_init;
+EXPORT_SYMBOL(vmalloc_global_nonsensitive_start);
+
+unsigned long vmalloc_local_nonsensitive_end __ro_after_init;
+EXPORT_SYMBOL(vmalloc_local_nonsensitive_end);
+
+/* Approximate percent only. Rounded to PGDIR_SIZE boundary. */
+static uint vmalloc_local_nonsensitive_percent __ro_after_init = 50;
+core_param(vmalloc_local_nonsensitive_percent,
+ vmalloc_local_nonsensitive_percent, uint, 0444);
+
int asi_register_class(const char *name, uint flags,
const struct asi_hooks *ops)
{
@@ -307,6 +319,10 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
i++)
set_pgd(asi->pgd + i, mm->asi[0].pgd[i]);

+ for (i = pgd_index(VMALLOC_LOCAL_NONSENSITIVE_START);
+ i <= pgd_index(VMALLOC_LOCAL_NONSENSITIVE_END); i++)
+ set_pgd(asi->pgd + i, mm->asi[0].pgd[i]);
+
for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START);
i < PTRS_PER_PGD; i++)
set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
@@ -432,6 +448,10 @@ void asi_free_mm_state(struct mm_struct *mm)
pgd_index(ASI_LOCAL_MAP +
PFN_PHYS(max_possible_pfn)) + 1);

+ asi_free_pgd_range(&mm->asi[0],
+ pgd_index(VMALLOC_LOCAL_NONSENSITIVE_START),
+ pgd_index(VMALLOC_LOCAL_NONSENSITIVE_END) + 1);
+
free_page((ulong)mm->asi[0].pgd);
}

@@ -671,3 +691,40 @@ void asi_sync_mapping(struct asi *asi, void *start, size_t len)
for (; addr < end; addr = pgd_addr_end(addr, end))
asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr);
}
+
+void __init asi_vmalloc_init(void)
+{
+ uint start_index = pgd_index(VMALLOC_START);
+ uint end_index = pgd_index(VMALLOC_END);
+ uint global_start_index;
+
+ if (!boot_cpu_has(X86_FEATURE_ASI)) {
+ vmalloc_global_nonsensitive_start = VMALLOC_START;
+ vmalloc_local_nonsensitive_end = VMALLOC_END;
+ return;
+ }
+
+ if (vmalloc_local_nonsensitive_percent == 0) {
+ vmalloc_local_nonsensitive_percent = 1;
+ pr_warn("vmalloc_local_nonsensitive_percent must be non-zero");
+ }
+
+ if (vmalloc_local_nonsensitive_percent >= 100) {
+ vmalloc_local_nonsensitive_percent = 99;
+ pr_warn("vmalloc_local_nonsensitive_percent must be less than 100");
+ }
+
+ global_start_index = start_index + (end_index - start_index) *
+ vmalloc_local_nonsensitive_percent / 100;
+ global_start_index = max(global_start_index, start_index + 1);
+
+ vmalloc_global_nonsensitive_start = -(PTRS_PER_PGD - global_start_index)
+ * PGDIR_SIZE;
+ vmalloc_local_nonsensitive_end = vmalloc_global_nonsensitive_start - 1;
+
+ pr_debug("vmalloc_global_nonsensitive_start = %llx",
+ vmalloc_global_nonsensitive_start);
+
+ VM_BUG_ON(vmalloc_local_nonsensitive_end >= VMALLOC_END);
+ VM_BUG_ON(vmalloc_global_nonsensitive_start <= VMALLOC_START);
+}
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index a1c8ebff70e8..7c50d8b64fa4 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -18,6 +18,9 @@
#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START
#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END

+#define VMALLOC_LOCAL_NONSENSITIVE_START VMALLOC_START
+#define VMALLOC_LOCAL_NONSENSITIVE_END VMALLOC_END
+
#ifndef _ASSEMBLY_

struct asi_hooks {};
@@ -36,6 +39,8 @@ static inline int asi_init_mm_state(struct mm_struct *mm) { return 0; }

static inline void asi_free_mm_state(struct mm_struct *mm) { }

+static inline void asi_vmalloc_init(void) { }
+
static inline
int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
{
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 5f85690f27b6..2b4eafc21fa5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -41,8 +41,10 @@ struct notifier_block; /* in notifier.h */

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
#define VM_GLOBAL_NONSENSITIVE 0x00000800 /* Similar to __GFP_GLOBAL_NONSENSITIVE */
+#define VM_LOCAL_NONSENSITIVE 0x00001000 /* Similar to __GFP_LOCAL_NONSENSITIVE */
#else
#define VM_GLOBAL_NONSENSITIVE 0
+#define VM_LOCAL_NONSENSITIVE 0
#endif

/* bits [20..32] reserved for arch specific ioremap internals */
@@ -67,6 +69,10 @@ struct vm_struct {
unsigned int nr_pages;
phys_addr_t phys_addr;
const void *caller;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* Valid if flags contain VM_*_NONSENSITIVE */
+ struct asi *asi;
+#endif
};

struct vmap_area {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f13bfe7e896b..ea94d8a1e2e9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2391,18 +2391,25 @@ void __init vmalloc_init(void)
*/
vmap_init_free_space();
vmap_initialized = true;
+
+ asi_vmalloc_init();
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
static int asi_map_vm_area(struct vm_struct *area)
{
if (!static_asi_enabled())
return 0;

if (area->flags & VM_GLOBAL_NONSENSITIVE)
- return asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr,
- get_vm_area_size(area));
+ area->asi = ASI_GLOBAL_NONSENSITIVE;
+ else if (area->flags & VM_LOCAL_NONSENSITIVE)
+ area->asi = ASI_LOCAL_NONSENSITIVE;
+ else
+ return 0;

- return 0;
+ return asi_map(area->asi, area->addr, get_vm_area_size(area));
}

static void asi_unmap_vm_area(struct vm_struct *area)
@@ -2415,11 +2422,17 @@ static void asi_unmap_vm_area(struct vm_struct *area)
* the case when the existing flush from try_purge_vmap_area_lazy()
* and/or vm_unmap_aliases() happens non-lazily.
*/
- if (area->flags & VM_GLOBAL_NONSENSITIVE)
- asi_unmap(ASI_GLOBAL_NONSENSITIVE, area->addr,
- get_vm_area_size(area), true);
+ if (area->flags & (VM_GLOBAL_NONSENSITIVE | VM_LOCAL_NONSENSITIVE))
+ asi_unmap(area->asi, area->addr, get_vm_area_size(area), true);
}

+#else
+
+static inline int asi_map_vm_area(struct vm_struct *area) { return 0; }
+static inline void asi_unmap_vm_area(struct vm_struct *area) { }
+
+#endif
+
static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
struct vmap_area *va, unsigned long flags, const void *caller)
{
@@ -2463,6 +2476,15 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (unlikely(!size))
return NULL;

+ if (static_asi_enabled()) {
+ VM_BUG_ON((flags & VM_LOCAL_NONSENSITIVE) &&
+ !(start >= VMALLOC_LOCAL_NONSENSITIVE_START &&
+ end <= VMALLOC_LOCAL_NONSENSITIVE_END));
+
+ VM_BUG_ON((flags & VM_GLOBAL_NONSENSITIVE) &&
+ start < VMALLOC_GLOBAL_NONSENSITIVE_START);
+ }
+
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
@@ -3073,8 +3095,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
if (WARN_ON_ONCE(!size))
return NULL;

- if (static_asi_enabled() && (vm_flags & VM_GLOBAL_NONSENSITIVE))
- gfp_mask |= __GFP_ZERO;
+ if (static_asi_enabled()) {
+ VM_BUG_ON((vm_flags & (VM_LOCAL_NONSENSITIVE |
+ VM_GLOBAL_NONSENSITIVE)) ==
+ (VM_LOCAL_NONSENSITIVE | VM_GLOBAL_NONSENSITIVE));
+
+ if ((vm_flags & VM_LOCAL_NONSENSITIVE) &&
+ !mm_asi_enabled(current->mm)) {
+ vm_flags &= ~VM_LOCAL_NONSENSITIVE;
+
+ if (end == VMALLOC_LOCAL_NONSENSITIVE_END)
+ end = VMALLOC_END;
+ }
+
+ if (vm_flags & (VM_GLOBAL_NONSENSITIVE | VM_LOCAL_NONSENSITIVE))
+ gfp_mask |= __GFP_ZERO;
+ }

if ((size >> PAGE_SHIFT) > totalram_pages()) {
warn_alloc(gfp_mask, NULL,
@@ -3166,11 +3202,19 @@ void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, int node, const void *caller)
{
ulong vm_flags = 0;
+ ulong start = VMALLOC_START, end = VMALLOC_END;

- if (static_asi_enabled() && (gfp_mask & __GFP_GLOBAL_NONSENSITIVE))
- vm_flags |= VM_GLOBAL_NONSENSITIVE;
+ if (static_asi_enabled()) {
+ if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) {
+ vm_flags |= VM_GLOBAL_NONSENSITIVE;
+ start = VMALLOC_GLOBAL_NONSENSITIVE_START;
+ } else if (gfp_mask & __GFP_LOCAL_NONSENSITIVE) {
+ vm_flags |= VM_LOCAL_NONSENSITIVE;
+ end = VMALLOC_LOCAL_NONSENSITIVE_END;
+ }
+ }

- return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range(size, align, start, end,
gfp_mask, PAGE_KERNEL, vm_flags, node, caller);
}
/*
@@ -3678,9 +3722,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
/* verify parameters and allocate data structures */
BUG_ON(offset_in_page(align) || !is_power_of_2(align));

- if (static_asi_enabled() && (flags & VM_GLOBAL_NONSENSITIVE)) {
- vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START;
- vmalloc_end = VMALLOC_GLOBAL_NONSENSITIVE_END;
+ if (static_asi_enabled()) {
+ VM_BUG_ON((flags & (VM_LOCAL_NONSENSITIVE |
+ VM_GLOBAL_NONSENSITIVE)) ==
+ (VM_LOCAL_NONSENSITIVE | VM_GLOBAL_NONSENSITIVE));
+
+ if (flags & VM_GLOBAL_NONSENSITIVE)
+ vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START;
+ else if (flags & VM_LOCAL_NONSENSITIVE)
+ vmalloc_end = VMALLOC_LOCAL_NONSENSITIVE_END;
}

vmalloc_start = ALIGN(vmalloc_start, align);
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 11:41:50

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 31/47] mm: asi: Support for non-sensitive SLUB caches

This adds support for allocating global and local non-sensitive objects
using the SLUB allocator. Similar to SLAB, per-process child caches are
created for locally non-sensitive allocations. This mechanism is based
on a modified form of the earlier implementation of per-memcg caches.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/slub_def.h | 6 ++
mm/slab.h | 5 ++
mm/slab_common.c | 33 +++++++--
mm/slub.c | 140 ++++++++++++++++++++++++++++++++++++++-
security/Kconfig | 3 +-
5 files changed, 179 insertions(+), 8 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 0fa751b946fa..6e185b61582c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -137,6 +137,12 @@ struct kmem_cache {
struct kasan_cache kasan_info;
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct kmem_local_cache_info local_cache_info;
+ /* For propagation, maximum size of a stored attr */
+ unsigned int max_attr_size;
+#endif
+
unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */

diff --git a/mm/slab.h b/mm/slab.h
index b9e11038be27..8799bcdd2fff 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -216,6 +216,7 @@ int __kmem_cache_shutdown(struct kmem_cache *);
void __kmem_cache_release(struct kmem_cache *);
int __kmem_cache_shrink(struct kmem_cache *);
void slab_kmem_cache_release(struct kmem_cache *);
+void kmem_cache_shrink_all(struct kmem_cache *s);

struct seq_file;
struct file;
@@ -344,6 +345,7 @@ void restore_page_nonsensitive_metadata(struct page *page,
}

void set_nonsensitive_cache_params(struct kmem_cache *s);
+void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root);

#else /* CONFIG_ADDRESS_SPACE_ISOLATION */

@@ -380,6 +382,9 @@ static inline void restore_page_nonsensitive_metadata(struct page *page,

static inline void set_nonsensitive_cache_params(struct kmem_cache *s) { }

+static inline
+void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { }
+
#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */

#ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b486b72d6344..efa61b97902a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -142,7 +142,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,

LIST_HEAD(slab_root_caches);

-static void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root)
+void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root)
{
if (root) {
s->local_cache_info.root_cache = root;
@@ -194,9 +194,6 @@ void set_nonsensitive_cache_params(struct kmem_cache *s)

#else

-static inline
-void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { }
-
static inline void cleanup_local_cache_info(struct kmem_cache *s) { }

#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
@@ -644,6 +641,34 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
}
EXPORT_SYMBOL(kmem_cache_shrink);

+/**
+ * kmem_cache_shrink_all - shrink a cache and all child caches for root cache
+ * @s: The cache pointer
+ */
+void kmem_cache_shrink_all(struct kmem_cache *s)
+{
+ struct kmem_cache *c;
+
+ if (!static_asi_enabled() || !is_root_cache(s)) {
+ kmem_cache_shrink(s);
+ return;
+ }
+
+ kasan_cache_shrink(s);
+ __kmem_cache_shrink(s);
+
+ /*
+ * We have to take the slab_mutex to protect from the child cache list
+ * modification.
+ */
+ mutex_lock(&slab_mutex);
+ for_each_child_cache(c, s) {
+ kasan_cache_shrink(c);
+ __kmem_cache_shrink(c);
+ }
+ mutex_unlock(&slab_mutex);
+}
+
bool slab_is_available(void)
{
return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index abe7db581d68..df0191f8b0e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -289,6 +289,21 @@ static void debugfs_slab_add(struct kmem_cache *);
static inline void debugfs_slab_add(struct kmem_cache *s) { }
#endif

+#if defined(CONFIG_SYSFS) && defined(CONFIG_ADDRESS_SPACE_ISOLATION)
+static void propagate_slab_attrs_from_parent(struct kmem_cache *s);
+static void propagate_slab_attr_to_children(struct kmem_cache *s,
+ struct attribute *attr,
+ const char *buf, size_t len);
+#else
+static inline void propagate_slab_attrs_from_parent(struct kmem_cache *s) { }
+
+static inline
+void propagate_slab_attr_to_children(struct kmem_cache *s,
+ struct attribute *attr,
+ const char *buf, size_t len)
+{ }
+#endif
+
static inline void stat(const struct kmem_cache *s, enum stat_item si)
{
#ifdef CONFIG_SLUB_STATS
@@ -2015,6 +2030,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
unaccount_slab_page(page, order, s);
+ restore_page_nonsensitive_metadata(page, s);
__free_pages(page, order);
}

@@ -4204,6 +4220,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
}
}

+ set_nonsensitive_cache_params(s);
+
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0)
@@ -4797,6 +4815,10 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
#endif
}
list_add(&s->list, &slab_caches);
+ init_local_cache_info(s, NULL);
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ list_del(&static_cache->root_caches_node);
+#endif
return s;
}

@@ -4863,7 +4885,7 @@ struct kmem_cache *
__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
slab_flags_t flags, void (*ctor)(void *))
{
- struct kmem_cache *s;
+ struct kmem_cache *s, *c;

s = find_mergeable(size, align, flags, name, ctor);
if (s) {
@@ -4876,6 +4898,11 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
s->object_size = max(s->object_size, size);
s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));

+ for_each_child_cache(c, s) {
+ c->object_size = s->object_size;
+ c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
+ }
+
if (sysfs_slab_alias(s, name)) {
s->refcount--;
s = NULL;
@@ -4889,6 +4916,9 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
{
int err;

+ if (!static_asi_enabled())
+ flags &= ~SLAB_NONSENSITIVE;
+
err = kmem_cache_open(s, flags);
if (err)
return err;
@@ -4897,6 +4927,8 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
if (slab_state <= UP)
return 0;

+ propagate_slab_attrs_from_parent(s);
+
err = sysfs_slab_add(s);
if (err) {
__kmem_cache_release(s);
@@ -5619,7 +5651,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
const char *buf, size_t length)
{
if (buf[0] == '1')
- kmem_cache_shrink(s);
+ kmem_cache_shrink_all(s);
else
return -EINVAL;
return length;
@@ -5829,6 +5861,87 @@ static ssize_t slab_attr_show(struct kobject *kobj,
return err;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static void propagate_slab_attrs_from_parent(struct kmem_cache *s)
+{
+ int i;
+ char *buffer = NULL;
+ struct kmem_cache *root_cache;
+
+ if (is_root_cache(s))
+ return;
+
+ root_cache = s->local_cache_info.root_cache;
+
+ /*
+ * This mean this cache had no attribute written. Therefore, no point
+ * in copying default values around
+ */
+ if (!root_cache->max_attr_size)
+ return;
+
+ for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
+ char mbuf[64];
+ char *buf;
+ struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
+ ssize_t len;
+
+ if (!attr || !attr->store || !attr->show)
+ continue;
+
+ /*
+ * It is really bad that we have to allocate here, so we will
+ * do it only as a fallback. If we actually allocate, though,
+ * we can just use the allocated buffer until the end.
+ *
+ * Most of the slub attributes will tend to be very small in
+ * size, but sysfs allows buffers up to a page, so they can
+ * theoretically happen.
+ */
+ if (buffer) {
+ buf = buffer;
+ } else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) &&
+ !IS_ENABLED(CONFIG_SLUB_STATS)) {
+ buf = mbuf;
+ } else {
+ buffer = (char *)get_zeroed_page(GFP_KERNEL);
+ if (WARN_ON(!buffer))
+ continue;
+ buf = buffer;
+ }
+
+ len = attr->show(root_cache, buf);
+ if (len > 0)
+ attr->store(s, buf, len);
+ }
+
+ if (buffer)
+ free_page((unsigned long)buffer);
+}
+
+static void propagate_slab_attr_to_children(struct kmem_cache *s,
+ struct attribute *attr,
+ const char *buf, size_t len)
+{
+ struct kmem_cache *c;
+ struct slab_attribute *attribute = to_slab_attr(attr);
+
+ if (static_asi_enabled()) {
+ mutex_lock(&slab_mutex);
+
+ if (s->max_attr_size < len)
+ s->max_attr_size = len;
+
+ for_each_child_cache(c, s)
+ attribute->store(c, buf, len);
+
+ mutex_unlock(&slab_mutex);
+ }
+}
+
+#endif
+
static ssize_t slab_attr_store(struct kobject *kobj,
struct attribute *attr,
const char *buf, size_t len)
@@ -5844,6 +5957,27 @@ static ssize_t slab_attr_store(struct kobject *kobj,
return -EIO;

err = attribute->store(s, buf, len);
+
+ /*
+ * This is a best effort propagation, so this function's return
+ * value will be determined by the parent cache only. This is
+ * basically because not all attributes will have a well
+ * defined semantics for rollbacks - most of the actions will
+ * have permanent effects.
+ *
+ * Returning the error value of any of the children that fail
+ * is not 100 % defined, in the sense that users seeing the
+ * error code won't be able to know anything about the state of
+ * the cache.
+ *
+ * Only returning the error code for the parent cache at least
+ * has well defined semantics. The cache being written to
+ * directly either failed or succeeded, in which case we loop
+ * through the descendants with best-effort propagation.
+ */
+ if (slab_state >= FULL && err >= 0 && is_root_cache(s))
+ propagate_slab_attr_to_children(s, attr, buf, len);
+
return err;
}

@@ -5866,7 +6000,7 @@ static struct kset *slab_kset;

static inline struct kset *cache_kset(struct kmem_cache *s)
{
- return slab_kset;
+ return is_root_cache(s) ? slab_kset : NULL;
}

#define ID_STR_LENGTH 64
diff --git a/security/Kconfig b/security/Kconfig
index 070a948b5266..a5cfb09352b0 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -68,7 +68,8 @@ config PAGE_TABLE_ISOLATION
config ADDRESS_SPACE_ISOLATION
bool "Allow code to run with a reduced kernel address space"
default n
- depends on X86_64 && !UML && SLAB && !NEED_PER_CPU_KM
+ depends on X86_64 && !UML && !NEED_PER_CPU_KM
+ depends on SLAB || SLUB
depends on !PARAVIRT
depends on !MEMORY_HOTPLUG
help
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 12:00:53

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 07/47] mm: asi: Functions to map/unmap a memory range into ASI page tables

Two functions, asi_map() and asi_map_gfp(), are added to allow mapping
memory into ASI page tables. The mapping will be identical to the one
for the same virtual address in the unrestricted page tables. This is
necessary to allow switching between the page tables at any arbitrary
point in the kernel.

Another function, asi_unmap() is added to allow unmapping memory mapped
via asi_map*

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 5 +
arch/x86/mm/asi.c | 196 +++++++++++++++++++++++++++++++++++++
include/asm-generic/asi.h | 19 ++++
mm/internal.h | 3 +
mm/vmalloc.c | 60 +++++++-----
5 files changed, 261 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 95557211dabd..521b40d1864b 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -53,6 +53,11 @@ void asi_destroy(struct asi *asi);
void asi_enter(struct asi *asi);
void asi_exit(void);

+int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags);
+int asi_map(struct asi *asi, void *addr, size_t len);
+void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb);
+void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len);
+
static inline void asi_init_thread_state(struct thread_struct *thread)
{
thread->intr_nest_depth = 0;
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 40d772b2e2a8..84d220cbdcfc 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -6,6 +6,8 @@
#include <asm/pgalloc.h>
#include <asm/mmu_context.h>

+#include "../../../mm/internal.h"
+
#undef pr_fmt
#define pr_fmt(fmt) "ASI: " fmt

@@ -287,3 +289,197 @@ void asi_init_mm_state(struct mm_struct *mm)
{
memset(mm->asi, 0, sizeof(mm->asi));
}
+
+static bool is_page_within_range(size_t addr, size_t page_size,
+ size_t range_start, size_t range_end)
+{
+ size_t page_start, page_end, page_mask;
+
+ page_mask = ~(page_size - 1);
+ page_start = addr & page_mask;
+ page_end = page_start + page_size;
+
+ return page_start >= range_start && page_end <= range_end;
+}
+
+static bool follow_physaddr(struct mm_struct *mm, size_t virt,
+ phys_addr_t *phys, size_t *page_size, ulong *flags)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+#define follow_addr_at_level(base, level, LEVEL) \
+ do { \
+ *page_size = LEVEL##_SIZE; \
+ level = level##_offset(base, virt); \
+ if (!level##_present(*level)) \
+ return false; \
+ \
+ if (level##_large(*level)) { \
+ *phys = PFN_PHYS(level##_pfn(*level)) | \
+ (virt & ~LEVEL##_MASK); \
+ *flags = level##_flags(*level); \
+ return true; \
+ } \
+ } while (false)
+
+ follow_addr_at_level(mm, pgd, PGDIR);
+ follow_addr_at_level(pgd, p4d, P4D);
+ follow_addr_at_level(p4d, pud, PUD);
+ follow_addr_at_level(pud, pmd, PMD);
+
+ *page_size = PAGE_SIZE;
+ pte = pte_offset_map(pmd, virt);
+ if (!pte)
+ return false;
+
+ if (!pte_present(*pte)) {
+ pte_unmap(pte);
+ return false;
+ }
+
+ *phys = PFN_PHYS(pte_pfn(*pte)) | (virt & ~PAGE_MASK);
+ *flags = pte_flags(*pte);
+
+ pte_unmap(pte);
+ return true;
+
+#undef follow_addr_at_level
+}
+
+/*
+ * Map the given range into the ASI page tables. The source of the mapping
+ * is the regular unrestricted page tables.
+ * Can be used to map any kernel memory.
+ *
+ * The caller MUST ensure that the source mapping will not change during this
+ * function. For dynamic kernel memory, this is generally ensured by mapping
+ * the memory within the allocator.
+ *
+ * If the source mapping is a large page and the range being mapped spans the
+ * entire large page, then it will be mapped as a large page in the ASI page
+ * tables too. If the range does not span the entire huge page, then it will
+ * be mapped as smaller pages. In that case, the implementation is slightly
+ * inefficient, as it will walk the source page tables again for each small
+ * destination page, but that should be ok for now, as usually in such cases,
+ * the range would consist of a small-ish number of pages.
+ */
+int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
+{
+ size_t virt;
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+ size_t page_size;
+
+ if (!static_cpu_has(X86_FEATURE_ASI))
+ return 0;
+
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(len & ~PAGE_MASK);
+ VM_BUG_ON(start < TASK_SIZE_MAX);
+
+ gfp_flags &= GFP_RECLAIM_MASK;
+
+ if (asi->mm != &init_mm)
+ gfp_flags |= __GFP_ACCOUNT;
+
+ for (virt = start; virt < end; virt = ALIGN(virt + 1, page_size)) {
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ phys_addr_t phys;
+ ulong flags;
+
+ if (!follow_physaddr(asi->mm, virt, &phys, &page_size, &flags))
+ continue;
+
+#define MAP_AT_LEVEL(base, BASE, level, LEVEL) { \
+ if (base##_large(*base)) { \
+ VM_BUG_ON(PHYS_PFN(phys & BASE##_MASK) != \
+ base##_pfn(*base)); \
+ continue; \
+ } \
+ \
+ level = asi_##level##_alloc(asi, base, virt, gfp_flags);\
+ if (!level) \
+ return -ENOMEM; \
+ \
+ if (page_size >= LEVEL##_SIZE && \
+ (level##_none(*level) || level##_leaf(*level)) && \
+ is_page_within_range(virt, LEVEL##_SIZE, \
+ start, end)) { \
+ page_size = LEVEL##_SIZE; \
+ phys &= LEVEL##_MASK; \
+ \
+ if (level##_none(*level)) \
+ set_##level(level, \
+ __##level(phys | flags)); \
+ else \
+ VM_BUG_ON(level##_pfn(*level) != \
+ PHYS_PFN(phys)); \
+ continue; \
+ } \
+ }
+
+ pgd = pgd_offset_pgd(asi->pgd, virt);
+
+ MAP_AT_LEVEL(pgd, PGDIR, p4d, P4D);
+ MAP_AT_LEVEL(p4d, P4D, pud, PUD);
+ MAP_AT_LEVEL(pud, PUD, pmd, PMD);
+ MAP_AT_LEVEL(pmd, PMD, pte, PAGE);
+
+ VM_BUG_ON(true); /* Should never reach here. */
+#undef MAP_AT_LEVEL
+ }
+
+ return 0;
+}
+
+int asi_map(struct asi *asi, void *addr, size_t len)
+{
+ return asi_map_gfp(asi, addr, len, GFP_KERNEL);
+}
+
+/*
+ * Unmap a kernel address range previously mapped into the ASI page tables.
+ * The caller must ensure appropriate TLB flushing.
+ *
+ * The area being unmapped must be a whole previously mapped region (or regions)
+ * Unmapping a partial subset of a previously mapped region is not supported.
+ * That will work, but may end up unmapping more than what was asked for, if
+ * the mapping contained huge pages.
+ *
+ * Note that higher order direct map allocations are allowed to be partially
+ * freed. If it turns out that that actually happens for any of the
+ * non-sensitive allocations, then the above limitation may be a problem. For
+ * now, vunmap_pgd_range() will emit a warning if this situation is detected.
+ */
+void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb)
+{
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+ pgtbl_mod_mask mask = 0;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) || !len)
+ return;
+
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(len & ~PAGE_MASK);
+ VM_BUG_ON(start < TASK_SIZE_MAX);
+
+ vunmap_pgd_range(asi->pgd, start, end, &mask, false);
+
+ if (flush_tlb)
+ asi_flush_tlb_range(asi, addr, len);
+}
+
+void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
+{
+ /* Later patches will do a more optimized flush. */
+ flush_tlb_kernel_range((ulong)addr, (ulong)addr + len);
+}
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index dae1403ee1d0..7da91cbe075d 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -2,6 +2,8 @@
#ifndef __ASM_GENERIC_ASI_H
#define __ASM_GENERIC_ASI_H

+#include <linux/types.h>
+
/* ASI class flags */
#define ASI_MAP_STANDARD_NONSENSITIVE 1

@@ -44,6 +46,23 @@ static inline struct asi *asi_get_target(void) { return NULL; }

static inline struct asi *asi_get_current(void) { return NULL; }

+static inline
+int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
+{
+ return 0;
+}
+
+static inline int asi_map(struct asi *asi, void *addr, size_t len)
+{
+ return 0;
+}
+
+static inline
+void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { }
+
+static inline
+void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { }
+
#define static_asi_enabled() false

#endif /* !_ASSEMBLY_ */
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..ae8799d86dd3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -79,6 +79,9 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long addr, unsigned long end,
struct zap_details *details);

+void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask, bool sleepable);
+
void do_page_cache_ra(struct readahead_control *, unsigned long nr_to_read,
unsigned long lookahead_size);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d2a00ad4e1dd..f2ef719f1cba 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -336,7 +336,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, bool sleepable)
{
pmd_t *pmd;
unsigned long next;
@@ -350,18 +350,22 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (cleared || pmd_bad(*pmd))
*mask |= PGTBL_PMD_MODIFIED;

- if (cleared)
+ if (cleared) {
+ WARN_ON(addr & ~PMD_MASK);
+ WARN_ON(next & ~PMD_MASK);
continue;
+ }
if (pmd_none_or_clear_bad(pmd))
continue;
vunmap_pte_range(pmd, addr, next, mask);

- cond_resched();
+ if (sleepable)
+ cond_resched();
} while (pmd++, addr = next, addr != end);
}

static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, bool sleepable)
{
pud_t *pud;
unsigned long next;
@@ -375,16 +379,19 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
if (cleared || pud_bad(*pud))
*mask |= PGTBL_PUD_MODIFIED;

- if (cleared)
+ if (cleared) {
+ WARN_ON(addr & ~PUD_MASK);
+ WARN_ON(next & ~PUD_MASK);
continue;
+ }
if (pud_none_or_clear_bad(pud))
continue;
- vunmap_pmd_range(pud, addr, next, mask);
+ vunmap_pmd_range(pud, addr, next, mask, sleepable);
} while (pud++, addr = next, addr != end);
}

static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, bool sleepable)
{
p4d_t *p4d;
unsigned long next;
@@ -398,14 +405,35 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
if (cleared || p4d_bad(*p4d))
*mask |= PGTBL_P4D_MODIFIED;

- if (cleared)
+ if (cleared) {
+ WARN_ON(addr & ~P4D_MASK);
+ WARN_ON(next & ~P4D_MASK);
continue;
+ }
if (p4d_none_or_clear_bad(p4d))
continue;
- vunmap_pud_range(p4d, addr, next, mask);
+ vunmap_pud_range(p4d, addr, next, mask, sleepable);
} while (p4d++, addr = next, addr != end);
}

+void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask, bool sleepable)
+{
+ unsigned long next;
+ pgd_t *pgd = pgd_offset_pgd(pgd_table, addr);
+
+ BUG_ON(addr >= end);
+
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_bad(*pgd))
+ *mask |= PGTBL_PGD_MODIFIED;
+ if (pgd_none_or_clear_bad(pgd))
+ continue;
+ vunmap_p4d_range(pgd, addr, next, mask, sleepable);
+ } while (pgd++, addr = next, addr != end);
+}
+
/*
* vunmap_range_noflush is similar to vunmap_range, but does not
* flush caches or TLBs.
@@ -420,21 +448,9 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
*/
void vunmap_range_noflush(unsigned long start, unsigned long end)
{
- unsigned long next;
- pgd_t *pgd;
- unsigned long addr = start;
pgtbl_mod_mask mask = 0;

- BUG_ON(addr >= end);
- pgd = pgd_offset_k(addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_bad(*pgd))
- mask |= PGTBL_PGD_MODIFIED;
- if (pgd_none_or_clear_bad(pgd))
- continue;
- vunmap_p4d_range(pgd, addr, next, &mask);
- } while (pgd++, addr = next, addr != end);
+ vunmap_pgd_range(init_mm.pgd, start, end, &mask, true);

if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 12:18:12

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 29/47] mm: asi: Reduce TLB flushes when freeing pages asynchronously

When we are freeing pages asynchronously (because the original free
was issued with IRQs disabled), issue only one TLB flush per execution
of the async work function. If there is only one page to free, we do a
targeted flush for that page only. Otherwise, we just do a full flush.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/tlbflush.h | 8 +++++
arch/x86/mm/tlb.c | 52 ++++++++++++++++++++-------------
include/linux/mm_types.h | 30 +++++++++++++------
mm/page_alloc.c | 40 ++++++++++++++++++++-----
4 files changed, 93 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 85315d1d2d70..7d04aa2a5f86 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -296,6 +296,14 @@ unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush);
u16 kern_pcid(u16 asid);
u16 asi_pcid(struct asi *asi, u16 asid);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+void __asi_prepare_tlb_flush(struct asi *asi, u64 *new_tlb_gen);
+void __asi_flush_tlb_range(u64 mm_context_id, u16 pcid_index, u64 new_tlb_gen,
+ size_t start, size_t end, const cpumask_t *cpu_mask);
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
#endif /* !MODULE */

#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a442335501f..fcd2c8e92f83 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1302,21 +1302,10 @@ static bool is_asi_active_on_cpu(int cpu, void *info)
return per_cpu(asi_cpu_state.curr_asi, cpu);
}

-void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
+void __asi_prepare_tlb_flush(struct asi *asi, u64 *new_tlb_gen)
{
- size_t start = (size_t)addr;
- size_t end = start + len;
- struct flush_tlb_info *info;
- u64 mm_context_id;
- const cpumask_t *cpu_mask;
- u64 new_tlb_gen = 0;
-
- if (!static_cpu_has(X86_FEATURE_ASI))
- return;
-
if (static_cpu_has(X86_FEATURE_PCID)) {
- new_tlb_gen = atomic64_inc_return(asi->tlb_gen);
-
+ *new_tlb_gen = atomic64_inc_return(asi->tlb_gen);
/*
* The increment of tlb_gen must happen before the curr_asi
* reads in is_asi_active_on_cpu(). That ensures that if another
@@ -1326,8 +1315,35 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
*/
smp_mb__after_atomic();
}
+}
+
+void __asi_flush_tlb_range(u64 mm_context_id, u16 pcid_index, u64 new_tlb_gen,
+ size_t start, size_t end, const cpumask_t *cpu_mask)
+{
+ struct flush_tlb_info *info;

preempt_disable();
+ info = get_flush_tlb_info(NULL, start, end, 0, false, new_tlb_gen,
+ mm_context_id, pcid_index);
+
+ on_each_cpu_cond_mask(is_asi_active_on_cpu, do_asi_tlb_flush, info,
+ true, cpu_mask);
+ put_flush_tlb_info();
+ preempt_enable();
+}
+
+void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
+{
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+ u64 mm_context_id;
+ u64 new_tlb_gen = 0;
+ const cpumask_t *cpu_mask;
+
+ if (!static_cpu_has(X86_FEATURE_ASI))
+ return;
+
+ __asi_prepare_tlb_flush(asi, &new_tlb_gen);

if (asi == ASI_GLOBAL_NONSENSITIVE) {
mm_context_id = U64_MAX;
@@ -1337,14 +1353,8 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
cpu_mask = mm_cpumask(asi->mm);
}

- info = get_flush_tlb_info(NULL, start, end, 0, false, new_tlb_gen,
- mm_context_id, asi->pcid_index);
-
- on_each_cpu_cond_mask(is_asi_active_on_cpu, do_asi_tlb_flush, info,
- true, cpu_mask);
-
- put_flush_tlb_info();
- preempt_enable();
+ __asi_flush_tlb_range(mm_context_id, asi->pcid_index, new_tlb_gen,
+ start, end, cpu_mask);
}

#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56511adc263e..7d38229ca85c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -193,21 +193,33 @@ struct page {
/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;

-#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#if defined(CONFIG_ADDRESS_SPACE_ISOLATION) && !defined(BUILD_VDSO32)
struct {
/* Links the pages_to_free_async list */
struct llist_node async_free_node;

unsigned long _asi_pad_1;
- unsigned long _asi_pad_2;
+ u64 asi_tlb_gen;

- /*
- * Upon allocation of a locally non-sensitive page, set
- * to the allocating mm. Must be set to the same mm when
- * the page is freed. May potentially be overwritten in
- * the meantime, as long as it is restored before free.
- */
- struct mm_struct *asi_mm;
+ union {
+ /*
+ * Upon allocation of a locally non-sensitive
+ * page, set to the allocating mm. Must be set
+ * to the same mm when the page is freed. May
+ * potentially be overwritten in the meantime,
+ * as long as it is restored before free.
+ */
+ struct mm_struct *asi_mm;
+
+ /*
+ * Set to the above mm's context ID if the page
+ * is being freed asynchronously. Can't directly
+ * use the mm_struct, unless we take additional
+ * steps to avoid it from being freed while the
+ * async work is pending.
+ */
+ u64 asi_mm_ctx_id;
+ };
};
#endif
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01784bff2a80..998ff6a56732 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5182,20 +5182,41 @@ static void async_free_work_fn(struct work_struct *work)
{
struct page *page, *tmp;
struct llist_node *pages_to_free;
- void *va;
- size_t len;
+ size_t addr;
uint order;

pages_to_free = llist_del_all(this_cpu_ptr(&pages_to_free_async));

- /* A later patch will do a more optimized TLB flush. */
+ if (!pages_to_free)
+ return;
+
+ /* If we only have one page to free, then do a targeted TLB flush. */
+ if (!llist_next(pages_to_free)) {
+ page = llist_entry(pages_to_free, struct page, async_free_node);
+ addr = (size_t)page_to_virt(page);
+ order = page->private;
+
+ __asi_flush_tlb_range(page->asi_mm_ctx_id, 0, page->asi_tlb_gen,
+ addr, addr + PAGE_SIZE * (1 << order),
+ cpu_online_mask);
+ /* Need to clear, since it shares space with page->mapping. */
+ page->asi_tlb_gen = 0;
+
+ __free_the_page(page, order);
+ return;
+ }
+
+ /*
+ * Otherwise, do a full flush. We could potentially try to optimize it
+ * via taking a union of what needs to be flushed, but it may not be
+ * worth the additional complexity.
+ */
+ asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, 0, TLB_FLUSH_ALL);

llist_for_each_entry_safe(page, tmp, pages_to_free, async_free_node) {
- va = page_to_virt(page);
order = page->private;
- len = PAGE_SIZE * (1 << order);
-
- asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, va, len);
+ /* Need to clear, since it shares space with page->mapping. */
+ page->asi_tlb_gen = 0;
__free_the_page(page, order);
}
}
@@ -5291,6 +5312,11 @@ static bool asi_unmap_freed_pages(struct page *page, unsigned int order)
if (!async_flush_needed)
return true;

+ page->asi_mm_ctx_id = PageGlobalNonSensitive(page)
+ ? U64_MAX : asi->mm->context.ctx_id;
+
+ __asi_prepare_tlb_flush(asi, &page->asi_tlb_gen);
+
page->private = order;
llist_add(&page->async_free_node, this_cpu_ptr(&pages_to_free_async));

--
2.35.1.473.g83b2b277ed-goog

2022-02-23 13:44:44

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 24/47] mm: asi: Support for local non-sensitive slab caches

A new flag SLAB_LOCAL_NONSENSITIVE is added to designate that a slab
cache can be used for local non-sensitive allocations. For such caches,
a per-process child cache will be created when a process tries to
make an allocation from that cache for the first time, similar to the
per-memcg child caches that used to exist before the object based memcg
charging mechanism. (A lot of the infrastructure for handling these
child caches is derived from the original per-memcg cache code).

If a cache only has SLAB_LOCAL_NONSENSITIVE, then all allocations from
that cache will automatically be considered locally non-sensitive. But
if a cache has both SLAB_LOCAL_NONSENSITIVE and
SLAB_GLOBAL_NONSENSITIVE, then each allocation must specify one of
__GFP_LOCAL_NONSENSITIVE or __GFP_GLOBAL_NONSENSITIVE.

Note that the first locally non-sensitive allocation that a process
makes from a given slab cache must occur from a sleepable context. If
that is not the case, then a new kmem_cache_precreate_local* API must
be called from a sleepable context before the first allocation.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/mm/asi.c | 5 +
include/linux/mm_types.h | 4 +
include/linux/sched/mm.h | 12 ++
include/linux/slab.h | 38 +++-
include/linux/slab_def.h | 4 +
kernel/fork.c | 3 +-
mm/slab.c | 41 ++++-
mm/slab.h | 151 +++++++++++++++-
mm/slab_common.c | 363 ++++++++++++++++++++++++++++++++++++++-
9 files changed, 602 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index a3d96be76fa9..6b9a0f5ab391 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -4,6 +4,7 @@
#include <linux/memblock.h>
#include <linux/memcontrol.h>
#include <linux/moduleparam.h>
+#include <linux/slab.h>

#include <asm/asi.h>
#include <asm/pgalloc.h>
@@ -455,6 +456,8 @@ int asi_init_mm_state(struct mm_struct *mm)

memset(mm->asi, 0, sizeof(mm->asi));
mm->asi_enabled = false;
+ RCU_INIT_POINTER(mm->local_slab_caches, NULL);
+ mm->local_slab_caches_array_size = 0;

/*
* TODO: In addition to a cgroup flag, we may also want a per-process
@@ -482,6 +485,8 @@ void asi_free_mm_state(struct mm_struct *mm)
if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled)
return;

+ free_local_slab_caches(mm);
+
asi_free_pgd_range(&mm->asi[0], pgd_index(ASI_LOCAL_MAP),
pgd_index(ASI_LOCAL_MAP +
PFN_PHYS(max_possible_pfn)) + 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e6980ae31323..56511adc263e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -517,6 +517,10 @@ struct mm_struct {

struct asi asi[ASI_MAX_NUM];

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct kmem_cache * __rcu *local_slab_caches;
+ uint local_slab_caches_array_size;
+#endif
/**
* @mm_users: The number of users including userspace.
*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index aca874d33fe6..c9122d4436d4 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,9 +37,21 @@ static inline void mmgrab(struct mm_struct *mm)
}

extern void __mmdrop(struct mm_struct *mm);
+extern void mmdrop_async(struct mm_struct *mm);

static inline void mmdrop(struct mm_struct *mm)
{
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /*
+ * We really only need to do this if we are in an atomic context.
+ * Unfortunately, there doesn't seem to be a reliable way to detect
+ * atomic context across all kernel configs. So we just always do async.
+ */
+ if (rcu_access_pointer(mm->local_slab_caches)) {
+ mmdrop_async(mm);
+ return;
+ }
+#endif
/*
* The implicit full barrier implied by atomic_dec_and_test() is
* required by the membarrier system call before returning to
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7b8a3853d827..ef9c73c0d874 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -93,6 +93,8 @@
/* Avoid kmemleak tracing */
#define SLAB_NOLEAKTRACE ((slab_flags_t __force)0x00800000U)

+/* 0x01000000U is used below for SLAB_LOCAL_NONSENSITIVE */
+
/* Fault injection mark */
#ifdef CONFIG_FAILSLAB
# define SLAB_FAILSLAB ((slab_flags_t __force)0x02000000U)
@@ -121,8 +123,10 @@
#define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U)

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define SLAB_LOCAL_NONSENSITIVE ((slab_flags_t __force)0x01000000U)
#define SLAB_GLOBAL_NONSENSITIVE ((slab_flags_t __force)0x20000000U)
#else
+#define SLAB_LOCAL_NONSENSITIVE 0
#define SLAB_GLOBAL_NONSENSITIVE 0
#endif

@@ -377,7 +381,8 @@ static __always_inline struct kmem_cache *get_kmalloc_cache(gfp_t flags,
{
#ifdef CONFIG_ADDRESS_SPACE_ISOLATION

- if (static_asi_enabled() && (flags & __GFP_GLOBAL_NONSENSITIVE))
+ if (static_asi_enabled() &&
+ (flags & (__GFP_GLOBAL_NONSENSITIVE | __GFP_LOCAL_NONSENSITIVE)))
return nonsensitive_kmalloc_caches[kmalloc_type(flags)][index];
#endif
return kmalloc_caches[kmalloc_type(flags)][index];
@@ -800,4 +805,35 @@ int slab_dead_cpu(unsigned int cpu);
#define slab_dead_cpu NULL
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s,
+ struct mm_struct *mm, gfp_t flags);
+void free_local_slab_caches(struct mm_struct *mm);
+int kmem_cache_precreate_local(struct kmem_cache *s);
+int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags);
+
+#else
+
+static inline
+struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s,
+ struct mm_struct *mm, gfp_t flags)
+{
+ return NULL;
+}
+
+static inline void free_local_slab_caches(struct mm_struct *mm) { }
+
+static inline int kmem_cache_precreate_local(struct kmem_cache *s)
+{
+ return 0;
+}
+
+static inline int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags)
+{
+ return 0;
+}
+
+#endif
+
#endif /* _LINUX_SLAB_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 3aa5e1e73ab6..53cbc1f40031 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -81,6 +81,10 @@ struct kmem_cache {
unsigned int *random_seq;
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct kmem_local_cache_info local_cache_info;
+#endif
+
unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */

diff --git a/kernel/fork.c b/kernel/fork.c
index 68b3aeab55ac..d7f55de00947 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -714,13 +714,14 @@ static void mmdrop_async_fn(struct work_struct *work)
__mmdrop(mm);
}

-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
{
if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
schedule_work(&mm->async_put_work);
}
}
+EXPORT_SYMBOL(mmdrop_async);

static inline void free_signal_struct(struct signal_struct *sig)
{
diff --git a/mm/slab.c b/mm/slab.c
index 5a928d95d67b..44cf6d127a4c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1403,6 +1403,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
/* In union with page->mapping where page allocator expects NULL */
page->slab_cache = NULL;

+ restore_page_nonsensitive_metadata(page, cachep);
+
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;
unaccount_slab_page(page, order, cachep);
@@ -2061,11 +2063,9 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
cachep->allocflags |= GFP_DMA32;
if (flags & SLAB_RECLAIM_ACCOUNT)
cachep->allocflags |= __GFP_RECLAIMABLE;
- if (flags & SLAB_GLOBAL_NONSENSITIVE)
- cachep->allocflags |= __GFP_GLOBAL_NONSENSITIVE;
cachep->size = size;
cachep->reciprocal_buffer_size = reciprocal_value(size);
-
+ set_nonsensitive_cache_params(cachep);
#if DEBUG
/*
* If we're going to use the generic kernel_map_pages()
@@ -3846,8 +3846,8 @@ static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp)
}

/* Always called with the slab_mutex held */
-static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
- int batchcount, int shared, gfp_t gfp)
+static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
+ int batchcount, int shared, gfp_t gfp)
{
struct array_cache __percpu *cpu_cache, *prev;
int cpu;
@@ -3892,6 +3892,29 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
return setup_kmem_cache_nodes(cachep, gfp);
}

+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+ int batchcount, int shared, gfp_t gfp)
+{
+ int ret;
+ struct kmem_cache *c;
+
+ ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
+
+ if (slab_state < FULL)
+ return ret;
+
+ if ((ret < 0) || !is_root_cache(cachep))
+ return ret;
+
+ lockdep_assert_held(&slab_mutex);
+ for_each_child_cache(c, cachep) {
+ /* return value determined by the root cache only */
+ __do_tune_cpucache(c, limit, batchcount, shared, gfp);
+ }
+
+ return ret;
+}
+
/* Called with slab_mutex held always */
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
{
@@ -3904,6 +3927,14 @@ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
if (err)
goto end;

+ if (!is_root_cache(cachep)) {
+ struct kmem_cache *root = get_root_cache(cachep);
+
+ limit = root->limit;
+ shared = root->shared;
+ batchcount = root->batchcount;
+ }
+
/*
* The head array serves three purposes:
* - create a LIFO ordering, i.e. return objects that are cache-warm
diff --git a/mm/slab.h b/mm/slab.h
index f190f4fc0286..b9e11038be27 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -5,6 +5,45 @@
* Internal slab definitions
*/

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+struct kmem_local_cache_info {
+ /* Valid for child caches. NULL for the root cache itself. */
+ struct kmem_cache *root_cache;
+ union {
+ /* For root caches */
+ struct {
+ int cache_id;
+ struct list_head __root_caches_node;
+ struct list_head children;
+ /*
+ * For SLAB_LOCAL_NONSENSITIVE root caches, this points
+ * to the cache to be used for local non-sensitive
+ * allocations from processes without ASI enabled.
+ *
+ * For root caches with only SLAB_LOCAL_NONSENSITIVE,
+ * the root cache itself is used as the sensitive cache.
+ *
+ * For root caches with both SLAB_LOCAL_NONSENSITIVE and
+ * SLAB_GLOBAL_NONSENSITIVE, the sensitive cache will be
+ * a child cache allocated on-demand.
+ *
+ * For non-sensiitve kmalloc caches, the sensitive cache
+ * will just be the corresponding regular kmalloc cache.
+ */
+ struct kmem_cache *sensitive_cache;
+ };
+
+ /* For child (process-local) caches */
+ struct {
+ struct mm_struct *mm;
+ struct list_head children_node;
+ };
+ };
+};
+
+#endif
+
#ifdef CONFIG_SLOB
/*
* Common fields provided in kmem_cache by all slab allocators
@@ -128,8 +167,7 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
}
#endif

-/* This will also include SLAB_LOCAL_NONSENSITIVE in a later patch. */
-#define SLAB_NONSENSITIVE SLAB_GLOBAL_NONSENSITIVE
+#define SLAB_NONSENSITIVE (SLAB_GLOBAL_NONSENSITIVE | SLAB_LOCAL_NONSENSITIVE)

/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
@@ -251,6 +289,99 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+/* List of all root caches. */
+extern struct list_head slab_root_caches;
+#define root_caches_node local_cache_info.__root_caches_node
+
+/*
+ * Iterate over all child caches of the given root cache. The caller must hold
+ * slab_mutex.
+ */
+#define for_each_child_cache(iter, root) \
+ list_for_each_entry(iter, &(root)->local_cache_info.children, \
+ local_cache_info.children_node)
+
+static inline bool is_root_cache(struct kmem_cache *s)
+{
+ return !s->local_cache_info.root_cache;
+}
+
+static inline bool slab_equal_or_root(struct kmem_cache *s,
+ struct kmem_cache *p)
+{
+ return p == s || p == s->local_cache_info.root_cache;
+}
+
+/*
+ * We use suffixes to the name in child caches because we can't have caches
+ * created in the system with the same name. But when we print them
+ * locally, better refer to them with the base name
+ */
+static inline const char *cache_name(struct kmem_cache *s)
+{
+ if (!is_root_cache(s))
+ s = s->local_cache_info.root_cache;
+ return s->name;
+}
+
+static inline struct kmem_cache *get_root_cache(struct kmem_cache *s)
+{
+ if (is_root_cache(s))
+ return s;
+ return s->local_cache_info.root_cache;
+}
+
+static inline
+void restore_page_nonsensitive_metadata(struct page *page,
+ struct kmem_cache *cachep)
+{
+ if (PageLocalNonSensitive(page)) {
+ VM_BUG_ON(is_root_cache(cachep));
+ page->asi_mm = cachep->local_cache_info.mm;
+ }
+}
+
+void set_nonsensitive_cache_params(struct kmem_cache *s);
+
+#else /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#define slab_root_caches slab_caches
+#define root_caches_node list
+
+#define for_each_child_cache(iter, root) \
+ for ((void)(iter), (void)(root); 0; )
+
+static inline bool is_root_cache(struct kmem_cache *s)
+{
+ return true;
+}
+
+static inline bool slab_equal_or_root(struct kmem_cache *s,
+ struct kmem_cache *p)
+{
+ return s == p;
+}
+
+static inline const char *cache_name(struct kmem_cache *s)
+{
+ return s->name;
+}
+
+static inline struct kmem_cache *get_root_cache(struct kmem_cache *s)
+{
+ return s;
+}
+
+static inline void restore_page_nonsensitive_metadata(struct page *page,
+ struct kmem_cache *cachep)
+{ }
+
+static inline void set_nonsensitive_cache_params(struct kmem_cache *s) { }
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
#ifdef CONFIG_MEMCG_KMEM
int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
gfp_t gfp, bool new_page);
@@ -449,11 +580,12 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
struct kmem_cache *cachep;

if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
+ !(s->flags & SLAB_LOCAL_NONSENSITIVE) &&
!kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
return s;

cachep = virt_to_cache(x);
- if (WARN(cachep && cachep != s,
+ if (WARN(cachep && !slab_equal_or_root(cachep, s),
"%s: Wrong slab cache. %s but object is from %s\n",
__func__, s->name, cachep->name))
print_tracking(cachep, x);
@@ -501,11 +633,24 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
if (static_asi_enabled()) {
VM_BUG_ON(!(s->flags & SLAB_GLOBAL_NONSENSITIVE) &&
(flags & __GFP_GLOBAL_NONSENSITIVE));
+ VM_BUG_ON(!(s->flags & SLAB_LOCAL_NONSENSITIVE) &&
+ (flags & __GFP_LOCAL_NONSENSITIVE));
+ VM_BUG_ON((s->flags & SLAB_NONSENSITIVE) == SLAB_NONSENSITIVE &&
+ !(flags & (__GFP_LOCAL_NONSENSITIVE |
+ __GFP_GLOBAL_NONSENSITIVE)));
}

if (should_failslab(s, flags))
return NULL;

+ if (static_asi_enabled() &&
+ (!(flags & __GFP_GLOBAL_NONSENSITIVE) &&
+ (s->flags & SLAB_LOCAL_NONSENSITIVE))) {
+ s = get_local_kmem_cache(s, current->mm, flags);
+ if (!s)
+ return NULL;
+ }
+
if (!memcg_slab_pre_alloc_hook(s, objcgp, size, flags))
return NULL;

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 72dee2494bf8..b486b72d6344 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -42,6 +42,13 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work);
static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
slab_caches_to_rcu_destroy_workfn);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static DEFINE_IDA(nonsensitive_cache_ids);
+static uint max_num_local_slab_caches = 32;
+
+#endif
+
/*
* Set of flags that will prevent slab merging
*/
@@ -131,6 +138,69 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
return i;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+LIST_HEAD(slab_root_caches);
+
+static void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root)
+{
+ if (root) {
+ s->local_cache_info.root_cache = root;
+ list_add(&s->local_cache_info.children_node,
+ &root->local_cache_info.children);
+ } else {
+ s->local_cache_info.cache_id = -1;
+ INIT_LIST_HEAD(&s->local_cache_info.children);
+ list_add(&s->root_caches_node, &slab_root_caches);
+ }
+}
+
+static void cleanup_local_cache_info(struct kmem_cache *s)
+{
+ if (is_root_cache(s)) {
+ VM_BUG_ON(!list_empty(&s->local_cache_info.children));
+
+ list_del(&s->root_caches_node);
+ if (s->local_cache_info.cache_id >= 0)
+ ida_free(&nonsensitive_cache_ids,
+ s->local_cache_info.cache_id);
+ } else {
+ struct mm_struct *mm = s->local_cache_info.mm;
+ struct kmem_cache *root_cache = s->local_cache_info.root_cache;
+ int id = root_cache->local_cache_info.cache_id;
+
+ list_del(&s->local_cache_info.children_node);
+ if (mm) {
+ struct kmem_cache **local_caches =
+ rcu_dereference_protected(mm->local_slab_caches,
+ lockdep_is_held(&slab_mutex));
+ local_caches[id] = NULL;
+ }
+ }
+}
+
+void set_nonsensitive_cache_params(struct kmem_cache *s)
+{
+ if (s->flags & SLAB_GLOBAL_NONSENSITIVE) {
+ s->allocflags |= __GFP_GLOBAL_NONSENSITIVE;
+ VM_BUG_ON(!is_root_cache(s));
+ } else if (s->flags & SLAB_LOCAL_NONSENSITIVE) {
+ if (is_root_cache(s))
+ s->local_cache_info.sensitive_cache = s;
+ else
+ s->allocflags |= __GFP_LOCAL_NONSENSITIVE;
+ }
+}
+
+#else
+
+static inline
+void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { }
+
+static inline void cleanup_local_cache_info(struct kmem_cache *s) { }
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
/*
* Figure out what the alignment of the objects will be given a set of
* flags, a user specified alignment and the size of the objects.
@@ -168,6 +238,9 @@ int slab_unmergeable(struct kmem_cache *s)
if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
return 1;

+ if (!is_root_cache(s))
+ return 1;
+
if (s->ctor)
return 1;

@@ -202,7 +275,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
if (flags & SLAB_NEVER_MERGE)
return NULL;

- list_for_each_entry_reverse(s, &slab_caches, list) {
+ list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
if (slab_unmergeable(s))
continue;

@@ -254,6 +327,8 @@ static struct kmem_cache *create_cache(const char *name,
s->useroffset = useroffset;
s->usersize = usersize;

+ init_local_cache_info(s, root_cache);
+
err = __kmem_cache_create(s, flags);
if (err)
goto out_free_cache;
@@ -266,6 +341,7 @@ static struct kmem_cache *create_cache(const char *name,
return s;

out_free_cache:
+ cleanup_local_cache_info(s);
kmem_cache_free(kmem_cache, s);
goto out;
}
@@ -459,6 +535,7 @@ static int shutdown_cache(struct kmem_cache *s)
return -EBUSY;

list_del(&s->list);
+ cleanup_local_cache_info(s);

if (s->flags & SLAB_TYPESAFE_BY_RCU) {
#ifdef SLAB_SUPPORTS_SYSFS
@@ -480,6 +557,36 @@ static int shutdown_cache(struct kmem_cache *s)
return 0;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static int shutdown_child_caches(struct kmem_cache *s)
+{
+ struct kmem_cache *c, *c2;
+ int r;
+
+ VM_BUG_ON(!is_root_cache(s));
+
+ lockdep_assert_held(&slab_mutex);
+
+ list_for_each_entry_safe(c, c2, &s->local_cache_info.children,
+ local_cache_info.children_node) {
+ r = shutdown_cache(c);
+ if (r)
+ return r;
+ }
+
+ return 0;
+}
+
+#else
+
+static inline int shutdown_child_caches(struct kmem_cache *s)
+{
+ return 0;
+}
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
void slab_kmem_cache_release(struct kmem_cache *s)
{
__kmem_cache_release(s);
@@ -501,7 +608,10 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (s->refcount)
goto out_unlock;

- err = shutdown_cache(s);
+ err = shutdown_child_caches(s);
+ if (!err)
+ err = shutdown_cache(s);
+
if (err) {
pr_err("%s %s: Slab cache still has objects\n",
__func__, s->name);
@@ -651,6 +761,8 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
s->useroffset = useroffset;
s->usersize = usersize;

+ init_local_cache_info(s, NULL);
+
err = __kmem_cache_create(s, flags);

if (err)
@@ -897,6 +1009,13 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
*/
if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL))
caches[type][idx]->refcount = -1;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+ if (flags & SLAB_NONSENSITIVE)
+ caches[type][idx]->local_cache_info.sensitive_cache =
+ kmalloc_caches[type][idx];
+#endif
}

/*
@@ -1086,12 +1205,12 @@ static void print_slabinfo_header(struct seq_file *m)
void *slab_start(struct seq_file *m, loff_t *pos)
{
mutex_lock(&slab_mutex);
- return seq_list_start(&slab_caches, *pos);
+ return seq_list_start(&slab_root_caches, *pos);
}

void *slab_next(struct seq_file *m, void *p, loff_t *pos)
{
- return seq_list_next(p, &slab_caches, pos);
+ return seq_list_next(p, &slab_root_caches, pos);
}

void slab_stop(struct seq_file *m, void *p)
@@ -1099,6 +1218,24 @@ void slab_stop(struct seq_file *m, void *p)
mutex_unlock(&slab_mutex);
}

+static void
+accumulate_children_slabinfo(struct kmem_cache *s, struct slabinfo *info)
+{
+ struct kmem_cache *c;
+ struct slabinfo sinfo;
+
+ for_each_child_cache(c, s) {
+ memset(&sinfo, 0, sizeof(sinfo));
+ get_slabinfo(c, &sinfo);
+
+ info->active_slabs += sinfo.active_slabs;
+ info->num_slabs += sinfo.num_slabs;
+ info->shared_avail += sinfo.shared_avail;
+ info->active_objs += sinfo.active_objs;
+ info->num_objs += sinfo.num_objs;
+ }
+}
+
static void cache_show(struct kmem_cache *s, struct seq_file *m)
{
struct slabinfo sinfo;
@@ -1106,8 +1243,10 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
memset(&sinfo, 0, sizeof(sinfo));
get_slabinfo(s, &sinfo);

+ accumulate_children_slabinfo(s, &sinfo);
+
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
- s->name, sinfo.active_objs, sinfo.num_objs, s->size,
+ cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
sinfo.objects_per_slab, (1 << sinfo.cache_order));

seq_printf(m, " : tunables %4u %4u %4u",
@@ -1120,9 +1259,9 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)

static int slab_show(struct seq_file *m, void *p)
{
- struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+ struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);

- if (p == slab_caches.next)
+ if (p == slab_root_caches.next)
print_slabinfo_header(m);
cache_show(s, m);
return 0;
@@ -1148,14 +1287,14 @@ void dump_unreclaimable_slab(void)
pr_info("Unreclaimable slab info:\n");
pr_info("Name Used Total\n");

- list_for_each_entry(s, &slab_caches, list) {
+ list_for_each_entry(s, &slab_root_caches, root_caches_node) {
if (s->flags & SLAB_RECLAIM_ACCOUNT)
continue;

get_slabinfo(s, &sinfo);

if (sinfo.num_objs > 0)
- pr_info("%-17s %10luKB %10luKB\n", s->name,
+ pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
(sinfo.active_objs * s->size) / 1024,
(sinfo.num_objs * s->size) / 1024);
}
@@ -1361,3 +1500,209 @@ int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
return 0;
}
ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+static int resize_local_slab_caches_array(struct mm_struct *mm, gfp_t flags)
+{
+ struct kmem_cache **new_array;
+ struct kmem_cache **old_array =
+ rcu_dereference_protected(mm->local_slab_caches,
+ lockdep_is_held(&slab_mutex));
+
+ new_array = kcalloc(max_num_local_slab_caches,
+ sizeof(struct kmem_cache *), flags);
+ if (!new_array)
+ return -ENOMEM;
+
+ if (old_array)
+ memcpy(new_array, old_array, mm->local_slab_caches_array_size *
+ sizeof(struct kmem_cache *));
+
+ rcu_assign_pointer(mm->local_slab_caches, new_array);
+ smp_store_release(&mm->local_slab_caches_array_size,
+ max_num_local_slab_caches);
+
+ if (old_array) {
+ synchronize_rcu();
+ kfree(old_array);
+ }
+
+ return 0;
+}
+
+static int get_or_alloc_cache_id(struct kmem_cache *root_cache, gfp_t flags)
+{
+ int id = root_cache->local_cache_info.cache_id;
+
+ if (id >= 0)
+ return id;
+
+ id = ida_alloc_max(&nonsensitive_cache_ids,
+ max_num_local_slab_caches - 1, flags);
+ if (id == -ENOSPC) {
+ max_num_local_slab_caches *= 2;
+ id = ida_alloc_max(&nonsensitive_cache_ids,
+ max_num_local_slab_caches - 1, flags);
+ }
+
+ if (id >= 0)
+ root_cache->local_cache_info.cache_id = id;
+
+ return id;
+}
+
+static struct kmem_cache *create_local_kmem_cache(struct kmem_cache *root_cache,
+ struct mm_struct *mm,
+ gfp_t flags)
+{
+ char *name;
+ struct kmem_cache *s = NULL;
+ slab_flags_t slab_flags = root_cache->flags & CACHE_CREATE_MASK;
+ struct kmem_cache **cache_ptr;
+
+ flags &= GFP_RECLAIM_MASK;
+
+ mutex_lock(&slab_mutex);
+
+ if (mm_asi_enabled(mm)) {
+ struct kmem_cache **caches;
+ int id = get_or_alloc_cache_id(root_cache, flags);
+
+ if (id < 0)
+ goto out;
+
+ flags |= __GFP_ACCOUNT;
+
+ if (mm->local_slab_caches_array_size <= id &&
+ resize_local_slab_caches_array(mm, flags) < 0)
+ goto out;
+
+ caches = rcu_dereference_protected(mm->local_slab_caches,
+ lockdep_is_held(&slab_mutex));
+ cache_ptr = &caches[id];
+ if (*cache_ptr) {
+ s = *cache_ptr;
+ goto out;
+ }
+
+ slab_flags &= ~SLAB_GLOBAL_NONSENSITIVE;
+ name = kasprintf(flags, "%s(%d:%s)", root_cache->name,
+ task_pid_nr(mm->owner), mm->owner->comm);
+ if (!name)
+ goto out;
+
+ } else {
+ cache_ptr = &root_cache->local_cache_info.sensitive_cache;
+ if (*cache_ptr) {
+ s = *cache_ptr;
+ goto out;
+ }
+
+ slab_flags &= ~SLAB_NONSENSITIVE;
+ name = kasprintf(flags, "%s(sensitive)", root_cache->name);
+ if (!name)
+ goto out;
+ }
+
+ s = create_cache(name,
+ root_cache->object_size,
+ root_cache->align,
+ slab_flags,
+ root_cache->useroffset, root_cache->usersize,
+ root_cache->ctor, root_cache);
+ if (IS_ERR(s)) {
+ pr_info("Unable to create child kmem cache %s. Err %ld",
+ name, PTR_ERR(s));
+ kfree(name);
+ s = NULL;
+ goto out;
+ }
+
+ if (mm_asi_enabled(mm))
+ s->local_cache_info.mm = mm;
+
+ smp_store_release(cache_ptr, s);
+out:
+ mutex_unlock(&slab_mutex);
+
+ return s;
+}
+
+struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s,
+ struct mm_struct *mm, gfp_t flags)
+{
+ struct kmem_cache *local_cache = NULL;
+
+ if (!(s->flags & SLAB_LOCAL_NONSENSITIVE) || !is_root_cache(s))
+ return s;
+
+ if (mm_asi_enabled(mm)) {
+ struct kmem_cache **caches;
+ int id = READ_ONCE(s->local_cache_info.cache_id);
+ uint array_size = smp_load_acquire(
+ &mm->local_slab_caches_array_size);
+
+ if (id >= 0 && array_size > id) {
+ rcu_read_lock();
+ caches = rcu_dereference(mm->local_slab_caches);
+ local_cache = smp_load_acquire(&caches[id]);
+ rcu_read_unlock();
+ }
+ } else {
+ local_cache =
+ smp_load_acquire(&s->local_cache_info.sensitive_cache);
+ }
+
+ if (!local_cache)
+ local_cache = create_local_kmem_cache(s, mm, flags);
+
+ return local_cache;
+}
+
+void free_local_slab_caches(struct mm_struct *mm)
+{
+ uint i;
+ struct kmem_cache **caches =
+ rcu_dereference_protected(mm->local_slab_caches,
+ atomic_read(&mm->mm_count) == 0);
+
+ if (!caches)
+ return;
+
+ cpus_read_lock();
+ mutex_lock(&slab_mutex);
+
+ for (i = 0; i < mm->local_slab_caches_array_size; i++)
+ if (caches[i])
+ WARN_ON(shutdown_cache(caches[i]));
+
+ mutex_unlock(&slab_mutex);
+ cpus_read_unlock();
+
+ kfree(caches);
+}
+
+int kmem_cache_precreate_local(struct kmem_cache *s)
+{
+ VM_BUG_ON(!is_root_cache(s));
+ VM_BUG_ON(!in_task());
+ might_sleep();
+
+ return get_local_kmem_cache(s, current->mm, GFP_KERNEL) ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL(kmem_cache_precreate_local);
+
+int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags)
+{
+ struct kmem_cache *s = kmalloc_slab(size,
+ flags | __GFP_LOCAL_NONSENSITIVE);
+
+ if (ZERO_OR_NULL_PTR(s))
+ return 0;
+
+ return kmem_cache_precreate_local(s);
+}
+EXPORT_SYMBOL(kmem_cache_precreate_local_kmalloc);
+
+#endif
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 13:45:46

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 44/47] kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts

From: Ofir Weisse <[email protected]>

The part that was allocated via ASI LOCAL SENSITIVE is in
`struct kvm_vcpu_arch_private`. The rest is in `struct kvm_vcpu_arch`.
The latter contains a pointer `private` which is allocated to be ASI
non-sensitive from a cache.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/include/asm/kvm_host.h | 109 ++++++++++++----------
arch/x86/kvm/cpuid.c | 14 +--
arch/x86/kvm/kvm_cache_regs.h | 22 ++---
arch/x86/kvm/mmu.h | 10 +-
arch/x86/kvm/mmu/mmu.c | 138 +++++++++++++--------------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 26 +++---
arch/x86/kvm/mmu/spte.c | 4 +-
arch/x86/kvm/mmu/tdp_mmu.c | 14 +--
arch/x86/kvm/svm/nested.c | 34 +++----
arch/x86/kvm/svm/sev.c | 70 +++++++-------
arch/x86/kvm/svm/svm.c | 52 +++++------
arch/x86/kvm/trace.h | 10 +-
arch/x86/kvm/vmx/nested.c | 68 +++++++-------
arch/x86/kvm/vmx/vmx.c | 64 ++++++-------
arch/x86/kvm/x86.c | 160 ++++++++++++++++----------------
arch/x86/kvm/x86.h | 2 +-
virt/kvm/kvm_main.c | 38 ++++++--
18 files changed, 436 insertions(+), 401 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 34a05add5e77..d7315f86f85c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -606,14 +606,12 @@ struct kvm_vcpu_xen {
u64 runstate_times[4];
};

-struct kvm_vcpu_arch {
- /*
+struct kvm_vcpu_arch_private {
+ /*
* rip and regs accesses must go through
* kvm_{register,rip}_{read,write} functions.
*/
unsigned long regs[NR_VCPU_REGS];
- u32 regs_avail;
- u32 regs_dirty;

unsigned long cr0;
unsigned long cr0_guest_owned_bits;
@@ -623,6 +621,63 @@ struct kvm_vcpu_arch {
unsigned long cr4_guest_owned_bits;
unsigned long cr4_guest_rsvd_bits;
unsigned long cr8;
+
+ /*
+ * QEMU userspace and the guest each have their own FPU state.
+ * In vcpu_run, we switch between the user and guest FPU contexts.
+ * While running a VCPU, the VCPU thread will have the guest FPU
+ * context.
+ *
+ * Note that while the PKRU state lives inside the fpu registers,
+ * it is switched out separately at VMENTER and VMEXIT time. The
+ * "guest_fpstate" state here contains the guest FPU context, with the
+ * host PRKU bits.
+ */
+ struct fpu_guest guest_fpu;
+
+ u64 xcr0;
+ u64 guest_supported_xcr0;
+
+ /*
+ * Paging state of the vcpu
+ *
+ * If the vcpu runs in guest mode with two level paging this still saves
+ * the paging mode of the l1 guest. This context is always used to
+ * handle faults.
+ */
+ struct kvm_mmu *mmu;
+
+ /* Non-nested MMU for L1 */
+ struct kvm_mmu root_mmu;
+
+ /* L1 MMU when running nested */
+ struct kvm_mmu guest_mmu;
+
+ /*
+ * Pointer to the mmu context currently used for
+ * gva_to_gpa translations.
+ */
+ struct kvm_mmu *walk_mmu;
+
+ /*
+ * Paging state of an L2 guest (used for nested npt)
+ *
+ * This context will save all necessary information to walk page tables
+ * of an L2 guest. This context is only initialized for page table
+ * walking and not for faulting since we never handle l2 page faults on
+ * the host.
+ */
+ struct kvm_mmu nested_mmu;
+
+ struct x86_emulate_ctxt *emulate_ctxt;
+};
+
+struct kvm_vcpu_arch {
+ struct kvm_vcpu_arch_private *private;
+
+ u32 regs_avail;
+ u32 regs_dirty;
+
u32 host_pkru;
u32 pkru;
u32 hflags;
@@ -645,36 +700,6 @@ struct kvm_vcpu_arch {
u64 arch_capabilities;
u64 perf_capabilities;

- /*
- * Paging state of the vcpu
- *
- * If the vcpu runs in guest mode with two level paging this still saves
- * the paging mode of the l1 guest. This context is always used to
- * handle faults.
- */
- struct kvm_mmu *mmu;
-
- /* Non-nested MMU for L1 */
- struct kvm_mmu root_mmu;
-
- /* L1 MMU when running nested */
- struct kvm_mmu guest_mmu;
-
- /*
- * Paging state of an L2 guest (used for nested npt)
- *
- * This context will save all necessary information to walk page tables
- * of an L2 guest. This context is only initialized for page table
- * walking and not for faulting since we never handle l2 page faults on
- * the host.
- */
- struct kvm_mmu nested_mmu;
-
- /*
- * Pointer to the mmu context currently used for
- * gva_to_gpa translations.
- */
- struct kvm_mmu *walk_mmu;

struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
@@ -683,21 +708,6 @@ struct kvm_vcpu_arch {

struct asi_pgtbl_pool asi_pgtbl_pool;

- /*
- * QEMU userspace and the guest each have their own FPU state.
- * In vcpu_run, we switch between the user and guest FPU contexts.
- * While running a VCPU, the VCPU thread will have the guest FPU
- * context.
- *
- * Note that while the PKRU state lives inside the fpu registers,
- * it is switched out separately at VMENTER and VMEXIT time. The
- * "guest_fpstate" state here contains the guest FPU context, with the
- * host PRKU bits.
- */
- struct fpu_guest guest_fpu;
-
- u64 xcr0;
- u64 guest_supported_xcr0;

struct kvm_pio_request pio;
void *pio_data;
@@ -734,7 +744,6 @@ struct kvm_vcpu_arch {

/* emulate context */

- struct x86_emulate_ctxt *emulate_ctxt;
bool emulate_regs_need_sync_to_vcpu;
bool emulate_regs_need_sync_from_vcpu;
int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dedabfdd292e..7192cbe06ba3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -169,12 +169,12 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)

best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (best)
- best->ebx = xstate_required_size(vcpu->arch.xcr0, false);
+ best->ebx = xstate_required_size(vcpu->arch.private->xcr0, false);

best = kvm_find_cpuid_entry(vcpu, 0xD, 1);
if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ best->ebx = xstate_required_size(vcpu->arch.private->xcr0, true);

best = kvm_find_kvm_cpuid_features(vcpu);
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -208,9 +208,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (!best)
- vcpu->arch.guest_supported_xcr0 = 0;
+ vcpu->arch.private->guest_supported_xcr0 = 0;
else
- vcpu->arch.guest_supported_xcr0 =
+ vcpu->arch.private->guest_supported_xcr0 =
(best->eax | ((u64)best->edx << 32)) & supported_xcr0;

/*
@@ -223,8 +223,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
*/
best = kvm_find_cpuid_entry(vcpu, 0x12, 0x1);
if (best) {
- best->ecx &= vcpu->arch.guest_supported_xcr0 & 0xffffffff;
- best->edx &= vcpu->arch.guest_supported_xcr0 >> 32;
+ best->ecx &= vcpu->arch.private->guest_supported_xcr0 & 0xffffffff;
+ best->edx &= vcpu->arch.private->guest_supported_xcr0 >> 32;
best->ecx |= XFEATURE_MASK_FPSSE;
}

@@ -234,7 +234,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);

kvm_pmu_refresh(vcpu);
- vcpu->arch.cr4_guest_rsvd_bits =
+ vcpu->arch.private->cr4_guest_rsvd_bits =
__cr4_reserved_bits(guest_cpuid_has, vcpu);

kvm_hv_set_cpuid(vcpu);
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 90e1ffdc05b7..592780402160 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -12,12 +12,12 @@
#define BUILD_KVM_GPR_ACCESSORS(lname, uname) \
static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)\
{ \
- return vcpu->arch.regs[VCPU_REGS_##uname]; \
+ return vcpu->arch.private->regs[VCPU_REGS_##uname]; \
} \
static __always_inline void kvm_##lname##_write(struct kvm_vcpu *vcpu, \
unsigned long val) \
{ \
- vcpu->arch.regs[VCPU_REGS_##uname] = val; \
+ vcpu->arch.private->regs[VCPU_REGS_##uname] = val; \
}
BUILD_KVM_GPR_ACCESSORS(rax, RAX)
BUILD_KVM_GPR_ACCESSORS(rbx, RBX)
@@ -82,7 +82,7 @@ static inline unsigned long kvm_register_read_raw(struct kvm_vcpu *vcpu, int reg
if (!kvm_register_is_available(vcpu, reg))
static_call(kvm_x86_cache_reg)(vcpu, reg);

- return vcpu->arch.regs[reg];
+ return vcpu->arch.private->regs[reg];
}

static inline void kvm_register_write_raw(struct kvm_vcpu *vcpu, int reg,
@@ -91,7 +91,7 @@ static inline void kvm_register_write_raw(struct kvm_vcpu *vcpu, int reg,
if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS))
return;

- vcpu->arch.regs[reg] = val;
+ vcpu->arch.private->regs[reg] = val;
kvm_register_mark_dirty(vcpu, reg);
}

@@ -122,21 +122,21 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index)
if (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR))
static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_PDPTR);

- return vcpu->arch.walk_mmu->pdptrs[index];
+ return vcpu->arch.private->walk_mmu->pdptrs[index];
}

static inline void kvm_pdptr_write(struct kvm_vcpu *vcpu, int index, u64 value)
{
- vcpu->arch.walk_mmu->pdptrs[index] = value;
+ vcpu->arch.private->walk_mmu->pdptrs[index] = value;
}

static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
{
ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
- if ((tmask & vcpu->arch.cr0_guest_owned_bits) &&
+ if ((tmask & vcpu->arch.private->cr0_guest_owned_bits) &&
!kvm_register_is_available(vcpu, VCPU_EXREG_CR0))
static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR0);
- return vcpu->arch.cr0 & mask;
+ return vcpu->arch.private->cr0 & mask;
}

static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu)
@@ -147,17 +147,17 @@ static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu)
static inline ulong kvm_read_cr4_bits(struct kvm_vcpu *vcpu, ulong mask)
{
ulong tmask = mask & KVM_POSSIBLE_CR4_GUEST_BITS;
- if ((tmask & vcpu->arch.cr4_guest_owned_bits) &&
+ if ((tmask & vcpu->arch.private->cr4_guest_owned_bits) &&
!kvm_register_is_available(vcpu, VCPU_EXREG_CR4))
static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR4);
- return vcpu->arch.cr4 & mask;
+ return vcpu->arch.private->cr4 & mask;
}

static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
{
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR3);
- return vcpu->arch.cr3;
+ return vcpu->arch.private->cr3;
}

static inline ulong kvm_read_cr4(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60b84331007d..aea21355580d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -89,7 +89,7 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);

static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
{
- if (likely(vcpu->arch.mmu->root_hpa != INVALID_PAGE))
+ if (likely(vcpu->arch.private->mmu->root_hpa != INVALID_PAGE))
return 0;

return kvm_mmu_load(vcpu);
@@ -111,13 +111,13 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)

static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
{
- u64 root_hpa = vcpu->arch.mmu->root_hpa;
+ u64 root_hpa = vcpu->arch.private->mmu->root_hpa;

if (!VALID_PAGE(root_hpa))
return;

static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa,
- vcpu->arch.mmu->shadow_root_level);
+ vcpu->arch.private->mmu->shadow_root_level);
}

struct kvm_page_fault {
@@ -193,7 +193,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.rsvd = err & PFERR_RSVD_MASK,
.user = err & PFERR_USER_MASK,
.prefetch = prefetch,
- .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+ .is_tdp = likely(vcpu->arch.private->mmu->page_fault == kvm_tdp_page_fault),
.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),

.max_level = KVM_MAX_HUGEPAGE_LEVEL,
@@ -204,7 +204,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (fault.is_tdp)
return kvm_tdp_page_fault(vcpu, &fault);
#endif
- return vcpu->arch.mmu->page_fault(vcpu, &fault);
+ return vcpu->arch.private->mmu->page_fault(vcpu, &fault);
}

/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a2ada1104c2d..e36171f69b8e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -704,7 +704,7 @@ static bool mmu_spte_age(u64 *sptep)

static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
{
- if (is_tdp_mmu(vcpu->arch.mmu)) {
+ if (is_tdp_mmu(vcpu->arch.private->mmu)) {
kvm_tdp_mmu_walk_lockless_begin();
} else {
/*
@@ -723,7 +723,7 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)

static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
{
- if (is_tdp_mmu(vcpu->arch.mmu)) {
+ if (is_tdp_mmu(vcpu->arch.private->mmu)) {
kvm_tdp_mmu_walk_lockless_end();
} else {
/*
@@ -1909,7 +1909,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct list_head *invalid_list)
{
- int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
+ int ret = vcpu->arch.private->mmu->sync_page(vcpu, sp);

if (ret < 0) {
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
@@ -2081,7 +2081,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
int direct,
unsigned int access)
{
- bool direct_mmu = vcpu->arch.mmu->direct_map;
+ bool direct_mmu = vcpu->arch.private->mmu->direct_map;
union kvm_mmu_page_role role;
struct hlist_head *sp_list;
unsigned quadrant;
@@ -2089,13 +2089,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
int collisions = 0;
LIST_HEAD(invalid_list);

- role = vcpu->arch.mmu->mmu_role.base;
+ role = vcpu->arch.private->mmu->mmu_role.base;
role.level = level;
role.direct = direct;
if (role.direct)
role.gpte_is_8_bytes = true;
role.access = access;
- if (!direct_mmu && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
+ if (!direct_mmu && vcpu->arch.private->mmu->root_level <= PT32_ROOT_LEVEL) {
quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
@@ -2181,11 +2181,11 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
{
iterator->addr = addr;
iterator->shadow_addr = root;
- iterator->level = vcpu->arch.mmu->shadow_root_level;
+ iterator->level = vcpu->arch.private->mmu->shadow_root_level;

if (iterator->level >= PT64_ROOT_4LEVEL &&
- vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
- !vcpu->arch.mmu->direct_map)
+ vcpu->arch.private->mmu->root_level < PT64_ROOT_4LEVEL &&
+ !vcpu->arch.private->mmu->direct_map)
iterator->level = PT32E_ROOT_LEVEL;

if (iterator->level == PT32E_ROOT_LEVEL) {
@@ -2193,10 +2193,10 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
* prev_root is currently only used for 64-bit hosts. So only
* the active root_hpa is valid here.
*/
- BUG_ON(root != vcpu->arch.mmu->root_hpa);
+ BUG_ON(root != vcpu->arch.private->mmu->root_hpa);

iterator->shadow_addr
- = vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
+ = vcpu->arch.private->mmu->pae_root[(addr >> 30) & 3];
iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
--iterator->level;
if (!iterator->shadow_addr)
@@ -2207,7 +2207,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, u64 addr)
{
- shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root_hpa,
+ shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.private->mmu->root_hpa,
addr);
}

@@ -2561,7 +2561,7 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t gpa;
int r;

- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.private->mmu->direct_map)
return 0;

gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
@@ -3186,7 +3186,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
do {
u64 new_spte;

- if (is_tdp_mmu(vcpu->arch.mmu))
+ if (is_tdp_mmu(vcpu->arch.private->mmu))
sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
else
sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
@@ -3393,7 +3393,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,

static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
u8 shadow_root_level = mmu->shadow_root_level;
hpa_t root;
unsigned i;
@@ -3501,7 +3501,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)

static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
u64 pdptrs[4], pm_mask;
gfn_t root_gfn, root_pgd;
hpa_t root;
@@ -3611,7 +3611,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)

static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL;
u64 *pml5_root = NULL;
u64 *pml4_root = NULL;
@@ -3712,16 +3712,16 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
int i;
struct kvm_mmu_page *sp;

- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.private->mmu->direct_map)
return;

- if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+ if (!VALID_PAGE(vcpu->arch.private->mmu->root_hpa))
return;

vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);

- if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
- hpa_t root = vcpu->arch.mmu->root_hpa;
+ if (vcpu->arch.private->mmu->root_level >= PT64_ROOT_4LEVEL) {
+ hpa_t root = vcpu->arch.private->mmu->root_hpa;
sp = to_shadow_page(root);

if (!is_unsync_root(root))
@@ -3741,7 +3741,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);

for (i = 0; i < 4; ++i) {
- hpa_t root = vcpu->arch.mmu->pae_root[i];
+ hpa_t root = vcpu->arch.private->mmu->pae_root[i];

if (IS_VALID_PAE_ROOT(root)) {
root &= PT64_BASE_ADDR_MASK;
@@ -3760,11 +3760,11 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
int i;

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
- if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa))
+ if (is_unsync_root(vcpu->arch.private->mmu->prev_roots[i].hpa))
roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);

/* sync prev_roots by simply freeing them */
- kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, roots_to_free);
+ kvm_mmu_free_roots(vcpu, vcpu->arch.private->mmu, roots_to_free);
}

static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gpa_t vaddr,
@@ -3781,7 +3781,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gpa_t vaddr,
{
if (exception)
exception->error_code = 0;
- return vcpu->arch.nested_mmu.translate_gpa(vcpu, vaddr, access, exception);
+ return vcpu->arch.private->nested_mmu.translate_gpa(vcpu, vaddr, access, exception);
}

static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
@@ -3834,7 +3834,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)

walk_shadow_page_lockless_begin(vcpu);

- if (is_tdp_mmu(vcpu->arch.mmu))
+ if (is_tdp_mmu(vcpu->arch.private->mmu))
leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
else
leaf = get_walk(vcpu, addr, sptes, &root);
@@ -3857,7 +3857,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
if (!is_shadow_present_pte(sptes[leaf]))
leaf++;

- rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
+ rsvd_check = &vcpu->arch.private->mmu->shadow_zero_check;

for (level = root; level >= leaf; level--)
reserved |= is_rsvd_spte(rsvd_check, sptes[level], level);
@@ -3945,8 +3945,8 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
arch.gfn = gfn;
- arch.direct_map = vcpu->arch.mmu->direct_map;
- arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
+ arch.direct_map = vcpu->arch.private->mmu->direct_map;
+ arch.cr3 = vcpu->arch.private->mmu->get_guest_pgd(vcpu);

return kvm_setup_async_pf(vcpu, cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
@@ -4029,7 +4029,7 @@ static void vcpu_fill_asi_pgtbl_pool(struct kvm_vcpu *vcpu)
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault, int mmu_seq)
{
- struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
+ struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.private->mmu->root_hpa);

/* Special roots, e.g. pae_root, are not backed by shadow pages. */
if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4052,7 +4052,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,

static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
+ bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.private->mmu);

unsigned long mmu_seq;
bool try_asi_map;
@@ -4206,7 +4206,7 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
{
uint i;
struct kvm_mmu_root_info root;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;

root.pgd = mmu->root_pgd;
root.hpa = mmu->root_hpa;
@@ -4230,7 +4230,7 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
union kvm_mmu_page_role new_role)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;

/*
* For now, limit the fast switch to 64-bit hosts+VMs in order to avoid
@@ -4248,7 +4248,7 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
union kvm_mmu_page_role new_role)
{
if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
- kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
+ kvm_mmu_free_roots(vcpu, vcpu->arch.private->mmu, KVM_MMU_ROOT_CURRENT);
return;
}

@@ -4279,7 +4279,7 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
*/
if (!new_role.direct)
__clear_sp_write_flooding_count(
- to_shadow_page(vcpu->arch.mmu->root_hpa));
+ to_shadow_page(vcpu->arch.private->mmu->root_hpa));
}

void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
@@ -4826,7 +4826,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,

static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *context = &vcpu->arch.root_mmu;
+ struct kvm_mmu *context = &vcpu->arch.private->root_mmu;
struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
union kvm_mmu_role new_role =
kvm_calc_tdp_mmu_root_page_role(vcpu, &regs, false);
@@ -4914,7 +4914,7 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
struct kvm_mmu_role_regs *regs)
{
- struct kvm_mmu *context = &vcpu->arch.root_mmu;
+ struct kvm_mmu *context = &vcpu->arch.private->root_mmu;
union kvm_mmu_role new_role =
kvm_calc_shadow_mmu_root_page_role(vcpu, regs, false);

@@ -4937,7 +4937,7 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
unsigned long cr4, u64 efer, gpa_t nested_cr3)
{
- struct kvm_mmu *context = &vcpu->arch.guest_mmu;
+ struct kvm_mmu *context = &vcpu->arch.private->guest_mmu;
struct kvm_mmu_role_regs regs = {
.cr0 = cr0,
.cr4 = cr4 & ~X86_CR4_PKE,
@@ -4960,7 +4960,7 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
union kvm_mmu_role role = {0};

/* SMM flag is inherited from root_mmu */
- role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm;
+ role.base.smm = vcpu->arch.private->root_mmu.mmu_role.base.smm;

role.base.level = level;
role.base.gpte_is_8_bytes = true;
@@ -4980,7 +4980,7 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
bool accessed_dirty, gpa_t new_eptp)
{
- struct kvm_mmu *context = &vcpu->arch.guest_mmu;
+ struct kvm_mmu *context = &vcpu->arch.private->guest_mmu;
u8 level = vmx_eptp_page_walk_level(new_eptp);
union kvm_mmu_role new_role =
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
@@ -5012,7 +5012,7 @@ EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);

static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *context = &vcpu->arch.root_mmu;
+ struct kvm_mmu *context = &vcpu->arch.private->root_mmu;
struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);

kvm_init_shadow_mmu(vcpu, &regs);
@@ -5043,7 +5043,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
union kvm_mmu_role new_role = kvm_calc_nested_mmu_role(vcpu, &regs);
- struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
+ struct kvm_mmu *g_context = &vcpu->arch.private->nested_mmu;

if (new_role.as_u64 == g_context->mmu_role.as_u64)
return;
@@ -5061,9 +5061,9 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
g_context->invlpg = NULL;

/*
- * Note that arch.mmu->gva_to_gpa translates l2_gpa to l1_gpa using
+ * Note that arch.private->mmu->gva_to_gpa translates l2_gpa to l1_gpa using
* L1's nested page tables (e.g. EPT12). The nested translation
- * of l2_gva to l1_gpa is done by arch.nested_mmu.gva_to_gpa using
+ * of l2_gva to l1_gpa is done by arch.private->nested_mmu.gva_to_gpa using
* L2's page tables as the first level of translation and L1's
* nested page tables as the second level of translation. Basically
* the gva_to_gpa functions between mmu and nested_mmu are swapped.
@@ -5119,9 +5119,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
* problem is swept under the rug; KVM's CPUID API is horrific and
* it's all but impossible to solve it without introducing a new API.
*/
- vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
- vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
- vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
+ vcpu->arch.private->root_mmu.mmu_role.ext.valid = 0;
+ vcpu->arch.private->guest_mmu.mmu_role.ext.valid = 0;
+ vcpu->arch.private->nested_mmu.mmu_role.ext.valid = 0;
kvm_mmu_reset_context(vcpu);

/*
@@ -5142,13 +5142,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;

- r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
+ r = mmu_topup_memory_caches(vcpu, !vcpu->arch.private->mmu->direct_map);
if (r)
goto out;
r = mmu_alloc_special_roots(vcpu);
if (r)
goto out;
- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.private->mmu->direct_map)
r = mmu_alloc_direct_roots(vcpu);
else
r = mmu_alloc_shadow_roots(vcpu);
@@ -5165,10 +5165,10 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)

void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
- kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.private->root_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.private->root_mmu.root_hpa));
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.private->guest_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.private->guest_mmu.root_hpa));
}

static bool need_remote_flush(u64 old, u64 new)
@@ -5351,9 +5351,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_PF;
- bool direct = vcpu->arch.mmu->direct_map;
+ bool direct = vcpu->arch.private->mmu->direct_map;

- if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+ if (WARN_ON(!VALID_PAGE(vcpu->arch.private->mmu->root_hpa)))
return RET_PF_RETRY;

r = RET_PF_INVALID;
@@ -5382,14 +5382,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
* paging in both guests. If true, we simply unprotect the page
* and resume the guest.
*/
- if (vcpu->arch.mmu->direct_map &&
+ if (vcpu->arch.private->mmu->direct_map &&
(error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
return 1;
}

/*
- * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we can still
+ * vcpu->arch.private->mmu.page_fault returned RET_PF_EMULATE, but we can still
* optimistically try to just unprotect the page and let the processor
* re-execute the instruction that caused the page fault. Do not allow
* retrying MMIO emulation, as it's not only pointless but could also
@@ -5412,8 +5412,8 @@ void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
{
int i;

- /* It's actually a GPA for vcpu->arch.guest_mmu. */
- if (mmu != &vcpu->arch.guest_mmu) {
+ /* It's actually a GPA for vcpu->arch.private->guest_mmu. */
+ if (mmu != &vcpu->arch.private->guest_mmu) {
/* INVLPG on a non-canonical address is a NOP according to the SDM. */
if (is_noncanonical_address(gva, vcpu))
return;
@@ -5448,7 +5448,7 @@ void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,

void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
- kvm_mmu_invalidate_gva(vcpu, vcpu->arch.walk_mmu, gva, INVALID_PAGE);
+ kvm_mmu_invalidate_gva(vcpu, vcpu->arch.private->walk_mmu, gva, INVALID_PAGE);
++vcpu->stat.invlpg;
}
EXPORT_SYMBOL_GPL(kvm_mmu_invlpg);
@@ -5456,7 +5456,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_invlpg);

void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
bool tlb_flush = false;
uint i;

@@ -5638,24 +5638,24 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_shadow_page_cache.gfp_asi = 0;
#endif

- vcpu->arch.mmu = &vcpu->arch.root_mmu;
- vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
+ vcpu->arch.private->mmu = &vcpu->arch.private->root_mmu;
+ vcpu->arch.private->walk_mmu = &vcpu->arch.private->root_mmu;

- vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa;
+ vcpu->arch.private->nested_mmu.translate_gpa = translate_nested_gpa;

asi_init_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool);

- ret = __kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
+ ret = __kvm_mmu_create(vcpu, &vcpu->arch.private->guest_mmu);
if (ret)
return ret;

- ret = __kvm_mmu_create(vcpu, &vcpu->arch.root_mmu);
+ ret = __kvm_mmu_create(vcpu, &vcpu->arch.private->root_mmu);
if (ret)
goto fail_allocate_root;

return ret;
fail_allocate_root:
- free_mmu_pages(&vcpu->arch.guest_mmu);
+ free_mmu_pages(&vcpu->arch.private->guest_mmu);
return ret;
}

@@ -6261,8 +6261,8 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
- free_mmu_pages(&vcpu->arch.root_mmu);
- free_mmu_pages(&vcpu->arch.guest_mmu);
+ free_mmu_pages(&vcpu->arch.private->root_mmu);
+ free_mmu_pages(&vcpu->arch.private->guest_mmu);
mmu_free_memory_caches(vcpu);
asi_clear_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..57ec9dd147da 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -114,7 +114,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
* being enabled is mandatory as the bits used to denote WP-only SPTEs
* are reserved for NPT w/ PAE (32-bit KVM).
*/
- return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
+ return vcpu->arch.private->mmu == &vcpu->arch.private->guest_mmu &&
kvm_x86_ops.cpu_dirty_log_size;
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 193317ad60a4..c39a1a870a2b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -194,11 +194,11 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
goto no_present;

/* if accessed bit is not supported prefetch non accessed gpte */
- if (PT_HAVE_ACCESSED_DIRTY(vcpu->arch.mmu) &&
+ if (PT_HAVE_ACCESSED_DIRTY(vcpu->arch.private->mmu) &&
!(gpte & PT_GUEST_ACCESSED_MASK))
goto no_present;

- if (FNAME(is_rsvd_bits_set)(vcpu->arch.mmu, gpte, PG_LEVEL_4K))
+ if (FNAME(is_rsvd_bits_set)(vcpu->arch.private->mmu, gpte, PG_LEVEL_4K))
goto no_present;

return false;
@@ -533,7 +533,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
}
#endif
walker->fault.address = addr;
- walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
+ walker->fault.nested_page_fault = mmu != vcpu->arch.private->walk_mmu;
walker->fault.async_page_fault = false;

trace_kvm_mmu_walker_error(walker->fault.error_code);
@@ -543,7 +543,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
static int FNAME(walk_addr)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, gpa_t addr, u32 access)
{
- return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.mmu, addr,
+ return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.private->mmu, addr,
access);
}

@@ -552,7 +552,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, gva_t addr,
u32 access)
{
- return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
+ return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.private->nested_mmu,
addr, access);
}
#endif
@@ -573,7 +573,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

gfn = gpte_to_gfn(gpte);
pte_access = sp->role.access & FNAME(gpte_access)(gpte);
- FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
+ FNAME(protect_clean_gpte)(vcpu->arch.private->mmu, &pte_access, gpte);

slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn,
no_dirty_log && (pte_access & ACC_WRITE_MASK));
@@ -670,7 +670,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
WARN_ON_ONCE(gw->gfn != base_gfn);
direct_access = gw->pte_access;

- top_level = vcpu->arch.mmu->root_level;
+ top_level = vcpu->arch.private->mmu->root_level;
if (top_level == PT32E_ROOT_LEVEL)
top_level = PT32_ROOT_LEVEL;
/*
@@ -682,7 +682,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (FNAME(gpte_changed)(vcpu, gw, top_level))
goto out_gpte_changed;

- if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+ if (WARN_ON(!VALID_PAGE(vcpu->arch.private->mmu->root_hpa)))
goto out_gpte_changed;

for (shadow_walk_init(&it, vcpu, fault->addr);
@@ -806,7 +806,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
bool self_changed = false;

if (!(walker->pte_access & ACC_WRITE_MASK ||
- (!is_cr0_wp(vcpu->arch.mmu) && !user_fault)))
+ (!is_cr0_wp(vcpu->arch.private->mmu) && !user_fault)))
return false;

for (level = walker->level; level <= walker->max_level; level++) {
@@ -905,7 +905,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* we will cache the incorrect access into mmio spte.
*/
if (fault->write && !(walker.pte_access & ACC_WRITE_MASK) &&
- !is_cr0_wp(vcpu->arch.mmu) && !fault->user && fault->slot) {
+ !is_cr0_wp(vcpu->arch.private->mmu) && !fault->user && fault->slot) {
walker.pte_access |= ACC_WRITE_MASK;
walker.pte_access &= ~ACC_USER_MASK;

@@ -915,7 +915,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* then we should prevent the kernel from executing it
* if SMEP is enabled.
*/
- if (is_cr4_smep(vcpu->arch.mmu))
+ if (is_cr4_smep(vcpu->arch.private->mmu))
walker.pte_access &= ~ACC_EXEC_MASK;
}

@@ -1071,7 +1071,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gpa_t vaddr,
*/
static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
- union kvm_mmu_page_role mmu_role = vcpu->arch.mmu->mmu_role.base;
+ union kvm_mmu_page_role mmu_role = vcpu->arch.private->mmu->mmu_role.base;
int i;
bool host_writable;
gpa_t first_pte_gpa;
@@ -1129,7 +1129,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
gfn = gpte_to_gfn(gpte);
pte_access = sp->role.access;
pte_access &= FNAME(gpte_access)(gpte);
- FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
+ FNAME(protect_clean_gpte)(vcpu->arch.private->mmu, &pte_access, gpte);

if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
continue;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 13038fae5088..df14b6639b35 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -177,9 +177,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (prefetch)
spte = mark_spte_for_access_track(spte);

- WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+ WARN_ONCE(is_rsvd_spte(&vcpu->arch.private->mmu->shadow_zero_check, spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
- get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+ get_rsvd_bits(&vcpu->arch.private->mmu->shadow_zero_check, spte, level));

if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1beb4ca90560..c3634ac01869 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -162,7 +162,7 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
{
union kvm_mmu_page_role role;

- role = vcpu->arch.mmu->mmu_role.base;
+ role = vcpu->arch.private->mmu->mmu_role.base;
role.level = level;
role.direct = true;
role.gpte_is_8_bytes = true;
@@ -198,7 +198,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)

lockdep_assert_held_write(&kvm->mmu_lock);

- role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
+ role = page_role_for_level(vcpu, vcpu->arch.private->mmu->shadow_root_level);

/* Check for an existing root before allocating a new one. */
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
@@ -207,7 +207,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
goto out;
}

- root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+ root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.private->mmu->shadow_root_level);
refcount_set(&root->tdp_mmu_root_count, 1);

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -952,7 +952,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
*/
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
u64 *child_pt;
@@ -1486,11 +1486,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level)
{
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;

- *root_level = vcpu->arch.mmu->shadow_root_level;
+ *root_level = vcpu->arch.private->mmu->shadow_root_level;

tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
leaf = iter.level;
@@ -1515,7 +1515,7 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte)
{
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f8b7bc04b3e7..c90ef5bf26cf 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -97,7 +97,7 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)

WARN_ON(mmu_is_nested(vcpu));

- vcpu->arch.mmu = &vcpu->arch.guest_mmu;
+ vcpu->arch.private->mmu = &vcpu->arch.private->guest_mmu;

/*
* The NPT format depends on L1's CR4 and EFER, which is in vmcb01. Note,
@@ -107,16 +107,16 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
kvm_init_shadow_npt_mmu(vcpu, X86_CR0_PG, svm->vmcb01.ptr->save.cr4,
svm->vmcb01.ptr->save.efer,
svm->nested.ctl.nested_cr3);
- vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3;
- vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr;
- vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit;
- vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu;
+ vcpu->arch.private->mmu->get_guest_pgd = nested_svm_get_tdp_cr3;
+ vcpu->arch.private->mmu->get_pdptr = nested_svm_get_tdp_pdptr;
+ vcpu->arch.private->mmu->inject_page_fault = nested_svm_inject_npf_exit;
+ vcpu->arch.private->walk_mmu = &vcpu->arch.private->nested_mmu;
}

static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu)
{
- vcpu->arch.mmu = &vcpu->arch.root_mmu;
- vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
+ vcpu->arch.private->mmu = &vcpu->arch.private->root_mmu;
+ vcpu->arch.private->walk_mmu = &vcpu->arch.private->root_mmu;
}

void recalc_intercepts(struct vcpu_svm *svm)
@@ -437,13 +437,13 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
return -EINVAL;

if (reload_pdptrs && !nested_npt && is_pae_paging(vcpu) &&
- CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
+ CC(!load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, cr3)))
return -EINVAL;

if (!nested_npt)
kvm_mmu_new_pgd(vcpu, cr3);

- vcpu->arch.cr3 = cr3;
+ vcpu->arch.private->cr3 = cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
@@ -500,7 +500,7 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
svm_set_cr0(&svm->vcpu, vmcb12->save.cr0);
svm_set_cr4(&svm->vcpu, vmcb12->save.cr4);

- svm->vcpu.arch.cr2 = vmcb12->save.cr2;
+ svm->vcpu.arch.private->cr2 = vmcb12->save.cr2;

kvm_rax_write(&svm->vcpu, vmcb12->save.rax);
kvm_rsp_write(&svm->vcpu, vmcb12->save.rsp);
@@ -634,7 +634,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
return ret;

if (!npt_enabled)
- vcpu->arch.mmu->inject_page_fault = svm_inject_page_fault_nested;
+ vcpu->arch.private->mmu->inject_page_fault = svm_inject_page_fault_nested;

if (!from_vmrun)
kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
@@ -695,7 +695,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
*/
svm->vmcb01.ptr->save.efer = vcpu->arch.efer;
svm->vmcb01.ptr->save.cr0 = kvm_read_cr0(vcpu);
- svm->vmcb01.ptr->save.cr4 = vcpu->arch.cr4;
+ svm->vmcb01.ptr->save.cr4 = vcpu->arch.private->cr4;
svm->vmcb01.ptr->save.rflags = kvm_get_rflags(vcpu);
svm->vmcb01.ptr->save.rip = kvm_rip_read(vcpu);

@@ -805,7 +805,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
vmcb12->save.cr0 = kvm_read_cr0(vcpu);
vmcb12->save.cr3 = kvm_read_cr3(vcpu);
vmcb12->save.cr2 = vmcb->save.cr2;
- vmcb12->save.cr4 = svm->vcpu.arch.cr4;
+ vmcb12->save.cr4 = svm->vcpu.arch.private->cr4;
vmcb12->save.rflags = kvm_get_rflags(vcpu);
vmcb12->save.rip = kvm_rip_read(vcpu);
vmcb12->save.rsp = kvm_rsp_read(vcpu);
@@ -991,7 +991,7 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
if (!(vmcb_is_intercept(&svm->nested.ctl, INTERCEPT_MSR_PROT)))
return NESTED_EXIT_HOST;

- msr = svm->vcpu.arch.regs[VCPU_REGS_RCX];
+ msr = svm->vcpu.arch.private->regs[VCPU_REGS_RCX];
offset = svm_msrpm_offset(msr);
write = svm->vmcb->control.exit_info_1 & 1;
mask = 1 << ((2 * (msr & 0xf)) + write);
@@ -1131,7 +1131,7 @@ static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
else if (svm->vcpu.arch.exception.has_payload)
svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
else
- svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.private->cr2;
} else if (nr == DB_VECTOR) {
/* See inject_pending_event. */
kvm_deliver_exception_payload(&svm->vcpu);
@@ -1396,7 +1396,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
* Set it again to fix this.
*/

- ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
+ ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.private->cr3,
nested_npt_enabled(svm), false);
if (WARN_ON_ONCE(ret))
goto out_free;
@@ -1449,7 +1449,7 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
* the guest CR3 might be restored prior to setting the nested
* state which can lead to a load of wrong PDPTRs.
*/
- if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3)))
+ if (CC(!load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, vcpu->arch.private->cr3)))
return false;

if (!nested_svm_vmrun_msrpm(svm)) {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index be2883141220..9c62566ddde8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -565,28 +565,28 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
return -EINVAL;

/* Sync registgers */
- save->rax = svm->vcpu.arch.regs[VCPU_REGS_RAX];
- save->rbx = svm->vcpu.arch.regs[VCPU_REGS_RBX];
- save->rcx = svm->vcpu.arch.regs[VCPU_REGS_RCX];
- save->rdx = svm->vcpu.arch.regs[VCPU_REGS_RDX];
- save->rsp = svm->vcpu.arch.regs[VCPU_REGS_RSP];
- save->rbp = svm->vcpu.arch.regs[VCPU_REGS_RBP];
- save->rsi = svm->vcpu.arch.regs[VCPU_REGS_RSI];
- save->rdi = svm->vcpu.arch.regs[VCPU_REGS_RDI];
+ save->rax = svm->vcpu.arch.private->regs[VCPU_REGS_RAX];
+ save->rbx = svm->vcpu.arch.private->regs[VCPU_REGS_RBX];
+ save->rcx = svm->vcpu.arch.private->regs[VCPU_REGS_RCX];
+ save->rdx = svm->vcpu.arch.private->regs[VCPU_REGS_RDX];
+ save->rsp = svm->vcpu.arch.private->regs[VCPU_REGS_RSP];
+ save->rbp = svm->vcpu.arch.private->regs[VCPU_REGS_RBP];
+ save->rsi = svm->vcpu.arch.private->regs[VCPU_REGS_RSI];
+ save->rdi = svm->vcpu.arch.private->regs[VCPU_REGS_RDI];
#ifdef CONFIG_X86_64
- save->r8 = svm->vcpu.arch.regs[VCPU_REGS_R8];
- save->r9 = svm->vcpu.arch.regs[VCPU_REGS_R9];
- save->r10 = svm->vcpu.arch.regs[VCPU_REGS_R10];
- save->r11 = svm->vcpu.arch.regs[VCPU_REGS_R11];
- save->r12 = svm->vcpu.arch.regs[VCPU_REGS_R12];
- save->r13 = svm->vcpu.arch.regs[VCPU_REGS_R13];
- save->r14 = svm->vcpu.arch.regs[VCPU_REGS_R14];
- save->r15 = svm->vcpu.arch.regs[VCPU_REGS_R15];
+ save->r8 = svm->vcpu.arch.private->regs[VCPU_REGS_R8];
+ save->r9 = svm->vcpu.arch.private->regs[VCPU_REGS_R9];
+ save->r10 = svm->vcpu.arch.private->regs[VCPU_REGS_R10];
+ save->r11 = svm->vcpu.arch.private->regs[VCPU_REGS_R11];
+ save->r12 = svm->vcpu.arch.private->regs[VCPU_REGS_R12];
+ save->r13 = svm->vcpu.arch.private->regs[VCPU_REGS_R13];
+ save->r14 = svm->vcpu.arch.private->regs[VCPU_REGS_R14];
+ save->r15 = svm->vcpu.arch.private->regs[VCPU_REGS_R15];
#endif
- save->rip = svm->vcpu.arch.regs[VCPU_REGS_RIP];
+ save->rip = svm->vcpu.arch.private->regs[VCPU_REGS_RIP];

/* Sync some non-GPR registers before encrypting */
- save->xcr0 = svm->vcpu.arch.xcr0;
+ save->xcr0 = svm->vcpu.arch.private->xcr0;
save->pkru = svm->vcpu.arch.pkru;
save->xss = svm->vcpu.arch.ia32_xss;
save->dr6 = svm->vcpu.arch.dr6;
@@ -2301,10 +2301,10 @@ static void sev_es_sync_to_ghcb(struct vcpu_svm *svm)
* Copy their values, even if they may not have been written during the
* VM-Exit. It's the guest's responsibility to not consume random data.
*/
- ghcb_set_rax(ghcb, vcpu->arch.regs[VCPU_REGS_RAX]);
- ghcb_set_rbx(ghcb, vcpu->arch.regs[VCPU_REGS_RBX]);
- ghcb_set_rcx(ghcb, vcpu->arch.regs[VCPU_REGS_RCX]);
- ghcb_set_rdx(ghcb, vcpu->arch.regs[VCPU_REGS_RDX]);
+ ghcb_set_rax(ghcb, vcpu->arch.private->regs[VCPU_REGS_RAX]);
+ ghcb_set_rbx(ghcb, vcpu->arch.private->regs[VCPU_REGS_RBX]);
+ ghcb_set_rcx(ghcb, vcpu->arch.private->regs[VCPU_REGS_RCX]);
+ ghcb_set_rdx(ghcb, vcpu->arch.private->regs[VCPU_REGS_RDX]);
}

static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
@@ -2326,18 +2326,18 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
*
* Copy their values to the appropriate location if supplied.
*/
- memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
+ memset(vcpu->arch.private->regs, 0, sizeof(vcpu->arch.private->regs));

- vcpu->arch.regs[VCPU_REGS_RAX] = ghcb_get_rax_if_valid(ghcb);
- vcpu->arch.regs[VCPU_REGS_RBX] = ghcb_get_rbx_if_valid(ghcb);
- vcpu->arch.regs[VCPU_REGS_RCX] = ghcb_get_rcx_if_valid(ghcb);
- vcpu->arch.regs[VCPU_REGS_RDX] = ghcb_get_rdx_if_valid(ghcb);
- vcpu->arch.regs[VCPU_REGS_RSI] = ghcb_get_rsi_if_valid(ghcb);
+ vcpu->arch.private->regs[VCPU_REGS_RAX] = ghcb_get_rax_if_valid(ghcb);
+ vcpu->arch.private->regs[VCPU_REGS_RBX] = ghcb_get_rbx_if_valid(ghcb);
+ vcpu->arch.private->regs[VCPU_REGS_RCX] = ghcb_get_rcx_if_valid(ghcb);
+ vcpu->arch.private->regs[VCPU_REGS_RDX] = ghcb_get_rdx_if_valid(ghcb);
+ vcpu->arch.private->regs[VCPU_REGS_RSI] = ghcb_get_rsi_if_valid(ghcb);

svm->vmcb->save.cpl = ghcb_get_cpl_if_valid(ghcb);

if (ghcb_xcr0_is_valid(ghcb)) {
- vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
+ vcpu->arch.private->xcr0 = ghcb_get_xcr0(ghcb);
kvm_update_cpuid_runtime(vcpu);
}

@@ -2667,8 +2667,8 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_CPUID_FUNC_POS);

/* Initialize the registers needed by the CPUID intercept */
- vcpu->arch.regs[VCPU_REGS_RAX] = cpuid_fn;
- vcpu->arch.regs[VCPU_REGS_RCX] = 0;
+ vcpu->arch.private->regs[VCPU_REGS_RAX] = cpuid_fn;
+ vcpu->arch.private->regs[VCPU_REGS_RCX] = 0;

ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_CPUID);
if (!ret) {
@@ -2680,13 +2680,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_CPUID_REG_MASK,
GHCB_MSR_CPUID_REG_POS);
if (cpuid_reg == 0)
- cpuid_value = vcpu->arch.regs[VCPU_REGS_RAX];
+ cpuid_value = vcpu->arch.private->regs[VCPU_REGS_RAX];
else if (cpuid_reg == 1)
- cpuid_value = vcpu->arch.regs[VCPU_REGS_RBX];
+ cpuid_value = vcpu->arch.private->regs[VCPU_REGS_RBX];
else if (cpuid_reg == 2)
- cpuid_value = vcpu->arch.regs[VCPU_REGS_RCX];
+ cpuid_value = vcpu->arch.private->regs[VCPU_REGS_RCX];
else
- cpuid_value = vcpu->arch.regs[VCPU_REGS_RDX];
+ cpuid_value = vcpu->arch.private->regs[VCPU_REGS_RDX];

set_ghcb_msr_bits(svm, cpuid_value,
GHCB_MSR_CPUID_VALUE_MASK,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5151efa424ac..516af87e7ab1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1425,10 +1425,10 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
/*
* SEV-ES guests maintain an encrypted version of their FPU
* state which is restored and saved on VMRUN and VMEXIT.
- * Mark vcpu->arch.guest_fpu->fpstate as scratch so it won't
+ * Mark vcpu->arch.private->guest_fpu->fpstate as scratch so it won't
* do xsave/xrstor on it.
*/
- fpstate_set_confidential(&vcpu->arch.guest_fpu);
+ fpstate_set_confidential(&vcpu->arch.private->guest_fpu);
}

err = avic_init_vcpu(svm);
@@ -1599,7 +1599,7 @@ static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
switch (reg) {
case VCPU_EXREG_PDPTR:
BUG_ON(!npt_enabled);
- load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+ load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, kvm_read_cr3(vcpu));
break;
default:
KVM_BUG_ON(1, vcpu->kvm);
@@ -1804,7 +1804,7 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
}
}
#endif
- vcpu->arch.cr0 = cr0;
+ vcpu->arch.private->cr0 = cr0;

if (!npt_enabled)
hcr0 |= X86_CR0_PG | X86_CR0_WP;
@@ -1845,12 +1845,12 @@ static bool svm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
unsigned long host_cr4_mce = cr4_read_shadow() & X86_CR4_MCE;
- unsigned long old_cr4 = vcpu->arch.cr4;
+ unsigned long old_cr4 = vcpu->arch.private->cr4;

if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
svm_flush_tlb(vcpu);

- vcpu->arch.cr4 = cr4;
+ vcpu->arch.private->cr4 = cr4;
if (!npt_enabled)
cr4 |= X86_CR4_PAE;
cr4 |= host_cr4_mce;
@@ -2239,7 +2239,7 @@ enum {
/* Return NONE_SVM_INSTR if not SVM instrs, otherwise return decode result */
static int svm_instr_opcode(struct kvm_vcpu *vcpu)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;

if (ctxt->b != 0x1 || ctxt->opcode_len != 2)
return NONE_SVM_INSTR;
@@ -2513,7 +2513,7 @@ static bool check_selective_cr0_intercepted(struct kvm_vcpu *vcpu,
unsigned long val)
{
struct vcpu_svm *svm = to_svm(vcpu);
- unsigned long cr0 = vcpu->arch.cr0;
+ unsigned long cr0 = vcpu->arch.private->cr0;
bool ret = false;

if (!is_guest_mode(vcpu) ||
@@ -2585,7 +2585,7 @@ static int cr_interception(struct kvm_vcpu *vcpu)
val = kvm_read_cr0(vcpu);
break;
case 2:
- val = vcpu->arch.cr2;
+ val = vcpu->arch.private->cr2;
break;
case 3:
val = kvm_read_cr3(vcpu);
@@ -3396,9 +3396,9 @@ static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
/* SEV-ES guests must use the CR write traps to track CR registers. */
if (!sev_es_guest(vcpu->kvm)) {
if (!svm_is_intercept(svm, INTERCEPT_CR0_WRITE))
- vcpu->arch.cr0 = svm->vmcb->save.cr0;
+ vcpu->arch.private->cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
- vcpu->arch.cr3 = svm->vmcb->save.cr3;
+ vcpu->arch.private->cr3 = svm->vmcb->save.cr3;
}

if (is_guest_mode(vcpu)) {
@@ -3828,7 +3828,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
* vmcb02 when switching vmcbs for nested virtualization.
*/
vmload(svm->vmcb01.pa);
- __svm_vcpu_run(vmcb_pa, (unsigned long *)&vcpu->arch.regs);
+ __svm_vcpu_run(vmcb_pa, (unsigned long *)&vcpu->arch.private->regs);
vmsave(svm->vmcb01.pa);

vmload(__sme_page_pa(sd->save_area));
@@ -3843,9 +3843,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)

trace_kvm_entry(vcpu);

- svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
- svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
- svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
+ svm->vmcb->save.rax = vcpu->arch.private->regs[VCPU_REGS_RAX];
+ svm->vmcb->save.rsp = vcpu->arch.private->regs[VCPU_REGS_RSP];
+ svm->vmcb->save.rip = vcpu->arch.private->regs[VCPU_REGS_RIP];

/*
* Disable singlestep if we're injecting an interrupt/exception.
@@ -3871,7 +3871,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
svm->vmcb->control.asid = svm->asid;
vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
}
- svm->vmcb->save.cr2 = vcpu->arch.cr2;
+ svm->vmcb->save.cr2 = vcpu->arch.private->cr2;

svm_hv_update_vp_id(svm->vmcb, vcpu);

@@ -3926,10 +3926,10 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl);

if (!sev_es_guest(vcpu->kvm)) {
- vcpu->arch.cr2 = svm->vmcb->save.cr2;
- vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax;
- vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
- vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
+ vcpu->arch.private->cr2 = svm->vmcb->save.cr2;
+ vcpu->arch.private->regs[VCPU_REGS_RAX] = svm->vmcb->save.rax;
+ vcpu->arch.private->regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
+ vcpu->arch.private->regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
}

if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
@@ -3999,8 +3999,8 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
/* Loading L2's CR3 is handled by enter_svm_guest_mode. */
if (!test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
return;
- cr3 = vcpu->arch.cr3;
- } else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
+ cr3 = vcpu->arch.private->cr3;
+ } else if (vcpu->arch.private->mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
} else {
/* PCID in the guest should be impossible with a 32-bit MMU. */
@@ -4221,7 +4221,7 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
INTERCEPT_SELECTIVE_CR0)))
break;

- cr0 = vcpu->arch.cr0 & ~SVM_CR0_SELECTIVE_MASK;
+ cr0 = vcpu->arch.private->cr0 & ~SVM_CR0_SELECTIVE_MASK;
val = info->src_val & ~SVM_CR0_SELECTIVE_MASK;

if (info->intercept == x86_intercept_lmsw) {
@@ -4358,9 +4358,9 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
/* FEE0h - SVM Guest VMCB Physical Address */
put_smstate(u64, smstate, 0x7ee0, svm->nested.vmcb12_gpa);

- svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
- svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
- svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
+ svm->vmcb->save.rax = vcpu->arch.private->regs[VCPU_REGS_RAX];
+ svm->vmcb->save.rsp = vcpu->arch.private->regs[VCPU_REGS_RSP];
+ svm->vmcb->save.rip = vcpu->arch.private->regs[VCPU_REGS_RIP];

ret = nested_svm_vmexit(svm);
if (ret)
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 953b0fcb21ee..2dc906dc9c13 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -791,13 +791,13 @@ TRACE_EVENT(kvm_emulate_insn,

TP_fast_assign(
__entry->csbase = static_call(kvm_x86_get_segment_base)(vcpu, VCPU_SREG_CS);
- __entry->len = vcpu->arch.emulate_ctxt->fetch.ptr
- - vcpu->arch.emulate_ctxt->fetch.data;
- __entry->rip = vcpu->arch.emulate_ctxt->_eip - __entry->len;
+ __entry->len = vcpu->arch.private->emulate_ctxt->fetch.ptr
+ - vcpu->arch.private->emulate_ctxt->fetch.data;
+ __entry->rip = vcpu->arch.private->emulate_ctxt->_eip - __entry->len;
memcpy(__entry->insn,
- vcpu->arch.emulate_ctxt->fetch.data,
+ vcpu->arch.private->emulate_ctxt->fetch.data,
15);
- __entry->flags = kei_decode_mode(vcpu->arch.emulate_ctxt->mode);
+ __entry->flags = kei_decode_mode(vcpu->arch.private->emulate_ctxt->mode);
__entry->failed = failed;
),

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0a0092e4102d..34b7621adf99 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -313,7 +313,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true);
vmx->nested.pi_desc = NULL;

- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.private->guest_mmu, KVM_MMU_ROOTS_ALL);

nested_release_evmcs(vcpu);

@@ -356,11 +356,11 @@ static void nested_ept_invalidate_addr(struct kvm_vcpu *vcpu, gpa_t eptp,
WARN_ON_ONCE(!mmu_is_nested(vcpu));

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
- cached_root = &vcpu->arch.mmu->prev_roots[i];
+ cached_root = &vcpu->arch.private->mmu->prev_roots[i];

if (nested_ept_root_matches(cached_root->hpa, cached_root->pgd,
eptp))
- vcpu->arch.mmu->invlpg(vcpu, addr, cached_root->hpa);
+ vcpu->arch.private->mmu->invlpg(vcpu, addr, cached_root->hpa);
}
}

@@ -410,19 +410,19 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
{
WARN_ON(mmu_is_nested(vcpu));

- vcpu->arch.mmu = &vcpu->arch.guest_mmu;
+ vcpu->arch.private->mmu = &vcpu->arch.private->guest_mmu;
nested_ept_new_eptp(vcpu);
- vcpu->arch.mmu->get_guest_pgd = nested_ept_get_eptp;
- vcpu->arch.mmu->inject_page_fault = nested_ept_inject_page_fault;
- vcpu->arch.mmu->get_pdptr = kvm_pdptr_read;
+ vcpu->arch.private->mmu->get_guest_pgd = nested_ept_get_eptp;
+ vcpu->arch.private->mmu->inject_page_fault = nested_ept_inject_page_fault;
+ vcpu->arch.private->mmu->get_pdptr = kvm_pdptr_read;

- vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu;
+ vcpu->arch.private->walk_mmu = &vcpu->arch.private->nested_mmu;
}

static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
{
- vcpu->arch.mmu = &vcpu->arch.root_mmu;
- vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
+ vcpu->arch.private->mmu = &vcpu->arch.private->root_mmu;
+ vcpu->arch.private->walk_mmu = &vcpu->arch.private->root_mmu;
}

static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
@@ -456,7 +456,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
}
if (nested_vmx_is_page_fault_vmexit(vmcs12,
vcpu->arch.exception.error_code)) {
- *exit_qual = has_payload ? payload : vcpu->arch.cr2;
+ *exit_qual = has_payload ? payload : vcpu->arch.private->cr2;
return 1;
}
} else if (vmcs12->exception_bitmap & (1u << nr)) {
@@ -1103,7 +1103,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
* must not be dereferenced.
*/
if (reload_pdptrs && !nested_ept && is_pae_paging(vcpu) &&
- CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))) {
+ CC(!load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, cr3))) {
*entry_failure_code = ENTRY_FAIL_PDPTE;
return -EINVAL;
}
@@ -1111,7 +1111,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
if (!nested_ept)
kvm_mmu_new_pgd(vcpu, cr3);

- vcpu->arch.cr3 = cr3;
+ vcpu->arch.private->cr3 = cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
@@ -2508,8 +2508,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
* trap. Note that CR0.TS also needs updating - we do this later.
*/
vmx_update_exception_bitmap(vcpu);
- vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
- vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+ vcpu->arch.private->cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+ vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.private->cr0_guest_owned_bits);

if (vmx->nested.nested_run_pending &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)) {
@@ -2595,7 +2595,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
}

if (!enable_ept)
- vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
+ vcpu->arch.private->walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;

if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
@@ -3070,7 +3070,7 @@ static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
vmx->loaded_vmcs->host_state.cr4 = cr4;
}

- vm_fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
+ vm_fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.private->regs,
vmx->loaded_vmcs->launched);

if (vmx->msr_autoload.host.nr)
@@ -3153,7 +3153,7 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
* the guest CR3 might be restored prior to setting the nested
* state which can lead to a load of wrong PDPTRs.
*/
- if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3)))
+ if (CC(!load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, vcpu->arch.private->cr3)))
return false;
}

@@ -3370,18 +3370,18 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
* i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its
* software model to the pre-VMEntry host state. When EPT is disabled,
* GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes
- * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3. Stuffing
- * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to
+ * nested_vmx_restore_host_state() to corrupt vcpu->arch.private->cr3. Stuffing
+ * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.private->cr3 to
* the correct value. Smashing vmcs01.GUEST_CR3 is safe because nested
* VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is
* guaranteed to be overwritten with a shadow CR3 prior to re-entering
* L1. Don't stuff vmcs01.GUEST_CR3 when using nested early checks as
- * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks
+ * KVM modifies vcpu->arch.private->cr3 if and only if the early hardware checks
* pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail
* path would need to manually save/restore vmcs01.GUEST_CR3.
*/
if (!enable_ept && !nested_early_check)
- vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
+ vmcs_writel(GUEST_CR3, vcpu->arch.private->cr3);

vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);

@@ -3655,20 +3655,20 @@ static inline unsigned long
vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
{
return
- /*1*/ (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
+ /*1*/ (vmcs_readl(GUEST_CR0) & vcpu->arch.private->cr0_guest_owned_bits) |
/*2*/ (vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
/*3*/ (vmcs_readl(CR0_READ_SHADOW) & ~(vmcs12->cr0_guest_host_mask |
- vcpu->arch.cr0_guest_owned_bits));
+ vcpu->arch.private->cr0_guest_owned_bits));
}

static inline unsigned long
vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
{
return
- /*1*/ (vmcs_readl(GUEST_CR4) & vcpu->arch.cr4_guest_owned_bits) |
+ /*1*/ (vmcs_readl(GUEST_CR4) & vcpu->arch.private->cr4_guest_owned_bits) |
/*2*/ (vmcs12->guest_cr4 & vmcs12->cr4_guest_host_mask) |
/*3*/ (vmcs_readl(CR4_READ_SHADOW) & ~(vmcs12->cr4_guest_host_mask |
- vcpu->arch.cr4_guest_owned_bits));
+ vcpu->arch.private->cr4_guest_owned_bits));
}

static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
@@ -4255,11 +4255,11 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
* CR0_GUEST_HOST_MASK is already set in the original vmcs01
* (KVM doesn't change it);
*/
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.private->cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
vmx_set_cr0(vcpu, vmcs12->host_cr0);

/* Same as above - no reason to call set_cr4_guest_host_mask(). */
- vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+ vcpu->arch.private->cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
vmx_set_cr4(vcpu, vmcs12->host_cr4);

nested_ept_uninit_mmu_context(vcpu);
@@ -4405,14 +4405,14 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
*/
vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));

- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.private->cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));

- vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+ vcpu->arch.private->cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));

nested_ept_uninit_mmu_context(vcpu);
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ vcpu->arch.private->cr3 = vmcs_readl(GUEST_CR3);
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

/*
@@ -5000,7 +5000,7 @@ static inline void nested_release_vmcs12(struct kvm_vcpu *vcpu)
vmx->nested.current_vmptr >> PAGE_SHIFT,
vmx->nested.cached_vmcs12, 0, VMCS12_SIZE);

- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.private->guest_mmu, KVM_MMU_ROOTS_ALL);

vmx->nested.current_vmptr = INVALID_GPA;
}
@@ -5427,7 +5427,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
* Nested EPT roots are always held through guest_mmu,
* not root_mmu.
*/
- mmu = &vcpu->arch.guest_mmu;
+ mmu = &vcpu->arch.private->guest_mmu;

switch (type) {
case VMX_EPT_EXTENT_CONTEXT:
@@ -5545,7 +5545,7 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
* TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
*/
if (!enable_ept)
- kvm_mmu_free_guest_mode_roots(vcpu, &vcpu->arch.root_mmu);
+ kvm_mmu_free_guest_mode_roots(vcpu, &vcpu->arch.private->root_mmu);

return nested_vmx_succeed(vcpu);
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e1bb017b696..beba656116d7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2242,20 +2242,20 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)

switch (reg) {
case VCPU_REGS_RSP:
- vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+ vcpu->arch.private->regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
break;
case VCPU_REGS_RIP:
- vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+ vcpu->arch.private->regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
break;
case VCPU_EXREG_PDPTR:
if (enable_ept)
ept_save_pdptrs(vcpu);
break;
case VCPU_EXREG_CR0:
- guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
+ guest_owned_bits = vcpu->arch.private->cr0_guest_owned_bits;

- vcpu->arch.cr0 &= ~guest_owned_bits;
- vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
+ vcpu->arch.private->cr0 &= ~guest_owned_bits;
+ vcpu->arch.private->cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
break;
case VCPU_EXREG_CR3:
/*
@@ -2263,13 +2263,13 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
* CR3 is loaded into hardware, not the guest's CR3.
*/
if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ vcpu->arch.private->cr3 = vmcs_readl(GUEST_CR3);
break;
case VCPU_EXREG_CR4:
- guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
+ guest_owned_bits = vcpu->arch.private->cr4_guest_owned_bits;

- vcpu->arch.cr4 &= ~guest_owned_bits;
- vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
+ vcpu->arch.private->cr4 &= ~guest_owned_bits;
+ vcpu->arch.private->cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
break;
default:
KVM_BUG_ON(1, vcpu->kvm);
@@ -2926,7 +2926,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)

static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
u64 root_hpa = mmu->root_hpa;

/* No flush required if the current context is invalid. */
@@ -2963,7 +2963,7 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)

void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->walk_mmu;

if (!kvm_register_is_dirty(vcpu, VCPU_EXREG_PDPTR))
return;
@@ -2978,7 +2978,7 @@ void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)

void ept_save_pdptrs(struct kvm_vcpu *vcpu)
{
- struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->walk_mmu;

if (WARN_ON_ONCE(!is_pae_paging(vcpu)))
return;
@@ -3019,7 +3019,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)

vmcs_writel(CR0_READ_SHADOW, cr0);
vmcs_writel(GUEST_CR0, hw_cr0);
- vcpu->arch.cr0 = cr0;
+ vcpu->arch.private->cr0 = cr0;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);

#ifdef CONFIG_X86_64
@@ -3067,12 +3067,12 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
exec_controls_set(vmx, tmp);
}

- /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */
+ /* Note, vmx_set_cr4() consumes the new vcpu->arch.private->cr0. */
if ((old_cr0_pg ^ cr0) & X86_CR0_PG)
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}

- /* depends on vcpu->arch.cr0 to be set to a new value */
+ /* depends on vcpu->arch.private->cr0 to be set to a new value */
vmx->emulation_required = vmx_emulation_required(vcpu);
}

@@ -3114,7 +3114,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
if (!enable_unrestricted_guest && !is_paging(vcpu))
guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
- guest_cr3 = vcpu->arch.cr3;
+ guest_cr3 = vcpu->arch.private->cr3;
else /* vmcs01.GUEST_CR3 is already up-to-date. */
update_guest_cr3 = false;
vmx_ept_load_pdptrs(vcpu);
@@ -3144,7 +3144,7 @@ static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)

void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
- unsigned long old_cr4 = vcpu->arch.cr4;
+ unsigned long old_cr4 = vcpu->arch.private->cr4;
struct vcpu_vmx *vmx = to_vmx(vcpu);
/*
* Pass through host's Machine Check Enable value to hw_cr4, which
@@ -3171,7 +3171,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
}
}

- vcpu->arch.cr4 = cr4;
+ vcpu->arch.private->cr4 = cr4;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR4);

if (!is_unrestricted_guest(vcpu)) {
@@ -4040,14 +4040,14 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
{
struct kvm_vcpu *vcpu = &vmx->vcpu;

- vcpu->arch.cr4_guest_owned_bits = KVM_POSSIBLE_CR4_GUEST_BITS &
- ~vcpu->arch.cr4_guest_rsvd_bits;
+ vcpu->arch.private->cr4_guest_owned_bits = KVM_POSSIBLE_CR4_GUEST_BITS &
+ ~vcpu->arch.private->cr4_guest_rsvd_bits;
if (!enable_ept)
- vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_PGE;
+ vcpu->arch.private->cr4_guest_owned_bits &= ~X86_CR4_PGE;
if (is_guest_mode(&vmx->vcpu))
- vcpu->arch.cr4_guest_owned_bits &=
+ vcpu->arch.private->cr4_guest_owned_bits &=
~get_vmcs12(vcpu)->cr4_guest_host_mask;
- vmcs_writel(CR4_GUEST_HOST_MASK, ~vcpu->arch.cr4_guest_owned_bits);
+ vmcs_writel(CR4_GUEST_HOST_MASK, ~vcpu->arch.private->cr4_guest_owned_bits);
}

static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
@@ -4345,8 +4345,8 @@ static void init_vmcs(struct vcpu_vmx *vmx)
/* 22.2.1, 20.8.1 */
vm_entry_controls_set(vmx, vmx_vmentry_ctrl());

- vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
- vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
+ vmx->vcpu.arch.private->cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.private->cr0_guest_owned_bits);

set_cr4_guest_host_mask(vmx);

@@ -4956,7 +4956,7 @@ static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)

static int handle_desc(struct kvm_vcpu *vcpu)
{
- WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));
+ WARN_ON(!(vcpu->arch.private->cr4 & X86_CR4_UMIP));
return kvm_emulate_instruction(vcpu, 0);
}

@@ -6626,13 +6626,13 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
vmx->loaded_vmcs->host_state.cr3 = cr3;
}

- if (vcpu->arch.cr2 != native_read_cr2())
- native_write_cr2(vcpu->arch.cr2);
+ if (vcpu->arch.private->cr2 != native_read_cr2())
+ native_write_cr2(vcpu->arch.private->cr2);

- vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
+ vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.private->regs,
vmx->loaded_vmcs->launched);

- vcpu->arch.cr2 = native_read_cr2();
+ vcpu->arch.private->cr2 = native_read_cr2();

VM_WARN_ON_ONCE(vcpu->kvm->asi && !is_asi_active());
asi_set_target_unrestricted();
@@ -6681,9 +6681,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
WARN_ON_ONCE(vmx->nested.need_vmcs12_to_shadow_sync);

if (kvm_register_is_dirty(vcpu, VCPU_REGS_RSP))
- vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+ vmcs_writel(GUEST_RSP, vcpu->arch.private->regs[VCPU_REGS_RSP]);
if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
- vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+ vmcs_writel(GUEST_RIP, vcpu->arch.private->regs[VCPU_REGS_RIP]);

cr4 = cr4_read_shadow();
if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd862edc1b5a..680725089a18 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -595,7 +595,7 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
vcpu->arch.dr6 &= ~BIT(12);
break;
case PF_VECTOR:
- vcpu->arch.cr2 = payload;
+ vcpu->arch.private->cr2 = payload;
break;
}

@@ -736,8 +736,8 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
struct kvm_mmu *fault_mmu;
WARN_ON_ONCE(fault->vector != PF_VECTOR);

- fault_mmu = fault->nested_page_fault ? vcpu->arch.mmu :
- vcpu->arch.walk_mmu;
+ fault_mmu = fault->nested_page_fault ? vcpu->arch.private->mmu :
+ vcpu->arch.private->walk_mmu;

/*
* Invalidate the TLB entry for the faulting address, if it exists,
@@ -892,7 +892,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
#endif
if (!(vcpu->arch.efer & EFER_LME) && (cr0 & X86_CR0_PG) &&
is_pae(vcpu) && ((cr0 ^ old_cr0) & pdptr_bits) &&
- !load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)))
+ !load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, kvm_read_cr3(vcpu)))
return 1;

if (!(cr0 & X86_CR0_PG) &&
@@ -920,8 +920,8 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)

if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {

- if (vcpu->arch.xcr0 != host_xcr0)
- xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
+ if (vcpu->arch.private->xcr0 != host_xcr0)
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.private->xcr0);

if (vcpu->arch.xsaves_enabled &&
vcpu->arch.ia32_xss != host_xss)
@@ -930,7 +930,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)

if (static_cpu_has(X86_FEATURE_PKU) &&
(kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
- (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
+ (vcpu->arch.private->xcr0 & XFEATURE_MASK_PKRU)) &&
vcpu->arch.pkru != vcpu->arch.host_pkru)
write_pkru(vcpu->arch.pkru);
}
@@ -943,7 +943,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)

if (static_cpu_has(X86_FEATURE_PKU) &&
(kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
- (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
+ (vcpu->arch.private->xcr0 & XFEATURE_MASK_PKRU))) {
vcpu->arch.pkru = rdpkru();
if (vcpu->arch.pkru != vcpu->arch.host_pkru)
write_pkru(vcpu->arch.host_pkru);
@@ -951,7 +951,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)

if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {

- if (vcpu->arch.xcr0 != host_xcr0)
+ if (vcpu->arch.private->xcr0 != host_xcr0)
xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);

if (vcpu->arch.xsaves_enabled &&
@@ -965,7 +965,7 @@ EXPORT_SYMBOL_GPL(kvm_load_host_xsave_state);
static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
{
u64 xcr0 = xcr;
- u64 old_xcr0 = vcpu->arch.xcr0;
+ u64 old_xcr0 = vcpu->arch.private->xcr0;
u64 valid_bits;

/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now */
@@ -981,7 +981,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
* saving. However, xcr0 bit 0 is always set, even if the
* emulated CPU does not support XSAVE (see kvm_vcpu_reset()).
*/
- valid_bits = vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FP;
+ valid_bits = vcpu->arch.private->guest_supported_xcr0 | XFEATURE_MASK_FP;
if (xcr0 & ~valid_bits)
return 1;

@@ -995,7 +995,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
return 1;
}
- vcpu->arch.xcr0 = xcr0;
+ vcpu->arch.private->xcr0 = xcr0;

if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
kvm_update_cpuid_runtime(vcpu);
@@ -1019,7 +1019,7 @@ bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (cr4 & cr4_reserved_bits)
return false;

- if (cr4 & vcpu->arch.cr4_guest_rsvd_bits)
+ if (cr4 & vcpu->arch.private->cr4_guest_rsvd_bits)
return false;

return static_call(kvm_x86_is_valid_cr4)(vcpu, cr4);
@@ -1069,7 +1069,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;
} else if (is_paging(vcpu) && (cr4 & X86_CR4_PAE)
&& ((cr4 ^ old_cr4) & pdptr_bits)
- && !load_pdptrs(vcpu, vcpu->arch.walk_mmu,
+ && !load_pdptrs(vcpu, vcpu->arch.private->walk_mmu,
kvm_read_cr3(vcpu)))
return 1;

@@ -1092,7 +1092,7 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);

static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm_mmu *mmu = vcpu->arch.private->mmu;
unsigned long roots_to_free = 0;
int i;

@@ -1159,13 +1159,13 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
return 1;

- if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
+ if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, cr3))
return 1;

if (cr3 != kvm_read_cr3(vcpu))
kvm_mmu_new_pgd(vcpu, cr3);

- vcpu->arch.cr3 = cr3;
+ vcpu->arch.private->cr3 = cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

handle_tlb_flush:
@@ -1190,7 +1190,7 @@ int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
if (lapic_in_kernel(vcpu))
kvm_lapic_set_tpr(vcpu, cr8);
else
- vcpu->arch.cr8 = cr8;
+ vcpu->arch.private->cr8 = cr8;
return 0;
}
EXPORT_SYMBOL_GPL(kvm_set_cr8);
@@ -1200,7 +1200,7 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu)
if (lapic_in_kernel(vcpu))
return kvm_lapic_get_cr8(vcpu);
else
- return vcpu->arch.cr8;
+ return vcpu->arch.private->cr8;
}
EXPORT_SYMBOL_GPL(kvm_get_cr8);

@@ -4849,10 +4849,10 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
- if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ if (fpstate_is_confidential(&vcpu->arch.private->guest_fpu))
return;

- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
+ fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.private->guest_fpu,
guest_xsave->region,
sizeof(guest_xsave->region),
vcpu->arch.pkru);
@@ -4861,10 +4861,10 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
- if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ if (fpstate_is_confidential(&vcpu->arch.private->guest_fpu))
return 0;

- return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
+ return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.private->guest_fpu,
guest_xsave->region,
supported_xcr0, &vcpu->arch.pkru);
}
@@ -4880,7 +4880,7 @@ static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
guest_xcrs->nr_xcrs = 1;
guest_xcrs->flags = 0;
guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
- guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
+ guest_xcrs->xcrs[0].value = vcpu->arch.private->xcr0;
}

static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
@@ -6516,7 +6516,7 @@ gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,

/* NPT walks are always user-walks */
access |= PFERR_USER_MASK;
- t_gpa = vcpu->arch.mmu->gva_to_gpa(vcpu, gpa, access, exception);
+ t_gpa = vcpu->arch.private->mmu->gva_to_gpa(vcpu, gpa, access, exception);

return t_gpa;
}
@@ -6525,7 +6525,7 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva,
struct x86_exception *exception)
{
u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0;
- return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+ return vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
}
EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_read);

@@ -6534,7 +6534,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_read);
{
u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0;
access |= PFERR_FETCH_MASK;
- return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+ return vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
}

gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
@@ -6542,7 +6542,7 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
{
u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0;
access |= PFERR_WRITE_MASK;
- return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+ return vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
}
EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_write);

@@ -6550,7 +6550,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_write);
gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
struct x86_exception *exception)
{
- return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
+ return vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
}

static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
@@ -6561,7 +6561,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
int r = X86EMUL_CONTINUE;

while (bytes) {
- gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access,
+ gpa_t gpa = vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, addr, access,
exception);
unsigned offset = addr & (PAGE_SIZE-1);
unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset);
@@ -6595,7 +6595,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
int ret;

/* Inline kvm_read_guest_virt_helper for speed. */
- gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
+ gpa_t gpa = vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
exception);
if (unlikely(gpa == UNMAPPED_GVA))
return X86EMUL_PROPAGATE_FAULT;
@@ -6659,7 +6659,7 @@ static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes
int r = X86EMUL_CONTINUE;

while (bytes) {
- gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
+ gpa_t gpa = vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, addr,
access,
exception);
unsigned offset = addr & (PAGE_SIZE-1);
@@ -6757,7 +6757,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
* shadow page table for L2 guest.
*/
if (vcpu_match_mmio_gva(vcpu, gva) && (!is_paging(vcpu) ||
- !permission_fault(vcpu, vcpu->arch.walk_mmu,
+ !permission_fault(vcpu, vcpu->arch.private->walk_mmu,
vcpu->arch.mmio_access, 0, access))) {
*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
(gva & (PAGE_SIZE - 1));
@@ -6765,7 +6765,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
return 1;
}

- *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+ *gpa = vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, access, exception);

if (*gpa == UNMAPPED_GVA)
return -1;
@@ -6867,7 +6867,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
int handled, ret;
bool write = ops->write;
struct kvm_mmio_fragment *frag;
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;

/*
* If the exit was due to a NPF we may already have a GPA.
@@ -7246,7 +7246,7 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
value = kvm_read_cr0(vcpu);
break;
case 2:
- value = vcpu->arch.cr2;
+ value = vcpu->arch.private->cr2;
break;
case 3:
value = kvm_read_cr3(vcpu);
@@ -7275,7 +7275,7 @@ static int emulator_set_cr(struct x86_emulate_ctxt *ctxt, int cr, ulong val)
res = kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
break;
case 2:
- vcpu->arch.cr2 = val;
+ vcpu->arch.private->cr2 = val;
break;
case 3:
res = kvm_set_cr3(vcpu, val);
@@ -7597,7 +7597,7 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)

static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;
if (ctxt->exception.vector == PF_VECTOR)
return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);

@@ -7621,14 +7621,14 @@ static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)

ctxt->vcpu = vcpu;
ctxt->ops = &emulate_ops;
- vcpu->arch.emulate_ctxt = ctxt;
+ vcpu->arch.private->emulate_ctxt = ctxt;

return ctxt;
}

static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;
int cs_db, cs_l;

static_call(kvm_x86_get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
@@ -7658,7 +7658,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu)

void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;
int ret;

init_emulate_ctxt(vcpu);
@@ -7731,7 +7731,7 @@ static void prepare_emulation_failure_exit(struct kvm_vcpu *vcpu, u64 *data,

static void prepare_emulation_ctxt_failure_exit(struct kvm_vcpu *vcpu)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;

prepare_emulation_failure_exit(vcpu, NULL, 0, ctxt->fetch.data,
ctxt->fetch.end - ctxt->fetch.data);
@@ -7792,7 +7792,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
return false;

- if (!vcpu->arch.mmu->direct_map) {
+ if (!vcpu->arch.private->mmu->direct_map) {
/*
* Write permission should be allowed since only
* write access need to be emulated.
@@ -7825,7 +7825,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_release_pfn_clean(pfn);

/* The instructions are well-emulated on direct mmu. */
- if (vcpu->arch.mmu->direct_map) {
+ if (vcpu->arch.private->mmu->direct_map) {
unsigned int indirect_shadow_pages;

write_lock(&vcpu->kvm->mmu_lock);
@@ -7893,7 +7893,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
vcpu->arch.last_retry_eip = ctxt->eip;
vcpu->arch.last_retry_addr = cr2_or_gpa;

- if (!vcpu->arch.mmu->direct_map)
+ if (!vcpu->arch.private->mmu->direct_map)
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);

kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
@@ -8055,7 +8055,7 @@ int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
void *insn, int insn_len)
{
int r = EMULATION_OK;
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;

init_emulate_ctxt(vcpu);

@@ -8081,7 +8081,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len)
{
int r;
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;
bool writeback = true;
bool write_fault_to_spt;

@@ -8160,7 +8160,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
ctxt->exception.address = cr2_or_gpa;

/* With shadow page tables, cr2 contains a GVA or nGPA. */
- if (vcpu->arch.mmu->direct_map) {
+ if (vcpu->arch.private->mmu->direct_map) {
ctxt->gpa_available = true;
ctxt->gpa_val = cr2_or_gpa;
}
@@ -9484,9 +9484,9 @@ static void enter_smm(struct kvm_vcpu *vcpu)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
kvm_rip_write(vcpu, 0x8000);

- cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
+ cr0 = vcpu->arch.private->cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
static_call(kvm_x86_set_cr0)(vcpu, cr0);
- vcpu->arch.cr0 = cr0;
+ vcpu->arch.private->cr0 = cr0;

static_call(kvm_x86_set_cr4)(vcpu, 0);

@@ -10245,14 +10245,14 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
* Exclude PKRU from restore as restored separately in
* kvm_x86_ops.run().
*/
- fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, true);
+ fpu_swap_kvm_fpstate(&vcpu->arch.private->guest_fpu, true);
trace_kvm_fpu(1);
}

/* When vcpu_run ends, restore user space FPU context. */
static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, false);
+ fpu_swap_kvm_fpstate(&vcpu->arch.private->guest_fpu, false);
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
@@ -10342,7 +10342,7 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
* that usually, but some bad designed PV devices (vmware
* backdoor interface) need this to work
*/
- emulator_writeback_register_cache(vcpu->arch.emulate_ctxt);
+ emulator_writeback_register_cache(vcpu->arch.private->emulate_ctxt);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
}
regs->rax = kvm_rax_read(vcpu);
@@ -10450,7 +10450,7 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
sregs->gdt.limit = dt.size;
sregs->gdt.base = dt.address;

- sregs->cr2 = vcpu->arch.cr2;
+ sregs->cr2 = vcpu->arch.private->cr2;
sregs->cr3 = kvm_read_cr3(vcpu);

skip_protected_regs:
@@ -10563,7 +10563,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
int reason, bool has_error_code, u32 error_code)
{
- struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.private->emulate_ctxt;
int ret;

init_emulate_ctxt(vcpu);
@@ -10632,9 +10632,9 @@ static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs,
dt.address = sregs->gdt.base;
static_call(kvm_x86_set_gdt)(vcpu, &dt);

- vcpu->arch.cr2 = sregs->cr2;
+ vcpu->arch.private->cr2 = sregs->cr2;
*mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
- vcpu->arch.cr3 = sregs->cr3;
+ vcpu->arch.private->cr3 = sregs->cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

kvm_set_cr8(vcpu, sregs->cr8);
@@ -10644,7 +10644,7 @@ static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs,

*mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
static_call(kvm_x86_set_cr0)(vcpu, sregs->cr0);
- vcpu->arch.cr0 = sregs->cr0;
+ vcpu->arch.private->cr0 = sregs->cr0;

*mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
static_call(kvm_x86_set_cr4)(vcpu, sregs->cr4);
@@ -10652,7 +10652,7 @@ static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs,
if (update_pdptrs) {
idx = srcu_read_lock(&vcpu->kvm->srcu);
if (is_pae_paging(vcpu)) {
- load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+ load_pdptrs(vcpu, vcpu->arch.private->walk_mmu, kvm_read_cr3(vcpu));
*mmu_reset_needed = 1;
}
srcu_read_unlock(&vcpu->kvm->srcu, idx);
@@ -10853,12 +10853,12 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct fxregs_state *fxsave;

- if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ if (fpstate_is_confidential(&vcpu->arch.private->guest_fpu))
return 0;

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu.fpstate->regs.fxsave;
+ fxsave = &vcpu->arch.private->guest_fpu.fpstate->regs.fxsave;
memcpy(fpu->fpr, fxsave->st_space, 128);
fpu->fcw = fxsave->cwd;
fpu->fsw = fxsave->swd;
@@ -10876,12 +10876,12 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct fxregs_state *fxsave;

- if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ if (fpstate_is_confidential(&vcpu->arch.private->guest_fpu))
return 0;

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu.fpstate->regs.fxsave;
+ fxsave = &vcpu->arch.private->guest_fpu.fpstate->regs.fxsave;

memcpy(fxsave->st_space, fpu->fpr, 128);
fxsave->cwd = fpu->fcw;
@@ -10988,7 +10988,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (!alloc_emulate_ctxt(vcpu))
goto free_wbinvd_dirty_mask;

- if (!fpu_alloc_guest_fpstate(&vcpu->arch.guest_fpu)) {
+ if (!fpu_alloc_guest_fpstate(&vcpu->arch.private->guest_fpu)) {
pr_err("kvm: failed to allocate vcpu's fpu\n");
goto free_emulate_ctxt;
}
@@ -11023,9 +11023,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
return 0;

free_guest_fpu:
- fpu_free_guest_fpstate(&vcpu->arch.guest_fpu);
+ fpu_free_guest_fpstate(&vcpu->arch.private->guest_fpu);
free_emulate_ctxt:
- kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt);
+ kmem_cache_free(x86_emulator_cache, vcpu->arch.private->emulate_ctxt);
free_wbinvd_dirty_mask:
free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
fail_free_mce_banks:
@@ -11067,9 +11067,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)

static_call(kvm_x86_vcpu_free)(vcpu);

- kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt);
+ kmem_cache_free(x86_emulator_cache, vcpu->arch.private->emulate_ctxt);
free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
- fpu_free_guest_fpstate(&vcpu->arch.guest_fpu);
+ fpu_free_guest_fpstate(&vcpu->arch.private->guest_fpu);

kvm_hv_vcpu_uninit(vcpu);
kvm_pmu_destroy(vcpu);
@@ -11118,7 +11118,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.dr7 = DR7_FIXED_1;
kvm_update_dr7(vcpu);

- vcpu->arch.cr2 = 0;
+ vcpu->arch.private->cr2 = 0;

kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.apf.msr_en_val = 0;
@@ -11131,8 +11131,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;

- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
- struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
+ if (vcpu->arch.private->guest_fpu.fpstate && kvm_mpx_supported()) {
+ struct fpstate *fpstate = vcpu->arch.private->guest_fpu.fpstate;

/*
* To avoid have the INIT path from kvm_apic_has_events() that be
@@ -11154,11 +11154,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vcpu->arch.msr_misc_features_enables = 0;

- vcpu->arch.xcr0 = XFEATURE_MASK_FP;
+ vcpu->arch.private->xcr0 = XFEATURE_MASK_FP;
}

/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
- memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
+ memset(vcpu->arch.private->regs, 0, sizeof(vcpu->arch.private->regs));
kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);

/*
@@ -11178,7 +11178,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
kvm_rip_write(vcpu, 0xfff0);

- vcpu->arch.cr3 = 0;
+ vcpu->arch.private->cr3 = 0;
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);

/*
@@ -12043,7 +12043,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
{
int r;

- if ((vcpu->arch.mmu->direct_map != work->arch.direct_map) ||
+ if ((vcpu->arch.private->mmu->direct_map != work->arch.direct_map) ||
work->wakeup_all)
return;

@@ -12051,8 +12051,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
if (unlikely(r))
return;

- if (!vcpu->arch.mmu->direct_map &&
- work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+ if (!vcpu->arch.private->mmu->direct_map &&
+ work->arch.cr3 != vcpu->arch.private->mmu->get_guest_pgd(vcpu))
return;

kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
@@ -12398,9 +12398,9 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c
(PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK);

if (!(error_code & PFERR_PRESENT_MASK) ||
- vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, &fault) != UNMAPPED_GVA) {
+ vcpu->arch.private->walk_mmu->gva_to_gpa(vcpu, gva, access, &fault) != UNMAPPED_GVA) {
/*
- * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page
+ * If vcpu->arch.private->walk_mmu->gva_to_gpa succeeded, the page
* tables probably do not match the TLB. Just proceed
* with the error code that the processor gave.
*/
@@ -12410,7 +12410,7 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c
fault.nested_page_fault = false;
fault.address = gva;
}
- vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
+ vcpu->arch.private->walk_mmu->inject_page_fault(vcpu, &fault);
}
EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 3d5da4daaf53..dbcb6551d111 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -183,7 +183,7 @@ static inline bool x86_exception_has_error_code(unsigned int vector)

static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
{
- return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
+ return vcpu->arch.private->walk_mmu == &vcpu->arch.private->nested_mmu;
}

static inline int is_pae(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 587a75428da8..3c4e27c5aea9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -109,6 +109,8 @@ static atomic_t hardware_enable_failed;

static struct kmem_cache *kvm_vcpu_cache;

+static struct kmem_cache *kvm_vcpu_private_cache;
+
static __read_mostly struct preempt_ops kvm_preempt_ops;
static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, kvm_running_vcpu);

@@ -457,6 +459,7 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
put_pid(rcu_dereference_protected(vcpu->pid, 1));

free_page((unsigned long)vcpu->run);
+ kmem_cache_free(kvm_vcpu_private_cache, vcpu->arch.private);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
@@ -2392,7 +2395,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* tail pages of non-compound higher order allocations, which
* would then underflow the refcount when the caller does the
* required put_page. Don't allow those pages here.
- */
+ */
if (!kvm_try_get_pfn(pfn))
r = -EFAULT;

@@ -3562,17 +3565,25 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
if (r)
goto vcpu_decrement;

- vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
+ vcpu = kmem_cache_zalloc(kvm_vcpu_cache,
+ GFP_KERNEL_ACCOUNT | __GFP_GLOBAL_NONSENSITIVE);
if (!vcpu) {
r = -ENOMEM;
goto vcpu_decrement;
}

+ vcpu->arch.private = kmem_cache_zalloc(kvm_vcpu_private_cache,
+ GFP_KERNEL | __GFP_LOCAL_NONSENSITIVE);
+ if (!vcpu->arch.private) {
+ r = -ENOMEM;
+ goto vcpu_free;
+ }
+
BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE);
if (!page) {
r = -ENOMEM;
- goto vcpu_free;
+ goto vcpu_private_free;
}
vcpu->run = page_address(page);

@@ -3631,6 +3642,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
kvm_arch_vcpu_destroy(vcpu);
vcpu_free_run_page:
free_page((unsigned long)vcpu->run);
+vcpu_private_free:
+ kmem_cache_free(kvm_vcpu_private_cache, vcpu->arch.private);
vcpu_free:
kmem_cache_free(kvm_vcpu_cache, vcpu);
vcpu_decrement:
@@ -5492,7 +5505,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
vcpu_align = __alignof__(struct kvm_vcpu);
kvm_vcpu_cache =
kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align,
- SLAB_ACCOUNT,
+ SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE,
offsetof(struct kvm_vcpu, arch),
offsetofend(struct kvm_vcpu, stats_id)
- offsetof(struct kvm_vcpu, arch),
@@ -5501,12 +5514,22 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
r = -ENOMEM;
goto out_free_3;
}
-
+ kvm_vcpu_private_cache = kmem_cache_create_usercopy("kvm_vcpu_private",
+ sizeof(struct kvm_vcpu_arch_private),
+ __alignof__(struct kvm_vcpu_arch_private),
+ SLAB_ACCOUNT | SLAB_LOCAL_NONSENSITIVE,
+ 0,
+ sizeof(struct kvm_vcpu_arch_private),
+ NULL);
+ if (!kvm_vcpu_private_cache) {
+ r = -ENOMEM;
+ goto out_free_4;
+ }
for_each_possible_cpu(cpu) {
if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
GFP_KERNEL, cpu_to_node(cpu))) {
r = -ENOMEM;
- goto out_free_4;
+ goto out_free_vcpu_private_cache;
}
}

@@ -5541,6 +5564,8 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
out_free_5:
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
+out_free_vcpu_private_cache:
+ kmem_cache_destroy(kvm_vcpu_private_cache);
out_free_4:
kmem_cache_destroy(kvm_vcpu_cache);
out_free_3:
@@ -5567,6 +5592,7 @@ void kvm_exit(void)
misc_deregister(&kvm_dev);
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
+ kmem_cache_destroy(kvm_vcpu_private_cache);
kmem_cache_destroy(kvm_vcpu_cache);
kvm_async_pf_deinit();
unregister_syscore_ops(&kvm_syscore_ops);
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 13:50:07

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 22/47] mm: asi: Added refcounting when initilizing an asi

From: Ofir Weisse <[email protected]>

Some KVM tests initilize multiple VMs in a single process. For these
cases, we want to suppurt multiple callse to asi_init() before a single
asi_destroy is called. We want the initilization to happen exactly once.
IF asi_destroy() is called, release the resources only if the counter
reached zero. In our current implementation, asi's are tied to
a specific mm. This may change in a future implementation. In which
case, the mutex for the refcounting will need to move to struct asi.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/include/asm/asi.h | 1 +
arch/x86/mm/asi.c | 52 +++++++++++++++++++++++++++++++++-----
include/linux/mm_types.h | 2 ++
kernel/fork.c | 3 +++
4 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index e3cbf6d8801e..2dc465f78bcc 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -40,6 +40,7 @@ struct asi {
pgd_t *pgd;
struct asi_class *class;
struct mm_struct *mm;
+ int64_t asi_ref_count;
};

DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 91e5ff1224ff..ac35323193a3 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -282,9 +282,25 @@ static int __init asi_global_init(void)
}
subsys_initcall(asi_global_init)

+/* We're assuming we hold mm->asi_init_lock */
+static void __asi_destroy(struct asi *asi)
+{
+ if (!boot_cpu_has(X86_FEATURE_ASI))
+ return;
+
+ /* If refcount is non-zero, it means asi_init() was called multiple
+ * times. We free the asi pgd only when the last VM is destroyed. */
+ if (--(asi->asi_ref_count) > 0)
+ return;
+
+ asi_free_pgd(asi);
+ memset(asi, 0, sizeof(struct asi));
+}
+
int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
{
- struct asi *asi = &mm->asi[asi_index];
+ int err = 0;
+ struct asi *asi = &mm->asi[asi_index];

*out_asi = NULL;

@@ -295,6 +311,15 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
WARN_ON(asi_index == 0 || asi_index >= ASI_MAX_NUM);
WARN_ON(asi->pgd != NULL);

+ /* Currently, mm and asi structs are conceptually tied together. In
+ * future implementations an asi object might be unrelated to a specicic
+ * mm. In that future implementation - the mutex will have to be inside
+ * asi. */
+ mutex_lock(&mm->asi_init_lock);
+
+ if (asi->asi_ref_count++ > 0)
+ goto exit_unlock; /* err is 0 */
+
/*
* For now, we allocate 2 pages to avoid any potential problems with
* KPTI code. This won't be needed once KPTI is folded into the ASI
@@ -302,8 +327,10 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
*/
asi->pgd = (pgd_t *)__get_free_pages(GFP_PGTABLE_USER,
PGD_ALLOCATION_ORDER);
- if (!asi->pgd)
- return -ENOMEM;
+ if (!asi->pgd) {
+ err = -ENOMEM;
+ goto exit_unlock;
+ }

asi->class = &asi_class[asi_index];
asi->mm = mm;
@@ -328,19 +355,30 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
}

- *out_asi = asi;
+exit_unlock:
+ if (err)
+ __asi_destroy(asi);

- return 0;
+ /* This unlock signals future asi_init() callers that we finished. */
+ mutex_unlock(&mm->asi_init_lock);
+
+ if (!err)
+ *out_asi = asi;
+ return err;
}
EXPORT_SYMBOL_GPL(asi_init);

void asi_destroy(struct asi *asi)
{
+ struct mm_struct *mm;
+
if (!boot_cpu_has(X86_FEATURE_ASI) || !asi)
return;

- asi_free_pgd(asi);
- memset(asi, 0, sizeof(struct asi));
+ mm = asi->mm;
+ mutex_lock(&mm->asi_init_lock);
+ __asi_destroy(asi);
+ mutex_unlock(&mm->asi_init_lock);
}
EXPORT_SYMBOL_GPL(asi_destroy);

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f9702d070975..e6980ae31323 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/mutex.h>

#include <asm/mmu.h>
#include <asm/asi.h>
@@ -628,6 +629,7 @@ struct mm_struct {
* these resources for every mm in the system, we expect that
* only VM mm's will have this flag set. */
bool asi_enabled;
+ struct mutex asi_init_lock;
#endif
struct user_namespace *user_ns;

diff --git a/kernel/fork.c b/kernel/fork.c
index dd5a86e913ea..68b3aeab55ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1084,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,

mm->user_ns = get_user_ns(user_ns);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ mutex_init(&mm->asi_init_lock);
+#endif
return mm;

fail_noasi:
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 14:19:34

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 30/47] mm: asi: Add API for mapping userspace address ranges

asi_map_user()/asi_unmap_user() can be used to map userspace address
ranges for ASI classes that do not specify ASI_MAP_ALL_USERSPACE.
In addition, another structure, asi_pgtbl_pool, allows for
pre-allocating a set of pages to avoid having to allocate memory
for page tables within asi_map_user(), which makes it easier to use
that function while holding locks.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 19 +++
arch/x86/mm/asi.c | 252 ++++++++++++++++++++++++++++++++++---
include/asm-generic/asi.h | 21 ++++
include/linux/mm_types.h | 2 +-
4 files changed, 275 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 35421356584b..bdb2f70d4f85 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -44,6 +44,12 @@ struct asi {
atomic64_t *tlb_gen;
atomic64_t __tlb_gen;
int64_t asi_ref_count;
+ rwlock_t user_map_lock;
+};
+
+struct asi_pgtbl_pool {
+ struct page *pgtbl_list;
+ uint count;
};

DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
@@ -74,6 +80,19 @@ void asi_do_lazy_map(struct asi *asi, size_t addr);
void asi_clear_user_pgd(struct mm_struct *mm, size_t addr);
void asi_clear_user_p4d(struct mm_struct *mm, size_t addr);

+int asi_map_user(struct asi *asi, void *addr, size_t len,
+ struct asi_pgtbl_pool *pool,
+ size_t allowed_start, size_t allowed_end);
+void asi_unmap_user(struct asi *asi, void *va, size_t len);
+int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags);
+void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool);
+
+static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool)
+{
+ pool->pgtbl_list = NULL;
+ pool->count = 0;
+}
+
static inline void asi_init_thread_state(struct thread_struct *thread)
{
thread->intr_nest_depth = 0;
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 29c74b6d4262..9b1bd005f343 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -86,6 +86,55 @@ void asi_unregister_class(int index)
}
EXPORT_SYMBOL_GPL(asi_unregister_class);

+static ulong get_pgtbl_from_pool(struct asi_pgtbl_pool *pool)
+{
+ struct page *pgtbl;
+
+ if (pool->count == 0)
+ return 0;
+
+ pgtbl = pool->pgtbl_list;
+ pool->pgtbl_list = pgtbl->asi_pgtbl_pool_next;
+ pgtbl->asi_pgtbl_pool_next = NULL;
+ pool->count--;
+
+ return (ulong)page_address(pgtbl);
+}
+
+static void return_pgtbl_to_pool(struct asi_pgtbl_pool *pool, ulong virt)
+{
+ struct page *pgtbl = virt_to_page(virt);
+
+ pgtbl->asi_pgtbl_pool_next = pool->pgtbl_list;
+ pool->pgtbl_list = pgtbl;
+ pool->count++;
+}
+
+int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags)
+{
+ if (!static_cpu_has(X86_FEATURE_ASI))
+ return 0;
+
+ while (pool->count < count) {
+ ulong pgtbl = get_zeroed_page(flags);
+
+ if (!pgtbl)
+ return -ENOMEM;
+
+ return_pgtbl_to_pool(pool, pgtbl);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(asi_fill_pgtbl_pool);
+
+void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool)
+{
+ while (pool->count > 0)
+ free_page(get_pgtbl_from_pool(pool));
+}
+EXPORT_SYMBOL_GPL(asi_clear_pgtbl_pool);
+
static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr)
{
pgd_t *src = pgd_offset_pgd(src_table, addr);
@@ -110,10 +159,12 @@ static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr)
#define DEFINE_ASI_PGTBL_ALLOC(base, level) \
static level##_t * asi_##level##_alloc(struct asi *asi, \
base##_t *base, ulong addr, \
- gfp_t flags) \
+ gfp_t flags, \
+ struct asi_pgtbl_pool *pool) \
{ \
if (unlikely(base##_none(*base))) { \
- ulong pgtbl = get_zeroed_page(flags); \
+ ulong pgtbl = pool ? get_pgtbl_from_pool(pool) \
+ : get_zeroed_page(flags); \
phys_addr_t pgtbl_pa; \
\
if (pgtbl == 0) \
@@ -127,7 +178,10 @@ static level##_t * asi_##level##_alloc(struct asi *asi, \
mm_inc_nr_##level##s(asi->mm); \
} else { \
paravirt_release_##level(PHYS_PFN(pgtbl_pa)); \
- free_page(pgtbl); \
+ if (pool) \
+ return_pgtbl_to_pool(pool, pgtbl); \
+ else \
+ free_page(pgtbl); \
} \
\
/* NOP on native. PV call on Xen. */ \
@@ -336,6 +390,7 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
asi->class = &asi_class[asi_index];
asi->mm = mm;
asi->pcid_index = asi_index;
+ rwlock_init(&asi->user_map_lock);

if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) {
uint i;
@@ -650,11 +705,6 @@ static bool follow_physaddr(struct mm_struct *mm, size_t virt,
/*
* Map the given range into the ASI page tables. The source of the mapping
* is the regular unrestricted page tables.
- * Can be used to map any kernel memory.
- *
- * The caller MUST ensure that the source mapping will not change during this
- * function. For dynamic kernel memory, this is generally ensured by mapping
- * the memory within the allocator.
*
* If the source mapping is a large page and the range being mapped spans the
* entire large page, then it will be mapped as a large page in the ASI page
@@ -664,19 +714,17 @@ static bool follow_physaddr(struct mm_struct *mm, size_t virt,
* destination page, but that should be ok for now, as usually in such cases,
* the range would consist of a small-ish number of pages.
*/
-int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
+int __asi_map(struct asi *asi, size_t start, size_t end, gfp_t gfp_flags,
+ struct asi_pgtbl_pool *pool,
+ size_t allowed_start, size_t allowed_end)
{
size_t virt;
- size_t start = (size_t)addr;
- size_t end = start + len;
size_t page_size;

- if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
- return 0;
-
VM_BUG_ON(start & ~PAGE_MASK);
- VM_BUG_ON(len & ~PAGE_MASK);
- VM_BUG_ON(start < TASK_SIZE_MAX);
+ VM_BUG_ON(end & ~PAGE_MASK);
+ VM_BUG_ON(end > allowed_end);
+ VM_BUG_ON(start < allowed_start);

gfp_flags &= GFP_RECLAIM_MASK;

@@ -702,14 +750,15 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
continue; \
} \
\
- level = asi_##level##_alloc(asi, base, virt, gfp_flags);\
+ level = asi_##level##_alloc(asi, base, virt, \
+ gfp_flags, pool); \
if (!level) \
return -ENOMEM; \
\
if (page_size >= LEVEL##_SIZE && \
(level##_none(*level) || level##_leaf(*level)) && \
is_page_within_range(virt, LEVEL##_SIZE, \
- start, end)) { \
+ allowed_start, allowed_end)) {\
page_size = LEVEL##_SIZE; \
phys &= LEVEL##_MASK; \
\
@@ -737,6 +786,26 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
return 0;
}

+/*
+ * Maps the given kernel address range into the ASI page tables.
+ *
+ * The caller MUST ensure that the source mapping will not change during this
+ * function. For dynamic kernel memory, this is generally ensured by mapping
+ * the memory within the allocator.
+ */
+int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
+{
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
+ return 0;
+
+ VM_BUG_ON(start < TASK_SIZE_MAX);
+
+ return __asi_map(asi, start, end, gfp_flags, NULL, start, end);
+}
+
int asi_map(struct asi *asi, void *addr, size_t len)
{
return asi_map_gfp(asi, addr, len, GFP_KERNEL);
@@ -935,3 +1004,150 @@ void asi_clear_user_p4d(struct mm_struct *mm, size_t addr)
if (!pgtable_l5_enabled())
__asi_clear_user_pgd(mm, addr);
}
+
+/*
+ * Maps the given userspace address range into the ASI page tables.
+ *
+ * The caller MUST ensure that the source mapping will not change during this
+ * function e.g. by synchronizing via MMU notifiers or acquiring the
+ * appropriate locks.
+ */
+int asi_map_user(struct asi *asi, void *addr, size_t len,
+ struct asi_pgtbl_pool *pool,
+ size_t allowed_start, size_t allowed_end)
+{
+ int err;
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
+ return 0;
+
+ VM_BUG_ON(end > TASK_SIZE_MAX);
+
+ read_lock(&asi->user_map_lock);
+ err = __asi_map(asi, start, end, GFP_NOWAIT, pool,
+ allowed_start, allowed_end);
+ read_unlock(&asi->user_map_lock);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(asi_map_user);
+
+static bool
+asi_unmap_free_pte_range(struct asi_pgtbl_pool *pgtbls_to_free,
+ pte_t *pte, size_t addr, size_t end)
+{
+ do {
+ pte_clear(NULL, addr, pte);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ return true;
+}
+
+#define DEFINE_ASI_UNMAP_FREE_RANGE(level, LEVEL, next_level, NEXT_LVL_SIZE) \
+static bool \
+asi_unmap_free_##level##_range(struct asi_pgtbl_pool *pgtbls_to_free, \
+ level##_t *level, size_t addr, size_t end) \
+{ \
+ bool unmapped = false; \
+ size_t next; \
+ \
+ do { \
+ next = level##_addr_end(addr, end); \
+ if (level##_none(*level)) \
+ continue; \
+ \
+ if (IS_ALIGNED(addr, LEVEL##_SIZE) && \
+ IS_ALIGNED(next, LEVEL##_SIZE)) { \
+ if (!level##_large(*level)) { \
+ ulong pgtbl = level##_page_vaddr(*level); \
+ struct page *page = virt_to_page(pgtbl); \
+ \
+ page->private = PG_LEVEL_##NEXT_LVL_SIZE; \
+ return_pgtbl_to_pool(pgtbls_to_free, pgtbl); \
+ } \
+ level##_clear(level); \
+ unmapped = true; \
+ } else { \
+ /* \
+ * At this time, we don't have a case where we need to \
+ * unmap a subset of a huge page. But that could arise \
+ * in the future. In that case, we'll need to split \
+ * the huge mapping here. \
+ */ \
+ if (WARN_ON(level##_large(*level))) \
+ continue; \
+ \
+ unmapped |= asi_unmap_free_##next_level##_range( \
+ pgtbls_to_free, \
+ next_level##_offset(level, addr), \
+ addr, next); \
+ } \
+ } while (level++, addr = next, addr != end); \
+ \
+ return unmapped; \
+}
+
+DEFINE_ASI_UNMAP_FREE_RANGE(pmd, PMD, pte, 4K)
+DEFINE_ASI_UNMAP_FREE_RANGE(pud, PUD, pmd, 2M)
+DEFINE_ASI_UNMAP_FREE_RANGE(p4d, P4D, pud, 1G)
+DEFINE_ASI_UNMAP_FREE_RANGE(pgd, PGDIR, p4d, 512G)
+
+static bool asi_unmap_and_free_range(struct asi_pgtbl_pool *pgtbls_to_free,
+ struct asi *asi, size_t addr, size_t end)
+{
+ size_t next;
+ bool unmapped = false;
+ pgd_t *pgd = pgd_offset_pgd(asi->pgd, addr);
+
+ BUILD_BUG_ON((void *)&((struct page *)NULL)->private ==
+ (void *)&((struct page *)NULL)->asi_pgtbl_pool_next);
+
+ if (pgtable_l5_enabled())
+ return asi_unmap_free_pgd_range(pgtbls_to_free, pgd, addr, end);
+
+ do {
+ next = pgd_addr_end(addr, end);
+ unmapped |= asi_unmap_free_p4d_range(pgtbls_to_free,
+ p4d_offset(pgd, addr),
+ addr, next);
+ } while (pgd++, addr = next, addr != end);
+
+ return unmapped;
+}
+
+void asi_unmap_user(struct asi *asi, void *addr, size_t len)
+{
+ static void (*const free_pgtbl_at_level[])(struct asi *, size_t) = {
+ NULL,
+ asi_free_pte,
+ asi_free_pmd,
+ asi_free_pud,
+ asi_free_p4d
+ };
+
+ struct asi_pgtbl_pool pgtbls_to_free = { 0 };
+ size_t start = (size_t)addr;
+ size_t end = start + len;
+ bool unmapped;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
+ return;
+
+ write_lock(&asi->user_map_lock);
+ unmapped = asi_unmap_and_free_range(&pgtbls_to_free, asi, start, end);
+ write_unlock(&asi->user_map_lock);
+
+ if (unmapped)
+ asi_flush_tlb_range(asi, addr, len);
+
+ while (pgtbls_to_free.count > 0) {
+ size_t pgtbl = get_pgtbl_from_pool(&pgtbls_to_free);
+ struct page *page = virt_to_page(pgtbl);
+
+ VM_BUG_ON(page->private >= PG_LEVEL_NUM);
+ free_pgtbl_at_level[page->private](asi, pgtbl);
+ }
+}
+EXPORT_SYMBOL_GPL(asi_unmap_user);
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index 8513d0d7865a..fffb323d2a00 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -26,6 +26,7 @@

struct asi_hooks {};
struct asi {};
+struct asi_pgtbl_pool {};

static inline
int asi_register_class(const char *name, uint flags,
@@ -92,6 +93,26 @@ void asi_clear_user_pgd(struct mm_struct *mm, size_t addr) { }
static inline
void asi_clear_user_p4d(struct mm_struct *mm, size_t addr) { }

+static inline
+int asi_map_user(struct asi *asi, void *addr, size_t len,
+ struct asi_pgtbl_pool *pool,
+ size_t allowed_start, size_t allowed_end)
+{
+ return 0;
+}
+
+static inline void asi_unmap_user(struct asi *asi, void *va, size_t len) { }
+
+static inline
+int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags)
+{
+ return 0;
+}
+
+static inline void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool) { }
+
+static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool) { }
+
static inline
void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { }

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7d38229ca85c..c3f209720a84 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,7 @@ struct page {
/* Links the pages_to_free_async list */
struct llist_node async_free_node;

- unsigned long _asi_pad_1;
+ struct page *asi_pgtbl_pool_next;
u64 asi_tlb_gen;

union {
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 15:24:25

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 40/47] mm: asi: support for static percpu DEFINE_PER_CPU*_ASI

From: Ofir Weisse <[email protected]>

Implemented the following PERCPU static declarations:

- DECLARE/DEFINE_PER_CPU_ASI_NOT_SENSITIVE
- DECLARE/DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE
- DECLARE/DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE
- DECLARE/DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE

These definitions are also supported in dynamic modules.
To support percpu variables in dynamic modules, we're creating an ASI
pcpu reserved chunk. The reserved size PERCPU_MODULE_RESERVE is now
split between the normal reserved chunk and the ASI one.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/mm/asi.c | 39 +++++++-
include/asm-generic/percpu.h | 6 ++
include/asm-generic/vmlinux.lds.h | 5 +
include/linux/module.h | 6 ++
include/linux/percpu-defs.h | 39 ++++++++
include/linux/percpu.h | 8 +-
kernel/module-internal.h | 1 +
kernel/module.c | 154 ++++++++++++++++++++++++++----
mm/percpu.c | 134 ++++++++++++++++++++++----
9 files changed, 356 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 6c14aa1fc4aa..ba373b461855 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -309,6 +309,32 @@ static int __init set_asi_param(char *str)
}
early_param("asi", set_asi_param);

+static int asi_map_percpu(struct asi *asi, void *percpu_addr, size_t len)
+{
+ int cpu, err;
+ void *ptr;
+
+ for_each_possible_cpu(cpu) {
+ ptr = per_cpu_ptr(percpu_addr, cpu);
+ err = asi_map(asi, ptr, len);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static void asi_unmap_percpu(struct asi *asi, void *percpu_addr, size_t len)
+{
+ int cpu;
+ void *ptr;
+
+ for_each_possible_cpu(cpu) {
+ ptr = per_cpu_ptr(percpu_addr, cpu);
+ asi_unmap(asi, ptr, len, true);
+ }
+}
+
/* asi_load_module() is called from layout_and_allocate() in kernel/module.c
* We map the module and its data in init_mm.asi_pgd[0].
*/
@@ -347,7 +373,13 @@ int asi_load_module(struct module* module)
if (err)
return err;

- return 0;
+ err = asi_map_percpu(ASI_GLOBAL_NONSENSITIVE,
+ module->percpu_asi,
+ module->percpu_asi_size );
+ if (err)
+ return err;
+
+ return 0;
}
EXPORT_SYMBOL_GPL(asi_load_module);

@@ -372,6 +404,9 @@ void asi_unload_module(struct module* module)
module->core_layout.once_section_offset,
module->core_layout.once_section_size, true);

+ asi_unmap_percpu(ASI_GLOBAL_NONSENSITIVE, module->percpu_asi,
+ module->percpu_asi_size);
+
}

static int __init asi_global_init(void)
@@ -399,6 +434,8 @@ static int __init asi_global_init(void)

static_branch_enable(&asi_local_map_initialized);

+ pcpu_map_asi_reserved_chunk();
+
return 0;
}
subsys_initcall(asi_global_init)
diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 6432a7fade91..40001b74114f 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -50,6 +50,12 @@ extern void setup_per_cpu_areas(void);

#endif /* SMP */

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+void __init pcpu_map_asi_reserved_chunk(void);
+#else
+static inline void pcpu_map_asi_reserved_chunk(void) {}
+#endif
+
#ifndef PER_CPU_BASE_SECTION
#ifdef CONFIG_SMP
#define PER_CPU_BASE_SECTION ".data..percpu"
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index c769d939c15f..0a931aedc285 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -1080,6 +1080,11 @@
. = ALIGN(cacheline); \
*(.data..percpu) \
*(.data..percpu..shared_aligned) \
+ . = ALIGN(PAGE_SIZE); \
+ __per_cpu_asi_start = .; \
+ *(.data..percpu..asi_non_sensitive) \
+ . = ALIGN(PAGE_SIZE); \
+ __per_cpu_asi_end = .; \
PERCPU_DECRYPTED_SECTION \
__per_cpu_end = .;

diff --git a/include/linux/module.h b/include/linux/module.h
index 82267a95f936..d4d020bae171 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -463,6 +463,12 @@ struct module {
/* Per-cpu data. */
void __percpu *percpu;
unsigned int percpu_size;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* Per-cpu data for ASI */
+ void __percpu *percpu_asi;
+ unsigned int percpu_asi_size;
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
#endif
void *noinstr_text_start;
unsigned int noinstr_text_size;
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index af1071535de8..5d9fdc93e0fa 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -170,6 +170,45 @@

#define DEFINE_PER_CPU_READ_MOSTLY(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "..read_mostly")
+/*
+ * Declaration/definition used for per-CPU variables which for the sake for
+ * address space isolation (ASI) are deemed not sensitive
+ */
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define ASI_PERCPU_SECTION "..asi_non_sensitive"
+#else
+#define ASI_PERCPU_SECTION ""
+#endif
+
+#define DECLARE_PER_CPU_ASI_NOT_SENSITIVE(type, name) \
+ DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION)
+
+#define DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ ____cacheline_aligned_in_smp
+
+#define DECLARE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ ____cacheline_aligned
+
+#define DECLARE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ __aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_ASI_NOT_SENSITIVE(type, name) \
+ DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION)
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ ____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ ____cacheline_aligned
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(type, name) \
+ DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \
+ __aligned(PAGE_SIZE)

/*
* Declaration/definition used for per-CPU variables that should be accessed
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index ae4004e7957e..a2cc4c32cabd 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -13,7 +13,8 @@

/* enough to cover all DEFINE_PER_CPUs in modules */
#ifdef CONFIG_MODULES
-#define PERCPU_MODULE_RESERVE (8 << 10)
+/* #define PERCPU_MODULE_RESERVE (8 << 10) */
+#define PERCPU_MODULE_RESERVE (16 << 10)
#else
#define PERCPU_MODULE_RESERVE 0
#endif
@@ -123,6 +124,11 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
#endif

extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1);
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+extern void __percpu *__alloc_reserved_percpu_asi(size_t size, size_t align);
+#endif
+
extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr);
extern bool is_kernel_percpu_address(unsigned long addr);

diff --git a/kernel/module-internal.h b/kernel/module-internal.h
index 33783abc377b..44c05ae06b2c 100644
--- a/kernel/module-internal.h
+++ b/kernel/module-internal.h
@@ -25,6 +25,7 @@ struct load_info {
#endif
struct {
unsigned int sym, str, mod, vers, info, pcpu;
+ unsigned int pcpu_asi;
} index;
};

diff --git a/kernel/module.c b/kernel/module.c
index d363b8a0ee24..0048b7843903 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -587,6 +587,13 @@ static inline void __percpu *mod_percpu(struct module *mod)
return mod->percpu;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+static inline void __percpu *mod_percpu_asi(struct module *mod)
+{
+ return mod->percpu_asi;
+}
+#endif
+
static int percpu_modalloc(struct module *mod, struct load_info *info)
{
Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu];
@@ -611,9 +618,34 @@ static int percpu_modalloc(struct module *mod, struct load_info *info)
return 0;
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+static int percpu_asi_modalloc(struct module *mod, struct load_info *info)
+{
+ Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu_asi];
+ unsigned long align = pcpusec->sh_addralign;
+
+ if ( !pcpusec->sh_size)
+ return 0;
+
+ mod->percpu_asi = __alloc_reserved_percpu_asi(pcpusec->sh_size, align);
+ if (!mod->percpu_asi) {
+ pr_warn("%s: Could not allocate %lu bytes percpu data\n",
+ mod->name, (unsigned long)pcpusec->sh_size);
+ return -ENOMEM;
+ }
+ mod->percpu_asi_size = pcpusec->sh_size;
+
+ return 0;
+}
+#endif
+
static void percpu_modfree(struct module *mod)
{
free_percpu(mod->percpu);
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ free_percpu(mod->percpu_asi);
+#endif
}

static unsigned int find_pcpusec(struct load_info *info)
@@ -621,6 +653,13 @@ static unsigned int find_pcpusec(struct load_info *info)
return find_sec(info, ".data..percpu");
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+static unsigned int find_pcpusec_asi(struct load_info *info)
+{
+ return find_sec(info, ".data..percpu" ASI_PERCPU_SECTION );
+}
+#endif
+
static void percpu_modcopy(struct module *mod,
const void *from, unsigned long size)
{
@@ -630,6 +669,39 @@ static void percpu_modcopy(struct module *mod,
memcpy(per_cpu_ptr(mod->percpu, cpu), from, size);
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+static void percpu_asi_modcopy(struct module *mod,
+ const void *from, unsigned long size)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ memcpy(per_cpu_ptr(mod->percpu_asi, cpu), from, size);
+}
+#endif
+
+bool __is_module_percpu_address_helper(unsigned long addr,
+ unsigned long *can_addr,
+ unsigned int cpu,
+ void* percpu_start,
+ unsigned int percpu_size)
+{
+ void *start = per_cpu_ptr(percpu_start, cpu);
+ void *va = (void *)addr;
+
+ if (va >= start && va < start + percpu_size) {
+ if (can_addr) {
+ *can_addr = (unsigned long) (va - start);
+ *can_addr += (unsigned long)
+ per_cpu_ptr(percpu_start,
+ get_boot_cpu_id());
+ }
+ return true;
+ }
+
+ return false;
+}
+
bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr)
{
struct module *mod;
@@ -640,22 +712,34 @@ bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr)
list_for_each_entry_rcu(mod, &modules, list) {
if (mod->state == MODULE_STATE_UNFORMED)
continue;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (!mod->percpu_size && !mod->percpu_asi_size)
+ continue;
+#else
if (!mod->percpu_size)
continue;
+#endif
for_each_possible_cpu(cpu) {
- void *start = per_cpu_ptr(mod->percpu, cpu);
- void *va = (void *)addr;
-
- if (va >= start && va < start + mod->percpu_size) {
- if (can_addr) {
- *can_addr = (unsigned long) (va - start);
- *can_addr += (unsigned long)
- per_cpu_ptr(mod->percpu,
- get_boot_cpu_id());
- }
+ if (__is_module_percpu_address_helper(addr,
+ can_addr,
+ cpu,
+ mod->percpu,
+ mod->percpu_size)) {
preempt_enable();
return true;
- }
+ }
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (__is_module_percpu_address_helper(
+ addr,
+ can_addr,
+ cpu,
+ mod->percpu_asi,
+ mod->percpu_asi_size)) {
+ preempt_enable();
+ return true;
+ }
+#endif
}
}

@@ -2344,6 +2428,10 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ else if (sym[i].st_shndx == info->index.pcpu_asi)
+ secbase = (unsigned long)mod_percpu_asi(mod);
+#endif
else
secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
sym[i].st_value += secbase;
@@ -2664,6 +2752,10 @@ static char elf_type(const Elf_Sym *sym, const struct load_info *info)
return 'U';
if (sym->st_shndx == SHN_ABS || sym->st_shndx == info->index.pcpu)
return 'a';
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (sym->st_shndx == info->index.pcpu_asi)
+ return 'a';
+#endif
if (sym->st_shndx >= SHN_LORESERVE)
return '?';
if (sechdrs[sym->st_shndx].sh_flags & SHF_EXECINSTR)
@@ -2691,7 +2783,8 @@ static char elf_type(const Elf_Sym *sym, const struct load_info *info)
}

static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
- unsigned int shnum, unsigned int pcpundx)
+ unsigned int shnum, unsigned int pcpundx,
+ unsigned pcpu_asi_ndx)
{
const Elf_Shdr *sec;

@@ -2701,7 +2794,7 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
return false;

#ifdef CONFIG_KALLSYMS_ALL
- if (src->st_shndx == pcpundx)
+ if (src->st_shndx == pcpundx || src->st_shndx == pcpu_asi_ndx )
return true;
#endif

@@ -2743,7 +2836,7 @@ static void layout_symtab(struct module *mod, struct load_info *info)
for (ndst = i = 0; i < nsrc; i++) {
if (i == 0 || is_livepatch_module(mod) ||
is_core_symbol(src+i, info->sechdrs, info->hdr->e_shnum,
- info->index.pcpu)) {
+ info->index.pcpu, info->index.pcpu_asi)) {
strtab_size += strlen(&info->strtab[src[i].st_name])+1;
ndst++;
}
@@ -2807,7 +2900,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info)
mod->kallsyms->typetab[i] = elf_type(src + i, info);
if (i == 0 || is_livepatch_module(mod) ||
is_core_symbol(src+i, info->sechdrs, info->hdr->e_shnum,
- info->index.pcpu)) {
+ info->index.pcpu, info->index.pcpu_asi)) {
mod->core_kallsyms.typetab[ndst] =
mod->kallsyms->typetab[i];
dst[ndst] = src[i];
@@ -3289,6 +3382,12 @@ static int setup_load_info(struct load_info *info, int flags)

info->index.pcpu = find_pcpusec(info);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ info->index.pcpu_asi = find_pcpusec_asi(info);
+#else
+ info->index.pcpu_asi = 0;
+#endif
+
return 0;
}

@@ -3629,6 +3728,12 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
/* We will do a special allocation for per-cpu sections later. */
info->sechdrs[info->index.pcpu].sh_flags &= ~(unsigned long)SHF_ALLOC;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (info->index.pcpu_asi)
+ info->sechdrs[info->index.pcpu_asi].sh_flags &=
+ ~(unsigned long)SHF_ALLOC;
+#endif
+
/*
* Mark ro_after_init section with SHF_RO_AFTER_INIT so that
* layout_sections() can put it in the right place.
@@ -3700,6 +3805,14 @@ static int post_relocation(struct module *mod, const struct load_info *info)
percpu_modcopy(mod, (void *)info->sechdrs[info->index.pcpu].sh_addr,
info->sechdrs[info->index.pcpu].sh_size);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* Copy relocated percpu ASI area over. */
+ percpu_asi_modcopy(
+ mod,
+ (void *)info->sechdrs[info->index.pcpu_asi].sh_addr,
+ info->sechdrs[info->index.pcpu_asi].sh_size);
+#endif
+
/* Setup kallsyms-specific fields. */
add_kallsyms(mod, info);

@@ -4094,6 +4207,11 @@ static int load_module(struct load_info *info, const char __user *uargs,
err = percpu_modalloc(mod, info);
if (err)
goto unlink_mod;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ err = percpu_asi_modalloc(mod, info);
+ if (err)
+ goto unlink_mod;
+#endif

/* Now module is in final location, initialize linked lists, etc. */
err = module_unload_init(mod);
@@ -4183,7 +4301,11 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info);

- asi_load_module(mod);
+ err = asi_load_module(mod);
+ /* If the ASI loading failed, it doesn't necessarily mean that the
+ * module loading failed. We print an error and move on. */
+ if (err)
+ pr_err("ASI: failed loading module %s", mod->name);

/* Done! */
trace_module_load(mod);
diff --git a/mm/percpu.c b/mm/percpu.c
index beaca5adf9d4..3665a5ea71ec 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -169,6 +169,10 @@ struct pcpu_chunk *pcpu_first_chunk __ro_after_init;
*/
struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+struct pcpu_chunk *pcpu_reserved_nonsensitive_chunk __ro_after_init;
+#endif
+
DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */

@@ -1621,6 +1625,11 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
if (pcpu_addr_in_chunk(pcpu_first_chunk, addr))
return pcpu_first_chunk;

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* is it in the reserved ASI region? */
+ if (pcpu_addr_in_chunk(pcpu_reserved_nonsensitive_chunk, addr))
+ return pcpu_reserved_nonsensitive_chunk;
+#endif
/* is it in the reserved region? */
if (pcpu_addr_in_chunk(pcpu_reserved_chunk, addr))
return pcpu_reserved_chunk;
@@ -1805,23 +1814,37 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,

spin_lock_irqsave(&pcpu_lock, flags);

+#define TRY_ALLOC_FROM_CHUNK(source_chunk, chunk_name) \
+do { \
+ if (!source_chunk) { \
+ err = chunk_name " chunk not allocated"; \
+ goto fail_unlock; \
+ } \
+ chunk = source_chunk; \
+ \
+ off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic); \
+ if (off < 0) { \
+ err = "alloc from " chunk_name " chunk failed"; \
+ goto fail_unlock; \
+ } \
+ \
+ off = pcpu_alloc_area(chunk, bits, bit_align, off); \
+ if (off >= 0) \
+ goto area_found; \
+ \
+ err = "alloc from " chunk_name " chunk failed"; \
+ goto fail_unlock; \
+} while(0)
+
/* serve reserved allocations from the reserved chunk if available */
- if (reserved && pcpu_reserved_chunk) {
- chunk = pcpu_reserved_chunk;
-
- off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic);
- if (off < 0) {
- err = "alloc from reserved chunk failed";
- goto fail_unlock;
- }
-
- off = pcpu_alloc_area(chunk, bits, bit_align, off);
- if (off >= 0)
- goto area_found;
-
- err = "alloc from reserved chunk failed";
- goto fail_unlock;
- }
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (reserved && (gfp & __GFP_GLOBAL_NONSENSITIVE))
+ TRY_ALLOC_FROM_CHUNK(pcpu_reserved_nonsensitive_chunk,
+ "reserverved ASI");
+ else
+#endif
+ if (reserved && pcpu_reserved_chunk)
+ TRY_ALLOC_FROM_CHUNK(pcpu_reserved_chunk, "reserved");

restart:
/* search through normal chunks */
@@ -1998,6 +2021,14 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
return pcpu_alloc(size, align, true, GFP_KERNEL);
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+void __percpu *__alloc_reserved_percpu_asi(size_t size, size_t align)
+{
+ return pcpu_alloc(size, align, true,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
+}
+#endif
+
/**
* pcpu_balance_free - manage the amount of free chunks
* @empty_only: free chunks only if there are no populated pages
@@ -2838,15 +2869,46 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
* the dynamic region.
*/
tmp_addr = (unsigned long)base_addr + static_size;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* If ASI is used, split the reserved size between the nonsensitive
+ * chunk and the normal chunk evenly. */
+ map_size = (ai->reserved_size / 2) ?: dyn_size;
+#else
map_size = ai->reserved_size ?: dyn_size;
+#endif
chunk = pcpu_alloc_first_chunk(tmp_addr, map_size);

/* init dynamic chunk if necessary */
if (ai->reserved_size) {
- pcpu_reserved_chunk = chunk;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* TODO: check if ASI was enabled via boot param or static branch */
+ /* We allocated pcpu_reserved_nonsensitive_chunk only if
+ * pcpu_reserved_chunk is used as well. */
+ pcpu_reserved_nonsensitive_chunk = chunk;
+ pcpu_reserved_nonsensitive_chunk->is_asi_nonsensitive = true;

+ /* We used the previous chunk as pcpu_reserved_nonsensitive_chunk. Now
+ * allocate pcpu_reserved_chunk */
+ tmp_addr = (unsigned long)base_addr + static_size +
+ (ai->reserved_size / 2);
+ map_size = ai->reserved_size / 2;
+ chunk = pcpu_alloc_first_chunk(tmp_addr, map_size);
+#endif
+ /* Whether ASI is enabled or disabled, the end result is the
+ * same:
+ * If ASI is enabled, tmp_addr, used for pcpu_first_chunk should
+ * be after
+ * 1. pcpu_reserved_nonsensitive_chunk AND
+ * 2. pcpu_reserved_chunk
+ * Since we split the reserve size in half, we skip in total the
+ * whole ai->reserved_size.
+ * If ASI is disabled, tmp_addr, used for pcpu_first_chunk is
+ * just after pcpu_reserved_chunk */
tmp_addr = (unsigned long)base_addr + static_size +
ai->reserved_size;
+
+ pcpu_reserved_chunk = chunk;
+
map_size = dyn_size;
chunk = pcpu_alloc_first_chunk(tmp_addr, map_size);
}
@@ -3129,7 +3191,6 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
cpu_distance_fn);
if (IS_ERR(ai))
return PTR_ERR(ai);
-
size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));

@@ -3460,3 +3521,40 @@ static int __init percpu_enable_async(void)
return 0;
}
subsys_initcall(percpu_enable_async);
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+void __init pcpu_map_asi_reserved_chunk(void)
+{
+ void *start_addr, *end_addr;
+ unsigned long map_start_addr, map_end_addr;
+ struct pcpu_chunk *chunk = pcpu_reserved_nonsensitive_chunk;
+ int err = 0;
+
+ if (!chunk)
+ return;
+
+ start_addr = chunk->base_addr + chunk->start_offset;
+ end_addr = chunk->base_addr + chunk->nr_pages * PAGE_SIZE -
+ chunk->end_offset;
+
+
+ /* No need in asi_map_percpu, since these addresses are "real". The
+ * chunk has full pages allocated, so we're not worried about leakage of
+ * data caused by start_addr-->end_addr not being page aligned. asi_map,
+ * however, will fail/crash if the addresses are not aligned. */
+ map_start_addr = (unsigned long)start_addr & PAGE_MASK;
+ map_end_addr = PAGE_ALIGN((unsigned long)end_addr);
+
+ pr_err("%s:%d mapping 0x%lx --> 0x%lx",
+ __FUNCTION__, __LINE__, map_start_addr, map_end_addr);
+ err = asi_map(ASI_GLOBAL_NONSENSITIVE,
+ (void*)map_start_addr, map_end_addr - map_start_addr);
+
+ WARN(err, "Failed mapping percpu reserved chunk into ASI");
+
+ /* If we couldn't map the chuknk into ASI, it is useless. Set the chunk
+ * to NULL, so allocations from it will fail. */
+ if (err)
+ pcpu_reserved_nonsensitive_chunk = NULL;
+}
+#endif
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 15:40:02

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 47/47] mm: asi: Properly un/mapping task stack from ASI + tlb flush

From: Ofir Weisse <[email protected]>

There are several locations where this is important. Especially since a
task_struct might be reused, potentially with a different mm.

1. Map in vcpu_run() @ arch/x86/kvm/x86.c
1. Unmap in release_task_stack() @ kernel/fork.c
2. Unmap in do_exit() @ kernel/exit.c
3. Unmap in begin_new_exec() @ fs/exec.c

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/include/asm/asi.h | 6 ++++
arch/x86/kvm/x86.c | 6 ++++
arch/x86/mm/asi.c | 59 ++++++++++++++++++++++++++++++++++++++
fs/exec.c | 7 ++++-
include/asm-generic/asi.h | 16 +++++++++--
include/linux/sched.h | 5 ++++
kernel/exit.c | 2 +-
kernel/fork.c | 22 +++++++++++++-
8 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 6148e65fb0c2..9d8f43981678 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -87,6 +87,12 @@ void asi_unmap_user(struct asi *asi, void *va, size_t len);
int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags);
void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool);

+int asi_map_task_stack(struct task_struct *tsk, struct asi *asi);
+void asi_unmap_task_stack(struct task_struct *tsk);
+void asi_mark_pages_local_nonsensitive(struct page *pages, uint order,
+ struct mm_struct *mm);
+void asi_clear_pages_local_nonsensitive(struct page *pages, uint order);
+
static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool)
{
pool->pgtbl_list = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 294f73e9e71e..718104eefaed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10122,6 +10122,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
vcpu->arch.l1tf_flush_l1d = true;

+ /* We must have current->stack mapped into asi. This function can be
+ * safely called many times, as it will only do the actual mapping once. */
+ r = asi_map_task_stack(current, vcpu->kvm->asi);
+ if (r != 0)
+ return r;
+
for (;;) {
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 7f2aa1823736..a86ac6644a57 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -1029,6 +1029,45 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb)
asi_flush_tlb_range(asi, addr, len);
}

+int asi_map_task_stack(struct task_struct *tsk, struct asi *asi)
+{
+ int ret;
+
+ /* If the stack is already mapped to asi - don't need to map it again. */
+ if (tsk->asi_stack_mapped)
+ return 0;
+
+ if (!tsk->mm)
+ return -EINVAL;
+
+ /* If the stack was allocated via the page allocator, we assume the
+ * stack pages were marked with PageNonSensitive, therefore tsk->stack
+ * address is properly aliased. */
+ ret = asi_map(ASI_LOCAL_NONSENSITIVE, tsk->stack, THREAD_SIZE);
+ if (!ret) {
+ tsk->asi_stack_mapped = asi;
+ asi_sync_mapping(asi, tsk->stack, THREAD_SIZE);
+ }
+
+ return ret;
+}
+
+void asi_unmap_task_stack(struct task_struct *tsk)
+{
+ /* No need to unmap if the stack was not mapped to begin with. */
+ if (!tsk->asi_stack_mapped)
+ return;
+
+ if (!tsk->mm)
+ return;
+
+ asi_unmap(ASI_LOCAL_NONSENSITIVE, tsk->stack, THREAD_SIZE,
+ /* flush_tlb = */ true);
+
+ tsk->asi_stack_mapped = NULL;
+}
+
+
void *asi_va(unsigned long pa)
{
struct page *page = pfn_to_page(PHYS_PFN(pa));
@@ -1336,3 +1375,23 @@ void asi_unmap_user(struct asi *asi, void *addr, size_t len)
}
}
EXPORT_SYMBOL_GPL(asi_unmap_user);
+
+void asi_mark_pages_local_nonsensitive(struct page *pages, uint order,
+ struct mm_struct *mm)
+{
+ uint i;
+ for (i = 0; i < (1 << order); i++) {
+ __SetPageLocalNonSensitive(pages + i);
+ pages[i].asi_mm = mm;
+ }
+}
+
+void asi_clear_pages_local_nonsensitive(struct page *pages, uint order)
+{
+ uint i;
+ for (i = 0; i < (1 << order); i++) {
+ __ClearPageLocalNonSensitive(pages + i);
+ pages[i].asi_mm = NULL;
+ }
+}
+
diff --git a/fs/exec.c b/fs/exec.c
index 76f3b433e80d..fb9182cf3f33 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -69,6 +69,7 @@
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
#include <asm/tlb.h>
+#include <asm/asi.h>

#include <trace/events/task.h>
#include "internal.h"
@@ -1238,7 +1239,11 @@ int begin_new_exec(struct linux_binprm * bprm)
struct task_struct *me = current;
int retval;

- /* TODO: (oweisse) unmap the stack from ASI */
+ /* The old mm is about to be released later on in exec_mmap. We are
+ * reusing the task, including its stack which was mapped to
+ * mm->asi_pgd[0]. We need to asi_unmap the stack, so the destructor of
+ * the mm won't complain on "lingering" asi mappings. */
+ asi_unmap_task_stack(current);

/* Once we are committed compute the creds */
retval = bprm_creds_from_file(bprm);
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index 2763cb1a974c..6e9a261a2b9d 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -66,8 +66,13 @@ static inline struct asi *asi_get_target(void) { return NULL; }

static inline struct asi *asi_get_current(void) { return NULL; }

-static inline
-int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
+static inline int asi_map_task_stack(struct task_struct *tsk, struct asi *asi)
+{ return 0; }
+
+static inline void asi_unmap_task_stack(struct task_struct *tsk) { }
+
+static inline int asi_map_gfp(struct asi *asi, void *addr, size_t len,
+ gfp_t gfp_flags)
{
return 0;
}
@@ -130,6 +135,13 @@ static inline int asi_load_module(struct module* module) {return 0;}

static inline void asi_unload_module(struct module* module) { }

+static inline
+void asi_mark_pages_local_nonsensitive(struct page *pages, uint order,
+ struct mm_struct *mm) { }
+
+static inline
+void asi_clear_pages_local_nonsensitive(struct page *pages, uint order) { }
+
#endif /* !_ASSEMBLY_ */

#endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..87ad45e52b19 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct asi;

/*
* Task state bitmask. NOTE! These bits are also
@@ -1470,6 +1471,10 @@ struct task_struct {
int mce_count;
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct asi *asi_stack_mapped;
+#endif
+
#ifdef CONFIG_KRETPROBES
struct llist_head kretprobe_instances;
#endif
diff --git a/kernel/exit.c b/kernel/exit.c
index ab2749cf6887..f21cc21814d1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -768,7 +768,7 @@ void __noreturn do_exit(long code)
profile_task_exit(tsk);
kcov_task_exit(tsk);

- /* TODO: (oweisse) unmap the stack from ASI */
+ asi_unmap_task_stack(tsk);

coredump_task_exit(tsk);
ptrace_event(PTRACE_EVENT_EXIT, code);
diff --git a/kernel/fork.c b/kernel/fork.c
index cb147a72372d..876fefc477cb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -216,7 +216,6 @@ static int free_vm_stack_cache(unsigned int cpu)

static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
- /* TODO: (oweisse) Add annotation to map the stack into ASI */
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
@@ -269,7 +268,16 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
struct page *page = alloc_pages_node(node, THREADINFO_GFP,
THREAD_SIZE_ORDER);

+ /* When marking pages as PageLocalNonSesitive we set the page->mm to be
+ * NULL. We must make sure the flag is cleared from the stack pages
+ * before free_pages is called. Otherwise, page->mm will be accessed
+ * which will reuslt in NULL reference. page_address() below will yield
+ * an aliased address after ASI_LOCAL_MAP, thanks to
+ * PageLocalNonSesitive flag. */
if (likely(page)) {
+ asi_mark_pages_local_nonsensitive(page,
+ THREAD_SIZE_ORDER,
+ NULL);
tsk->stack = kasan_reset_tag(page_address(page));
return tsk->stack;
}
@@ -301,6 +309,14 @@ static inline void free_thread_stack(struct task_struct *tsk)
}
#endif

+ /* We must clear the PageNonSensitive flag before calling free_pages().
+ * Otherwise page->mm (which is NULL) will be accessed, in order to
+ * unmap the pages from ASI. Specifically for the stack, we assume the
+ * pages were already unmapped from ASI before we got here, via
+ * asi_unmap_task_stack(). */
+ asi_clear_pages_local_nonsensitive(virt_to_page(tsk->stack),
+ THREAD_SIZE_ORDER);
+
__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
}
# else
@@ -436,6 +452,7 @@ static void release_task_stack(struct task_struct *tsk)
if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD))
return; /* Better to leak the stack than to free prematurely */

+ asi_unmap_task_stack(tsk);
account_kernel_stack(tsk, -1);
free_thread_stack(tsk);
tsk->stack = NULL;
@@ -916,6 +933,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
* functions again.
*/
tsk->stack = stack;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ tsk->asi_stack_mapped = NULL;
+#endif
#ifdef CONFIG_VMAP_STACK
tsk->stack_vm_area = stack_vm_area;
#endif
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 15:41:24

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 25/47] mm: asi: Avoid warning from NMI userspace accesses in ASI context

nmi_uaccess_okay() emits a warning if current CR3 != mm->pgd.
Limit the warning to only when ASI is not active.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/mm/tlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 25bee959d1d3..628f1cd904ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1292,7 +1292,8 @@ bool nmi_uaccess_okay(void)
if (loaded_mm != current_mm)
return false;

- VM_WARN_ON_ONCE(current_mm->pgd != __va(read_cr3_pa()));
+ VM_WARN_ON_ONCE(current_mm->pgd != __va(read_cr3_pa()) &&
+ !is_asi_active());

return true;
}
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 16:09:50

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 35/47] mm: asi: asi_exit() on PF, skip handling if address is accessible

From: Ofir Weisse <[email protected]>

On a page-fault - do asi_exit(). Then check if now after the exit the
address is accessible. We do this by refactoring spurious_kernel_fault()
into two parts:

1. Verify that the error code value is something that could arise from a
lazy TLB update.
2. Walk the page table and verify permissions, which is now called
is_address_accessible_now(). We also define PTE_PRESENT() and
PMD_PRESENT() which are suitable for checking userspace pages. For the
sake of spurious faualts, pte_present() and pmd_present() are only
good for kernelspace pages. This is because these macros might return
true even if the present bit is 0 (only relevant for userspace).

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/mm/fault.c | 60 ++++++++++++++++++++++++++++++++++------
include/linux/mm_types.h | 3 ++
2 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8692eb50f4a5..d08021ba380b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -982,6 +982,8 @@ static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
return 1;
}

+static int is_address_accessible_now(unsigned long error_code, unsigned long address,
+ pgd_t *pgd);
/*
* Handle a spurious fault caused by a stale TLB entry.
*
@@ -1003,15 +1005,13 @@ static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
* See Intel Developer's Manual Vol 3 Section 4.10.4.3, bullet 3
* (Optional Invalidation).
*/
+/* A spurious fault is also possible when Address Space Isolation (ASI) is in
+ * use. Specifically, code running withing an ASI domain touched memory outside
+ * the domain. This access causes a page-fault --> asi_exit() */
static noinline int
spurious_kernel_fault(unsigned long error_code, unsigned long address)
{
pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
- int ret;

/*
* Only writes to RO or instruction fetches from NX may cause
@@ -1027,6 +1027,37 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address)
return 0;

pgd = init_mm.pgd + pgd_index(address);
+ return is_address_accessible_now(error_code, address, pgd);
+}
+NOKPROBE_SYMBOL(spurious_kernel_fault);
+
+
+/* Check if an address (kernel or userspace) would cause a page fault if
+ * accessed now.
+ *
+ * For kernel addresses, pte_present and pmd_present are sufficioent. For
+ * userspace, we must use PTE_PRESENT and PMD_PRESENT, which will only check the
+ * present bits.
+ * The existing pmd_present() in arch/x86/include/asm/pgtable.h is misleading.
+ * The PMD page might be in the middle of split_huge_page with present bit
+ * clear, but pmd_present will still return true. We are inteerested in knowing
+ * if the page is accessible to hardware - that is - the present bit is 1. */
+#define PMD_PRESENT(pmd) (pmd_flags(pmd) & _PAGE_PRESENT)
+
+/* pte_present will return true is _PAGE_PROTNONE is 1. We care if the hardware
+ * can actually access the page right now. */
+#define PTE_PRESENT(pte) (pte_flags(pte) & _PAGE_PRESENT)
+
+static noinline int
+is_address_accessible_now(unsigned long error_code, unsigned long address,
+ pgd_t *pgd)
+{
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ int ret;
+
if (!pgd_present(*pgd))
return 0;

@@ -1045,14 +1076,14 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address)
return spurious_kernel_fault_check(error_code, (pte_t *) pud);

pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
+ if (!PMD_PRESENT(*pmd))
return 0;

if (pmd_large(*pmd))
return spurious_kernel_fault_check(error_code, (pte_t *) pmd);

pte = pte_offset_kernel(pmd, address);
- if (!pte_present(*pte))
+ if (!PTE_PRESENT(*pte))
return 0;

ret = spurious_kernel_fault_check(error_code, pte);
@@ -1068,7 +1099,6 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address)

return ret;
}
-NOKPROBE_SYMBOL(spurious_kernel_fault);

int show_unhandled_signals = 1;

@@ -1504,6 +1534,20 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* the fixup on the next page fault.
*/
struct asi *asi = asi_get_current();
+ if (asi)
+ asi_exit();
+
+ /* handle_page_fault() might call BUG() if we run it for a kernel
+ * address. This might be the case if we got here due to an ASI fault.
+ * We avoid this case by checking whether the address is now, after a
+ * potential asi_exit(), accessible by hardware. If it is - there's
+ * nothing to do.
+ */
+ if (current && mm_asi_enabled(current->mm)) {
+ pgd_t *pgd = (pgd_t*)__va(read_cr3_pa()) + pgd_index(address);
+ if (is_address_accessible_now(error_code, address, pgd))
+ return;
+ }

prefetchw(&current->mm->mmap_lock);

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3f209720a84..560909e80841 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -707,6 +707,9 @@ extern struct mm_struct init_mm;
#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
static inline bool mm_asi_enabled(struct mm_struct *mm)
{
+ if (!mm)
+ return false;
+
return mm->asi_enabled;
}
#else
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 17:05:55

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 43/47] mm: asi: Annotation of dynamic variables to be nonsensitive

From: Ofir Weisse <[email protected]>

The heart of ASI is to diffrentiate between sensitive and non-sensitive
data access. This commit marks certain dynamic allocations as not
sensitive.

Some dynamic variables are accessed frequently and therefore would cause
many ASI exits. The frequency of these accesses is monitored by tracing
asi_exits and analyzing the accessed addresses. Many of these variables
don't contain sensitive information and can therefore be mapped into the
global ASI region. This commit adds GFP_LOCAL/GLOBAL_NONSENSITIVE
attributes to these frequenmtly-accessed yet not sensitive variables.
The end result is a very significant reduction in ASI exits on real
benchmarks.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kernel/apic/x2apic_cluster.c | 2 +-
arch/x86/kvm/cpuid.c | 4 ++-
arch/x86/kvm/lapic.c | 9 ++++---
arch/x86/kvm/mmu/mmu.c | 7 ++++++
arch/x86/kvm/vmx/vmx.c | 6 +++--
arch/x86/kvm/x86.c | 8 +++---
fs/binfmt_elf.c | 2 +-
fs/eventfd.c | 2 +-
fs/eventpoll.c | 10 +++++---
fs/exec.c | 2 ++
fs/file.c | 3 ++-
fs/timerfd.c | 2 +-
include/linux/kvm_host.h | 2 +-
include/linux/kvm_types.h | 3 +++
kernel/cgroup/cgroup.c | 4 +--
kernel/events/core.c | 15 +++++++----
kernel/exit.c | 2 ++
kernel/fork.c | 36 +++++++++++++++++++++------
kernel/rcu/srcutree.c | 3 ++-
kernel/sched/core.c | 6 +++--
kernel/sched/cpuacct.c | 8 +++---
kernel/sched/fair.c | 3 ++-
kernel/sched/topology.c | 14 +++++++----
kernel/smp.c | 17 +++++++------
kernel/trace/ring_buffer.c | 5 ++--
kernel/tracepoint.c | 2 +-
lib/radix-tree.c | 6 ++---
mm/memcontrol.c | 7 +++---
mm/util.c | 3 ++-
mm/vmalloc.c | 3 ++-
net/core/skbuff.c | 2 +-
net/core/sock.c | 2 +-
virt/kvm/coalesced_mmio.c | 2 +-
virt/kvm/eventfd.c | 5 ++--
virt/kvm/kvm_main.c | 12 ++++++---
36 files changed, 148 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b7292c4fece7..34a05add5e77 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1562,7 +1562,8 @@ static inline void kvm_ops_static_call_update(void)
#define __KVM_HAVE_ARCH_VM_ALLOC
static inline struct kvm *kvm_arch_alloc_vm(void)
{
- return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO |
+ __GFP_GLOBAL_NONSENSITIVE);
}

#define __KVM_HAVE_ARCH_VM_FREE
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 655fe820a240..a1f6eb51ecb7 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -144,7 +144,7 @@ static int alloc_clustermask(unsigned int cpu, int node)
}

cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask),
- GFP_KERNEL, node);
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, node);
if (!cluster_hotplug_mask)
return -ENOMEM;
cluster_hotplug_mask->node = node;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 07e9215e911d..dedabfdd292e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -310,7 +310,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
if (IS_ERR(e))
return PTR_ERR(e);

- e2 = kvmalloc_array(cpuid->nent, sizeof(*e2), GFP_KERNEL_ACCOUNT);
+ e2 = kvmalloc_array(cpuid->nent, sizeof(*e2),
+ GFP_KERNEL_ACCOUNT |
+ __GFP_LOCAL_NONSENSITIVE);
if (!e2) {
r = -ENOMEM;
goto out_free_cpuid;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 213bbdfab49e..3a550299f015 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -213,7 +213,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm)

new = kvzalloc(sizeof(struct kvm_apic_map) +
sizeof(struct kvm_lapic *) * ((u64)max_id + 1),
- GFP_KERNEL_ACCOUNT);
+ GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);

if (!new)
goto out;
@@ -993,7 +993,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
*r = -1;

if (irq->shorthand == APIC_DEST_SELF) {
- *r = kvm_apic_set_irq(src->vcpu, irq, dest_map);
+ *r = kvm_apic_set_irq(src->vcpu, irq, dest_map);
return true;
}

@@ -2455,13 +2455,14 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)

ASSERT(vcpu != NULL);

- apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT);
+ apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (!apic)
goto nomem;

vcpu->arch.apic = apic;

- apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT
+ | __GFP_LOCAL_NONSENSITIVE);
if (!apic->regs) {
printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
vcpu->vcpu_id);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5785a0d02558..a2ada1104c2d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5630,6 +5630,13 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (static_cpu_has(X86_FEATURE_ASI) && mm_asi_enabled(current->mm))
+ vcpu->arch.mmu_shadow_page_cache.gfp_asi =
+ __GFP_LOCAL_NONSENSITIVE;
+ else
+ vcpu->arch.mmu_shadow_page_cache.gfp_asi = 0;
+#endif

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e1ad82c25a78..6e1bb017b696 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2629,7 +2629,7 @@ void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
free_vmcs(loaded_vmcs->vmcs);
loaded_vmcs->vmcs = NULL;
if (loaded_vmcs->msr_bitmap)
- free_page((unsigned long)loaded_vmcs->msr_bitmap);
+ kfree(loaded_vmcs->msr_bitmap);
WARN_ON(loaded_vmcs->shadow_vmcs != NULL);
}

@@ -2648,7 +2648,9 @@ int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)

if (cpu_has_vmx_msr_bitmap()) {
loaded_vmcs->msr_bitmap = (unsigned long *)
- __get_free_page(GFP_KERNEL_ACCOUNT);
+ kzalloc(PAGE_SIZE,
+ GFP_KERNEL_ACCOUNT |
+ __GFP_LOCAL_NONSENSITIVE );
if (!loaded_vmcs->msr_bitmap)
goto out_vmcs;
memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 451872d178e5..dd862edc1b5a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -329,7 +329,8 @@ static struct kmem_cache *kvm_alloc_emulator_cache(void)

return kmem_cache_create_usercopy("x86_emulator", size,
__alignof__(struct x86_emulate_ctxt),
- SLAB_ACCOUNT, useroffset,
+ SLAB_ACCOUNT|SLAB_LOCAL_NONSENSITIVE,
+ useroffset,
size - useroffset, NULL);
}

@@ -10969,7 +10970,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

r = -ENOMEM;

- page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE);
if (!page)
goto fail_free_lapic;
vcpu->arch.pio_data = page_address(page);
@@ -11718,7 +11719,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,

lpages = __kvm_mmu_slot_lpages(slot, npages, level);

- linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT);
+ linfo = kvcalloc(lpages, sizeof(*linfo),
+ GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (!linfo)
goto out_free;

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index f8c7f26f1fbb..b0550951da59 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -477,7 +477,7 @@ static struct elf_phdr *load_elf_phdrs(const struct elfhdr *elf_ex,
if (size == 0 || size > 65536 || size > ELF_MIN_ALIGN)
goto out;

- elf_phdata = kmalloc(size, GFP_KERNEL);
+ elf_phdata = kmalloc(size, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!elf_phdata)
goto out;

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 3627dd7d25db..c748433e52af 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -415,7 +415,7 @@ static int do_eventfd(unsigned int count, int flags)
if (flags & ~EFD_FLAGS_SET)
return -EINVAL;

- ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+ ctx = kmalloc(sizeof(*ctx), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ctx)
return -ENOMEM;

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06f4c5ae1451..b28826c9f079 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1239,7 +1239,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
if (unlikely(!epi)) // an earlier allocation has failed
return;

- pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL);
+ pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (unlikely(!pwq)) {
epq->epi = NULL;
return;
@@ -1453,7 +1453,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
return -ENOSPC;
percpu_counter_inc(&ep->user->epoll_watches);

- if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) {
+ if (!(epi = kmem_cache_zalloc(epi_cache,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE))) {
percpu_counter_dec(&ep->user->epoll_watches);
return -ENOMEM;
}
@@ -2373,11 +2374,12 @@ static int __init eventpoll_init(void)

/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
- 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+ 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE, NULL);

/* Allocates slab cache used to allocate "struct eppoll_entry" */
pwq_cache = kmem_cache_create("eventpoll_pwq",
- sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL);
+ sizeof(struct eppoll_entry), 0,
+ SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE, NULL);

ephead_cache = kmem_cache_create("ep_head",
sizeof(struct epitems_head), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL);
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..76f3b433e80d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1238,6 +1238,8 @@ int begin_new_exec(struct linux_binprm * bprm)
struct task_struct *me = current;
int retval;

+ /* TODO: (oweisse) unmap the stack from ASI */
+
/* Once we are committed compute the creds */
retval = bprm_creds_from_file(bprm);
if (retval)
diff --git a/fs/file.c b/fs/file.c
index 97d212a9b814..85bfa5d70323 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -117,7 +117,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
if (!fdt)
goto out;
fdt->max_fds = nr;
- data = kvmalloc_array(nr, sizeof(struct file *), GFP_KERNEL_ACCOUNT);
+ data = kvmalloc_array(nr, sizeof(struct file *),
+ GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (!data)
goto out_fdt;
fdt->fd = data;
diff --git a/fs/timerfd.c b/fs/timerfd.c
index e9c96a0c79f1..385fbb29837d 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -425,7 +425,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
!capable(CAP_WAKE_ALARM))
return -EPERM;

- ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ctx)
return -ENOMEM;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f31f7442eced..dfbb26d7a185 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1085,7 +1085,7 @@ int kvm_arch_create_vm_debugfs(struct kvm *kvm);
*/
static inline struct kvm *kvm_arch_alloc_vm(void)
{
- return kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ return kzalloc(sizeof(struct kvm), GFP_KERNEL | __GFP_LOCAL_NONSENSITIVE);
}
#endif

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 234eab059839..a5a810db85ca 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -64,6 +64,9 @@ struct gfn_to_hva_cache {
struct kvm_mmu_memory_cache {
int nobjs;
gfp_t gfp_zero;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ gfp_t gfp_asi;
+#endif
struct kmem_cache *kmem_cache;
void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
};
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 729495e17363..79692dafd2be 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1221,7 +1221,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
if (cset)
return cset;

- cset = kzalloc(sizeof(*cset), GFP_KERNEL);
+ cset = kzalloc(sizeof(*cset), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!cset)
return NULL;

@@ -5348,7 +5348,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,

/* allocate the cgroup and its ID, 0 is reserved for the root */
cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!cgrp)
return ERR_PTR(-ENOMEM);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1914cc538cab..64eeb2c67d92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4586,7 +4586,8 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
{
struct perf_event_context *ctx;

- ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
+ ctx = kzalloc(sizeof(struct perf_event_context),
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ctx)
return NULL;

@@ -11062,7 +11063,8 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)

mutex_lock(&pmus_lock);
ret = -ENOMEM;
- pmu->pmu_disable_count = alloc_percpu(int);
+ pmu->pmu_disable_count = alloc_percpu_gfp(int,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!pmu->pmu_disable_count)
goto unlock;

@@ -11112,7 +11114,8 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
goto got_cpu_context;

ret = -ENOMEM;
- pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
+ pmu->pmu_cpu_context = alloc_percpu_gfp(struct perf_cpu_context,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!pmu->pmu_cpu_context)
goto free_dev;

@@ -11493,7 +11496,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
}

node = (cpu >= 0) ? cpu_to_node(cpu) : -1;
- event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO,
+ event = kmem_cache_alloc_node(perf_event_cache,
+ GFP_KERNEL | __GFP_ZERO | __GFP_GLOBAL_NONSENSITIVE,
node);
if (!event)
return ERR_PTR(-ENOMEM);
@@ -13378,7 +13382,8 @@ void __init perf_event_init(void)
ret = init_hw_breakpoint();
WARN(ret, "hw_breakpoint initialization failed with: %d", ret);

- perf_event_cache = KMEM_CACHE(perf_event, SLAB_PANIC);
+ perf_event_cache = KMEM_CACHE(perf_event,
+ SLAB_PANIC | SLAB_GLOBAL_NONSENSITIVE);

/*
* Build time assertion that we keep the data_head at the intended
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..ab2749cf6887 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -768,6 +768,8 @@ void __noreturn do_exit(long code)
profile_task_exit(tsk);
kcov_task_exit(tsk);

+ /* TODO: (oweisse) unmap the stack from ASI */
+
coredump_task_exit(tsk);
ptrace_event(PTRACE_EVENT_EXIT, code);

diff --git a/kernel/fork.c b/kernel/fork.c
index d7f55de00947..cb147a72372d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -168,6 +168,8 @@ static struct kmem_cache *task_struct_cachep;

static inline struct task_struct *alloc_task_struct_node(int node)
{
+ /* TODO: Figure how to allocate this propperly to ASI process map. This
+ * should be mapped in a __GFP_LOCAL_NONSENSITIVE slab. */
return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
}

@@ -214,6 +216,7 @@ static int free_vm_stack_cache(unsigned int cpu)

static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
+ /* TODO: (oweisse) Add annotation to map the stack into ASI */
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
@@ -242,9 +245,13 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
* so memcg accounting is performed manually on assigning/releasing
* stacks to tasks. Drop __GFP_ACCOUNT.
*/
+ /* ASI: We intentionally don't pass VM_LOCAL_NONSENSITIVE nor
+ * __GFP_LOCAL_NONSENSITIVE since we don't have an mm yet. Later on we'll
+ * map the stack into the mm asi map. That being said, we do care about
+ * the stack weing allocaed below VMALLOC_LOCAL_NONSENSITIVE_END */
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
- VMALLOC_START, VMALLOC_END,
- THREADINFO_GFP & ~__GFP_ACCOUNT,
+ VMALLOC_START, VMALLOC_LOCAL_NONSENSITIVE_END,
+ (THREADINFO_GFP & (~__GFP_ACCOUNT)),
PAGE_KERNEL,
0, node, __builtin_return_address(0));

@@ -346,7 +353,8 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
struct vm_area_struct *vma;

- vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ vma = kmem_cache_alloc(vm_area_cachep,
+ GFP_KERNEL);
if (vma)
vma_init(vma, mm);
return vma;
@@ -683,6 +691,8 @@ static void check_mm(struct mm_struct *mm)
#endif
}

+/* TODO: (oweisse) ASI: we need to allocate mm such that it will only be visible
+ * within itself. */
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm)))

@@ -823,9 +833,12 @@ void __init fork_init(void)

/* create a slab on which task_structs can be allocated */
task_struct_whitelist(&useroffset, &usersize);
+ /* TODO: (oweisse) for the time being this cache is shared among all tasks. We
+ * mark it SLAB_NONSENSITIVE so task_struct can be accessed withing ASI.
+ * A final secure solution should have this memory LOCAL, not GLOBAL.*/
task_struct_cachep = kmem_cache_create_usercopy("task_struct",
arch_task_struct_size, align,
- SLAB_PANIC|SLAB_ACCOUNT,
+ SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE,
useroffset, usersize, NULL);
#endif

@@ -1601,6 +1614,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
refcount_inc(&current->sighand->count);
return 0;
}
+ /* TODO: (oweisse) ASI replace with proper ASI allcation. */
sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
RCU_INIT_POINTER(tsk->sighand, sig);
if (!sig)
@@ -1649,6 +1663,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_THREAD)
return 0;

+ /* TODO: (oweisse) figure out how to properly allocate this in ASI for local
+ * process */
sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
tsk->signal = sig;
if (!sig)
@@ -2923,7 +2939,8 @@ void __init proc_caches_init(void)
SLAB_ACCOUNT, sighand_ctor);
signal_cachep = kmem_cache_create("signal_cache",
sizeof(struct signal_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|
+ SLAB_GLOBAL_NONSENSITIVE,
NULL);
files_cachep = kmem_cache_create("files_cache",
sizeof(struct files_struct), 0,
@@ -2941,13 +2958,18 @@ void __init proc_caches_init(void)
*/
mm_size = sizeof(struct mm_struct) + cpumask_size();

+ /* TODO: (oweisse) ASI replace with proper ASI allcation. */
mm_cachep = kmem_cache_create_usercopy("mm_struct",
mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT
+ |SLAB_GLOBAL_NONSENSITIVE,
offsetof(struct mm_struct, saved_auxv),
sizeof_field(struct mm_struct, saved_auxv),
NULL);
- vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+
+ /* TODO: (oweisse) ASI replace with proper ASI allcation. */
+ vm_area_cachep = KMEM_CACHE(vm_area_struct,
+ SLAB_PANIC|SLAB_ACCOUNT|SLAB_LOCAL_NONSENSITIVE);
mmap_init();
nsproxy_cache_init();
}
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 6833d8887181..553221503803 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -171,7 +171,8 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static)
atomic_set(&ssp->srcu_barrier_cpu_cnt, 0);
INIT_DELAYED_WORK(&ssp->work, process_srcu);
if (!is_static)
- ssp->sda = alloc_percpu(struct srcu_data);
+ ssp->sda = alloc_percpu_gfp(struct srcu_data,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ssp->sda)
return -ENOMEM;
init_srcu_struct_nodes(ssp);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c96f0001c7f..7515f0612f5c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9329,7 +9329,8 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
- task_group_cache = KMEM_CACHE(task_group, 0);
+ /* TODO: (oweisse) add SLAB_NONSENSITIVE */
+ task_group_cache = KMEM_CACHE(task_group, SLAB_GLOBAL_NONSENSITIVE);

list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
@@ -9741,7 +9742,8 @@ struct task_group *sched_create_group(struct task_group *parent)
{
struct task_group *tg;

- tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+ tg = kmem_cache_alloc(task_group_cache,
+ GFP_KERNEL | __GFP_ZERO | __GFP_GLOBAL_NONSENSITIVE);
if (!tg)
return ERR_PTR(-ENOMEM);

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 6e3da149125c..e8b0b29b4d37 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -64,15 +64,17 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!parent_css)
return &root_cpuacct.css;

- ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+ ca = kzalloc(sizeof(*ca), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ca)
goto out;

- ca->cpuusage = alloc_percpu(struct cpuacct_usage);
+ ca->cpuusage = alloc_percpu_gfp(struct cpuacct_usage,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ca->cpuusage)
goto out_free_ca;

- ca->cpustat = alloc_percpu(struct kernel_cpustat);
+ ca->cpustat = alloc_percpu_gfp(struct kernel_cpustat,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!ca->cpustat)
goto out_free_cpuusage;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc9b6133b059..97d70f1eb2c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11486,7 +11486,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
+ cpu_to_node(i));
if (!cfs_rq)
goto err;

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1dcea6a6133e..2ad96c78306c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -569,7 +569,7 @@ static struct root_domain *alloc_rootdomain(void)
{
struct root_domain *rd;

- rd = kzalloc(sizeof(*rd), GFP_KERNEL);
+ rd = kzalloc(sizeof(*rd), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!rd)
return NULL;

@@ -2044,21 +2044,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
struct sched_group_capacity *sgc;

sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
+ cpu_to_node(j));
if (!sd)
return -ENOMEM;

*per_cpu_ptr(sdd->sd, j) = sd;

sds = kzalloc_node(sizeof(struct sched_domain_shared),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
+ cpu_to_node(j));
if (!sds)
return -ENOMEM;

*per_cpu_ptr(sdd->sds, j) = sds;

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
+ cpu_to_node(j));
if (!sg)
return -ENOMEM;

@@ -2067,7 +2070,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sg, j) = sg;

sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
+ cpu_to_node(j));
if (!sgc)
return -ENOMEM;

diff --git a/kernel/smp.c b/kernel/smp.c
index 3c1b328f0a09..db9ab5a58e2c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -103,15 +103,18 @@ int smpcfd_prepare_cpu(unsigned int cpu)
{
struct call_function_data *cfd = &per_cpu(cfd_data, cpu);

- if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
+ if (!zalloc_cpumask_var_node(&cfd->cpumask,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
cpu_to_node(cpu)))
return -ENOMEM;
- if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL,
+ if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE,
cpu_to_node(cpu))) {
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->pcpu = alloc_percpu(struct cfd_percpu);
+ cfd->pcpu = alloc_percpu_gfp(struct cfd_percpu,
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!cfd->pcpu) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -179,10 +182,10 @@ static int __init csdlock_debug(char *str)
}
early_param("csdlock_debug", csdlock_debug);

-static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
-static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
-static DEFINE_PER_CPU(void *, cur_csd_info);
-static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(call_single_data_t *, cur_csd);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(smp_call_func_t, cur_csd_func);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(void *, cur_csd_info);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cfd_seq_local, cfd_seq_local);

#define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
static atomic_t csd_bug_count = ATOMIC_INIT(0);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2699e9e562b1..9ad7d4569d4b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1539,7 +1539,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
* gracefully without invoking oom-killer and the system is not
* destabilized.
*/
- mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+ /* TODO(oweisse): this is a hack to enable ASI tracing. */
+ mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_GLOBAL_NONSENSITIVE;

/*
* If a user thread allocates too much, and si_mem_available()
@@ -1718,7 +1719,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,

/* keep it in its own cache line */
buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!buffer)
return NULL;

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 64ea283f2f86..0ae6c38ee121 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -107,7 +107,7 @@ static void tp_stub_func(void)
static inline void *allocate_probes(int count)
{
struct tp_probes *p = kmalloc(struct_size(p, probes, count),
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
return p == NULL ? NULL : p->probes;
}

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b3afafe46fff..c7d3342a7b30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -248,8 +248,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
* cache first for the new node to get accounted to the memory
* cgroup.
*/
- ret = kmem_cache_alloc(radix_tree_node_cachep,
- gfp_mask | __GFP_NOWARN);
+ ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask | __GFP_NOWARN);
if (ret)
goto out;

@@ -1597,9 +1596,10 @@ void __init radix_tree_init(void)
BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
BUILD_BUG_ON(ROOT_IS_IDR & ~GFP_ZONEMASK);
BUILD_BUG_ON(XA_CHUNK_SIZE > 255);
+ /*TODO: (oweisse) ASI add SLAB_NONSENSITIVE */
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
- SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
+ SLAB_PANIC | SLAB_RECLAIM_ACCOUNT | SLAB_GLOBAL_NONSENSITIVE,
radix_tree_node_ctor);
ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead",
NULL, radix_tree_cpu_dead);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a66d6b222ecf..fbc42e96b157 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5143,20 +5143,21 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
size = sizeof(struct mem_cgroup);
size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);

- memcg = kzalloc(size, GFP_KERNEL);
+ memcg = kzalloc(size, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (!memcg)
return ERR_PTR(error);

memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
1, MEM_CGROUP_ID_MAX,
- GFP_KERNEL);
+ GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE);
if (memcg->id.id < 0) {
error = memcg->id.id;
goto fail;
}

memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
- GFP_KERNEL_ACCOUNT);
+ GFP_KERNEL_ACCOUNT |
+ __GFP_GLOBAL_NONSENSITIVE);
if (!memcg->vmstats_percpu)
goto fail;

diff --git a/mm/util.c b/mm/util.c
index 741ba32a43ac..0a49e15a0765 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -196,7 +196,8 @@ void *vmemdup_user(const void __user *src, size_t len)
{
void *p;

- p = kvmalloc(len, GFP_USER);
+ /* TODO(oweisse): is this secure? */
+ p = kvmalloc(len, GFP_USER | __GFP_LOCAL_NONSENSITIVE);
if (!p)
return ERR_PTR(-ENOMEM);

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a89866a926f6..659560f286b0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3309,7 +3309,8 @@ EXPORT_SYMBOL(vzalloc);
void *vmalloc_user(unsigned long size)
{
return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
+ GFP_KERNEL | __GFP_ZERO
+ | __GFP_LOCAL_NONSENSITIVE, PAGE_KERNEL,
VM_USERMAP, NUMA_NO_NODE,
__builtin_return_address(0));
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 909db87d7383..ce8c331386fb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -404,7 +404,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
? skbuff_fclone_cache : skbuff_head_cache;

if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
- gfp_mask |= __GFP_MEMALLOC;
+ gfp_mask |= __GFP_MEMALLOC | __GFP_GLOBAL_NONSENSITIVE;

/* Get the HEAD */
if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI)) == SKB_ALLOC_NAPI &&
diff --git a/net/core/sock.c b/net/core/sock.c
index 41e91d0f7061..6f6e0bd5ebf1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2704,7 +2704,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_GLOBAL_NONSENSITIVE,
SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 0be80c213f7f..5b87476566c4 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -111,7 +111,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
{
struct page *page;

- page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE);
if (!page)
return -ENOMEM;

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2ad013b8bde9..40acb841135c 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -306,7 +306,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
if (!kvm_arch_irqfd_allowed(kvm, args))
return -EINVAL;

- irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT);
+ irqfd = kzalloc(sizeof(*irqfd),
+ GFP_KERNEL_ACCOUNT | __GFP_GLOBAL_NONSENSITIVE);
if (!irqfd)
return -ENOMEM;

@@ -813,7 +814,7 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);

- p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
+ p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT | __GFP_GLOBAL_NONSENSITIVE);
if (!p) {
ret = -ENOMEM;
goto fail;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8d2d76de5bd0..587a75428da8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -370,6 +370,9 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
{
gfp_flags |= mc->gfp_zero;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ gfp_flags |= mc->gfp_asi;
+#endif

if (mc->kmem_cache)
return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
@@ -863,7 +866,8 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
int i;
struct kvm_memslots *slots;

- slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
+ slots = kvzalloc(sizeof(struct kvm_memslots),
+ GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (!slots)
return NULL;

@@ -1529,7 +1533,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
else
new_size = kvm_memslots_size(old->used_slots);

- slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
+ slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (likely(slots))
kvm_copy_memslots(slots, old);

@@ -3565,7 +3569,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
}

BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
- page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE);
if (!page) {
r = -ENOMEM;
goto vcpu_free;
@@ -4959,7 +4963,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
return -ENOSPC;

new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1),
- GFP_KERNEL_ACCOUNT);
+ GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE);
if (!new_bus)
return -ENOMEM;

--
2.35.1.473.g83b2b277ed-goog

2022-02-23 17:40:04

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 19/47] mm: asi: Support for locally nonsensitive page allocations

A new GFP flag, __GFP_LOCAL_NONSENSITIVE, is added to allocate pages
that are considered non-sensitive within the context of the current
process, but sensitive in the context of other processes.

For these allocations, page->asi_mm is set to the current mm during
allocation. It must be set to the same value when the page is freed.
Though it can potentially be overwritten and used for some other
purpose in the meantime, as long as it is restored before freeing.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/gfp.h | 5 +++-
include/linux/mm_types.h | 17 ++++++++++--
include/trace/events/mmflags.h | 1 +
mm/page_alloc.c | 47 ++++++++++++++++++++++++++++------
tools/perf/builtin-kmem.c | 1 +
5 files changed, 60 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 07a99a463a34..2ab394adbda3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -62,8 +62,10 @@ struct vm_area_struct;
#endif
#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
#define ___GFP_GLOBAL_NONSENSITIVE 0x4000000u
+#define ___GFP_LOCAL_NONSENSITIVE 0x8000000u
#else
#define ___GFP_GLOBAL_NONSENSITIVE 0
+#define ___GFP_LOCAL_NONSENSITIVE 0
#endif
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

@@ -255,9 +257,10 @@ struct vm_area_struct;

/* Allocate non-sensitive memory */
#define __GFP_GLOBAL_NONSENSITIVE ((__force gfp_t)___GFP_GLOBAL_NONSENSITIVE)
+#define __GFP_LOCAL_NONSENSITIVE ((__force gfp_t)___GFP_LOCAL_NONSENSITIVE)

/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 27
+#define __GFP_BITS_SHIFT 28
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8624d2783661..f9702d070975 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -193,8 +193,21 @@ struct page {
struct rcu_head rcu_head;

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
- /* Links the pages_to_free_async list */
- struct llist_node async_free_node;
+ struct {
+ /* Links the pages_to_free_async list */
+ struct llist_node async_free_node;
+
+ unsigned long _asi_pad_1;
+ unsigned long _asi_pad_2;
+
+ /*
+ * Upon allocation of a locally non-sensitive page, set
+ * to the allocating mm. Must be set to the same mm when
+ * the page is freed. May potentially be overwritten in
+ * the meantime, as long as it is restored before free.
+ */
+ struct mm_struct *asi_mm;
+ };
#endif
};

diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 96e61d838bec..c00b8a4e1968 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -51,6 +51,7 @@
{(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\
{(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \
{(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"},\
+ {(unsigned long)__GFP_LOCAL_NONSENSITIVE, "__GFP_LOCAL_NONSENSITIVE"},\
{(unsigned long)__GFP_GLOBAL_NONSENSITIVE, "__GFP_GLOBAL_NONSENSITIVE"}\

#define show_gfp_flags(flags) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a4048fa1868a..01784bff2a80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5231,19 +5231,33 @@ early_initcall(asi_page_alloc_init);
static int asi_map_alloced_pages(struct page *page, uint order, gfp_t gfp_mask)
{
uint i;
+ struct asi *asi;
+
+ VM_BUG_ON((gfp_mask & (__GFP_GLOBAL_NONSENSITIVE |
+ __GFP_LOCAL_NONSENSITIVE)) ==
+ (__GFP_GLOBAL_NONSENSITIVE | __GFP_LOCAL_NONSENSITIVE));

if (!static_asi_enabled())
return 0;

+ if (!(gfp_mask & (__GFP_GLOBAL_NONSENSITIVE |
+ __GFP_LOCAL_NONSENSITIVE)))
+ return 0;
+
if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) {
+ asi = ASI_GLOBAL_NONSENSITIVE;
for (i = 0; i < (1 << order); i++)
__SetPageGlobalNonSensitive(page + i);
-
- return asi_map_gfp(ASI_GLOBAL_NONSENSITIVE, page_to_virt(page),
- PAGE_SIZE * (1 << order), gfp_mask);
+ } else {
+ asi = ASI_LOCAL_NONSENSITIVE;
+ for (i = 0; i < (1 << order); i++) {
+ __SetPageLocalNonSensitive(page + i);
+ page[i].asi_mm = current->mm;
+ }
}

- return 0;
+ return asi_map_gfp(asi, page_to_virt(page),
+ PAGE_SIZE * (1 << order), gfp_mask);
}

static bool asi_unmap_freed_pages(struct page *page, unsigned int order)
@@ -5251,18 +5265,28 @@ static bool asi_unmap_freed_pages(struct page *page, unsigned int order)
void *va;
size_t len;
bool async_flush_needed;
+ struct asi *asi;
+
+ VM_BUG_ON(PageGlobalNonSensitive(page) && PageLocalNonSensitive(page));

if (!static_asi_enabled())
return true;

- if (!PageGlobalNonSensitive(page))
+ if (PageGlobalNonSensitive(page))
+ asi = ASI_GLOBAL_NONSENSITIVE;
+ else if (PageLocalNonSensitive(page))
+ asi = &page->asi_mm->asi[0];
+ else
return true;

+ /* Heuristic to check that page->asi_mm is actually an mm_struct */
+ VM_BUG_ON(PageLocalNonSensitive(page) && asi->mm != page->asi_mm);
+
va = page_to_virt(page);
len = PAGE_SIZE * (1 << order);
async_flush_needed = irqs_disabled() || in_interrupt();

- asi_unmap(ASI_GLOBAL_NONSENSITIVE, va, len, !async_flush_needed);
+ asi_unmap(asi, va, len, !async_flush_needed);

if (!async_flush_needed)
return true;
@@ -5476,8 +5500,15 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
return NULL;
}

- if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE))
- gfp |= __GFP_ZERO;
+ if (static_asi_enabled()) {
+ if ((gfp & __GFP_LOCAL_NONSENSITIVE) &&
+ !mm_asi_enabled(current->mm))
+ gfp &= ~__GFP_LOCAL_NONSENSITIVE;
+
+ if (gfp & (__GFP_GLOBAL_NONSENSITIVE |
+ __GFP_LOCAL_NONSENSITIVE))
+ gfp |= __GFP_ZERO;
+ }

gfp &= gfp_allowed_mask;
/*
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 5857953cd5c1..a2337fc3404f 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -661,6 +661,7 @@ static const struct {
{ "__GFP_DIRECT_RECLAIM", "DR" },
{ "__GFP_KSWAPD_RECLAIM", "KR" },
{ "__GFP_GLOBAL_NONSENSITIVE", "GNS" },
+ { "__GFP_LOCAL_NONSENSITIVE", "LNS" },
};

static size_t max_gfp_len;
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 17:50:03

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 09/47] mm: Add __PAGEFLAG_FALSE

__PAGEFLAG_FALSE is a non-atomic equivalent of PAGEFLAG_FALSE.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/page-flags.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5f14d581113..b90a17e9796d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -390,6 +390,10 @@ static inline int Page##uname(const struct page *page) { return 0; }
static inline void folio_set_##lname(struct folio *folio) { } \
static inline void SetPage##uname(struct page *page) { }

+#define __SETPAGEFLAG_NOOP(uname, lname) \
+static inline void __folio_set_##lname(struct folio *folio) { } \
+static inline void __SetPage##uname(struct page *page) { }
+
#define CLEARPAGEFLAG_NOOP(uname, lname) \
static inline void folio_clear_##lname(struct folio *folio) { } \
static inline void ClearPage##uname(struct page *page) { }
@@ -411,6 +415,9 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
#define PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname) \
SETPAGEFLAG_NOOP(uname, lname) CLEARPAGEFLAG_NOOP(uname, lname)

+#define __PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname) \
+ __SETPAGEFLAG_NOOP(uname, lname) __CLEARPAGEFLAG_NOOP(uname, lname)
+
#define TESTSCFLAG_FALSE(uname, lname) \
TESTSETFLAG_FALSE(uname, lname) TESTCLEARFLAG_FALSE(uname, lname)

--
2.35.1.473.g83b2b277ed-goog

2022-02-23 19:14:55

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 42/47] mm: asi: Annotation of PERCPU variables to be nonsensitive

From: Ofir Weisse <[email protected]>

The heart of ASI is to diffrentiate between sensitive and non-sensitive
data access. This commit marks certain static PERCPU variables as not
sensitive.

Some static variables are accessed frequently and therefore would cause
many ASI exits. The frequency of these accesses is monitored by tracing
asi_exits and analyzing the accessed addresses. Many of these variables
don't contain sensitive information and can therefore be mapped into the
global ASI region. This commit modified
DEFINE_PER_CPU --> DEFINE_PER_CPU_ASI_NOT_SENSITIVE to variables which
are frequenmtly-accessed yet not sensitive variables.
The end result is a very significant reduction in ASI exits on real
benchmarks.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/events/core.c | 2 +-
arch/x86/events/intel/bts.c | 2 +-
arch/x86/events/perf_event.h | 2 +-
arch/x86/include/asm/asi.h | 2 +-
arch/x86/include/asm/current.h | 2 +-
arch/x86/include/asm/debugreg.h | 2 +-
arch/x86/include/asm/desc.h | 2 +-
arch/x86/include/asm/fpu/api.h | 2 +-
arch/x86/include/asm/hardirq.h | 2 +-
arch/x86/include/asm/hw_irq.h | 2 +-
arch/x86/include/asm/percpu.h | 2 +-
arch/x86/include/asm/preempt.h | 2 +-
arch/x86/include/asm/processor.h | 12 ++++++------
arch/x86/include/asm/smp.h | 2 +-
arch/x86/include/asm/tlbflush.h | 4 ++--
arch/x86/include/asm/topology.h | 2 +-
arch/x86/kernel/apic/apic.c | 2 +-
arch/x86/kernel/apic/x2apic_cluster.c | 6 +++---
arch/x86/kernel/cpu/common.c | 12 ++++++------
arch/x86/kernel/fpu/core.c | 2 +-
arch/x86/kernel/hw_breakpoint.c | 2 +-
arch/x86/kernel/irq.c | 2 +-
arch/x86/kernel/irqinit.c | 2 +-
arch/x86/kernel/nmi.c | 6 +++---
arch/x86/kernel/process.c | 4 ++--
arch/x86/kernel/setup_percpu.c | 4 ++--
arch/x86/kernel/smpboot.c | 3 ++-
arch/x86/kernel/tsc.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
arch/x86/kvm/x86.h | 2 +-
arch/x86/mm/asi.c | 2 +-
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 2 +-
include/asm-generic/irq_regs.h | 2 +-
include/linux/arch_topology.h | 2 +-
include/linux/hrtimer.h | 2 +-
include/linux/interrupt.h | 2 +-
include/linux/kernel_stat.h | 4 ++--
include/linux/prandom.h | 2 +-
kernel/events/core.c | 6 +++---
kernel/irq_work.c | 6 +++---
kernel/rcu/tree.c | 2 +-
kernel/sched/core.c | 6 +++---
kernel/sched/cpufreq.c | 3 ++-
kernel/sched/cputime.c | 2 +-
kernel/sched/sched.h | 21 +++++++++++----------
kernel/sched/topology.c | 14 +++++++-------
kernel/smp.c | 7 ++++---
kernel/softirq.c | 2 +-
kernel/time/hrtimer.c | 2 +-
kernel/time/tick-common.c | 2 +-
kernel/time/tick-internal.h | 4 ++--
kernel/time/tick-sched.c | 2 +-
kernel/time/timer.c | 2 +-
kernel/trace/trace.c | 2 +-
kernel/trace/trace_preemptirq.c | 2 +-
kernel/watchdog.c | 12 ++++++------
lib/irq_regs.c | 2 +-
lib/random32.c | 3 ++-
virt/kvm/kvm_main.c | 2 +-
60 files changed, 112 insertions(+), 107 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index db825bf053fd..2d9829d774d7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -47,7 +47,7 @@
struct x86_pmu x86_pmu __asi_not_sensitive_readmostly;
static struct pmu pmu;

-DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cpu_hw_events, cpu_hw_events) = {
.enabled = 1,
.pmu = &pmu,
};
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 974e917e65b2..06d9de514b0d 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -36,7 +36,7 @@ enum {
BTS_STATE_ACTIVE,
};

-static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct bts_ctx, bts_ctx);

#define BTS_RECORD_SIZE 24
#define BTS_SAFETY_MARGIN 4080
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 27cca7fd6f17..9a4855e6ffa6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1036,7 +1036,7 @@ static inline bool x86_pmu_has_lbr_callstack(void)
x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
}

-DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct cpu_hw_events, cpu_hw_events);

int x86_perf_event_set_period(struct perf_event *event);

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index d43f6aadffee..6148e65fb0c2 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -52,7 +52,7 @@ struct asi_pgtbl_pool {
uint count;
};

-DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
+DECLARE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct asi_state, asi_cpu_state);

extern pgd_t asi_global_nonsensitive_pgd[];

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index 3e204e6140b5..a4bcf1f305bf 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -8,7 +8,7 @@
#ifndef __ASSEMBLY__
struct task_struct;

-DECLARE_PER_CPU(struct task_struct *, current_task);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index cfdf307ddc01..fa67db27b098 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -6,7 +6,7 @@
#include <linux/bug.h>
#include <uapi/asm/debugreg.h>

-DECLARE_PER_CPU(unsigned long, cpu_dr7);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_dr7);

#ifndef CONFIG_PARAVIRT_XXL
/*
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ab97b22ac04a..7d9fff8c9543 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -298,7 +298,7 @@ static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i];
}

-DECLARE_PER_CPU(bool, __tss_limit_invalid);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(bool, __tss_limit_invalid);

static inline void force_reload_TR(void)
{
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 6f5ca3c2ef4a..15abb1b05fbc 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -121,7 +121,7 @@ static inline void fpstate_init_soft(struct swregs_state *soft) {}
#endif

/* State tracking */
-DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct fpu *, fpu_fpregs_owner_ctx);

/* Process cleanup */
#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..2f70deca4a20 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -46,7 +46,7 @@ typedef struct {
#endif
} ____cacheline_aligned irq_cpustat_t;

-DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
+DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(irq_cpustat_t, irq_stat);

#define __ARCH_IRQ_STAT

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..e561abfce735 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -128,7 +128,7 @@ extern char spurious_entries_start[];
#define VECTOR_RETRIGGERED ((void *)-2L)

typedef struct irq_desc* vector_irq_t[NR_VECTORS];
-DECLARE_PER_CPU(vector_irq_t, vector_irq);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(vector_irq_t, vector_irq);

#endif /* !ASSEMBLY_ */

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index a3c33b79fb86..f9486bbe8a76 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -390,7 +390,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr,
#include <asm-generic/percpu.h>

/* We can use this directly for local CPU (faster). */
-DECLARE_PER_CPU_READ_MOSTLY(unsigned long, this_cpu_off);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, this_cpu_off);

#endif /* !__ASSEMBLY__ */

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index fe5efbcba824..204a8532b870 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -7,7 +7,7 @@
#include <linux/thread_info.h>
#include <linux/static_call_types.h>

-DECLARE_PER_CPU(int, __preempt_count);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, __preempt_count);

/* We use the MSB mostly because its available */
#define PREEMPT_NEED_RESCHED 0x80000000
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 20116efd2756..63831f9a503b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -417,14 +417,14 @@ struct tss_struct {
struct x86_io_bitmap io_bitmap;
} __aligned(PAGE_SIZE);

-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
+DECLARE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(struct tss_struct, cpu_tss_rw);

/* Per CPU interrupt stacks */
struct irq_stack {
char stack[IRQ_STACK_SIZE];
} __aligned(IRQ_STACK_SIZE);

-DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_current_top_of_stack);

#ifdef CONFIG_X86_64
struct fixed_percpu_data {
@@ -448,8 +448,8 @@ static inline unsigned long cpu_kernelmode_gs_base(int cpu)
return (unsigned long)per_cpu(fixed_percpu_data.gs_base, cpu);
}

-DECLARE_PER_CPU(void *, hardirq_stack_ptr);
-DECLARE_PER_CPU(bool, hardirq_stack_inuse);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(void *, hardirq_stack_ptr);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(bool, hardirq_stack_inuse);
extern asmlinkage void ignore_sysret(void);

/* Save actual FS/GS selectors and bases to current->thread */
@@ -458,8 +458,8 @@ void current_save_fsgs(void);
#ifdef CONFIG_STACKPROTECTOR
DECLARE_PER_CPU(unsigned long, __stack_chk_guard);
#endif
-DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
-DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irq_stack *, hardirq_stack_ptr);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irq_stack *, softirq_stack_ptr);
#endif /* !X86_64 */

struct perf_event;
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 81a0211a372d..8d85a918532e 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -19,7 +19,7 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
-DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, cpu_number);

static inline struct cpumask *cpu_llc_shared_mask(int cpu)
{
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 7d04aa2a5f86..adcdeb58d817 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -151,7 +151,7 @@ struct tlb_state {
*/
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
};
-DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state, cpu_tlbstate);

struct tlb_state_shared {
/*
@@ -171,7 +171,7 @@ struct tlb_state_shared {
*/
bool is_lazy;
};
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
+DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state_shared, cpu_tlbstate_shared);

bool nmi_uaccess_okay(void);
#define nmi_uaccess_okay nmi_uaccess_okay
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cc164777e661..bff1a9123469 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -203,7 +203,7 @@ DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);

#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)

-DECLARE_PER_CPU(unsigned long, arch_freq_scale);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale);

static inline long arch_scale_freq_capacity(int cpu)
{
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b70344bf6600..5fa0ce0ecfb3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -548,7 +548,7 @@ static struct clock_event_device lapic_clockevent = {
.rating = 100,
.irq = -1,
};
-static DEFINE_PER_CPU(struct clock_event_device, lapic_events);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct clock_event_device, lapic_events);

static const struct x86_cpu_id deadline_match[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS(HASWELL_X, X86_STEPPINGS(0x2, 0x2), 0x3a), /* EP */
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index e696e22d0531..655fe820a240 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -20,10 +20,10 @@ struct cluster_mask {
* x86_cpu_to_logical_apicid for all online cpus in a sequential way.
* Using per cpu variable would cost one cache line per cpu.
*/
-static u32 *x86_cpu_to_logical_apicid __read_mostly;
+static u32 *x86_cpu_to_logical_apicid __asi_not_sensitive_readmostly;

-static DEFINE_PER_CPU(cpumask_var_t, ipi_mask);
-static DEFINE_PER_CPU_READ_MOSTLY(struct cluster_mask *, cluster_masks);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(cpumask_var_t, ipi_mask);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cluster_mask *, cluster_masks);
static struct cluster_mask *cluster_hotplug_mask;

static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0083464de5e3..471b3a42db64 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1775,17 +1775,17 @@ EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
* The following percpu variables are hot. Align current_task to
* cacheline size such that they fall in the same cacheline.
*/
-DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task) ____cacheline_aligned =
&init_task;
EXPORT_PER_CPU_SYMBOL(current_task);

-DEFINE_PER_CPU(void *, hardirq_stack_ptr);
-DEFINE_PER_CPU(bool, hardirq_stack_inuse);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(void *, hardirq_stack_ptr);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, hardirq_stack_inuse);

-DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);

-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK;
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK;

/* May not be marked __init: used by software suspend */
void syscall_init(void)
@@ -1826,7 +1826,7 @@ void syscall_init(void)

#else /* CONFIG_X86_64 */

-DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task) = &init_task;
EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index d7859573973d..b59317c5721f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -57,7 +57,7 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu);
/*
* Track which context is using the FPU on the CPU:
*/
-DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct fpu *, fpu_fpregs_owner_ctx);

struct kmem_cache *fpstate_cachep;

diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 668a4a6533d9..c2ceea8f6801 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -36,7 +36,7 @@
#include <asm/tlbflush.h>

/* Per cpu debug control register value */
-DEFINE_PER_CPU(unsigned long, cpu_dr7);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_dr7);
EXPORT_PER_CPU_SYMBOL(cpu_dr7);

/* Per cpu debug address registers values */
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..5c5aa75050a5 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -26,7 +26,7 @@
#define CREATE_TRACE_POINTS
#include <asm/trace/irq_vectors.h>

-DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
+DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(irq_cpustat_t, irq_stat);
EXPORT_PER_CPU_SYMBOL(irq_stat);

atomic_t irq_err_count;
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index beb1bada1b0a..d7893e040695 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -46,7 +46,7 @@
* (these are usually mapped into the 0x30-0xff vector range)
*/

-DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(vector_irq_t, vector_irq) = {
[0 ... NR_VECTORS - 1] = VECTOR_UNUSED,
};

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bce802d25fb..ef95071228ca 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -469,9 +469,9 @@ enum nmi_states {
NMI_EXECUTING,
NMI_LATCHED,
};
-static DEFINE_PER_CPU(enum nmi_states, nmi_state);
-static DEFINE_PER_CPU(unsigned long, nmi_cr2);
-static DEFINE_PER_CPU(unsigned long, nmi_dr7);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(enum nmi_states, nmi_state);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, nmi_cr2);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, nmi_dr7);

DEFINE_IDTENTRY_RAW(exc_nmi)
{
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f9bd1c3415d4..e4a32490dda0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,7 +56,7 @@
* section. Since TSS's are completely CPU-local, we want them
* on exact cacheline boundaries, to eliminate cacheline ping-pong.
*/
-__visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
+__visible DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(struct tss_struct, cpu_tss_rw) = {
.x86_tss = {
/*
* .sp0 is only used when entering ring 0 from a lower
@@ -77,7 +77,7 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
};
EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);

-DEFINE_PER_CPU(bool, __tss_limit_invalid);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, __tss_limit_invalid);
EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);

void __init arch_task_cache_init(void)
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 7b65275544b2..13c94a512b7e 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -23,7 +23,7 @@
#include <asm/cpu.h>
#include <asm/stackprotector.h>

-DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, cpu_number);
EXPORT_PER_CPU_SYMBOL(cpu_number);

#ifdef CONFIG_X86_64
@@ -32,7 +32,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_number);
#define BOOT_PERCPU_OFFSET 0
#endif

-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET;
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET;
EXPORT_PER_CPU_SYMBOL(this_cpu_off);

unsigned long __per_cpu_offset[NR_CPUS] __ro_after_init = {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 617012f4619f..0cfc4fdc2476 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -2224,7 +2224,8 @@ static void disable_freq_invariance_workfn(struct work_struct *work)
static DECLARE_WORK(disable_freq_invariance_work,
disable_freq_invariance_workfn);

-DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale) =
+ SCHED_CAPACITY_SCALE;

void arch_scale_freq_tick(void)
{
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d7169da99b01..39c441409dec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -59,7 +59,7 @@ struct cyc2ns {

}; /* fits one cacheline */

-static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
+static DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct cyc2ns, cyc2ns);

static int __init tsc_early_khz_setup(char *buf)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0df88eadab60..451872d178e5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8523,7 +8523,7 @@ static void kvm_timer_init(void)
kvmclock_cpu_online, kvmclock_cpu_down_prep);
}

-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, current_vcpu);
EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);

int kvm_is_in_guest(void)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4abcd8d9836d..3d5da4daaf53 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -392,7 +392,7 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
}

-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, current_vcpu);

static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index fdc117929fc7..04628949e89d 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -20,7 +20,7 @@
static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive;
static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive);

-DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
+DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct asi_state, asi_cpu_state);
EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state);

__aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD];
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dfff17363365..012631d03c4f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1025,7 +1025,7 @@ void __init zone_sizes_init(void)
free_area_init(max_zone_pfns);
}

-__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
.next_asid = 1,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fcd2c8e92f83..36d41356ed04 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -972,7 +972,7 @@ static bool tlb_is_not_lazy(int cpu)

static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);

-DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
+DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state_shared, cpu_tlbstate_shared);
EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);

STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
diff --git a/include/asm-generic/irq_regs.h b/include/asm-generic/irq_regs.h
index 2e7c6e89d42e..3225bdb2aefa 100644
--- a/include/asm-generic/irq_regs.h
+++ b/include/asm-generic/irq_regs.h
@@ -14,7 +14,7 @@
* Per-cpu current frame pointer - the location of the last exception frame on
* the stack
*/
-DECLARE_PER_CPU(struct pt_regs *, __irq_regs);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct pt_regs *, __irq_regs);

static inline struct pt_regs *get_irq_regs(void)
{
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index b97cea83b25e..35fdf256777a 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -23,7 +23,7 @@ static inline unsigned long topology_get_cpu_scale(int cpu)

void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);

-DECLARE_PER_CPU(unsigned long, arch_freq_scale);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale);

static inline unsigned long topology_get_freq_scale(int cpu)
{
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 0ee140176f10..68b2f10aaa46 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -355,7 +355,7 @@ static inline void timerfd_clock_was_set(void) { }
static inline void timerfd_resume(void) { }
#endif

-DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device);

#ifdef CONFIG_PREEMPT_RT
void hrtimer_cancel_wait_running(const struct hrtimer *timer);
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1f22a30c0963..6ae485d2ebb3 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -554,7 +554,7 @@ extern void __raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq(unsigned int nr);

-DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, ksoftirqd);

static inline struct task_struct *this_cpu_ksoftirqd(void)
{
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 69ae6b278464..89609dc5d30f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -40,8 +40,8 @@ struct kernel_stat {
unsigned int softirqs[NR_SOFTIRQS];
};

-DECLARE_PER_CPU(struct kernel_stat, kstat);
-DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_stat, kstat);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_cpustat, kernel_cpustat);

/* Must have preemption disabled for this to be meaningful. */
#define kstat_this_cpu this_cpu_ptr(&kstat)
diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index 056d31317e49..f02392ca6dc2 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -16,7 +16,7 @@ void prandom_bytes(void *buf, size_t nbytes);
void prandom_seed(u32 seed);
void prandom_reseed_late(void);

-DECLARE_PER_CPU(unsigned long, net_rand_noise);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, net_rand_noise);

#define PRANDOM_ADD_NOISE(a, b, c, d) \
prandom_u32_add_noise((unsigned long)(a), (unsigned long)(b), \
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6ea559b6e0f4..1914cc538cab 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1207,7 +1207,7 @@ void perf_pmu_enable(struct pmu *pmu)
pmu->pmu_enable(pmu);
}

-static DEFINE_PER_CPU(struct list_head, active_ctx_list);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct list_head, active_ctx_list);

/*
* perf_event_ctx_activate(), perf_event_ctx_deactivate(), and
@@ -4007,8 +4007,8 @@ do { \
return div64_u64(dividend, divisor);
}

-static DEFINE_PER_CPU(int, perf_throttled_count);
-static DEFINE_PER_CPU(u64, perf_throttled_seq);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, perf_throttled_count);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(u64, perf_throttled_seq);

static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable)
{
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index f7df715ec28e..10df3577c733 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -22,9 +22,9 @@
#include <asm/processor.h>
#include <linux/kasan.h>

-static DEFINE_PER_CPU(struct llist_head, raised_list);
-static DEFINE_PER_CPU(struct llist_head, lazy_list);
-static DEFINE_PER_CPU(struct task_struct *, irq_workd);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct llist_head, raised_list);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct llist_head, lazy_list);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, irq_workd);

static void wake_irq_workd(void)
{
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 284d2722cf0c..aee2b6994bc2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -74,7 +74,7 @@

/* Data structures. */

-static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
+static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rcu_data, rcu_data) = {
.dynticks_nesting = 1,
.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
.dynticks = ATOMIC_INIT(1),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1c08ff4130e..7c96f0001c7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -43,7 +43,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);

-DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rq, runqueues);

#ifdef CONFIG_SCHED_DEBUG
/*
@@ -5104,8 +5104,8 @@ void sched_exec(void)

#endif

-DEFINE_PER_CPU(struct kernel_stat, kstat);
-DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_stat, kstat);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_cpustat, kernel_cpustat);

EXPORT_PER_CPU_SYMBOL(kstat);
EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index 7c2fe50fd76d..c55a47f8e963 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -9,7 +9,8 @@

#include "sched.h"

-DEFINE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct update_util_data __rcu *,
+ cpufreq_update_util_data);

/**
* cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 623b5feb142a..d3ad13308889 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -17,7 +17,7 @@
* task when irq is in progress while we read rq->clock. That is a worthy
* compromise in place of having locks on each irq in account_system_time.
*/
-DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct irqtime, cpu_irqtime);

static int __asi_not_sensitive sched_clock_irqtime;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 517c70a29a57..4188c1a570db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1360,7 +1360,7 @@ static inline void update_idle_core(struct rq *rq)
static inline void update_idle_core(struct rq *rq) { }
#endif

-DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rq, runqueues);

#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() this_cpu_ptr(&runqueues)
@@ -1760,13 +1760,13 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
return sd;
}

-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
-DECLARE_PER_CPU(int, sd_llc_size);
-DECLARE_PER_CPU(int, sd_llc_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_llc);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_size);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_id);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_numa);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_packing);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;

struct sched_group_capacity {
@@ -2753,7 +2753,7 @@ struct irqtime {
struct u64_stats_sync sync;
};

-DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irqtime, cpu_irqtime);

/*
* Returns the irqtime minus the softirq time computed by ksoftirqd.
@@ -2776,7 +2776,8 @@ static inline u64 irq_time_read(int cpu)
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

#ifdef CONFIG_CPU_FREQ
-DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct update_util_data __rcu *,
+ cpufreq_update_util_data);

/**
* cpufreq_update_util - Take a note about CPU utilization changes.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..1dcea6a6133e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -641,13 +641,13 @@ static void destroy_sched_domains(struct sched_domain *sd)
* the cpumask of the domain), this allows us to quickly tell if
* two CPUs are in the same cache domain, see cpus_share_cache().
*/
-DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
-DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
-DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
-DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_llc);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_size);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_id);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_numa);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_packing);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);

static void update_top_cache_domain(int cpu)
diff --git a/kernel/smp.c b/kernel/smp.c
index c51fd981a4a9..3c1b328f0a09 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -92,9 +92,10 @@ struct call_function_data {
cpumask_var_t cpumask_ipi;
};

-static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data);
+static DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct call_function_data, cfd_data);

-static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
+static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct llist_head,
+ call_single_queue);

static void flush_smp_call_function_queue(bool warn_cpu_offline);

@@ -464,7 +465,7 @@ static __always_inline void csd_unlock(struct __call_single_data *csd)
smp_store_release(&csd->node.u_flags, 0);
}

-static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
+static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(call_single_data_t, csd_data);

void __smp_call_single_queue(int cpu, struct llist_node *node)
{
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c462b7fab4d3..d2660a59feab 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -59,7 +59,7 @@ EXPORT_PER_CPU_SYMBOL(irq_stat);
static struct softirq_action softirq_vec[NR_SOFTIRQS]
__asi_not_sensitive ____cacheline_aligned;

-DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, ksoftirqd);

const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 8b176f5c01f2..74cfc89a17c4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -65,7 +65,7 @@
* to reach a base using a clockid, hrtimer_clockid_to_base()
* is used to convert from clockid to the proper hrtimer_base_type.
*/
-DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer_cpu_base, hrtimer_bases) =
{
.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
.clock_base =
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index cbe75661ca74..67180cb44394 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -25,7 +25,7 @@
/*
* Tick devices
*/
-DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device);
/*
* Tick next event: keeps track of the tick time. It's updated by the
* CPU which handles the tick and protected by jiffies_lock. There is
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index ed7e2a18060a..6961318d41b7 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -13,7 +13,7 @@
# define TICK_DO_TIMER_NONE -1
# define TICK_DO_TIMER_BOOT -2

-DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device);
extern ktime_t tick_next_period;
extern int tick_do_timer_cpu;

@@ -161,7 +161,7 @@ static inline void timers_update_nohz(void) { }
#define tick_nohz_active (0)
#endif

-DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
+DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer_cpu_base, hrtimer_bases);

extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
void timer_clear_idle(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c23fecbb68c2..afd393b85577 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -36,7 +36,7 @@
/*
* Per-CPU nohz control structure
*/
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_sched, tick_cpu_sched);

struct tick_sched *tick_get_tick_sched(int cpu)
{
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 0b09c99b568c..9567df187420 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -212,7 +212,7 @@ struct timer_base {
struct hlist_head vectors[WHEEL_SIZE];
} ____cacheline_aligned;

-static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct timer_base, timer_bases[NR_BASES]);

#ifdef CONFIG_NO_HZ_COMMON

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index eaec3814c5a4..b82f478caf4e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -106,7 +106,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
* tracing is active, only save the comm when a trace event
* occurred.
*/
-static DEFINE_PER_CPU(bool, trace_taskinfo_save);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, trace_taskinfo_save);

/*
* Kill all tracing for good (never come back).
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index f4938040c228..177de3501677 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -17,7 +17,7 @@

#ifdef CONFIG_TRACE_IRQFLAGS
/* Per-cpu variable to prevent redundant calls when IRQs already off */
-static DEFINE_PER_CPU(int, tracing_irq_cpu);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, tracing_irq_cpu);

/*
* Like trace_hardirqs_on() but without the lockdep invocation. This is
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index ad912511a0c0..c2bf55024202 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -174,13 +174,13 @@ static bool softlockup_initialized __read_mostly;
static u64 __read_mostly sample_period;

/* Timestamp taken after the last successful reschedule. */
-static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, watchdog_touch_ts);
/* Timestamp of the last softlockup report. */
-static DEFINE_PER_CPU(unsigned long, watchdog_report_ts);
-static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
-static DEFINE_PER_CPU(bool, softlockup_touch_sync);
-static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
-static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, watchdog_report_ts);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer, watchdog_hrtimer);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, softlockup_touch_sync);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, hrtimer_interrupts);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, hrtimer_interrupts_saved);
static unsigned long soft_lockup_nmi_warn;

static int __init nowatchdog_setup(char *str)
diff --git a/lib/irq_regs.c b/lib/irq_regs.c
index 0d545a93070e..8b3c6be06a7a 100644
--- a/lib/irq_regs.c
+++ b/lib/irq_regs.c
@@ -9,6 +9,6 @@
#include <asm/irq_regs.h>

#ifndef ARCH_HAS_OWN_IRQ_REGS
-DEFINE_PER_CPU(struct pt_regs *, __irq_regs);
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct pt_regs *, __irq_regs);
EXPORT_PER_CPU_SYMBOL(__irq_regs);
#endif
diff --git a/lib/random32.c b/lib/random32.c
index a57a0e18819d..e4c1cb1a70b4 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -339,7 +339,8 @@ struct siprand_state {
};

static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
-DEFINE_PER_CPU(unsigned long, net_rand_noise);
+/* TODO(oweisse): Is this entropy sensitive?? */
+DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, net_rand_noise);
EXPORT_PER_CPU_SYMBOL(net_rand_noise);

/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0af973b950c2..8d2d76de5bd0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -110,7 +110,7 @@ static atomic_t hardware_enable_failed;
static struct kmem_cache *kvm_vcpu_cache;

static __read_mostly struct preempt_ops kvm_preempt_ops;
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
+static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, kvm_running_vcpu);

struct dentry *kvm_debugfs_dir;
EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 19:16:15

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 16/47] mm: asi: Support for mapping non-sensitive pcpu chunks

This adds support for mapping and unmapping dynamic percpu chunks as
globally non-sensitive. A later patch will modify the percpu allocator
to use this for dynamically allocating non-sensitive percpu memory.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/vmalloc.h | 4 ++--
mm/percpu-vm.c | 51 +++++++++++++++++++++++++++++++++--------
mm/vmalloc.c | 17 ++++++++++----
security/Kconfig | 2 +-
4 files changed, 58 insertions(+), 16 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c7c66decda3e..5f85690f27b6 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -260,14 +260,14 @@ extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
# ifdef CONFIG_MMU
struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
- size_t align);
+ size_t align, ulong flags);

void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
# else
static inline struct vm_struct **
pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
- size_t align)
+ size_t align, ulong flags)
{
return NULL;
}
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 2054c9213c43..5579a96ad782 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -153,8 +153,12 @@ static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
{
+ struct vm_struct **vms = (struct vm_struct **)chunk->data;
unsigned int cpu;
int i;
+ ulong addr, nr_pages;
+
+ nr_pages = page_end - page_start;

for_each_possible_cpu(cpu) {
for (i = page_start; i < page_end; i++) {
@@ -164,8 +168,14 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
WARN_ON(!page);
pages[pcpu_page_idx(cpu, i)] = page;
}
- __pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
- page_end - page_start);
+ addr = pcpu_chunk_addr(chunk, cpu, page_start);
+
+ /* TODO: We should batch the TLB flushes */
+ if (vms[0]->flags & VM_GLOBAL_NONSENSITIVE)
+ asi_unmap(ASI_GLOBAL_NONSENSITIVE, (void *)addr,
+ nr_pages * PAGE_SIZE, true);
+
+ __pcpu_unmap_pages(addr, nr_pages);
}
}

@@ -212,18 +222,30 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
* reverse lookup (addr -> chunk).
*/
static int pcpu_map_pages(struct pcpu_chunk *chunk,
- struct page **pages, int page_start, int page_end)
+ struct page **pages, int page_start, int page_end,
+ gfp_t gfp)
{
unsigned int cpu, tcpu;
int i, err;
+ ulong addr, nr_pages;
+
+ nr_pages = page_end - page_start;

for_each_possible_cpu(cpu) {
- err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+ addr = pcpu_chunk_addr(chunk, cpu, page_start);
+ err = __pcpu_map_pages(addr,
&pages[pcpu_page_idx(cpu, page_start)],
- page_end - page_start);
+ nr_pages);
if (err < 0)
goto err;

+ if (gfp & __GFP_GLOBAL_NONSENSITIVE) {
+ err = asi_map(ASI_GLOBAL_NONSENSITIVE, (void *)addr,
+ nr_pages * PAGE_SIZE);
+ if (err)
+ goto err;
+ }
+
for (i = page_start; i < page_end; i++)
pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
chunk);
@@ -231,10 +253,15 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
return 0;
err:
for_each_possible_cpu(tcpu) {
+ addr = pcpu_chunk_addr(chunk, tcpu, page_start);
+
+ if (gfp & __GFP_GLOBAL_NONSENSITIVE)
+ asi_unmap(ASI_GLOBAL_NONSENSITIVE, (void *)addr,
+ nr_pages * PAGE_SIZE, false);
+
+ __pcpu_unmap_pages(addr, nr_pages);
if (tcpu == cpu)
break;
- __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
- page_end - page_start);
}
pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
return err;
@@ -285,7 +312,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
return -ENOMEM;

- if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
+ if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) {
pcpu_free_pages(chunk, pages, page_start, page_end);
return -ENOMEM;
}
@@ -334,13 +361,19 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
{
struct pcpu_chunk *chunk;
struct vm_struct **vms;
+ ulong vm_flags = 0;
+
+ if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE))
+ vm_flags = VM_GLOBAL_NONSENSITIVE;
+
+ gfp &= ~__GFP_GLOBAL_NONSENSITIVE;

chunk = pcpu_alloc_chunk(gfp);
if (!chunk)
return NULL;

vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
- pcpu_nr_groups, pcpu_atom_size);
+ pcpu_nr_groups, pcpu_atom_size, vm_flags);
if (!vms) {
pcpu_free_chunk(chunk);
return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ba588a37ee75..f13bfe7e896b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3664,10 +3664,10 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
*/
struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
- size_t align)
+ size_t align, ulong flags)
{
- const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
- const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+ unsigned long vmalloc_start = VMALLOC_START;
+ unsigned long vmalloc_end = VMALLOC_END;
struct vmap_area **vas, *va;
struct vm_struct **vms;
int area, area2, last_area, term_area;
@@ -3677,6 +3677,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,

/* verify parameters and allocate data structures */
BUG_ON(offset_in_page(align) || !is_power_of_2(align));
+
+ if (static_asi_enabled() && (flags & VM_GLOBAL_NONSENSITIVE)) {
+ vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START;
+ vmalloc_end = VMALLOC_GLOBAL_NONSENSITIVE_END;
+ }
+
+ vmalloc_start = ALIGN(vmalloc_start, align);
+ vmalloc_end = vmalloc_end & ~(align - 1);
+
for (last_area = 0, area = 0; area < nr_vms; area++) {
start = offsets[area];
end = start + sizes[area];
@@ -3815,7 +3824,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
for (area = 0; area < nr_vms; area++) {
insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);

- setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
+ setup_vmalloc_vm_locked(vms[area], vas[area], flags | VM_ALLOC,
pcpu_get_vm_areas);
}
spin_unlock(&vmap_area_lock);
diff --git a/security/Kconfig b/security/Kconfig
index 0a3e49d6a331..e89c2658e6cf 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -68,7 +68,7 @@ config PAGE_TABLE_ISOLATION
config ADDRESS_SPACE_ISOLATION
bool "Allow code to run with a reduced kernel address space"
default n
- depends on X86_64 && !UML && SLAB
+ depends on X86_64 && !UML && SLAB && !NEED_PER_CPU_KM
depends on !PARAVIRT
help
This feature provides the ability to run some kernel code
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 19:57:59

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 18/47] mm: asi: Support for pre-ASI-init local non-sensitive allocations

Local non-sensitive allocations can be made before an actual ASI
instance is initialized. To support this, a process-wide pseudo-PGD
is created, which contains mappings for all locally non-sensitive
allocations. Memory can be mapped into this pseudo-PGD by using
ASI_LOCAL_NONSENSITIVE when calling asi_map(). The mappings will be
copied to an actual ASI PGD when an ASI instance is initialized in
that process, by copying all the PGD entries in the local
non-sensitive range from the pseudo-PGD to the ASI PGD. In addition,
the page fault handler will copy any new PGD entries that get added
after the initialization of the ASI instance.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 6 +++-
arch/x86/mm/asi.c | 74 +++++++++++++++++++++++++++++++++++++-
arch/x86/mm/fault.c | 7 ++++
include/asm-generic/asi.h | 12 ++++++-
kernel/fork.c | 8 +++--
5 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index f69e1f2f09a4..f11010c0334b 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -16,6 +16,7 @@
#define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER)

#define ASI_GLOBAL_NONSENSITIVE (&init_mm.asi[0])
+#define ASI_LOCAL_NONSENSITIVE (&current->mm->asi[0])

struct asi_state {
struct asi *curr_asi;
@@ -45,7 +46,8 @@ DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);

extern pgd_t asi_global_nonsensitive_pgd[];

-void asi_init_mm_state(struct mm_struct *mm);
+int asi_init_mm_state(struct mm_struct *mm);
+void asi_free_mm_state(struct mm_struct *mm);

int asi_register_class(const char *name, uint flags,
const struct asi_hooks *ops);
@@ -61,6 +63,8 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags);
int asi_map(struct asi *asi, void *addr, size_t len);
void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb);
void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len);
+void asi_sync_mapping(struct asi *asi, void *addr, size_t len);
+void asi_do_lazy_map(struct asi *asi, size_t addr);

static inline void asi_init_thread_state(struct thread_struct *thread)
{
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 38eaa650bac1..3ba0971a318d 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -73,6 +73,17 @@ void asi_unregister_class(int index)
}
EXPORT_SYMBOL_GPL(asi_unregister_class);

+static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr)
+{
+ pgd_t *src = pgd_offset_pgd(src_table, addr);
+ pgd_t *dst = pgd_offset_pgd(dst_table, addr);
+
+ if (!pgd_val(*dst))
+ set_pgd(dst, *src);
+ else
+ VM_BUG_ON(pgd_val(*dst) != pgd_val(*src));
+}
+
#ifndef mm_inc_nr_p4ds
#define mm_inc_nr_p4ds(mm) do {} while (false)
#endif
@@ -291,6 +302,11 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
for (i = KERNEL_PGD_BOUNDARY; i < pgd_index(ASI_LOCAL_MAP); i++)
set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);

+ for (i = pgd_index(ASI_LOCAL_MAP);
+ i <= pgd_index(ASI_LOCAL_MAP + PFN_PHYS(max_possible_pfn));
+ i++)
+ set_pgd(asi->pgd + i, mm->asi[0].pgd[i]);
+
for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START);
i < PTRS_PER_PGD; i++)
set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
@@ -379,7 +395,7 @@ void asi_exit(void)
}
EXPORT_SYMBOL_GPL(asi_exit);

-void asi_init_mm_state(struct mm_struct *mm)
+int asi_init_mm_state(struct mm_struct *mm)
{
struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);

@@ -395,6 +411,28 @@ void asi_init_mm_state(struct mm_struct *mm)
memcg->use_asi;
css_put(&memcg->css);
}
+
+ if (!mm->asi_enabled)
+ return 0;
+
+ mm->asi[0].mm = mm;
+ mm->asi[0].pgd = (pgd_t *)__get_free_page(GFP_PGTABLE_USER);
+ if (!mm->asi[0].pgd)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void asi_free_mm_state(struct mm_struct *mm)
+{
+ if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled)
+ return;
+
+ asi_free_pgd_range(&mm->asi[0], pgd_index(ASI_LOCAL_MAP),
+ pgd_index(ASI_LOCAL_MAP +
+ PFN_PHYS(max_possible_pfn)) + 1);
+
+ free_page((ulong)mm->asi[0].pgd);
}

static bool is_page_within_range(size_t addr, size_t page_size,
@@ -599,3 +637,37 @@ void *asi_va(unsigned long pa)
? ASI_LOCAL_MAP : PAGE_OFFSET));
}
EXPORT_SYMBOL(asi_va);
+
+static bool is_addr_in_local_nonsensitive_range(size_t addr)
+{
+ return addr >= ASI_LOCAL_MAP &&
+ addr < VMALLOC_GLOBAL_NONSENSITIVE_START;
+}
+
+void asi_do_lazy_map(struct asi *asi, size_t addr)
+{
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
+ return;
+
+ if ((asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) &&
+ is_addr_in_local_nonsensitive_range(addr))
+ asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr);
+}
+
+/*
+ * Should be called after asi_map(ASI_LOCAL_NONSENSITIVE,...) for any mapping
+ * that is required to exist prior to asi_enter() (e.g. thread stacks)
+ */
+void asi_sync_mapping(struct asi *asi, void *start, size_t len)
+{
+ size_t addr = (size_t)start;
+ size_t end = addr + len;
+
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
+ return;
+
+ if ((asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) &&
+ is_addr_in_local_nonsensitive_range(addr))
+ for (; addr < end; addr = pgd_addr_end(addr, end))
+ asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr);
+}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4bfed53e210e..8692eb50f4a5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1498,6 +1498,12 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
irqentry_state_t state;
+ /*
+ * There is a very small chance that an NMI could cause an asi_exit()
+ * before this asi_get_current(), but that is ok, we will just do
+ * the fixup on the next page fault.
+ */
+ struct asi *asi = asi_get_current();

prefetchw(&current->mm->mmap_lock);

@@ -1539,6 +1545,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)

instrumentation_begin();
handle_page_fault(regs, error_code, address);
+ asi_do_lazy_map(asi, address);
instrumentation_end();

irqentry_exit(regs, state);
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index 51c9c4a488e8..a1c8ebff70e8 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -13,6 +13,7 @@
#define ASI_MAX_NUM 0

#define ASI_GLOBAL_NONSENSITIVE NULL
+#define ASI_LOCAL_NONSENSITIVE NULL

#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START
#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END
@@ -31,7 +32,9 @@ int asi_register_class(const char *name, uint flags,

static inline void asi_unregister_class(int asi_index) { }

-static inline void asi_init_mm_state(struct mm_struct *mm) { }
+static inline int asi_init_mm_state(struct mm_struct *mm) { return 0; }
+
+static inline void asi_free_mm_state(struct mm_struct *mm) { }

static inline
int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
@@ -67,9 +70,16 @@ static inline int asi_map(struct asi *asi, void *addr, size_t len)
return 0;
}

+static inline
+void asi_sync_mapping(struct asi *asi, void *addr, size_t len) { }
+
static inline
void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { }

+
+static inline
+void asi_do_lazy_map(struct asi *asi, size_t addr) { }
+
static inline
void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { }

diff --git a/kernel/fork.c b/kernel/fork.c
index 3695a32ee9bd..dd5a86e913ea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -699,6 +699,7 @@ void __mmdrop(struct mm_struct *mm)
mm_free_pgd(mm);
destroy_context(mm);
mmu_notifier_subscriptions_destroy(mm);
+ asi_free_mm_state(mm);
check_mm(mm);
put_user_ns(mm->user_ns);
free_mm(mm);
@@ -1072,17 +1073,20 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->def_flags = 0;
}

- asi_init_mm_state(mm);
-
if (mm_alloc_pgd(mm))
goto fail_nopgd;

if (init_new_context(p, mm))
goto fail_nocontext;

+ if (asi_init_mm_state(mm))
+ goto fail_noasi;
+
mm->user_ns = get_user_ns(user_ns);
+
return mm;

+fail_noasi:
fail_nocontext:
mm_free_pgd(mm);
fail_nopgd:
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 20:39:22

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 36/47] mm: asi: Adding support for dynamic percpu ASI allocations

From: Ofir Weisse <[email protected]>

Adding infrastructure to support pcpu_alloc with gfp flag of
__GFP_GLOBAL_NONSENSITIVE. We use a similar mechanism as the earlier
infrastructure for memcg percpu allocations and add pcpu type
PCPU_CHUNK_ASI_NONSENSITIVE.
pcpu_chunk_list(PCPU_CHUNK_ASI_NONSENSITIVE) will return a list of ASI
nonsensitive percpu chunks, allowing most of the code to be unchanged.

Signed-off-by: Ofir Weisse <[email protected]>


---
mm/percpu-internal.h | 23 ++++++-
mm/percpu-km.c | 5 +-
mm/percpu-vm.c | 6 +-
mm/percpu.c | 139 ++++++++++++++++++++++++++++++++++---------
4 files changed, 141 insertions(+), 32 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 639662c20c82..2fac01114edc 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,15 @@
#include <linux/types.h>
#include <linux/percpu.h>

+enum pcpu_chunk_type {
+ PCPU_CHUNK_ROOT,
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ PCPU_CHUNK_ASI_NONSENSITIVE,
+#endif
+ PCPU_NR_CHUNK_TYPES,
+ PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
+};
+
/*
* pcpu_block_md is the metadata block struct.
* Each chunk's bitmap is split into a number of full blocks.
@@ -59,6 +68,9 @@ struct pcpu_chunk {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup **obj_cgroups; /* vector of object cgroups */
#endif
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ bool is_asi_nonsensitive; /* ASI nonsensitive chunk */
+#endif

int nr_pages; /* # of pages served by this chunk */
int nr_populated; /* # of populated pages */
@@ -68,7 +80,7 @@ struct pcpu_chunk {

extern spinlock_t pcpu_lock;

-extern struct list_head *pcpu_chunk_lists;
+extern struct list_head *pcpu_chunk_lists[PCPU_NR_CHUNK_TYPES];
extern int pcpu_nr_slots;
extern int pcpu_sidelined_slot;
extern int pcpu_to_depopulate_slot;
@@ -113,6 +125,15 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
}

+static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (chunk->is_asi_nonsensitive)
+ return PCPU_CHUNK_ASI_NONSENSITIVE;
+#endif
+ return PCPU_CHUNK_ROOT;
+}
+
#ifdef CONFIG_PERCPU_STATS

#include <linux/spinlock.h>
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index fe31aa19db81..01e31bd55860 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -50,7 +50,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
/* nada */
}

-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+ gfp_t gfp)
{
const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
struct pcpu_chunk *chunk;
@@ -58,7 +59,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
unsigned long flags;
int i;

- chunk = pcpu_alloc_chunk(gfp);
+ chunk = pcpu_alloc_chunk(type, gfp);
if (!chunk)
return NULL;

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 5579a96ad782..59f3b55abdd1 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -357,7 +357,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
pcpu_free_pages(chunk, pages, page_start, page_end);
}

-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+ gfp_t gfp)
{
struct pcpu_chunk *chunk;
struct vm_struct **vms;
@@ -368,7 +369,8 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)

gfp &= ~__GFP_GLOBAL_NONSENSITIVE;

- chunk = pcpu_alloc_chunk(gfp);
+ chunk = pcpu_alloc_chunk(type, gfp);
+
if (!chunk)
return NULL;

diff --git a/mm/percpu.c b/mm/percpu.c
index f5b2c2ea5a54..beaca5adf9d4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -172,7 +172,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */

-struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
+struct list_head *pcpu_chunk_lists[PCPU_NR_CHUNK_TYPES] __ro_after_init; /* chunk list slots */

/* chunks which need their map areas extended, protected by pcpu_lock */
static LIST_HEAD(pcpu_map_extend_chunks);
@@ -531,10 +531,12 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
bool move_front)
{
if (chunk != pcpu_reserved_chunk) {
+ struct list_head *pcpu_type_lists =
+ pcpu_chunk_lists[pcpu_chunk_type(chunk)];
if (move_front)
- list_move(&chunk->list, &pcpu_chunk_lists[slot]);
+ list_move(&chunk->list, &pcpu_type_lists[slot]);
else
- list_move_tail(&chunk->list, &pcpu_chunk_lists[slot]);
+ list_move_tail(&chunk->list, &pcpu_type_lists[slot]);
}
}

@@ -570,13 +572,16 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)

static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
{
+ struct list_head *pcpu_type_lists =
+ pcpu_chunk_lists[pcpu_chunk_type(chunk)];
+
lockdep_assert_held(&pcpu_lock);

if (!chunk->isolated) {
chunk->isolated = true;
pcpu_nr_empty_pop_pages -= chunk->nr_empty_pop_pages;
}
- list_move(&chunk->list, &pcpu_chunk_lists[pcpu_to_depopulate_slot]);
+ list_move(&chunk->list, &pcpu_type_lists[pcpu_to_depopulate_slot]);
}

static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
@@ -1438,7 +1443,8 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
return chunk;
}

-static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type,
+ gfp_t gfp)
{
struct pcpu_chunk *chunk;
int region_bits;
@@ -1475,6 +1481,13 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
goto objcg_fail;
}
#endif
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /* TODO: (oweisse) do asi_map for nonsensitive chunks */
+ if (type == PCPU_CHUNK_ASI_NONSENSITIVE)
+ chunk->is_asi_nonsensitive = true;
+ else
+ chunk->is_asi_nonsensitive = false;
+#endif

pcpu_init_md_blocks(chunk);

@@ -1580,7 +1593,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
int page_start, int page_end);
static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
int page_start, int page_end);
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+ gfp_t gfp);
static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
static struct page *pcpu_addr_to_page(void *addr);
static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
@@ -1733,6 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
unsigned long flags;
void __percpu *ptr;
size_t bits, bit_align;
+ enum pcpu_chunk_type type;
+ struct list_head *pcpu_type_lists;

gfp = current_gfp_context(gfp);
/* whitelisted flags that can be passed to the backing allocators */
@@ -1763,6 +1779,16 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
if (unlikely(!pcpu_memcg_pre_alloc_hook(size, gfp, &objcg)))
return NULL;

+ type = PCPU_CHUNK_ROOT;
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE)) {
+ type = PCPU_CHUNK_ASI_NONSENSITIVE;
+ pcpu_gfp |= __GFP_GLOBAL_NONSENSITIVE;
+ }
+#endif
+ pcpu_type_lists = pcpu_chunk_lists[type];
+ BUG_ON(!pcpu_type_lists);
+
if (!is_atomic) {
/*
* pcpu_balance_workfn() allocates memory under this mutex,
@@ -1800,7 +1826,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
restart:
/* search through normal chunks */
for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
- list_for_each_entry_safe(chunk, next, &pcpu_chunk_lists[slot],
+ list_for_each_entry_safe(chunk, next, &pcpu_type_lists[slot],
list) {
off = pcpu_find_block_fit(chunk, bits, bit_align,
is_atomic);
@@ -1830,8 +1856,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
goto fail;
}

- if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
- chunk = pcpu_create_chunk(pcpu_gfp);
+ if (list_empty(&pcpu_type_lists[pcpu_free_slot])) {
+ chunk = pcpu_create_chunk(type, pcpu_gfp);
if (!chunk) {
err = "failed to allocate new chunk";
goto fail;
@@ -1983,12 +2009,19 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
* CONTEXT:
* pcpu_lock (can be dropped temporarily)
*/
-static void pcpu_balance_free(bool empty_only)
+
+static void __pcpu_balance_free(bool empty_only,
+ enum pcpu_chunk_type type)
{
LIST_HEAD(to_free);
- struct list_head *free_head = &pcpu_chunk_lists[pcpu_free_slot];
+ struct list_head *pcpu_type_lists = pcpu_chunk_lists[type];
+ struct list_head *free_head;
struct pcpu_chunk *chunk, *next;

+ if (!pcpu_type_lists)
+ return;
+ free_head = &pcpu_type_lists[pcpu_free_slot];
+
lockdep_assert_held(&pcpu_lock);

/*
@@ -2026,6 +2059,14 @@ static void pcpu_balance_free(bool empty_only)
spin_lock_irq(&pcpu_lock);
}

+static void pcpu_balance_free(bool empty_only)
+{
+ enum pcpu_chunk_type type;
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+ __pcpu_balance_free(empty_only, type);
+ }
+}
+
/**
* pcpu_balance_populated - manage the amount of populated pages
*
@@ -2038,12 +2079,21 @@ static void pcpu_balance_free(bool empty_only)
* CONTEXT:
* pcpu_lock (can be dropped temporarily)
*/
-static void pcpu_balance_populated(void)
+static void __pcpu_balance_populated(enum pcpu_chunk_type type)
{
/* gfp flags passed to underlying allocators */
- const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+ const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ | (type == PCPU_CHUNK_ASI_NONSENSITIVE ?
+ __GFP_GLOBAL_NONSENSITIVE : 0)
+#endif
+ ;
struct pcpu_chunk *chunk;
int slot, nr_to_pop, ret;
+ struct list_head *pcpu_type_lists = pcpu_chunk_lists[type];
+
+ if (!pcpu_type_lists)
+ return;

lockdep_assert_held(&pcpu_lock);

@@ -2074,7 +2124,7 @@ static void pcpu_balance_populated(void)
if (!nr_to_pop)
break;

- list_for_each_entry(chunk, &pcpu_chunk_lists[slot], list) {
+ list_for_each_entry(chunk, &pcpu_type_lists[slot], list) {
nr_unpop = chunk->nr_pages - chunk->nr_populated;
if (nr_unpop)
break;
@@ -2107,7 +2157,7 @@ static void pcpu_balance_populated(void)
if (nr_to_pop) {
/* ran out of chunks to populate, create a new one and retry */
spin_unlock_irq(&pcpu_lock);
- chunk = pcpu_create_chunk(gfp);
+ chunk = pcpu_create_chunk(type, gfp);
cond_resched();
spin_lock_irq(&pcpu_lock);
if (chunk) {
@@ -2117,6 +2167,14 @@ static void pcpu_balance_populated(void)
}
}

+static void pcpu_balance_populated()
+{
+ enum pcpu_chunk_type type;
+
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+ __pcpu_balance_populated(type);
+}
+
/**
* pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages
*
@@ -2132,13 +2190,19 @@ static void pcpu_balance_populated(void)
* pcpu_lock (can be dropped temporarily)
*
*/
-static void pcpu_reclaim_populated(void)
+
+
+static void __pcpu_reclaim_populated(enum pcpu_chunk_type type)
{
struct pcpu_chunk *chunk;
struct pcpu_block_md *block;
int freed_page_start, freed_page_end;
int i, end;
bool reintegrate;
+ struct list_head *pcpu_type_lists = pcpu_chunk_lists[type];
+
+ if (!pcpu_type_lists)
+ return;

lockdep_assert_held(&pcpu_lock);

@@ -2148,8 +2212,8 @@ static void pcpu_reclaim_populated(void)
* other accessor is the free path which only returns area back to the
* allocator not touching the populated bitmap.
*/
- while (!list_empty(&pcpu_chunk_lists[pcpu_to_depopulate_slot])) {
- chunk = list_first_entry(&pcpu_chunk_lists[pcpu_to_depopulate_slot],
+ while (!list_empty(&pcpu_type_lists[pcpu_to_depopulate_slot])) {
+ chunk = list_first_entry(&pcpu_type_lists[pcpu_to_depopulate_slot],
struct pcpu_chunk, list);
WARN_ON(chunk->immutable);

@@ -2219,10 +2283,18 @@ static void pcpu_reclaim_populated(void)
pcpu_reintegrate_chunk(chunk);
else
list_move_tail(&chunk->list,
- &pcpu_chunk_lists[pcpu_sidelined_slot]);
+ &pcpu_type_lists[pcpu_sidelined_slot]);
}
}

+static void pcpu_reclaim_populated(void)
+{
+ enum pcpu_chunk_type type;
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+ __pcpu_reclaim_populated(type);
+ }
+}
+
/**
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
* @work: unused
@@ -2268,6 +2340,7 @@ void free_percpu(void __percpu *ptr)
unsigned long flags;
int size, off;
bool need_balance = false;
+ struct list_head *pcpu_type_lists = NULL;

if (!ptr)
return;
@@ -2280,6 +2353,8 @@ void free_percpu(void __percpu *ptr)

chunk = pcpu_chunk_addr_search(addr);
off = addr - chunk->base_addr;
+ pcpu_type_lists = pcpu_chunk_lists[pcpu_chunk_type(chunk)];
+ BUG_ON(!pcpu_type_lists);

size = pcpu_free_area(chunk, off);

@@ -2293,7 +2368,7 @@ void free_percpu(void __percpu *ptr)
if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) {
struct pcpu_chunk *pos;

- list_for_each_entry(pos, &pcpu_chunk_lists[pcpu_free_slot], list)
+ list_for_each_entry(pos, &pcpu_type_lists[pcpu_free_slot], list)
if (pos != chunk) {
need_balance = true;
break;
@@ -2601,6 +2676,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
int map_size;
unsigned long tmp_addr;
size_t alloc_size;
+ enum pcpu_chunk_type type;

#define PCPU_SETUP_BUG_ON(cond) do { \
if (unlikely(cond)) { \
@@ -2723,15 +2799,24 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_free_slot = pcpu_sidelined_slot + 1;
pcpu_to_depopulate_slot = pcpu_free_slot + 1;
pcpu_nr_slots = pcpu_to_depopulate_slot + 1;
- pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
- sizeof(pcpu_chunk_lists[0]),
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ if (type == PCPU_CHUNK_ASI_NONSENSITIVE &&
+ !static_asi_enabled()) {
+ pcpu_chunk_lists[type] = NULL;
+ continue;
+ }
+#endif
+ pcpu_chunk_lists[type] = memblock_alloc(pcpu_nr_slots *
+ sizeof(pcpu_chunk_lists[0][0]),
SMP_CACHE_BYTES);
- if (!pcpu_chunk_lists)
- panic("%s: Failed to allocate %zu bytes\n", __func__,
- pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]));
+ if (!pcpu_chunk_lists[type])
+ panic("%s: Failed to allocate %zu bytes\n", __func__,
+ pcpu_nr_slots * sizeof(pcpu_chunk_lists[0][0]));

- for (i = 0; i < pcpu_nr_slots; i++)
- INIT_LIST_HEAD(&pcpu_chunk_lists[i]);
+ for (i = 0; i < pcpu_nr_slots; i++)
+ INIT_LIST_HEAD(&pcpu_chunk_lists[type][i]);
+ }

/*
* The end of the static region needs to be aligned with the
--
2.35.1.473.g83b2b277ed-goog

2022-02-23 21:00:16

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 12/47] mm: asi: Support for global non-sensitive slab caches

A new flag SLAB_GLOBAL_NONSENSITIVE is added, which would designate all
objects within that slab cache to be globally non-sensitive.

Another flag SLAB_NONSENSITIVE is also added, which is currently just an
alias for SLAB_GLOBAL_NONSENSITIVE, but will eventually be used to
designate slab caches which can allocate either global or local
non-sensitive objects.

In addition, new kmalloc caches have been added that can be used to
allocate non-sensitive objects.

Signed-off-by: Junaid Shahid <[email protected]>


---
include/linux/slab.h | 32 +++++++++++++++----
mm/slab.c | 5 +++
mm/slab.h | 14 ++++++++-
mm/slab_common.c | 73 +++++++++++++++++++++++++++++++++-----------
security/Kconfig | 2 +-
5 files changed, 101 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 181045148b06..7b8a3853d827 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -120,6 +120,12 @@
/* Slab deactivation flag */
#define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U)

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define SLAB_GLOBAL_NONSENSITIVE ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_GLOBAL_NONSENSITIVE 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
@@ -329,6 +335,11 @@ enum kmalloc_cache_type {
extern struct kmem_cache *
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+extern struct kmem_cache *
+nonsensitive_kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
+#endif
+
/*
* Define gfp bits that should not be set for KMALLOC_NORMAL.
*/
@@ -361,6 +372,17 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
return KMALLOC_CGROUP;
}

+static __always_inline struct kmem_cache *get_kmalloc_cache(gfp_t flags,
+ uint index)
+{
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+ if (static_asi_enabled() && (flags & __GFP_GLOBAL_NONSENSITIVE))
+ return nonsensitive_kmalloc_caches[kmalloc_type(flags)][index];
+#endif
+ return kmalloc_caches[kmalloc_type(flags)][index];
+}
+
/*
* Figure out which kmalloc slab an allocation of a certain size
* belongs to.
@@ -587,9 +609,8 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
if (!index)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc_trace(
- kmalloc_caches[kmalloc_type(flags)][index],
- flags, size);
+ return kmem_cache_alloc_trace(get_kmalloc_cache(flags, index),
+ flags, size);
#endif
}
return __kmalloc(size, flags);
@@ -605,9 +626,8 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
if (!i)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc_node_trace(
- kmalloc_caches[kmalloc_type(flags)][i],
- flags, node, size);
+ return kmem_cache_alloc_node_trace(get_kmalloc_cache(flags, i),
+ flags, node, size);
}
#endif
return __kmalloc_node(size, flags, node);
diff --git a/mm/slab.c b/mm/slab.c
index ca4822f6b2b6..5a928d95d67b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1956,6 +1956,9 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
size = ALIGN(size, REDZONE_ALIGN);
}

+ if (!static_asi_enabled())
+ flags &= ~SLAB_NONSENSITIVE;
+
/* 3) caller mandated alignment */
if (ralign < cachep->align) {
ralign = cachep->align;
@@ -2058,6 +2061,8 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
cachep->allocflags |= GFP_DMA32;
if (flags & SLAB_RECLAIM_ACCOUNT)
cachep->allocflags |= __GFP_RECLAIMABLE;
+ if (flags & SLAB_GLOBAL_NONSENSITIVE)
+ cachep->allocflags |= __GFP_GLOBAL_NONSENSITIVE;
cachep->size = size;
cachep->reciprocal_buffer_size = reciprocal_value(size);

diff --git a/mm/slab.h b/mm/slab.h
index 56ad7eea3ddf..f190f4fc0286 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -77,6 +77,10 @@ extern struct kmem_cache *kmem_cache;
/* A table of kmalloc cache names and sizes */
extern const struct kmalloc_info_struct {
const char *name[NR_KMALLOC_TYPES];
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ const char *nonsensitive_name[NR_KMALLOC_TYPES];
+#endif
+ slab_flags_t flags[NR_KMALLOC_TYPES];
unsigned int size;
} kmalloc_info[];

@@ -124,11 +128,14 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
}
#endif

+/* This will also include SLAB_LOCAL_NONSENSITIVE in a later patch. */
+#define SLAB_NONSENSITIVE SLAB_GLOBAL_NONSENSITIVE

/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
SLAB_CACHE_DMA32 | SLAB_PANIC | \
- SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS )
+ SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
+ SLAB_NONSENSITIVE)

#if defined(CONFIG_DEBUG_SLAB)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
@@ -491,6 +498,11 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,

might_alloc(flags);

+ if (static_asi_enabled()) {
+ VM_BUG_ON(!(s->flags & SLAB_GLOBAL_NONSENSITIVE) &&
+ (flags & __GFP_GLOBAL_NONSENSITIVE));
+ }
+
if (should_failslab(s, flags))
return NULL;

diff --git a/mm/slab_common.c b/mm/slab_common.c
index e5d080a93009..72dee2494bf8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -50,7 +50,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
SLAB_FAILSLAB | kasan_never_merge())

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
- SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
+ SLAB_CACHE_DMA32 | SLAB_ACCOUNT | SLAB_NONSENSITIVE)

/*
* Merge control. If this is set then no merging of slab caches will occur.
@@ -681,6 +681,15 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init =
{ /* initialization for https://bugs.llvm.org/show_bug.cgi?id=42570 */ };
EXPORT_SYMBOL(kmalloc_caches);

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+struct kmem_cache *
+nonsensitive_kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init =
+{ /* initialization for https://bugs.llvm.org/show_bug.cgi?id=42570 */ };
+EXPORT_SYMBOL(nonsensitive_kmalloc_caches);
+
+#endif
+
/*
* Conversion table for small slabs sizes / 8 to the index in the
* kmalloc array. This is necessary for slabs < 192 since we have non power
@@ -738,25 +747,34 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
index = fls(size - 1);
}

- return kmalloc_caches[kmalloc_type(flags)][index];
+ return get_kmalloc_cache(flags, index);
}

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+#define __KMALLOC_NAME(type, base_name, sz) \
+ .name[type] = base_name "-" #sz, \
+ .nonsensitive_name[type] = "ns-" base_name "-" #sz,
+#else
+#define __KMALLOC_NAME(type, base_name, sz) \
+ .name[type] = base_name "-" #sz,
+#endif
+
#ifdef CONFIG_ZONE_DMA
-#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
+#define KMALLOC_DMA_NAME(sz) __KMALLOC_NAME(KMALLOC_DMA, "dma-kmalloc", sz)
#else
#define KMALLOC_DMA_NAME(sz)
#endif

#ifdef CONFIG_MEMCG_KMEM
-#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
+#define KMALLOC_CGROUP_NAME(sz) __KMALLOC_NAME(KMALLOC_CGROUP, "kmalloc-cg", sz)
#else
#define KMALLOC_CGROUP_NAME(sz)
#endif

#define INIT_KMALLOC_INFO(__size, __short_size) \
{ \
- .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
- .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \
+ __KMALLOC_NAME(KMALLOC_NORMAL, "kmalloc", __short_size) \
+ __KMALLOC_NAME(KMALLOC_RECLAIM, "kmalloc-rcl", __short_size) \
KMALLOC_CGROUP_NAME(__short_size) \
KMALLOC_DMA_NAME(__short_size) \
.size = __size, \
@@ -846,18 +864,30 @@ void __init setup_kmalloc_cache_index_table(void)
static void __init
new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
{
+ struct kmem_cache *(*caches)[KMALLOC_SHIFT_HIGH + 1] = kmalloc_caches;
+ const char *name = kmalloc_info[idx].name[type];
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+ if (flags & SLAB_NONSENSITIVE) {
+ caches = nonsensitive_kmalloc_caches;
+ name = kmalloc_info[idx].nonsensitive_name[type];
+ }
+#endif
+
if (type == KMALLOC_RECLAIM) {
flags |= SLAB_RECLAIM_ACCOUNT;
} else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) {
if (cgroup_memory_nokmem) {
- kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx];
+ caches[type][idx] = caches[KMALLOC_NORMAL][idx];
return;
}
flags |= SLAB_ACCOUNT;
+ } else if (IS_ENABLED(CONFIG_ZONE_DMA) && (type == KMALLOC_DMA)) {
+ flags |= SLAB_CACHE_DMA;
}

- kmalloc_caches[type][idx] = create_kmalloc_cache(
- kmalloc_info[idx].name[type],
+ caches[type][idx] = create_kmalloc_cache(name,
kmalloc_info[idx].size, flags, 0,
kmalloc_info[idx].size);

@@ -866,7 +896,7 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
* KMALLOC_NORMAL caches.
*/
if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL))
- kmalloc_caches[type][idx]->refcount = -1;
+ caches[type][idx]->refcount = -1;
}

/*
@@ -908,15 +938,24 @@ void __init create_kmalloc_caches(slab_flags_t flags)
for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) {
struct kmem_cache *s = kmalloc_caches[KMALLOC_NORMAL][i];

- if (s) {
- kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache(
- kmalloc_info[i].name[KMALLOC_DMA],
- kmalloc_info[i].size,
- SLAB_CACHE_DMA | flags, 0,
- kmalloc_info[i].size);
- }
+ if (s)
+ new_kmalloc_cache(i, KMALLOC_DMA, flags);
}
#endif
+ /*
+ * TODO: We may want to make slab allocations without exiting ASI.
+ * In that case, the cache metadata itself would need to be
+ * treated as non-sensitive and mapped as such, and we would need to
+ * do the bootstrap much more carefully. We can do that if we find
+ * that slab allocations while inside a restricted address space are
+ * frequent enough to warrant the additional complexity.
+ */
+ if (static_asi_enabled())
+ for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++)
+ for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++)
+ if (kmalloc_caches[type][i])
+ new_kmalloc_cache(i, type,
+ flags | SLAB_NONSENSITIVE);
}
#endif /* !CONFIG_SLOB */

diff --git a/security/Kconfig b/security/Kconfig
index 21b15ecaf2c1..0a3e49d6a331 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -68,7 +68,7 @@ config PAGE_TABLE_ISOLATION
config ADDRESS_SPACE_ISOLATION
bool "Allow code to run with a reduced kernel address space"
default n
- depends on X86_64 && !UML
+ depends on X86_64 && !UML && SLAB
depends on !PARAVIRT
help
This feature provides the ability to run some kernel code
--
2.35.1.473.g83b2b277ed-goog

2022-02-24 00:34:00

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 15/47] kvm: asi: Restricted address space for VM execution

An ASI restricted address space is added for KVM. It is currently only
enabled for Intel CPUs. The ASI hooks have been setup to do an L1D
cache flush and MDS clear when entering the restricted address space.

The hooks are also meant to stun and unstun the sibling hyperthread
when exiting and entering the restricted address space. Internally,
we do have a full stunning implementation available, but it hasn't
yet been determined whether it is fully compatible with the upstream
core scheduling implementation, so it is not included in this patch
series and instead this patch just includes corresponding stub
functions to demonstrate where the stun/unstun would happen.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/vmx/vmx.c | 41 ++++++++++++-----
arch/x86/kvm/x86.c | 81 ++++++++++++++++++++++++++++++++-
include/linux/kvm_host.h | 2 +
4 files changed, 113 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 555f4de47ef2..98cbd6447e3e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1494,6 +1494,8 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);

void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+
+ void (*flush_sensitive_cpu_state)(struct kvm_vcpu *vcpu);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0dbf94eb954f..e0178b57be75 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -47,6 +47,7 @@
#include <asm/spec-ctrl.h>
#include <asm/virtext.h>
#include <asm/vmx.h>
+#include <asm/asi.h>

#include "capabilities.h"
#include "cpuid.h"
@@ -300,7 +301,7 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
else
static_branch_disable(&vmx_l1d_should_flush);

- if (l1tf == VMENTER_L1D_FLUSH_COND)
+ if (l1tf == VMENTER_L1D_FLUSH_COND && !boot_cpu_has(X86_FEATURE_ASI))
static_branch_enable(&vmx_l1d_flush_cond);
else
static_branch_disable(&vmx_l1d_flush_cond);
@@ -6079,6 +6080,8 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
if (static_branch_likely(&vmx_l1d_flush_cond)) {
bool flush_l1d;

+ VM_BUG_ON(vcpu->kvm->asi);
+
/*
* Clear the per-vcpu flush bit, it gets set again
* either from vcpu_run() or from one of the unsafe
@@ -6590,16 +6593,31 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
}
}

-static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
- struct vcpu_vmx *vmx)
+static void vmx_flush_sensitive_cpu_state(struct kvm_vcpu *vcpu)
{
- kvm_guest_enter_irqoff();
-
/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
else if (static_branch_unlikely(&mds_user_clear))
mds_clear_cpu_buffers();
+}
+
+static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+ struct vcpu_vmx *vmx)
+{
+ unsigned long cr3;
+
+ kvm_guest_enter_irqoff();
+
+ vmx_flush_sensitive_cpu_state(vcpu);
+
+ asi_enter(vcpu->kvm->asi);
+
+ cr3 = __get_current_cr3_fast();
+ if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
+ vmcs_writel(HOST_CR3, cr3);
+ vmx->loaded_vmcs->host_state.cr3 = cr3;
+ }

if (vcpu->arch.cr2 != native_read_cr2())
native_write_cr2(vcpu->arch.cr2);
@@ -6609,13 +6627,16 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,

vcpu->arch.cr2 = native_read_cr2();

+ VM_WARN_ON_ONCE(vcpu->kvm->asi && !is_asi_active());
+ asi_set_target_unrestricted();
+
kvm_guest_exit_irqoff();
}

static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- unsigned long cr3, cr4;
+ unsigned long cr4;

/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!enable_vnmi &&
@@ -6657,12 +6678,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);

- cr3 = __get_current_cr3_fast();
- if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
- vmcs_writel(HOST_CR3, cr3);
- vmx->loaded_vmcs->host_state.cr3 = cr3;
- }
-
cr4 = cr4_read_shadow();
if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
vmcs_writel(HOST_CR4, cr4);
@@ -7691,6 +7706,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .flush_sensitive_cpu_state = vmx_flush_sensitive_cpu_state,
};

static __init void vmx_setup_user_return_msrs(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e50e97ac4408..dd07f677d084 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -81,6 +81,7 @@
#include <asm/emulate_prefix.h>
#include <asm/sgx.h>
#include <clocksource/hyperv_timer.h>
+#include <asm/asi.h>

#define CREATE_TRACE_POINTS
#include "trace.h"
@@ -297,6 +298,8 @@ EXPORT_SYMBOL_GPL(supported_xcr0);

static struct kmem_cache *x86_emulator_cache;

+static int __read_mostly kvm_asi_index;
+
/*
* When called, it means the previous get/set msr reached an invalid msr.
* Return true if we want to ignore/silent this failed msr access.
@@ -8620,6 +8623,50 @@ static struct notifier_block pvclock_gtod_notifier = {
};
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+/*
+ * We have an HT-stunning implementation available internally,
+ * but it is yet to be determined if it is fully compatible with the
+ * upstream core scheduling implementation. So leaving it out for now
+ * and just leaving these stubs here.
+ */
+static void stun_sibling(void) { }
+static void unstun_sibling(void) { }
+
+/*
+ * This function must be fully re-entrant and idempotent.
+ * Though the idempotency requirement could potentially be relaxed for stuff
+ * like stats where complete accuracy is not needed.
+ */
+static void kvm_pre_asi_exit(void)
+{
+ stun_sibling();
+}
+
+/*
+ * This function must be fully re-entrant and idempotent.
+ * Though the idempotency requirement could potentially be relaxed for stuff
+ * like stats where complete accuracy is not needed.
+ */
+static void kvm_post_asi_enter(void)
+{
+ struct kvm_vcpu *vcpu = raw_cpu_read(*kvm_get_running_vcpus());
+
+ kvm_x86_ops.flush_sensitive_cpu_state(vcpu);
+
+ unstun_sibling();
+}
+
+#endif
+
+static const struct asi_hooks kvm_asi_hooks = {
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ .pre_asi_exit = kvm_pre_asi_exit,
+ .post_asi_enter = kvm_post_asi_enter
+#endif
+};
+
int kvm_arch_init(void *opaque)
{
struct kvm_x86_init_ops *ops = opaque;
@@ -8674,6 +8721,15 @@ int kvm_arch_init(void *opaque)
if (r)
goto out_free_percpu;

+ if (ops->runtime_ops->flush_sensitive_cpu_state) {
+ r = asi_register_class("KVM", ASI_MAP_STANDARD_NONSENSITIVE,
+ &kvm_asi_hooks);
+ if (r < 0)
+ goto out_mmu_exit;
+
+ kvm_asi_index = r;
+ }
+
kvm_timer_init();

perf_register_guest_info_callbacks(&kvm_guest_cbs);
@@ -8694,6 +8750,8 @@ int kvm_arch_init(void *opaque)

return 0;

+out_mmu_exit:
+ kvm_mmu_module_exit();
out_free_percpu:
free_percpu(user_return_msrs);
out_free_x86_emulator_cache:
@@ -8720,6 +8778,11 @@ void kvm_arch_exit(void)
irq_work_sync(&pvclock_irq_work);
cancel_work_sync(&pvclock_gtod_work);
#endif
+ if (kvm_asi_index > 0) {
+ asi_unregister_class(kvm_asi_index);
+ kvm_asi_index = 0;
+ }
+
kvm_x86_ops.hardware_enable = NULL;
kvm_mmu_module_exit();
free_percpu(user_return_msrs);
@@ -11391,11 +11454,26 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);

kvm_apicv_init(kvm);
+
+ if (kvm_asi_index > 0) {
+ ret = asi_init(kvm->mm, kvm_asi_index, &kvm->asi);
+ if (ret)
+ goto error;
+ }
+
kvm_hv_init_vm(kvm);
kvm_mmu_init_vm(kvm);
kvm_xen_init_vm(kvm);

- return static_call(kvm_x86_vm_init)(kvm);
+ ret = static_call(kvm_x86_vm_init)(kvm);
+ if (ret)
+ goto error;
+
+ return 0;
+error:
+ kvm_page_track_cleanup(kvm);
+ asi_destroy(kvm->asi);
+ return ret;
}

int kvm_arch_post_init_vm(struct kvm *kvm)
@@ -11549,6 +11627,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_page_track_cleanup(kvm);
kvm_xen_destroy_vm(kvm);
kvm_hv_destroy_vm(kvm);
+ asi_destroy(kvm->asi);
}

static void memslot_rmap_free(struct kvm_memory_slot *slot)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c310648cc8f1..9dd63ed21f75 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -38,6 +38,7 @@

#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
+#include <asm/asi.h>

#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -551,6 +552,7 @@ struct kvm {
*/
struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ struct asi *asi;
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];

--
2.35.1.473.g83b2b277ed-goog

2022-02-24 00:55:45

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 01/47] mm: asi: Introduce ASI core API

Introduce core API for Address Space Isolation (ASI).
Kernel address space isolation provides the ability to run some kernel
code with a reduced kernel address space.

There can be multiple classes of such restricted kernel address spaces
(e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index.
The ASI class can register some hooks to be called when
entering/exiting the restricted address space.

Currently, there is a fixed maximum number of ASI classes supported.
In addition, each process can have at most one restricted address space
from each ASI class. Neither of these are inherent limitations and
are merely simplifying assumptions for the time being.

(The Kconfig and the high-level ASI API are derived from the original
ASI RFC by Alexandre Chartre).

Originally-by: Alexandre Chartre <[email protected]>

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/csky/include/asm/Kbuild | 1 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nds32/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/asi.h | 81 +++++++++++++++
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 152 +++++++++++++++++++++++++++++
arch/x86/mm/init.c | 5 +-
arch/x86/mm/tlb.c | 2 +-
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/asi.h | 51 ++++++++++
include/linux/mm_types.h | 3 +
kernel/fork.c | 3 +
security/Kconfig | 10 ++
32 files changed, 329 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/mm/asi.c
create mode 100644 include/asm-generic/asi.h

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 42911c8340c7..e3cd063d9cca 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -4,3 +4,4 @@ generated-y += syscall_table.h
generic-y += export.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 3c1afa524b9c..60bdeffa7c31 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -4,3 +4,4 @@ generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..1e2c3d8dbbd9 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += parport.h

generated-y += mach-types.h
generated-y += unistd-nr.h
+generic-y += asi.h
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 64202010b700..086e94f00f94 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -4,5 +4,6 @@ generic-y += mcs_spinlock.h
generic-y += qrwlock.h
generic-y += qspinlock.h
generic-y += user.h
+generic-y += asi.h

generated-y += cpucaps.h
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 904a18a818be..b4af49fa48c3 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += kvm_para.h
generic-y += qrwlock.h
generic-y += user.h
generic-y += vmlinux.lds.h
+generic-y += asi.h
diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild
index e23139c8fc0d..f1e937df4c8e 100644
--- a/arch/h8300/include/asm/Kbuild
+++ b/arch/h8300/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += spinlock.h
+generic-y += asi.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 3ece3c93fe08..744ffbeeb7ae 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -3,3 +3,4 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
index f994c1daf9d4..897a388f3e85 100644
--- a/arch/ia64/include/asm/Kbuild
+++ b/arch/ia64/include/asm/Kbuild
@@ -3,3 +3,4 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += vtime.h
+generic-y += asi.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index 0dbf9c5c6fae..faf0f135df4a 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -4,3 +4,4 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += spinlock.h
+generic-y += asi.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index a055f5dbe00a..012e4bf83c13 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += parport.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index dee172716581..b2c7b62536b4 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/nds32/include/asm/Kbuild b/arch/nds32/include/asm/Kbuild
index 82a4453c9c2d..e8c4cf63db79 100644
--- a/arch/nds32/include/asm/Kbuild
+++ b/arch/nds32/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += gpio.h
generic-y += kvm_para.h
generic-y += parport.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 7fe7437555fb..bfdc4026c5b1 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += spinlock.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index ca5987e11053..3d365bec74d0 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -7,3 +7,4 @@ generic-y += qspinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index e6e7f74c8ac9..b14e4f727331 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,3 +4,4 @@ generated-y += syscall_table_64.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += user.h
+generic-y += asi.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index bcf95ce0964f..2aff0fa469c4 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += mcs_spinlock.h
generic-y += qrwlock.h
generic-y += vtime.h
generic-y += early_ioremap.h
+generic-y += asi.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 445ccc97305a..3e2022a5a6c5 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += flat.h
generic-y += kvm_para.h
generic-y += user.h
generic-y += vmlinux.lds.h
+generic-y += asi.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 1a18d7b82f86..ef80906ed195 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += asm-offsets.h
generic-y += export.h
generic-y += kvm_types.h
generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index fc44d9c88b41..ea19e4515828 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,3 +3,4 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += asi.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 0b9d98ced34a..08730a26aaed 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,3 +4,4 @@ generated-y += syscall_table_64.h
generic-y += export.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index e5a7b552bb38..b62245b2445a 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -27,3 +27,4 @@ generic-y += word-at-a-time.h
generic-y += kprobes.h
generic-y += mm_hooks.h
generic-y += vga.h
+generic-y += asi.h
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 000000000000..f9fc928a555d
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_ASI_H
+#define _ASM_X86_ASI_H
+
+#include <asm-generic/asi.h>
+
+#include <asm/pgtable_types.h>
+#include <asm/percpu.h>
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+#define ASI_MAX_NUM_ORDER 2
+#define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER)
+
+struct asi_state {
+ struct asi *curr_asi;
+ struct asi *target_asi;
+};
+
+struct asi_hooks {
+ /* Both of these functions MUST be idempotent and re-entrant. */
+
+ void (*post_asi_enter)(void);
+ void (*pre_asi_exit)(void);
+};
+
+struct asi_class {
+ struct asi_hooks ops;
+ uint flags;
+ const char *name;
+};
+
+struct asi {
+ pgd_t *pgd;
+ struct asi_class *class;
+ struct mm_struct *mm;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
+
+void asi_init_mm_state(struct mm_struct *mm);
+
+int asi_register_class(const char *name, uint flags,
+ const struct asi_hooks *ops);
+void asi_unregister_class(int index);
+
+int asi_init(struct mm_struct *mm, int asi_index);
+void asi_destroy(struct asi *asi);
+
+void asi_enter(struct asi *asi);
+void asi_exit(void);
+
+static inline void asi_set_target_unrestricted(void)
+{
+ barrier();
+ this_cpu_write(asi_cpu_state.target_asi, NULL);
+}
+
+static inline struct asi *asi_get_current(void)
+{
+ return this_cpu_read(asi_cpu_state.curr_asi);
+}
+
+static inline struct asi *asi_get_target(void)
+{
+ return this_cpu_read(asi_cpu_state.target_asi);
+}
+
+static inline bool is_asi_active(void)
+{
+ return (bool)asi_get_current();
+}
+
+static inline bool asi_is_target_unrestricted(void)
+{
+ return !asi_get_target();
+}
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#endif
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index b587a9ee9cb2..3c43ad46c14a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,6 +259,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,

extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

+unsigned long build_cr3(pgd_t *pgd, u16 asid);
+
#endif /* !MODULE */

#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..09d5e65e47c8 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
new file mode 100644
index 000000000000..9928325f3787
--- /dev/null
+++ b/arch/x86/mm/asi.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <asm/asi.h>
+#include <asm/pgalloc.h>
+#include <asm/mmu_context.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt) "ASI: " fmt
+
+static struct asi_class asi_class[ASI_MAX_NUM];
+static DEFINE_SPINLOCK(asi_class_lock);
+
+DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
+EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state);
+
+int asi_register_class(const char *name, uint flags,
+ const struct asi_hooks *ops)
+{
+ int i;
+
+ VM_BUG_ON(name == NULL);
+
+ spin_lock(&asi_class_lock);
+
+ for (i = 1; i < ASI_MAX_NUM; i++) {
+ if (asi_class[i].name == NULL) {
+ asi_class[i].name = name;
+ asi_class[i].flags = flags;
+ if (ops != NULL)
+ asi_class[i].ops = *ops;
+ break;
+ }
+ }
+
+ spin_unlock(&asi_class_lock);
+
+ if (i == ASI_MAX_NUM)
+ i = -ENOSPC;
+
+ return i;
+}
+EXPORT_SYMBOL_GPL(asi_register_class);
+
+void asi_unregister_class(int index)
+{
+ spin_lock(&asi_class_lock);
+
+ WARN_ON(asi_class[index].name == NULL);
+ memset(&asi_class[index], 0, sizeof(struct asi_class));
+
+ spin_unlock(&asi_class_lock);
+}
+EXPORT_SYMBOL_GPL(asi_unregister_class);
+
+int asi_init(struct mm_struct *mm, int asi_index)
+{
+ struct asi *asi = &mm->asi[asi_index];
+
+ /* Index 0 is reserved for special purposes. */
+ WARN_ON(asi_index == 0 || asi_index >= ASI_MAX_NUM);
+ WARN_ON(asi->pgd != NULL);
+
+ /*
+ * For now, we allocate 2 pages to avoid any potential problems with
+ * KPTI code. This won't be needed once KPTI is folded into the ASI
+ * framework.
+ */
+ asi->pgd = (pgd_t *)__get_free_pages(GFP_PGTABLE_USER,
+ PGD_ALLOCATION_ORDER);
+ if (!asi->pgd)
+ return -ENOMEM;
+
+ asi->class = &asi_class[asi_index];
+ asi->mm = mm;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(asi_init);
+
+void asi_destroy(struct asi *asi)
+{
+ free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER);
+ memset(asi, 0, sizeof(struct asi));
+}
+EXPORT_SYMBOL_GPL(asi_destroy);
+
+static void __asi_enter(void)
+{
+ u64 asi_cr3;
+ struct asi *target = this_cpu_read(asi_cpu_state.target_asi);
+
+ VM_BUG_ON(preemptible());
+
+ if (!target || target == this_cpu_read(asi_cpu_state.curr_asi))
+ return;
+
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) ==
+ LOADED_MM_SWITCHING);
+
+ this_cpu_write(asi_cpu_state.curr_asi, target);
+
+ asi_cr3 = build_cr3(target->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ write_cr3(asi_cr3);
+
+ if (target->class->ops.post_asi_enter)
+ target->class->ops.post_asi_enter();
+}
+
+void asi_enter(struct asi *asi)
+{
+ VM_WARN_ON_ONCE(!asi);
+
+ this_cpu_write(asi_cpu_state.target_asi, asi);
+ barrier();
+
+ __asi_enter();
+}
+EXPORT_SYMBOL_GPL(asi_enter);
+
+void asi_exit(void)
+{
+ u64 unrestricted_cr3;
+ struct asi *asi;
+
+ preempt_disable();
+
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) ==
+ LOADED_MM_SWITCHING);
+
+ asi = this_cpu_read(asi_cpu_state.curr_asi);
+
+ if (asi) {
+ if (asi->class->ops.pre_asi_exit)
+ asi->class->ops.pre_asi_exit();
+
+ unrestricted_cr3 =
+ build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+
+ write_cr3(unrestricted_cr3);
+ this_cpu_write(asi_cpu_state.curr_asi, NULL);
+ }
+
+ preempt_enable();
+}
+EXPORT_SYMBOL_GPL(asi_exit);
+
+void asi_init_mm_state(struct mm_struct *mm)
+{
+ memset(mm->asi, 0, sizeof(mm->asi));
+}
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 1895986842b9..000cbe5315f5 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -238,8 +238,9 @@ static void __init probe_page_size_mask(void)

/* By the default is everything supported: */
__default_kernel_pte_mask = __supported_pte_mask;
- /* Except when with PTI where the kernel is mostly non-Global: */
- if (cpu_feature_enabled(X86_FEATURE_PTI))
+ /* Except when with PTI or ASI where the kernel is mostly non-Global: */
+ if (cpu_feature_enabled(X86_FEATURE_PTI) ||
+ IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION))
__default_kernel_pte_mask &= ~_PAGE_GLOBAL;

/* Enable 1 GB linear kernel mappings if available: */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59ba2968af1b..88d9298720dc 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -153,7 +153,7 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
{
if (static_cpu_has(X86_FEATURE_PCID)) {
return __sme_pa(pgd) | kern_pcid(asid);
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 854c5e07e867..49fcdf9d83f5 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -7,3 +7,4 @@ generic-y += param.h
generic-y += qrwlock.h
generic-y += qspinlock.h
generic-y += user.h
+generic-y += asi.h
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
new file mode 100644
index 000000000000..e5ba51d30b90
--- /dev/null
+++ b/include/asm-generic/asi.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_ASI_H
+#define __ASM_GENERIC_ASI_H
+
+/* ASI class flags */
+#define ASI_MAP_STANDARD_NONSENSITIVE 1
+
+#ifndef CONFIG_ADDRESS_SPACE_ISOLATION
+
+#define ASI_MAX_NUM_ORDER 0
+#define ASI_MAX_NUM 0
+
+#ifndef _ASSEMBLY_
+
+struct asi_hooks {};
+struct asi {};
+
+static inline
+int asi_register_class(const char *name, uint flags,
+ const struct asi_hooks *ops)
+{
+ return 0;
+}
+
+static inline void asi_unregister_class(int asi_index) { }
+
+static inline void asi_init_mm_state(struct mm_struct *mm) { }
+
+static inline int asi_init(struct mm_struct *mm, int asi_index) { return 0; }
+
+static inline void asi_destroy(struct asi *asi) { }
+
+static inline void asi_enter(struct asi *asi) { }
+
+static inline void asi_set_target_unrestricted(void) { }
+
+static inline bool asi_is_target_unrestricted(void) { return true; }
+
+static inline void asi_exit(void) { }
+
+static inline bool is_asi_active(void) { return false; }
+
+static inline struct asi *asi_get_target(void) { return NULL; }
+
+static inline struct asi *asi_get_current(void) { return NULL; }
+
+#endif /* !_ASSEMBLY_ */
+
+#endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3a6e6209600..3de1afa57289 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,6 +18,7 @@
#include <linux/seqlock.h>

#include <asm/mmu.h>
+#include <asm/asi.h>

#ifndef AT_VECTOR_SIZE_ARCH
#define AT_VECTOR_SIZE_ARCH 0
@@ -495,6 +496,8 @@ struct mm_struct {
atomic_t membarrier_state;
#endif

+ struct asi asi[ASI_MAX_NUM];
+
/**
* @mm_users: The number of users including userspace.
*
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..3695a32ee9bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@
#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include <asm/asi.h>

#include <trace/events/sched.h>

@@ -1071,6 +1072,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->def_flags = 0;
}

+ asi_init_mm_state(mm);
+
if (mm_alloc_pgd(mm))
goto fail_nopgd;

diff --git a/security/Kconfig b/security/Kconfig
index 0b847f435beb..21b15ecaf2c1 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION

See Documentation/x86/pti.rst for more details.

+config ADDRESS_SPACE_ISOLATION
+ bool "Allow code to run with a reduced kernel address space"
+ default n
+ depends on X86_64 && !UML
+ depends on !PARAVIRT
+ help
+ This feature provides the ability to run some kernel code
+ with a reduced kernel address space. This can be used to
+ mitigate some speculative execution attacks.
+
config SECURITY_INFINIBAND
bool "Infiniband Security Hooks"
depends on SECURITY && INFINIBAND
--
2.35.1.473.g83b2b277ed-goog

2022-02-24 00:58:45

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 41/47] mm: asi: Annotation of static variables to be nonsensitive

From: Ofir Weisse <[email protected]>

The heart of ASI is to diffrentiate between sensitive and non-sensitive
data access. This commit marks certain static variables as not
sensitive.

Some static variables are accessed frequently and therefore would cause
many ASI exits. The frequency of these accesses is monitored by tracing
asi_exits and analyzing the accessed addresses. Many of these variables
don't contain sensitive information and can therefore be mapped into the
global ASI region. This commit applies the __asi_not_sensitive*
attributes to these frequenmtly-accessed yet not sensitive variables.
The end result is a very significant reduction in ASI exits on real
benchmarks.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/events/core.c | 4 ++--
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/msr.c | 2 +-
arch/x86/events/perf_event.h | 2 +-
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/setup.c | 4 ++--
arch/x86/kernel/smp.c | 2 +-
arch/x86/kernel/tsc.c | 8 +++----
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mtrr.c | 2 +-
arch/x86/kvm/vmx/capabilities.h | 14 ++++++------
arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++---------------
arch/x86/kvm/x86.c | 35 +++++++++++++++--------------
arch/x86/mm/asi.c | 4 ++--
include/linux/debug_locks.h | 4 ++--
include/linux/jiffies.h | 4 ++--
include/linux/notifier.h | 2 +-
include/linux/profile.h | 2 +-
include/linux/rcupdate.h | 4 +++-
include/linux/rcutree.h | 2 +-
include/linux/sched/sysctl.h | 1 +
init/main.c | 2 +-
kernel/cgroup/cgroup.c | 5 +++--
kernel/cpu.c | 14 ++++++------
kernel/events/core.c | 4 ++--
kernel/freezer.c | 2 +-
kernel/locking/lockdep.c | 14 ++++++------
kernel/panic.c | 2 +-
kernel/printk/printk.c | 4 ++--
kernel/profile.c | 4 ++--
kernel/rcu/tree.c | 10 ++++-----
kernel/rcu/update.c | 4 ++--
kernel/sched/clock.c | 2 +-
kernel/sched/core.c | 6 ++---
kernel/sched/cpuacct.c | 2 +-
kernel/sched/cputime.c | 2 +-
kernel/sched/fair.c | 4 ++--
kernel/sched/loadavg.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 4 ++--
kernel/smp.c | 2 +-
kernel/softirq.c | 3 ++-
kernel/time/hrtimer.c | 2 +-
kernel/time/jiffies.c | 8 ++++++-
kernel/time/ntp.c | 30 ++++++++++++-------------
kernel/time/tick-common.c | 4 ++--
kernel/time/tick-internal.h | 2 +-
kernel/time/tick-sched.c | 2 +-
kernel/time/timekeeping.c | 10 ++++-----
kernel/time/timekeeping.h | 2 +-
kernel/time/timer.c | 2 +-
kernel/trace/trace.c | 2 +-
kernel/trace/trace_sched_switch.c | 4 ++--
lib/debug_locks.c | 5 +++--
mm/memory.c | 2 +-
mm/page_alloc.c | 2 +-
mm/sparse.c | 4 ++--
virt/kvm/kvm_main.c | 2 +-
62 files changed, 170 insertions(+), 156 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 38b2c779146f..db825bf053fd 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -44,7 +44,7 @@

#include "perf_event.h"

-struct x86_pmu x86_pmu __read_mostly;
+struct x86_pmu x86_pmu __asi_not_sensitive_readmostly;
static struct pmu pmu;

DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
@@ -2685,7 +2685,7 @@ static int x86_pmu_filter_match(struct perf_event *event)
return 1;
}

-static struct pmu pmu = {
+static struct pmu pmu __asi_not_sensitive = {
.pmu_enable = x86_pmu_enable,
.pmu_disable = x86_pmu_disable,

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ec6444f2c9dc..5b2b7473b2f2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -189,7 +189,7 @@ static struct event_constraint intel_slm_event_constraints[] __read_mostly =
EVENT_CONSTRAINT_END
};

-static struct event_constraint intel_skl_event_constraints[] = {
+static struct event_constraint intel_skl_event_constraints[] __asi_not_sensitive = {
FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index 96c775abe31f..db7bca37c726 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -280,7 +280,7 @@ static int msr_event_add(struct perf_event *event, int flags)
return 0;
}

-static struct pmu pmu_msr = {
+static struct pmu pmu_msr __asi_not_sensitive = {
.task_ctx_nr = perf_sw_context,
.attr_groups = attr_groups,
.event_init = msr_event_init,
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5480db242083..27cca7fd6f17 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1020,7 +1020,7 @@ static struct perf_pmu_format_hybrid_attr format_attr_hybrid_##_name = {\
}

struct pmu *x86_get_pmu(unsigned int cpu);
-extern struct x86_pmu x86_pmu __read_mostly;
+extern struct x86_pmu x86_pmu __asi_not_sensitive_readmostly;

static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
{
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8ba88bbcf895..b7292c4fece7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1542,8 +1542,8 @@ struct kvm_arch_async_pf {

extern u32 __read_mostly kvm_nr_uret_msrs;
extern u64 __read_mostly host_efer;
-extern bool __read_mostly allow_smaller_maxphyaddr;
-extern bool __read_mostly enable_apicv;
+extern bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr;
+extern bool __asi_not_sensitive_readmostly enable_apicv;
extern struct kvm_x86_ops kvm_x86_ops;

#define KVM_X86_OP(func) \
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 23fb4d51a5da..9836ebe953ed 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -31,7 +31,7 @@
#include <asm/paravirt.h>
#include <asm/asm-prototypes.h>

-int __read_mostly alternatives_patched;
+int __asi_not_sensitive alternatives_patched;

EXPORT_SYMBOL_GPL(alternatives_patched);

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 1c1f218a701d..6b5e6574e391 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -46,7 +46,7 @@ static void __init srbds_select_mitigation(void);
static void __init l1d_flush_select_mitigation(void);

/* The base value of the SPEC_CTRL MSR that always has to be preserved. */
-u64 x86_spec_ctrl_base;
+u64 x86_spec_ctrl_base __asi_not_sensitive;
EXPORT_SYMBOL_GPL(x86_spec_ctrl_base);
static DEFINE_MUTEX(spec_ctrl_mutex);

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e04f5e6eb33f..d8461ac88b36 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -116,7 +116,7 @@ static struct resource bss_resource = {
struct cpuinfo_x86 new_cpu_data;

/* Common CPU data for all CPUs */
-struct cpuinfo_x86 boot_cpu_data __read_mostly;
+struct cpuinfo_x86 boot_cpu_data __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(boot_cpu_data);

unsigned int def_to_bigsmp;
@@ -133,7 +133,7 @@ struct ist_info ist_info;
#endif

#else
-struct cpuinfo_x86 boot_cpu_data __read_mostly;
+struct cpuinfo_x86 boot_cpu_data __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(boot_cpu_data);
#endif

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 06db901fabe8..e9e10ffc2ec2 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -257,7 +257,7 @@ static int __init nonmi_ipi_setup(char *str)

__setup("nonmi_ipi", nonmi_ipi_setup);

-struct smp_ops smp_ops = {
+struct smp_ops smp_ops __asi_not_sensitive = {
.smp_prepare_boot_cpu = native_smp_prepare_boot_cpu,
.smp_prepare_cpus = native_smp_prepare_cpus,
.smp_cpus_done = native_smp_cpus_done,
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a698196377be..d7169da99b01 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -30,10 +30,10 @@
#include <asm/i8259.h>
#include <asm/uv/uv.h>

-unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
+unsigned int __asi_not_sensitive_readmostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);

-unsigned int __read_mostly tsc_khz;
+unsigned int __asi_not_sensitive_readmostly tsc_khz;
EXPORT_SYMBOL(tsc_khz);

#define KHZ 1000
@@ -41,7 +41,7 @@ EXPORT_SYMBOL(tsc_khz);
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/
-static int __read_mostly tsc_unstable;
+static int __asi_not_sensitive_readmostly tsc_unstable;
static unsigned int __initdata tsc_early_khz;

static DEFINE_STATIC_KEY_FALSE(__use_tsc);
@@ -1146,7 +1146,7 @@ static struct clocksource clocksource_tsc_early = {
* this one will immediately take over. We will only register if TSC has
* been found good.
*/
-static struct clocksource clocksource_tsc = {
+static struct clocksource clocksource_tsc __asi_not_sensitive = {
.name = "tsc",
.rating = 300,
.read = read_tsc,
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f206fc35deff..213bbdfab49e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -60,7 +60,7 @@
#define MAX_APIC_VECTOR 256
#define APIC_VECTORS_PER_REG 32

-static bool lapic_timer_advance_dynamic __read_mostly;
+static bool lapic_timer_advance_dynamic __asi_not_sensitive_readmostly;
#define LAPIC_TIMER_ADVANCE_ADJUST_MIN 100 /* clock cycles */
#define LAPIC_TIMER_ADVANCE_ADJUST_MAX 10000 /* clock cycles */
#define LAPIC_TIMER_ADVANCE_NS_INIT 1000
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..13038fae5088 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -33,7 +33,7 @@ u64 __read_mostly shadow_mmio_mask;
u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_mask;
-u64 __read_mostly shadow_acc_track_mask;
+u64 __asi_not_sensitive_readmostly shadow_acc_track_mask;

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cc432f9a966b..d1af03f63009 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -151,7 +151,7 @@ extern u64 __read_mostly shadow_me_mask;
* shadow_acc_track_mask is the set of bits to be cleared in non-accessed
* pages.
*/
-extern u64 __read_mostly shadow_acc_track_mask;
+extern u64 __asi_not_sensitive_readmostly shadow_acc_track_mask;

/*
* This mask must be set on all non-zero Non-Present or Reserved SPTEs in order
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index a8502e02f479..66228abfa9fa 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -138,7 +138,7 @@ struct fixed_mtrr_segment {
int range_start;
};

-static struct fixed_mtrr_segment fixed_seg_table[] = {
+static struct fixed_mtrr_segment fixed_seg_table[] __asi_not_sensitive = {
/* MSR_MTRRfix64K_00000, 1 unit. 64K fixed mtrr. */
{
.start = 0x0,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..0ab03ec7d6d0 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -6,13 +6,13 @@

#include "lapic.h"

-extern bool __read_mostly enable_vpid;
-extern bool __read_mostly flexpriority_enabled;
-extern bool __read_mostly enable_ept;
-extern bool __read_mostly enable_unrestricted_guest;
-extern bool __read_mostly enable_ept_ad_bits;
-extern bool __read_mostly enable_pml;
-extern int __read_mostly pt_mode;
+extern bool __asi_not_sensitive_readmostly enable_vpid;
+extern bool __asi_not_sensitive_readmostly flexpriority_enabled;
+extern bool __asi_not_sensitive_readmostly enable_ept;
+extern bool __asi_not_sensitive_readmostly enable_unrestricted_guest;
+extern bool __asi_not_sensitive_readmostly enable_ept_ad_bits;
+extern bool __asi_not_sensitive_readmostly enable_pml;
+extern int __asi_not_sensitive_readmostly pt_mode;

#define PT_MODE_SYSTEM 0
#define PT_MODE_HOST_GUEST 1
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6549fef39f2b..e1ad82c25a78 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -78,29 +78,29 @@ static const struct x86_cpu_id vmx_cpu_id[] = {
MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
#endif

-bool __read_mostly enable_vpid = 1;
+bool __asi_not_sensitive_readmostly enable_vpid = 1;
module_param_named(vpid, enable_vpid, bool, 0444);

-static bool __read_mostly enable_vnmi = 1;
+static bool __asi_not_sensitive_readmostly enable_vnmi = 1;
module_param_named(vnmi, enable_vnmi, bool, S_IRUGO);

-bool __read_mostly flexpriority_enabled = 1;
+bool __asi_not_sensitive_readmostly flexpriority_enabled = 1;
module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);

-bool __read_mostly enable_ept = 1;
+bool __asi_not_sensitive_readmostly enable_ept = 1;
module_param_named(ept, enable_ept, bool, S_IRUGO);

-bool __read_mostly enable_unrestricted_guest = 1;
+bool __asi_not_sensitive_readmostly enable_unrestricted_guest = 1;
module_param_named(unrestricted_guest,
enable_unrestricted_guest, bool, S_IRUGO);

-bool __read_mostly enable_ept_ad_bits = 1;
+bool __asi_not_sensitive_readmostly enable_ept_ad_bits = 1;
module_param_named(eptad, enable_ept_ad_bits, bool, S_IRUGO);

-static bool __read_mostly emulate_invalid_guest_state = true;
+static bool __asi_not_sensitive_readmostly emulate_invalid_guest_state = true;
module_param(emulate_invalid_guest_state, bool, S_IRUGO);

-static bool __read_mostly fasteoi = 1;
+static bool __asi_not_sensitive_readmostly fasteoi = 1;
module_param(fasteoi, bool, S_IRUGO);

module_param(enable_apicv, bool, S_IRUGO);
@@ -110,13 +110,13 @@ module_param(enable_apicv, bool, S_IRUGO);
* VMX and be a hypervisor for its own guests. If nested=0, guests may not
* use VMX instructions.
*/
-static bool __read_mostly nested = 1;
+static bool __asi_not_sensitive_readmostly nested = 1;
module_param(nested, bool, S_IRUGO);

-bool __read_mostly enable_pml = 1;
+bool __asi_not_sensitive_readmostly enable_pml = 1;
module_param_named(pml, enable_pml, bool, S_IRUGO);

-static bool __read_mostly dump_invalid_vmcs = 0;
+static bool __asi_not_sensitive_readmostly dump_invalid_vmcs = 0;
module_param(dump_invalid_vmcs, bool, 0644);

#define MSR_BITMAP_MODE_X2APIC 1
@@ -125,13 +125,13 @@ module_param(dump_invalid_vmcs, bool, 0644);
#define KVM_VMX_TSC_MULTIPLIER_MAX 0xffffffffffffffffULL

/* Guest_tsc -> host_tsc conversion requires 64-bit division. */
-static int __read_mostly cpu_preemption_timer_multi;
-static bool __read_mostly enable_preemption_timer = 1;
+static int __asi_not_sensitive_readmostly cpu_preemption_timer_multi;
+static bool __asi_not_sensitive_readmostly enable_preemption_timer = 1;
#ifdef CONFIG_X86_64
module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
#endif

-extern bool __read_mostly allow_smaller_maxphyaddr;
+extern bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr;
module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);

#define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
@@ -202,7 +202,7 @@ static unsigned int ple_window_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
module_param(ple_window_max, uint, 0444);

/* Default is SYSTEM mode, 1 for host-guest mode */
-int __read_mostly pt_mode = PT_MODE_SYSTEM;
+int __asi_not_sensitive_readmostly pt_mode = PT_MODE_SYSTEM;
module_param(pt_mode, int, S_IRUGO);

static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
@@ -421,7 +421,7 @@ static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
static DEFINE_SPINLOCK(vmx_vpid_lock);

-struct vmcs_config vmcs_config;
+struct vmcs_config vmcs_config __asi_not_sensitive;
struct vmx_capability vmx_capability;

#define VMX_SEGMENT_FIELD(seg) \
@@ -453,7 +453,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
vmx->segment_cache.bitmask = 0;
}

-static unsigned long host_idt_base;
+static unsigned long host_idt_base __asi_not_sensitive;

#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -5549,7 +5549,8 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
* to be done to userspace and return 0.
*/
-static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
+static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) __asi_not_sensitive
+= {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi,
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0df14deae80..0df88eadab60 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,7 +123,7 @@ static int sync_regs(struct kvm_vcpu *vcpu);
static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);

-struct kvm_x86_ops kvm_x86_ops __read_mostly;
+struct kvm_x86_ops kvm_x86_ops __asi_not_sensitive_readmostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);

#define KVM_X86_OP(func) \
@@ -148,17 +148,17 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
static bool __read_mostly kvmclock_periodic_sync = true;
module_param(kvmclock_periodic_sync, bool, S_IRUGO);

-bool __read_mostly kvm_has_tsc_control;
+bool __asi_not_sensitive_readmostly kvm_has_tsc_control;
EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
-u32 __read_mostly kvm_max_guest_tsc_khz;
+u32 __asi_not_sensitive_readmostly kvm_max_guest_tsc_khz;
EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
-u8 __read_mostly kvm_tsc_scaling_ratio_frac_bits;
+u8 __asi_not_sensitive_readmostly kvm_tsc_scaling_ratio_frac_bits;
EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
-u64 __read_mostly kvm_max_tsc_scaling_ratio;
+u64 __asi_not_sensitive_readmostly kvm_max_tsc_scaling_ratio;
EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
-u64 __read_mostly kvm_default_tsc_scaling_ratio;
+u64 __asi_not_sensitive_readmostly kvm_default_tsc_scaling_ratio;
EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
-bool __read_mostly kvm_has_bus_lock_exit;
+bool __asi_not_sensitive_readmostly kvm_has_bus_lock_exit;
EXPORT_SYMBOL_GPL(kvm_has_bus_lock_exit);

/* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
@@ -171,20 +171,20 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
* advancement entirely. Any other value is used as-is and disables adaptive
* tuning, i.e. allows privileged userspace to set an exact advancement time.
*/
-static int __read_mostly lapic_timer_advance_ns = -1;
+static int __asi_not_sensitive_readmostly lapic_timer_advance_ns = -1;
module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);

-static bool __read_mostly vector_hashing = true;
+static bool __asi_not_sensitive_readmostly vector_hashing = true;
module_param(vector_hashing, bool, S_IRUGO);

-bool __read_mostly enable_vmware_backdoor = false;
+bool __asi_not_sensitive_readmostly enable_vmware_backdoor = false;
module_param(enable_vmware_backdoor, bool, S_IRUGO);
EXPORT_SYMBOL_GPL(enable_vmware_backdoor);

-static bool __read_mostly force_emulation_prefix = false;
+static bool __asi_not_sensitive_readmostly force_emulation_prefix = false;
module_param(force_emulation_prefix, bool, S_IRUGO);

-int __read_mostly pi_inject_timer = -1;
+int __asi_not_sensitive_readmostly pi_inject_timer = -1;
module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);

/*
@@ -216,13 +216,14 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);

-bool __read_mostly allow_smaller_maxphyaddr = 0;
+bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr = 0;
EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);

-bool __read_mostly enable_apicv = true;
+bool __asi_not_sensitive_readmostly enable_apicv = true;
EXPORT_SYMBOL_GPL(enable_apicv);

-u64 __read_mostly host_xss;
+/* TODO(oweisse): how dangerous is this variable, from a security standpoint? */
+u64 __asi_not_sensitive_readmostly host_xss;
EXPORT_SYMBOL_GPL(host_xss);
u64 __read_mostly supported_xss;
EXPORT_SYMBOL_GPL(supported_xss);
@@ -292,7 +293,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
sizeof(kvm_vcpu_stats_desc),
};

-u64 __read_mostly host_xcr0;
+u64 __asi_not_sensitive_readmostly host_xcr0;
u64 __read_mostly supported_xcr0;
EXPORT_SYMBOL_GPL(supported_xcr0);

@@ -2077,7 +2078,7 @@ struct pvclock_gtod_data {
u64 wall_time_sec;
};

-static struct pvclock_gtod_data pvclock_gtod_data;
+static struct pvclock_gtod_data pvclock_gtod_data __asi_not_sensitive;

static void update_pvclock_gtod(struct timekeeper *tk)
{
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index ba373b461855..fdc117929fc7 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -17,8 +17,8 @@
#undef pr_fmt
#define pr_fmt(fmt) "ASI: " fmt

-static struct asi_class asi_class[ASI_MAX_NUM];
-static DEFINE_SPINLOCK(asi_class_lock);
+static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive;
+static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive);

DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state);
EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state);
diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h
index dbb409d77d4f..7bd0c3dd6d47 100644
--- a/include/linux/debug_locks.h
+++ b/include/linux/debug_locks.h
@@ -7,8 +7,8 @@

struct task_struct;

-extern int debug_locks __read_mostly;
-extern int debug_locks_silent __read_mostly;
+extern int debug_locks;
+extern int debug_locks_silent;


static __always_inline int __debug_locks_off(void)
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 5e13f801c902..deccab0dcb4a 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -76,8 +76,8 @@ extern int register_refined_jiffies(long clock_tick_rate);
* without sampling the sequence number in jiffies_lock.
* get_jiffies_64() will do this for you as appropriate.
*/
-extern u64 __cacheline_aligned_in_smp jiffies_64;
-extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies;
+extern u64 jiffies_64;
+extern unsigned long volatile __jiffy_arch_data jiffies;

#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void);
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 87069b8459af..a27b193b8e60 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -117,7 +117,7 @@ extern void srcu_init_notifier_head(struct srcu_notifier_head *nh);
struct blocking_notifier_head name = \
BLOCKING_NOTIFIER_INIT(name)
#define RAW_NOTIFIER_HEAD(name) \
- struct raw_notifier_head name = \
+ struct raw_notifier_head name __asi_not_sensitive = \
RAW_NOTIFIER_INIT(name)

#ifdef CONFIG_TREE_SRCU
diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f557..4988b6d05d4c 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -38,7 +38,7 @@ enum profile_type {

#ifdef CONFIG_PROFILING

-extern int prof_on __read_mostly;
+extern int prof_on;

/* init basic kernel profiler */
int profile_init(void);
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5e0beb5c5659..34f5073c88a2 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -84,7 +84,7 @@ static inline int rcu_preempt_depth(void)

/* Internal to kernel */
void rcu_init(void);
-extern int rcu_scheduler_active __read_mostly;
+extern int rcu_scheduler_active;
void rcu_sched_clock_irq(int user);
void rcu_report_dead(unsigned int cpu);
void rcutree_migrate_callbacks(int cpu);
@@ -308,6 +308,8 @@ static inline int rcu_read_lock_any_held(void)

#ifdef CONFIG_PROVE_RCU

+/* TODO: ASI - (oweisse) we might want to switch ".data.unlikely" to some other
+ * section that will be mapped to ASI. */
/**
* RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
* @c: condition to check
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 53209d669400..76665db179fa 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -62,7 +62,7 @@ static inline void rcu_irq_exit_check_preempt(void) { }
void exit_rcu(void);

void rcu_scheduler_starting(void);
-extern int rcu_scheduler_active __read_mostly;
+extern int rcu_scheduler_active;
void rcu_end_inkernel_boot(void);
bool rcu_inkernel_boot_has_ended(void);
bool rcu_is_watching(void);
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 304f431178fd..1529e3835939 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -3,6 +3,7 @@
#define _LINUX_SCHED_SYSCTL_H

#include <linux/types.h>
+#include <asm/asi.h>

struct ctl_table;

diff --git a/init/main.c b/init/main.c
index bb984ed79de0..ce87fac83aed 100644
--- a/init/main.c
+++ b/init/main.c
@@ -123,7 +123,7 @@ extern void radix_tree_init(void);
* operations which are not allowed with IRQ disabled are allowed while the
* flag is set.
*/
-bool early_boot_irqs_disabled __read_mostly;
+bool early_boot_irqs_disabled __asi_not_sensitive;

enum system_states system_state __read_mostly;
EXPORT_SYMBOL(system_state);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index cafb8c114a21..729495e17363 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -162,7 +162,8 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);

/* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root __asi_not_sensitive =
+ { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
EXPORT_SYMBOL_GPL(cgrp_dfl_root);

/*
@@ -755,7 +756,7 @@ EXPORT_SYMBOL_GPL(of_css);
* reference-counted, to improve performance when child cgroups
* haven't been created.
*/
-struct css_set init_css_set = {
+struct css_set init_css_set __asi_not_sensitive = {
.refcount = REFCOUNT_INIT(1),
.dom_cset = &init_css_set,
.tasks = LIST_HEAD_INIT(init_css_set.tasks),
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 407a2568f35e..59530bd5da39 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2581,26 +2581,26 @@ const DECLARE_BITMAP(cpu_all_bits, NR_CPUS) = CPU_BITS_ALL;
EXPORT_SYMBOL(cpu_all_bits);

#ifdef CONFIG_INIT_ALL_POSSIBLE
-struct cpumask __cpu_possible_mask __read_mostly
+struct cpumask __cpu_possible_mask __asi_not_sensitive_readmostly
= {CPU_BITS_ALL};
#else
-struct cpumask __cpu_possible_mask __read_mostly;
+struct cpumask __cpu_possible_mask __asi_not_sensitive_readmostly;
#endif
EXPORT_SYMBOL(__cpu_possible_mask);

-struct cpumask __cpu_online_mask __read_mostly;
+struct cpumask __cpu_online_mask __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(__cpu_online_mask);

-struct cpumask __cpu_present_mask __read_mostly;
+struct cpumask __cpu_present_mask __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(__cpu_present_mask);

-struct cpumask __cpu_active_mask __read_mostly;
+struct cpumask __cpu_active_mask __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(__cpu_active_mask);

-struct cpumask __cpu_dying_mask __read_mostly;
+struct cpumask __cpu_dying_mask __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(__cpu_dying_mask);

-atomic_t __num_online_cpus __read_mostly;
+atomic_t __num_online_cpus __asi_not_sensitive_readmostly;
EXPORT_SYMBOL(__num_online_cpus);

void init_cpu_present(const struct cpumask *src)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 30d94f68c5bd..6ea559b6e0f4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9651,7 +9651,7 @@ static int perf_swevent_init(struct perf_event *event)
return 0;
}

-static struct pmu perf_swevent = {
+static struct pmu perf_swevent __asi_not_sensitive = {
.task_ctx_nr = perf_sw_context,

.capabilities = PERF_PMU_CAP_NO_NMI,
@@ -9800,7 +9800,7 @@ static int perf_tp_event_init(struct perf_event *event)
return 0;
}

-static struct pmu perf_tracepoint = {
+static struct pmu perf_tracepoint __asi_not_sensitive = {
.task_ctx_nr = perf_sw_context,

.event_init = perf_tp_event_init,
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 45ab36ffd0e7..6ca163e4880b 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,7 +13,7 @@
#include <linux/kthread.h>

/* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
+atomic_t __asi_not_sensitive system_freezing_cnt = ATOMIC_INIT(0);
EXPORT_SYMBOL(system_freezing_cnt);

/* indicate whether PM freezing is in effect, protected by
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 2270ec68f10a..1b8f51a37883 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -64,7 +64,7 @@
#include <trace/events/lock.h>

#ifdef CONFIG_PROVE_LOCKING
-int prove_locking = 1;
+int prove_locking __asi_not_sensitive = 1;
module_param(prove_locking, int, 0644);
#else
#define prove_locking 0
@@ -186,8 +186,8 @@ unsigned long nr_zapped_classes;
#ifndef CONFIG_DEBUG_LOCKDEP
static
#endif
-struct lock_class lock_classes[MAX_LOCKDEP_KEYS];
-static DECLARE_BITMAP(lock_classes_in_use, MAX_LOCKDEP_KEYS);
+struct lock_class lock_classes[MAX_LOCKDEP_KEYS] __asi_not_sensitive;
+static DECLARE_BITMAP(lock_classes_in_use, MAX_LOCKDEP_KEYS) __asi_not_sensitive;

static inline struct lock_class *hlock_class(struct held_lock *hlock)
{
@@ -389,7 +389,7 @@ static struct hlist_head classhash_table[CLASSHASH_SIZE];
#define __chainhashfn(chain) hash_long(chain, CHAINHASH_BITS)
#define chainhashentry(chain) (chainhash_table + __chainhashfn((chain)))

-static struct hlist_head chainhash_table[CHAINHASH_SIZE];
+static struct hlist_head chainhash_table[CHAINHASH_SIZE] __asi_not_sensitive;

/*
* the id of held_lock
@@ -599,7 +599,7 @@ u64 lockdep_stack_hash_count(void)
unsigned int nr_hardirq_chains;
unsigned int nr_softirq_chains;
unsigned int nr_process_chains;
-unsigned int max_lockdep_depth;
+unsigned int max_lockdep_depth __asi_not_sensitive;

#ifdef CONFIG_DEBUG_LOCKDEP
/*
@@ -3225,8 +3225,8 @@ check_prevs_add(struct task_struct *curr, struct held_lock *next)
return 0;
}

-struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS];
-static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS);
+struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS] __asi_not_sensitive;
+static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS) __asi_not_sensitive;
static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS];
unsigned long nr_zapped_lock_chains;
unsigned int nr_free_chain_hlocks; /* Free chain_hlocks in buckets */
diff --git a/kernel/panic.c b/kernel/panic.c
index cefd7d82366f..6d0ee3ddd58b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -56,7 +56,7 @@ int panic_on_warn __read_mostly;
unsigned long panic_on_taint;
bool panic_on_taint_nousertaint = false;

-int panic_timeout = CONFIG_PANIC_TIMEOUT;
+int panic_timeout __asi_not_sensitive = CONFIG_PANIC_TIMEOUT;
EXPORT_SYMBOL_GPL(panic_timeout);

#define PANIC_PRINT_TASK_INFO 0x00000001
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 57b132b658e1..3425fb1554d3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -75,7 +75,7 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
* Low level drivers may need that to know if they can schedule in
* their unblank() callback or not. So let's export it.
*/
-int oops_in_progress;
+int oops_in_progress __asi_not_sensitive;
EXPORT_SYMBOL(oops_in_progress);

/*
@@ -2001,7 +2001,7 @@ static u8 *__printk_recursion_counter(void)
local_irq_restore(flags); \
} while (0)

-int printk_delay_msec __read_mostly;
+int printk_delay_msec __asi_not_sensitive_readmostly;

static inline void printk_delay(void)
{
diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac5..c5beb9b0b0a8 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -44,10 +44,10 @@ static atomic_t *prof_buffer;
static unsigned long prof_len;
static unsigned short int prof_shift;

-int prof_on __read_mostly;
+int prof_on __asi_not_sensitive_readmostly;
EXPORT_SYMBOL_GPL(prof_on);

-static cpumask_var_t prof_cpu_mask;
+static cpumask_var_t prof_cpu_mask __asi_not_sensitive;
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits);
static DEFINE_PER_CPU(int, cpu_profile_flip);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ef8d36f580fc..284d2722cf0c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -82,7 +82,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
#endif
};
-static struct rcu_state rcu_state = {
+static struct rcu_state rcu_state __asi_not_sensitive = {
.level = { &rcu_state.node[0] },
.gp_state = RCU_GP_IDLE,
.gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT,
@@ -98,7 +98,7 @@ static struct rcu_state rcu_state = {
static bool dump_tree;
module_param(dump_tree, bool, 0444);
/* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */
-static bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT);
+static __asi_not_sensitive bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT);
#ifndef CONFIG_PREEMPT_RT
module_param(use_softirq, bool, 0444);
#endif
@@ -125,7 +125,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
* transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU
* is fully initialized, including all of its kthreads having been spawned.
*/
-int rcu_scheduler_active __read_mostly;
+int rcu_scheduler_active __asi_not_sensitive;
EXPORT_SYMBOL_GPL(rcu_scheduler_active);

/*
@@ -140,7 +140,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
* early boot to take responsibility for these callbacks, but one step at
* a time.
*/
-static int rcu_scheduler_fully_active __read_mostly;
+static int rcu_scheduler_fully_active __asi_not_sensitive;

static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
unsigned long gps, unsigned long flags);
@@ -470,7 +470,7 @@ module_param(qovld, long, 0444);

static ulong jiffies_till_first_fqs = IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 0 : ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;
-static bool rcu_kick_kthreads;
+static bool rcu_kick_kthreads __asi_not_sensitive;
static int rcu_divisor = 7;
module_param(rcu_divisor, int, 0644);

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 156892c22bb5..b61a3854e62d 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -243,7 +243,7 @@ core_initcall(rcu_set_runtime_mode);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
static struct lock_class_key rcu_lock_key;
-struct lockdep_map rcu_lock_map = {
+struct lockdep_map rcu_lock_map __asi_not_sensitive = {
.name = "rcu_read_lock",
.key = &rcu_lock_key,
.wait_type_outer = LD_WAIT_FREE,
@@ -494,7 +494,7 @@ EXPORT_SYMBOL_GPL(rcutorture_sched_setaffinity);
#ifdef CONFIG_RCU_STALL_COMMON
int rcu_cpu_stall_ftrace_dump __read_mostly;
module_param(rcu_cpu_stall_ftrace_dump, int, 0644);
-int rcu_cpu_stall_suppress __read_mostly; // !0 = suppress stall warnings.
+int rcu_cpu_stall_suppress __asi_not_sensitive_readmostly; // !0 = suppress stall warnings.
EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress);
module_param(rcu_cpu_stall_suppress, int, 0644);
int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c2b2859ddd82..6c3585053f05 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -84,7 +84,7 @@ static int __sched_clock_stable_early = 1;
/*
* We want: ktime_get_ns() + __gtod_offset == sched_clock() + __sched_clock_offset
*/
-__read_mostly u64 __sched_clock_offset;
+__asi_not_sensitive u64 __sched_clock_offset;
static __read_mostly u64 __gtod_offset;

struct sched_clock_data {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44ea197c16ea..e1c08ff4130e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,9 +76,9 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
* Limited because this is done with IRQs disabled.
*/
#ifdef CONFIG_PREEMPT_RT
-const_debug unsigned int sysctl_sched_nr_migrate = 8;
+unsigned int sysctl_sched_nr_migrate __asi_not_sensitive_readmostly = 8;
#else
-const_debug unsigned int sysctl_sched_nr_migrate = 32;
+unsigned int sysctl_sched_nr_migrate __asi_not_sensitive_readmostly = 32;
#endif

/*
@@ -9254,7 +9254,7 @@ int in_sched_functions(unsigned long addr)
* Default task group.
* Every task in system belongs to this group at bootup.
*/
-struct task_group root_task_group;
+struct task_group root_task_group __asi_not_sensitive;
LIST_HEAD(task_groups);

/* Cacheline aligned slab cache for task_group */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 893eece65bfd..6e3da149125c 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -50,7 +50,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
}

static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
-static struct cpuacct root_cpuacct = {
+static struct cpuacct root_cpuacct __asi_not_sensitive = {
.cpustat = &kernel_cpustat,
.cpuusage = &root_cpuacct_cpuusage,
};
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9392aea1804e..623b5feb142a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -19,7 +19,7 @@
*/
DEFINE_PER_CPU(struct irqtime, cpu_irqtime);

-static int sched_clock_irqtime;
+static int __asi_not_sensitive sched_clock_irqtime;

void enable_sched_clock_irqtime(void)
{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..dc9b6133b059 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -35,7 +35,7 @@
*
* (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
*/
-unsigned int sysctl_sched_latency = 6000000ULL;
+__asi_not_sensitive unsigned int sysctl_sched_latency = 6000000ULL;
static unsigned int normalized_sysctl_sched_latency = 6000000ULL;

/*
@@ -90,7 +90,7 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;

-const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+unsigned int sysctl_sched_migration_cost __asi_not_sensitive_readmostly = 500000UL;

int sched_thermal_decay_shift;
static int __init setup_sched_thermal_decay_shift(char *str)
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 954b229868d9..af71cde93e98 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -57,7 +57,7 @@

/* Variables and functions for calc_load */
atomic_long_t calc_load_tasks;
-unsigned long calc_load_update;
+unsigned long calc_load_update __asi_not_sensitive;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun); /* should be removed */

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b48baaba2fc2..9d5fbe66d355 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -14,7 +14,7 @@ static const u64 max_rt_runtime = MAX_BW;

static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);

-struct rt_bandwidth def_rt_bandwidth;
+struct rt_bandwidth def_rt_bandwidth __asi_not_sensitive;

static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e66749486e7..517c70a29a57 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2379,8 +2379,8 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

-extern const_debug unsigned int sysctl_sched_nr_migrate;
-extern const_debug unsigned int sysctl_sched_migration_cost;
+extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_sched_migration_cost;

#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
diff --git a/kernel/smp.c b/kernel/smp.c
index 01a7c1706a58..c51fd981a4a9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1070,7 +1070,7 @@ static int __init maxcpus(char *str)
early_param("maxcpus", maxcpus);

/* Setup number of possible processor ids */
-unsigned int nr_cpu_ids __read_mostly = NR_CPUS;
+unsigned int nr_cpu_ids __asi_not_sensitive = NR_CPUS;
EXPORT_SYMBOL(nr_cpu_ids);

/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 41f470929e99..c462b7fab4d3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,7 +56,8 @@ DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
EXPORT_PER_CPU_SYMBOL(irq_stat);
#endif

-static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
+static struct softirq_action softirq_vec[NR_SOFTIRQS]
+__asi_not_sensitive ____cacheline_aligned;

DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 0ea8702eb516..8b176f5c01f2 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -706,7 +706,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
* High resolution timer enabled ?
*/
static bool hrtimer_hres_enabled __read_mostly = true;
-unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
+unsigned int hrtimer_resolution __asi_not_sensitive = LOW_RES_NSEC;
EXPORT_SYMBOL_GPL(hrtimer_resolution);

/*
diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index bc4db9e5ab70..c60f8da1cfb5 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -40,7 +40,13 @@ static struct clocksource clocksource_jiffies = {
.max_cycles = 10,
};

-__cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(jiffies_lock);
+/* TODO(oweisse): __cacheline_aligned_in_smp is expanded to
+ __section__(".data..cacheline_aligned"))) which is at odds with
+ __asi_not_sensitive. We should consider instead using
+ __attribute__ ((__aligned__(XXX))) where XXX is a def for cacheline or
+ something*/
+/* __cacheline_aligned_in_smp */
+__asi_not_sensitive DEFINE_RAW_SPINLOCK(jiffies_lock);
__cacheline_aligned_in_smp seqcount_raw_spinlock_t jiffies_seq =
SEQCNT_RAW_SPINLOCK_ZERO(jiffies_seq, &jiffies_lock);

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 406dccb79c2b..23711fb94323 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -31,13 +31,13 @@


/* USER_HZ period (usecs): */
-unsigned long tick_usec = USER_TICK_USEC;
+unsigned long tick_usec __asi_not_sensitive = USER_TICK_USEC;

/* SHIFTED_HZ period (nsecs): */
-unsigned long tick_nsec;
+unsigned long tick_nsec __asi_not_sensitive;

-static u64 tick_length;
-static u64 tick_length_base;
+static u64 tick_length __asi_not_sensitive;
+static u64 tick_length_base __asi_not_sensitive;

#define SECS_PER_DAY 86400
#define MAX_TICKADJ 500LL /* usecs */
@@ -54,36 +54,36 @@ static u64 tick_length_base;
*
* (TIME_ERROR prevents overwriting the CMOS clock)
*/
-static int time_state = TIME_OK;
+static int time_state __asi_not_sensitive = TIME_OK;

/* clock status bits: */
-static int time_status = STA_UNSYNC;
+static int time_status __asi_not_sensitive = STA_UNSYNC;

/* time adjustment (nsecs): */
-static s64 time_offset;
+static s64 time_offset __asi_not_sensitive;

/* pll time constant: */
-static long time_constant = 2;
+static long time_constant __asi_not_sensitive = 2;

/* maximum error (usecs): */
-static long time_maxerror = NTP_PHASE_LIMIT;
+static long time_maxerror __asi_not_sensitive = NTP_PHASE_LIMIT;

/* estimated error (usecs): */
-static long time_esterror = NTP_PHASE_LIMIT;
+static long time_esterror __asi_not_sensitive = NTP_PHASE_LIMIT;

/* frequency offset (scaled nsecs/secs): */
-static s64 time_freq;
+static s64 time_freq __asi_not_sensitive;

/* time at last adjustment (secs): */
-static time64_t time_reftime;
+static time64_t time_reftime __asi_not_sensitive;

-static long time_adjust;
+static long time_adjust __asi_not_sensitive;

/* constant (boot-param configurable) NTP tick adjustment (upscaled) */
-static s64 ntp_tick_adj;
+static s64 ntp_tick_adj __asi_not_sensitive;

/* second value of the next pending leapsecond, or TIME64_MAX if no leap */
-static time64_t ntp_next_leap_sec = TIME64_MAX;
+static time64_t ntp_next_leap_sec __asi_not_sensitive = TIME64_MAX;

#ifdef CONFIG_NTP_PPS

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 46789356f856..cbe75661ca74 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
* CPU which handles the tick and protected by jiffies_lock. There is
* no requirement to write hold the jiffies seqcount for it.
*/
-ktime_t tick_next_period;
+ktime_t tick_next_period __asi_not_sensitive;

/*
* tick_do_timer_cpu is a timer core internal variable which holds the CPU NR
@@ -47,7 +47,7 @@ ktime_t tick_next_period;
* at it will take over and keep the time keeping alive. The handover
* procedure also covers cpu hotplug.
*/
-int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
+int tick_do_timer_cpu __asi_not_sensitive_readmostly = TICK_DO_TIMER_BOOT;
#ifdef CONFIG_NO_HZ_FULL
/*
* tick_do_timer_boot_cpu indicates the boot CPU temporarily owns
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 649f2b48e8f0..ed7e2a18060a 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -15,7 +15,7 @@

DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
extern ktime_t tick_next_period;
-extern int tick_do_timer_cpu __read_mostly;
+extern int tick_do_timer_cpu;

extern void tick_setup_periodic(struct clock_event_device *dev, int broadcast);
extern void tick_handle_periodic(struct clock_event_device *dev);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 17a283ce2b20..c23fecbb68c2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -49,7 +49,7 @@ struct tick_sched *tick_get_tick_sched(int cpu)
* jiffies_lock and jiffies_seq. tick_nohz_next_event() needs to get a
* consistent view of jiffies and last_jiffies_update.
*/
-static ktime_t last_jiffies_update;
+static ktime_t last_jiffies_update __asi_not_sensitive;

/*
* Must be called with interrupts disabled !
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index dcdcb85121e4..120395965e45 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -39,7 +39,7 @@ enum timekeeping_adv_mode {
TK_ADV_FREQ
};

-DEFINE_RAW_SPINLOCK(timekeeper_lock);
+__asi_not_sensitive DEFINE_RAW_SPINLOCK(timekeeper_lock);

/*
* The most important data for readout fits into a single 64 byte
@@ -48,14 +48,14 @@ DEFINE_RAW_SPINLOCK(timekeeper_lock);
static struct {
seqcount_raw_spinlock_t seq;
struct timekeeper timekeeper;
-} tk_core ____cacheline_aligned = {
+} tk_core ____cacheline_aligned __asi_not_sensitive = {
.seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock),
};

-static struct timekeeper shadow_timekeeper;
+static struct timekeeper shadow_timekeeper __asi_not_sensitive;

/* flag for if timekeeping is suspended */
-int __read_mostly timekeeping_suspended;
+int __asi_not_sensitive_readmostly timekeeping_suspended;

/**
* struct tk_fast - NMI safe timekeeper
@@ -72,7 +72,7 @@ struct tk_fast {
};

/* Suspend-time cycles value for halted fast timekeeper. */
-static u64 cycles_at_suspend;
+static u64 cycles_at_suspend __asi_not_sensitive;

static u64 dummy_clock_read(struct clocksource *cs)
{
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 543beba096c7..b32ee75808fe 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -26,7 +26,7 @@ extern void update_process_times(int user);
extern void do_timer(unsigned long ticks);
extern void update_wall_time(void);

-extern raw_spinlock_t jiffies_lock;
+extern __asi_not_sensitive raw_spinlock_t jiffies_lock;
extern seqcount_raw_spinlock_t jiffies_seq;

#define CS_NAME_LEN 32
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 85f1021ad459..0b09c99b568c 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -56,7 +56,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/timer.h>

-__visible u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
+u64 jiffies_64 __asi_not_sensitive ____cacheline_aligned = INITIAL_JIFFIES;

EXPORT_SYMBOL(jiffies_64);

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 78ea542ce3bc..eaec3814c5a4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -432,7 +432,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
* The global_trace is the descriptor that holds the top-level tracing
* buffers for the live tracing.
*/
-static struct trace_array global_trace = {
+static struct trace_array global_trace __asi_not_sensitive = {
.trace_flags = TRACE_DEFAULT_FLAGS,
};

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e304196d7c28..d49db8e2430a 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -16,8 +16,8 @@
#define RECORD_CMDLINE 1
#define RECORD_TGID 2

-static int sched_cmdline_ref;
-static int sched_tgid_ref;
+static int sched_cmdline_ref __asi_not_sensitive;
+static int sched_tgid_ref __asi_not_sensitive;
static DEFINE_MUTEX(sched_register_mutex);

static void
diff --git a/lib/debug_locks.c b/lib/debug_locks.c
index a75ee30b77cb..f2d217859be6 100644
--- a/lib/debug_locks.c
+++ b/lib/debug_locks.c
@@ -14,6 +14,7 @@
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/debug_locks.h>
+#include <asm/asi.h>

/*
* We want to turn all lock-debugging facilities on/off at once,
@@ -22,7 +23,7 @@
* that would just muddy the log. So we report the first one and
* shut up after that.
*/
-int debug_locks __read_mostly = 1;
+int debug_locks __asi_not_sensitive_readmostly = 1;
EXPORT_SYMBOL_GPL(debug_locks);

/*
@@ -30,7 +31,7 @@ EXPORT_SYMBOL_GPL(debug_locks);
* 'silent failure': nothing is printed to the console when
* a locking bug is detected.
*/
-int debug_locks_silent __read_mostly;
+int debug_locks_silent __asi_not_sensitive_readmostly;
EXPORT_SYMBOL_GPL(debug_locks_silent);

/*
diff --git a/mm/memory.c b/mm/memory.c
index 667ece86e051..5aa39d0aba2b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -152,7 +152,7 @@ static int __init disable_randmaps(char *s)
}
__setup("norandmaps", disable_randmaps);

-unsigned long zero_pfn __read_mostly;
+unsigned long zero_pfn __asi_not_sensitive;
EXPORT_SYMBOL(zero_pfn);

unsigned long highest_memmap_pfn __read_mostly;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 998ff6a56732..9c850b8bd1fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -183,7 +183,7 @@ unsigned long totalreserve_pages __read_mostly;
unsigned long totalcma_pages __read_mostly;

int percpu_pagelist_high_fraction;
-gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
+gfp_t gfp_allowed_mask __asi_not_sensitive_readmostly = GFP_BOOT_MASK;
DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
EXPORT_SYMBOL(init_on_alloc);

diff --git a/mm/sparse.c b/mm/sparse.c
index e5c84b0cf0c9..64dcf7fceaed 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -24,10 +24,10 @@
* 1) mem_section - memory sections, mem_map's for valid memory
*/
#ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section **mem_section;
+struct mem_section **mem_section __asi_not_sensitive;
#else
struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
- ____cacheline_internodealigned_in_smp;
+ ____cacheline_internodealigned_in_smp __asi_not_sensitive;
#endif
EXPORT_SYMBOL(mem_section);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e8e9c8588908..0af973b950c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3497,7 +3497,7 @@ static int kvm_vcpu_release(struct inode *inode, struct file *filp)
return 0;
}

-static struct file_operations kvm_vcpu_fops = {
+static struct file_operations kvm_vcpu_fops __asi_not_sensitive = {
.release = kvm_vcpu_release,
.unlocked_ioctl = kvm_vcpu_ioctl,
.mmap = kvm_vcpu_mmap,
--
2.35.1.473.g83b2b277ed-goog

2022-02-24 01:03:05

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 14/47] mm: asi: Disable ASI API when ASI is not enabled for a process

If ASI is not enabled for a process, then asi_init() will return a NULL
ASI pointer as output, though it will return a 0 error code. All other
ASI API functions will return without an error when they get a NULL ASI
pointer.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/include/asm/asi.h | 2 +-
arch/x86/mm/asi.c | 18 ++++++++++--------
include/asm-generic/asi.h | 7 ++++++-
3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 64c2b4d1dba2..f69e1f2f09a4 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -51,7 +51,7 @@ int asi_register_class(const char *name, uint flags,
const struct asi_hooks *ops);
void asi_unregister_class(int index);

-int asi_init(struct mm_struct *mm, int asi_index);
+int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi);
void asi_destroy(struct asi *asi);

void asi_enter(struct asi *asi);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index ca50a32ecd7e..58d1c532274a 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -207,11 +207,13 @@ static int __init asi_global_init(void)
}
subsys_initcall(asi_global_init)

-int asi_init(struct mm_struct *mm, int asi_index)
+int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
{
struct asi *asi = &mm->asi[asi_index];

- if (!boot_cpu_has(X86_FEATURE_ASI))
+ *out_asi = NULL;
+
+ if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled)
return 0;

/* Index 0 is reserved for special purposes. */
@@ -238,13 +240,15 @@ int asi_init(struct mm_struct *mm, int asi_index)
set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
}

+ *out_asi = asi;
+
return 0;
}
EXPORT_SYMBOL_GPL(asi_init);

void asi_destroy(struct asi *asi)
{
- if (!boot_cpu_has(X86_FEATURE_ASI))
+ if (!boot_cpu_has(X86_FEATURE_ASI) || !asi)
return;

asi_free_pgd(asi);
@@ -278,11 +282,9 @@ void __asi_enter(void)

void asi_enter(struct asi *asi)
{
- if (!static_cpu_has(X86_FEATURE_ASI))
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
return;

- VM_WARN_ON_ONCE(!asi);
-
this_cpu_write(asi_cpu_state.target_asi, asi);
barrier();

@@ -423,7 +425,7 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags)
size_t end = start + len;
size_t page_size;

- if (!static_cpu_has(X86_FEATURE_ASI))
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi)
return 0;

VM_BUG_ON(start & ~PAGE_MASK);
@@ -514,7 +516,7 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb)
size_t end = start + len;
pgtbl_mod_mask mask = 0;

- if (!static_cpu_has(X86_FEATURE_ASI) || !len)
+ if (!static_cpu_has(X86_FEATURE_ASI) || !asi || !len)
return;

VM_BUG_ON(start & ~PAGE_MASK);
diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h
index f918cd052722..51c9c4a488e8 100644
--- a/include/asm-generic/asi.h
+++ b/include/asm-generic/asi.h
@@ -33,7 +33,12 @@ static inline void asi_unregister_class(int asi_index) { }

static inline void asi_init_mm_state(struct mm_struct *mm) { }

-static inline int asi_init(struct mm_struct *mm, int asi_index) { return 0; }
+static inline
+int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi)
+{
+ *out_asi = NULL;
+ return 0;
+}

static inline void asi_destroy(struct asi *asi) { }

--
2.35.1.473.g83b2b277ed-goog

2022-02-24 01:38:00

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 46/47] kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace

From: Ofir Weisse <[email protected]>

For the time being, we switch to the full kernel address space before
returning back to userspace. Once KPTI is also implemented using ASI,
we could potentially also switch to the KPTI address space directly.

Signed-off-by: Ofir Weisse <[email protected]>


---
arch/x86/kvm/x86.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 680725089a18..294f73e9e71e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10148,13 +10148,17 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
r = xfer_to_guest_mode_handle_work(vcpu);
if (r)
- return r;
+ goto exit;
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
}
}

srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);

+exit:
+ /* TODO(oweisse): trace this exit if we're still within an ASI. */
+ asi_exit();
+
return r;
}

--
2.35.1.473.g83b2b277ed-goog

2022-02-24 01:50:21

by Junaid Shahid

[permalink] [raw]
Subject: [RFC PATCH 06/47] mm: asi: ASI page table allocation and free functions

This adds custom allocation and free functions for ASI page tables.

The alloc functions support allocating memory using different GFP
reclaim flags, in order to be able to support non-sensitive allocations
from both standard and atomic contexts. They also install the page
tables locklessly, which makes it slightly simpler to handle
non-sensitive allocations from interrupts/exceptions.

The free functions recursively free the page tables when the ASI
instance is being torn down.

Signed-off-by: Junaid Shahid <[email protected]>


---
arch/x86/mm/asi.c | 109 +++++++++++++++++++++++++++++++++++++++-
include/linux/pgtable.h | 3 ++
2 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 2453124f221d..40d772b2e2a8 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -60,6 +60,113 @@ void asi_unregister_class(int index)
}
EXPORT_SYMBOL_GPL(asi_unregister_class);

+#ifndef mm_inc_nr_p4ds
+#define mm_inc_nr_p4ds(mm) do {} while (false)
+#endif
+
+#ifndef mm_dec_nr_p4ds
+#define mm_dec_nr_p4ds(mm) do {} while (false)
+#endif
+
+#define pte_offset pte_offset_kernel
+
+#define DEFINE_ASI_PGTBL_ALLOC(base, level) \
+static level##_t * asi_##level##_alloc(struct asi *asi, \
+ base##_t *base, ulong addr, \
+ gfp_t flags) \
+{ \
+ if (unlikely(base##_none(*base))) { \
+ ulong pgtbl = get_zeroed_page(flags); \
+ phys_addr_t pgtbl_pa; \
+ \
+ if (pgtbl == 0) \
+ return NULL; \
+ \
+ pgtbl_pa = __pa(pgtbl); \
+ paravirt_alloc_##level(asi->mm, PHYS_PFN(pgtbl_pa)); \
+ \
+ if (cmpxchg((ulong *)base, 0, \
+ pgtbl_pa | _PAGE_TABLE) == 0) { \
+ mm_inc_nr_##level##s(asi->mm); \
+ } else { \
+ paravirt_release_##level(PHYS_PFN(pgtbl_pa)); \
+ free_page(pgtbl); \
+ } \
+ \
+ /* NOP on native. PV call on Xen. */ \
+ set_##base(base, *base); \
+ } \
+ VM_BUG_ON(base##_large(*base)); \
+ return level##_offset(base, addr); \
+}
+
+DEFINE_ASI_PGTBL_ALLOC(pgd, p4d)
+DEFINE_ASI_PGTBL_ALLOC(p4d, pud)
+DEFINE_ASI_PGTBL_ALLOC(pud, pmd)
+DEFINE_ASI_PGTBL_ALLOC(pmd, pte)
+
+#define asi_free_dummy(asi, addr)
+#define __pmd_free(mm, pmd) free_page((ulong)(pmd))
+#define pud_page_vaddr(pud) ((ulong)pud_pgtable(pud))
+#define p4d_page_vaddr(p4d) ((ulong)p4d_pgtable(p4d))
+
+static inline unsigned long pte_page_vaddr(pte_t pte)
+{
+ return (unsigned long)__va(pte_val(pte) & PTE_PFN_MASK);
+}
+
+#define DEFINE_ASI_PGTBL_FREE(level, LEVEL, next, free) \
+static void asi_free_##level(struct asi *asi, ulong pgtbl_addr) \
+{ \
+ uint i; \
+ level##_t *level = (level##_t *)pgtbl_addr; \
+ \
+ for (i = 0; i < PTRS_PER_##LEVEL; i++) { \
+ ulong vaddr; \
+ \
+ if (level##_none(level[i])) \
+ continue; \
+ \
+ vaddr = level##_page_vaddr(level[i]); \
+ \
+ if (!level##_leaf(level[i])) \
+ asi_free_##next(asi, vaddr); \
+ else \
+ VM_WARN(true, "Lingering mapping in ASI %p at %lx",\
+ asi, vaddr); \
+ } \
+ paravirt_release_##level(PHYS_PFN(__pa(pgtbl_addr))); \
+ free(asi->mm, level); \
+ mm_dec_nr_##level##s(asi->mm); \
+}
+
+DEFINE_ASI_PGTBL_FREE(pte, PTE, dummy, pte_free_kernel)
+DEFINE_ASI_PGTBL_FREE(pmd, PMD, pte, __pmd_free)
+DEFINE_ASI_PGTBL_FREE(pud, PUD, pmd, pud_free)
+DEFINE_ASI_PGTBL_FREE(p4d, P4D, pud, p4d_free)
+
+static void asi_free_pgd_range(struct asi *asi, uint start, uint end)
+{
+ uint i;
+
+ for (i = start; i < end; i++)
+ if (pgd_present(asi->pgd[i]))
+ asi_free_p4d(asi, (ulong)p4d_offset(asi->pgd + i, 0));
+}
+
+/*
+ * Free the page tables allocated for the given ASI instance.
+ * The caller must ensure that all the mappings have already been cleared
+ * and appropriate TLB flushes have been issued before calling this function.
+ */
+static void asi_free_pgd(struct asi *asi)
+{
+ VM_BUG_ON(asi->mm == &init_mm);
+
+ asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD);
+ free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER);
+}
+
static int __init set_asi_param(char *str)
{
if (strcmp(str, "on") == 0)
@@ -102,7 +209,7 @@ void asi_destroy(struct asi *asi)
if (!boot_cpu_has(X86_FEATURE_ASI))
return;

- free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER);
+ asi_free_pgd(asi);
memset(asi, 0, sizeof(struct asi));
}
EXPORT_SYMBOL_GPL(asi_destroy);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..2fff17a939f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1593,6 +1593,9 @@ typedef unsigned int pgtbl_mod_mask;
#ifndef pmd_leaf
#define pmd_leaf(x) 0
#endif
+#ifndef pte_leaf
+#define pte_leaf(x) 1
+#endif

#ifndef pgd_leaf_size
#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
--
2.35.1.473.g83b2b277ed-goog

2022-03-05 12:13:47

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

On Tue, Feb 22, 2022 at 09:21:36PM -0800, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of
> Address Space Isolation for KVM. It has similar goals and a somewhat similar
> high-level design as the original ASI patches from Alexandre Chartre
> ([1],[2],[3],[4]), but with a different underlying implementation. This also
> includes several memory management changes to help with differentiating between
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
> ASI restricted address spaces.
>
> This RFC is intended as a demonstration of what a full ASI implementation for
> KVM could look like, not necessarily as a direct proposal for what might
> eventually be merged. In particular, these patches do not yet implement KPTI on
> top of ASI, although the framework is generic enough to be able to support it.
> Similarly, these patches do not include non-sensitive annotations for data
> structures that did not get frequently accessed during execution of our test
> workloads, but the framework is designed such that new non-sensitive memory
> annotations can be added trivially.
>
> The patches apply on top of Linux v5.16. These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

[+Cc slab maintainers/reviewers]

Please Cc relevant people.
patch 14, 24, 31 need to be reviewed by slab people :)

> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types
> of speculative execution attacks. Even though the kernel already has several
> speculative execution vulnerability mitigations, some of them can be quite
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing
> mechanisms requires doing an L1 cache flush on every single VM entry as well as
> disabling hyperthreading altogether. (Although core scheduling can provide some
> protection when hyperthreading is enabled, it is not sufficient by itself to
> protect against all leaks unless sibling hyperthread stunning is also performed
> on every VM exit.) ASI provides a much less expensive mitigation for such
> vulnerabilities while still providing an almost similar level of protection.
>
> There are a couple of basic insights/assumptions behind ASI:
>
> 1. Most execution paths in the kernel (especially during virtual machine
> execution) access only memory that is not particularly sensitive even if it were
> to get leaked to the executing process/VM (setting aside for a moment what
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory
> that is mapped in the current page tables into its various caches and internal
> buffers.
>
> Given these, the idea of using ASI to thwart speculative attacks is that we can
> execute the kernel using a restricted set of page tables most of the time and
> switch to the full unrestricted kernel address space only when the kernel needs
> to access something that is not mapped in the restricted address space. And we
> keep track of when a switch to the full kernel address space is done, so that
> before returning back to the process/VM, we can switch back to the restricted
> address space. In the paths where the kernel is able to execute entirely while
> remaining in the restricted address space, we can skip other mitigations for
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
> sibling hyperthread stunning etc.). Only in the cases where we do end up
> switching the page tables, we perform these more expensive mitigations. Assuming
> that happens relatively infrequently, the performance can be significantly
> better compared to performing these mitigations all the time.
>
> Please note that although we do have a sibling hyperthread stunning
> implementation internally, which is fully integrated with KVM-ASI, it is not
> included in this RFC for the time being. The earlier upstream proposal for
> sibling stunning [6] could potentially be integrated into an upstream ASI
> implementation.
>
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
> would be another ASI class. An ASI instance (struct asi) represents a single
> restricted address space. There is a separate ASI instance for each untrusted
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
> there can be multiple untrusted security contexts (and thus multiple restricted
> address spaces) within a single process e.g. in the case of VMs, the userspace
> process is a different security context than the guest VM, and in principle,
> even each VCPU could be considered a separate security context (That would be
> primarily useful for securing nested virtualization).
>
> In this RFC, a process can have at most one ASI instance of each class, though
> this is not an inherent limitation and multiple instances of the same class
> should eventually be supported. (A process can still have ASI instances of
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
> entirely necessary to tie an ASI instance to a process. That is just a
> simplification for the initial implementation.
>
> An asi_enter operation switches into the restricted address space represented by
> the given ASI instance. An asi_exit operation switches to the full unrestricted
> kernel address space. Each ASI class can provide hooks to be executed during
> these operations, which can be used to perform speculative attack mitigations
> relevant to that class. For instance, the KVM-ASI hooks would perform a
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the
> page tables is enough mitigation in that case.
>
> If the kernel attempts to access memory that is not mapped in the currently
> active ASI instance, the page fault handler automatically performs an asi_exit
> operation. This means that except for a few critical pieces of memory, leaving
> something out of an unrestricted address space will result in only a performance
> hit, rather than a catastrophic failure. The kernel can also perform explicit
> asi_exit operations in some paths as needed.
>
> Apart from the page fault handler, other exceptions and interrupts (even NMIs)
> do not automatically cause an asi_exit and could potentially be executed
> completely within a restricted address space if they don't end up accessing any
> sensitive piece of memory.
>
> The mappings within a restricted address space are always a subset of the full
> kernel address space and each mapping is always the same as the corresponding
> mapping in the full kernel address space. This is necessary because we could
> potentially end up performing an asi_exit at any point.
>
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
> class could also be implemented on top of the same infrastructure. Furthermore,
> in the future we could also implement a KPTI-Next class that actually uses the
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
> the restricted address space and trying to execute most syscalls/interrupts
> without switching to the full kernel address space, as opposed to the current
> KPTI which requires an address space switch on every kernel/user mode
> transition.
>
> Memory classification
> =====================
> We divide memory into three categories.
>
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive
> memory is only mapped in the unrestricted kernel page tables. By default, all
> memory is considered sensitive unless specifically categorized otherwise.
>
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it
> were to get leaked to any process or VM in the system. Globally non-sensitive
> memory is mapped in the restricted address spaces for all processes.
>
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to
> get leaked to the currently running process or VM, but would present a security
> issue if it were to get leaked to any other process or VM in the system.
> Examples include userspace memory (or guest memory in the case of VMs) or kernel
> structures containing userspace/guest register context etc. Locally
> non-sensitive memory is mapped only in the restricted address space of a single
> process.
>
> Various mechanisms are provided to annotate different types of memory (static,
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
> addition, the ASI infrastructure takes care to ensure that different classes of
> memory do not share the same physical page. This includes separation of
> sensitive, globally non-sensitive and locally non-sensitive memory into
> different pages and also separation of locally non-sensitive memory for
> different processes into different pages as well.
>
> What exactly should be considered non-sensitive (either globally or locally) is
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
> many things also fall into a gray area, depending on how paranoid one wants to
> be. For this proof of concept, we have generally treated such things as
> non-sensitive, though that may not necessarily be the ideal classification in
> each case. Similarly, there is also a gray area between globally and locally
> non-sensitive classifications in some cases, and in those cases this RFC has
> mostly erred on the side of marking them as locally non-sensitive, even though
> many of those cases could likely be safely classified as globally non-sensitive.
>
> Although this implementation includes fairly extensive support for marking most
> types of dynamically allocated memory as locally non-sensitive, it is possibly
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such
> as [5]), if we are very selective about what memory we treat as locally
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
> general mechanism is included in this proof of concept as an illustration for
> what could be done if we really needed to treat any arbitrary kernel memory as
> locally non-sensitive.
>
> It is also possible to have ASI classes that do not utilize the above described
> infrastructure and instead manage all the memory mappings inside the restricted
> address space on their own.
>
>
> References
> ==========
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/[email protected]
> [3] https://lore.kernel.org/lkml/[email protected]
> [4] https://lore.kernel.org/lkml/[email protected]
> [5] https://lore.kernel.org/lkml/[email protected]
> [6] https://lore.kernel.org/lkml/[email protected]
>
> Cc: Paul Turner <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: Alexandre Chartre <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
>
>
> Junaid Shahid (32):
> mm: asi: Introduce ASI core API
> mm: asi: Add command-line parameter to enable/disable ASI
> mm: asi: Switch to unrestricted address space when entering scheduler
> mm: asi: ASI support in interrupts/exceptions
> mm: asi: Make __get_current_cr3_fast() ASI-aware
> mm: asi: ASI page table allocation and free functions
> mm: asi: Functions to map/unmap a memory range into ASI page tables
> mm: asi: Add basic infrastructure for global non-sensitive mappings
> mm: Add __PAGEFLAG_FALSE
> mm: asi: Support for global non-sensitive direct map allocations
> mm: asi: Global non-sensitive vmalloc/vmap support
> mm: asi: Support for global non-sensitive slab caches
> mm: asi: Disable ASI API when ASI is not enabled for a process
> kvm: asi: Restricted address space for VM execution
> mm: asi: Support for mapping non-sensitive pcpu chunks
> mm: asi: Aliased direct map for local non-sensitive allocations
> mm: asi: Support for pre-ASI-init local non-sensitive allocations
> mm: asi: Support for locally nonsensitive page allocations
> mm: asi: Support for locally non-sensitive vmalloc allocations
> mm: asi: Add support for locally non-sensitive VM_USERMAP pages
> mm: asi: Add support for mapping all userspace memory into ASI
> mm: asi: Support for local non-sensitive slab caches
> mm: asi: Avoid warning from NMI userspace accesses in ASI context
> mm: asi: Use separate PCIDs for restricted address spaces
> mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
> mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
> mm: asi: Reduce TLB flushes when freeing pages asynchronously
> mm: asi: Add API for mapping userspace address ranges
> mm: asi: Support for non-sensitive SLUB caches
> x86: asi: Allocate FPU state separately when ASI is enabled.
> kvm: asi: Map guest memory into restricted ASI address space
> kvm: asi: Unmap guest memory from ASI address space when using nested
> virt
>
> Ofir Weisse (15):
> asi: Added ASI memory cgroup flag
> mm: asi: Added refcounting when initilizing an asi
> mm: asi: asi_exit() on PF, skip handling if address is accessible
> mm: asi: Adding support for dynamic percpu ASI allocations
> mm: asi: ASI annotation support for static variables.
> mm: asi: ASI annotation support for dynamic modules.
> mm: asi: Skip conventional L1TF/MDS mitigations
> mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
> mm: asi: Annotation of static variables to be nonsensitive
> mm: asi: Annotation of PERCPU variables to be nonsensitive
> mm: asi: Annotation of dynamic variables to be nonsensitive
> kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
> mm: asi: Mapping global nonsensitive areas in asi_global_init
> kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
> mm: asi: Properly un/mapping task stack from ASI + tlb flush
>
> arch/alpha/include/asm/Kbuild | 1 +
> arch/arc/include/asm/Kbuild | 1 +
> arch/arm/include/asm/Kbuild | 1 +
> arch/arm64/include/asm/Kbuild | 1 +
> arch/csky/include/asm/Kbuild | 1 +
> arch/h8300/include/asm/Kbuild | 1 +
> arch/hexagon/include/asm/Kbuild | 1 +
> arch/ia64/include/asm/Kbuild | 1 +
> arch/m68k/include/asm/Kbuild | 1 +
> arch/microblaze/include/asm/Kbuild | 1 +
> arch/mips/include/asm/Kbuild | 1 +
> arch/nds32/include/asm/Kbuild | 1 +
> arch/nios2/include/asm/Kbuild | 1 +
> arch/openrisc/include/asm/Kbuild | 1 +
> arch/parisc/include/asm/Kbuild | 1 +
> arch/powerpc/include/asm/Kbuild | 1 +
> arch/riscv/include/asm/Kbuild | 1 +
> arch/s390/include/asm/Kbuild | 1 +
> arch/sh/include/asm/Kbuild | 1 +
> arch/sparc/include/asm/Kbuild | 1 +
> arch/um/include/asm/Kbuild | 1 +
> arch/x86/events/core.c | 6 +-
> arch/x86/events/intel/bts.c | 2 +-
> arch/x86/events/intel/core.c | 2 +-
> arch/x86/events/msr.c | 2 +-
> arch/x86/events/perf_event.h | 4 +-
> arch/x86/include/asm/asi.h | 215 ++++
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/current.h | 2 +-
> arch/x86/include/asm/debugreg.h | 2 +-
> arch/x86/include/asm/desc.h | 2 +-
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/fpu/api.h | 3 +-
> arch/x86/include/asm/hardirq.h | 2 +-
> arch/x86/include/asm/hw_irq.h | 2 +-
> arch/x86/include/asm/idtentry.h | 25 +-
> arch/x86/include/asm/kvm_host.h | 124 +-
> arch/x86/include/asm/page.h | 19 +-
> arch/x86/include/asm/page_64.h | 27 +-
> arch/x86/include/asm/page_64_types.h | 20 +
> arch/x86/include/asm/percpu.h | 2 +-
> arch/x86/include/asm/pgtable_64_types.h | 10 +
> arch/x86/include/asm/preempt.h | 2 +-
> arch/x86/include/asm/processor.h | 17 +-
> arch/x86/include/asm/smp.h | 2 +-
> arch/x86/include/asm/tlbflush.h | 49 +-
> arch/x86/include/asm/topology.h | 2 +-
> arch/x86/kernel/alternative.c | 2 +-
> arch/x86/kernel/apic/apic.c | 2 +-
> arch/x86/kernel/apic/x2apic_cluster.c | 8 +-
> arch/x86/kernel/cpu/bugs.c | 2 +-
> arch/x86/kernel/cpu/common.c | 12 +-
> arch/x86/kernel/e820.c | 7 +-
> arch/x86/kernel/fpu/core.c | 47 +-
> arch/x86/kernel/fpu/init.c | 7 +-
> arch/x86/kernel/fpu/internal.h | 1 +
> arch/x86/kernel/fpu/xstate.c | 21 +-
> arch/x86/kernel/head_64.S | 12 +
> arch/x86/kernel/hw_breakpoint.c | 2 +-
> arch/x86/kernel/irq.c | 2 +-
> arch/x86/kernel/irqinit.c | 2 +-
> arch/x86/kernel/nmi.c | 6 +-
> arch/x86/kernel/process.c | 13 +-
> arch/x86/kernel/setup.c | 4 +-
> arch/x86/kernel/setup_percpu.c | 4 +-
> arch/x86/kernel/smp.c | 2 +-
> arch/x86/kernel/smpboot.c | 3 +-
> arch/x86/kernel/traps.c | 2 +
> arch/x86/kernel/tsc.c | 10 +-
> arch/x86/kernel/vmlinux.lds.S | 2 +-
> arch/x86/kvm/cpuid.c | 18 +-
> arch/x86/kvm/kvm_cache_regs.h | 22 +-
> arch/x86/kvm/lapic.c | 11 +-
> arch/x86/kvm/mmu.h | 16 +-
> arch/x86/kvm/mmu/mmu.c | 209 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 40 +-
> arch/x86/kvm/mmu/spte.c | 6 +-
> arch/x86/kvm/mmu/spte.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 14 +-
> arch/x86/kvm/mtrr.c | 2 +-
> arch/x86/kvm/svm/nested.c | 34 +-
> arch/x86/kvm/svm/sev.c | 70 +-
> arch/x86/kvm/svm/svm.c | 52 +-
> arch/x86/kvm/trace.h | 10 +-
> arch/x86/kvm/vmx/capabilities.h | 14 +-
> arch/x86/kvm/vmx/nested.c | 90 +-
> arch/x86/kvm/vmx/vmx.c | 152 ++-
> arch/x86/kvm/x86.c | 315 +++--
> arch/x86/kvm/x86.h | 4 +-
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/asi.c | 1397 ++++++++++++++++++++++
> arch/x86/mm/fault.c | 67 +-
> arch/x86/mm/init.c | 7 +-
> arch/x86/mm/init_64.c | 26 +-
> arch/x86/mm/kaslr.c | 34 +-
> arch/x86/mm/mm_internal.h | 5 +
> arch/x86/mm/physaddr.c | 8 +
> arch/x86/mm/tlb.c | 419 ++++++-
> arch/xtensa/include/asm/Kbuild | 1 +
> fs/binfmt_elf.c | 2 +-
> fs/eventfd.c | 2 +-
> fs/eventpoll.c | 10 +-
> fs/exec.c | 7 +
> fs/file.c | 3 +-
> fs/timerfd.c | 2 +-
> include/asm-generic/asi.h | 149 +++
> include/asm-generic/irq_regs.h | 2 +-
> include/asm-generic/percpu.h | 6 +
> include/asm-generic/vmlinux.lds.h | 36 +-
> include/linux/arch_topology.h | 2 +-
> include/linux/debug_locks.h | 4 +-
> include/linux/gfp.h | 13 +-
> include/linux/hrtimer.h | 2 +-
> include/linux/interrupt.h | 2 +-
> include/linux/jiffies.h | 4 +-
> include/linux/kernel_stat.h | 4 +-
> include/linux/kvm_host.h | 7 +-
> include/linux/kvm_types.h | 3 +
> include/linux/memcontrol.h | 3 +
> include/linux/mm_types.h | 59 +
> include/linux/module.h | 15 +
> include/linux/notifier.h | 2 +-
> include/linux/page-flags.h | 19 +
> include/linux/percpu-defs.h | 39 +
> include/linux/percpu.h | 8 +-
> include/linux/pgtable.h | 3 +
> include/linux/prandom.h | 2 +-
> include/linux/profile.h | 2 +-
> include/linux/rcupdate.h | 4 +-
> include/linux/rcutree.h | 2 +-
> include/linux/sched.h | 5 +
> include/linux/sched/mm.h | 12 +
> include/linux/sched/sysctl.h | 1 +
> include/linux/slab.h | 68 +-
> include/linux/slab_def.h | 4 +
> include/linux/slub_def.h | 6 +
> include/linux/vmalloc.h | 16 +-
> include/trace/events/mmflags.h | 14 +-
> init/main.c | 2 +-
> kernel/cgroup/cgroup.c | 9 +-
> kernel/cpu.c | 14 +-
> kernel/entry/common.c | 6 +
> kernel/events/core.c | 25 +-
> kernel/exit.c | 2 +
> kernel/fork.c | 69 +-
> kernel/freezer.c | 2 +-
> kernel/irq_work.c | 6 +-
> kernel/locking/lockdep.c | 14 +-
> kernel/module-internal.h | 1 +
> kernel/module.c | 210 +++-
> kernel/panic.c | 2 +-
> kernel/printk/printk.c | 4 +-
> kernel/profile.c | 4 +-
> kernel/rcu/srcutree.c | 3 +-
> kernel/rcu/tree.c | 12 +-
> kernel/rcu/update.c | 4 +-
> kernel/sched/clock.c | 2 +-
> kernel/sched/core.c | 23 +-
> kernel/sched/cpuacct.c | 10 +-
> kernel/sched/cpufreq.c | 3 +-
> kernel/sched/cputime.c | 4 +-
> kernel/sched/fair.c | 7 +-
> kernel/sched/loadavg.c | 2 +-
> kernel/sched/rt.c | 2 +-
> kernel/sched/sched.h | 25 +-
> kernel/sched/topology.c | 28 +-
> kernel/smp.c | 26 +-
> kernel/softirq.c | 5 +-
> kernel/time/hrtimer.c | 4 +-
> kernel/time/jiffies.c | 8 +-
> kernel/time/ntp.c | 30 +-
> kernel/time/tick-common.c | 6 +-
> kernel/time/tick-internal.h | 6 +-
> kernel/time/tick-sched.c | 4 +-
> kernel/time/timekeeping.c | 10 +-
> kernel/time/timekeeping.h | 2 +-
> kernel/time/timer.c | 4 +-
> kernel/trace/ring_buffer.c | 5 +-
> kernel/trace/trace.c | 4 +-
> kernel/trace/trace_preemptirq.c | 2 +-
> kernel/trace/trace_sched_switch.c | 4 +-
> kernel/tracepoint.c | 2 +-
> kernel/watchdog.c | 12 +-
> lib/debug_locks.c | 5 +-
> lib/irq_regs.c | 2 +-
> lib/radix-tree.c | 6 +-
> lib/random32.c | 3 +-
> mm/init-mm.c | 2 +
> mm/internal.h | 3 +
> mm/memcontrol.c | 37 +-
> mm/memory.c | 4 +-
> mm/page_alloc.c | 204 +++-
> mm/percpu-internal.h | 23 +-
> mm/percpu-km.c | 5 +-
> mm/percpu-vm.c | 57 +-
> mm/percpu.c | 273 ++++-
> mm/slab.c | 42 +-
> mm/slab.h | 166 ++-
> mm/slab_common.c | 461 ++++++-
> mm/slub.c | 140 ++-
> mm/sparse.c | 4 +-
> mm/util.c | 3 +-
> mm/vmalloc.c | 193 ++-
> net/core/skbuff.c | 2 +-
> net/core/sock.c | 2 +-
> security/Kconfig | 12 +
> tools/perf/builtin-kmem.c | 2 +
> virt/kvm/coalesced_mmio.c | 2 +-
> virt/kvm/eventfd.c | 5 +-
> virt/kvm/kvm_main.c | 61 +-
> 211 files changed, 5727 insertions(+), 959 deletions(-)
> create mode 100644 arch/x86/include/asm/asi.h
> create mode 100644 arch/x86/mm/asi.c
> create mode 100644 include/asm-generic/asi.h
>
> --
> 2.35.1.473.g83b2b277ed-goog
>
>

--
Thank you, You are awesome!
Hyeonggon :-)

2022-03-15 07:33:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions

On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
> #define DEFINE_IDTENTRY_RAW(func) \
> -__visible noinstr void func(struct pt_regs *regs)
> +static __always_inline void __##func(struct pt_regs *regs); \
> + \
> +__visible noinstr void func(struct pt_regs *regs) \
> +{ \
> + asi_intr_enter(); \

This is wrong. You cannot invoke arbitrary code within a noinstr
section.

Please enable CONFIG_VMLINUX_VALIDATION and watch the build result with
and without your patches.

Thanks,

tglx

2022-03-15 07:53:13

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions

On 3/14/22 08:50, Thomas Gleixner wrote:
> On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>> #define DEFINE_IDTENTRY_RAW(func) \
>> -__visible noinstr void func(struct pt_regs *regs)
>> +static __always_inline void __##func(struct pt_regs *regs); \
>> + \
>> +__visible noinstr void func(struct pt_regs *regs) \
>> +{ \
>> + asi_intr_enter(); \
>
> This is wrong. You cannot invoke arbitrary code within a noinstr
> section.
>
> Please enable CONFIG_VMLINUX_VALIDATION and watch the build result with
> and without your patches.
>
> Thanks,
>
> tglx

Thank you for the pointer. It seems that marking asi_intr_enter/exit and asi_enter/exit, and the few functions that they in turn call, as noinstr would fix this, correct? (Along with removing the VM_BUG_ONs from those functions and using notrace/nodebug variants of a couple of functions).

Thanks,
Junaid

2022-03-16 13:49:43

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions

Hi Thomas,

On 3/15/22 05:55, Thomas Gleixner wrote:
>>>
>>> This is wrong. You cannot invoke arbitrary code within a noinstr
>>> section.
>>>
>>> Please enable CONFIG_VMLINUX_VALIDATION and watch the build result with
>>> and without your patches.
>>>
>> Thank you for the pointer. It seems that marking asi_intr_enter/exit
>> and asi_enter/exit, and the few functions that they in turn call, as
>> noinstr would fix this, correct? (Along with removing the VM_BUG_ONs
>> from those functions and using notrace/nodebug variants of a couple of
>> functions).
>
> you can keep the BUG_ON()s. If such a bug happens the noinstr
> correctness is the least of your worries, but it's important to get the
> information out, right?

Yes, that makes sense :)

>
> Vs. adding noinstr. Yes, making the full callchain noinstr is going to
> cure it, but you really want to think hard whether these calls need to
> be in this section of the exception handlers.
>
> These code sections have other constraints aside of being excluded from
> instrumentation, the main one being that you cannot use RCU there.

Neither of these functions need to use RCU, so that should be ok. Are there any other constraints that could matter here?

>
> I'm not yet convinced that asi_intr_enter()/exit() need to be invoked in
> exactly the places you put it. The changelog does not give any clue
> about the why...

I had to place these calls early in the exception/interrupt handlers and specifically before the point where things like tracing and lockdep etc. can be used (and after the point where they can no longer be used, for the asi_intr_exit() calls). Otherwise, we would need to map all data structures touched by the tracing/lockdep infrastructure into the ASI restricted address spaces.

Basically, in general, if while running in a restricted address space, some kernel code touches some memory which is not mapped in the restricted address space, it will take an implicit ASI Exit via the page fault handler and continue running, so it would just be a small performance hit, but not a fatal issue. But there are 3 critical code regions where this implicit ASI Exit mechanism doesn't apply. The first is the region between an asi_enter() call and the asi_set_target_unrestricted() call. The second is the region between the start of an interrupt/exception handler and the asi_intr_enter() call, and the third is the region between the asi_intr_exit() call and the IRET. So any memory that is accessed by the code in these regions has to be mapped in the restricted address space, which is why I tried to place the asi_intr_enter/exit calls fairly early/late in the handlers. It is possible to move them further in, but if we accidentally miss annotating some data needed in that region, then it could potentially be fatal in some situations.

Thanks,
Junaid

>
> Thanks,
>
> tglx
>
>
>

2022-03-17 01:28:12

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM


Hi Junaid,

On 2/23/22 06:21, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of
> Address Space Isolation for KVM. It has similar goals and a somewhat similar
> high-level design as the original ASI patches from Alexandre Chartre
> ([1],[2],[3],[4]), but with a different underlying implementation. This also
> includes several memory management changes to help with differentiating between
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
> ASI restricted address spaces.
>
> This RFC is intended as a demonstration of what a full ASI implementation for
> KVM could look like, not necessarily as a direct proposal for what might
> eventually be merged. In particular, these patches do not yet implement KPTI on
> top of ASI, although the framework is generic enough to be able to support it.
> Similarly, these patches do not include non-sensitive annotations for data
> structures that did not get frequently accessed during execution of our test
> workloads, but the framework is designed such that new non-sensitive memory
> annotations can be added trivially.
>
> The patches apply on top of Linux v5.16. These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>
Sorry for the late answer, and thanks for investigating possible ASI
implementations. I have to admit I put ASI on the back-burner for
a while because I am more and more wondering if the complexity of
ASI is worth the benefit, especially given challenges to effectively
exploit flaws that ASI is expected to mitigate, in particular when VMs
are running on dedicated cpu cores, or when core scheduling is used.
So I have been looking at a more simplistic approach (see below, A
Possible Alternative to ASI).

But first, your implementation confirms that KVM-ASI can be broken up
into different parts: pagetable management, ASI core and sibling cpus
synchronization.

Pagetable Management
====================
For ASI, we need to build a pagetable with a subset of the kernel
pagetable mappings. Your solution is interesting as it is provides
a broad solution and also works well with dynamic allocations (while
my approach to copy mappings had several limitations). The drawback
is the extend of your changes which spread over all the mm code
(while the simple solution to copy mappings can be done with a few
self-contained independent functions).

ASI Core
========

KPTI
----
Implementing KPTI with ASI is possible but this is not straight forward.
This requires some special handling in particular in the assembly kernel
entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
as an example) because we are also switching privilege level in addition
of switching the pagetable. So this might be something to consider early
in your implementation to ensure it is effectively compatible with KPTI.

Going beyond KPTI (with a KPTI-next) and trying to execute most
syscalls/interrupts without switching to the full kernel address space
is more challenging, because it would require much more kernel mapping
in the user pagetable, and this would basically defeat the purpose of
KPTI. You can refer to discussions about the RFC to defer CR3 switch
to C code [7] which was an attempt to just reach the kernel entry C
code with a KPTI pagetable.

Interrupts/Exceptions
---------------------
As long as interrupts/exceptions are not expected to be processed with
ASI, it is probably better to explicitly exit ASI before processing an
interrupt/exception, otherwise you will have an extra overhead on each
interrupt/exception to take a page fault and then exit ASI.

This is particularily true if you have want to have KPTI use ASI, and
in that case the ASI exit will need to be done early in the interrupt
and exception assembly entry code.

ASI Hooks
---------
ASI hooks are certainly a good idea to perform specific actions on ASI
enter or exit. However, I am not sure they are appropriate places for cpus
stunning with KVM-ASI. That's because cpus stunning doesn't need to be
done precisely when entering and exiting ASI, and it probably shouldn't be
done there: it should be done right before VMEnter and right after VMExit
(see below).

Sibling CPUs Synchronization
============================
KVM-ASI requires the synchronization of sibling CPUs from the same CPU
core so that when a VM is running then sibling CPUs are running with the
ASI associated with this VM (or an ASI compatible with the VM, depending
on how ASI is defined). That way the VM can only spy on data from ASI
and won't be able to access any sensitive data.

So, right before entering a VM, KVM should ensures that sibling CPUs are
using ASI. If a sibling CPU is not using ASI then KVM can either wait for
that sibling to run ASI, or force it to use ASI (or to become idle).
This behavior should be enforced as long as any sibling is running the
VM. When all siblings are not running the VM then other siblings can run
any code (using ASI or not).

I would be interesting to see the code you use to achieve this, because
I don't get how this is achieved from the description of your sibling
hyperthread stun and unstun mechanism.

Note that this synchronization is critical for ASI to work, in particular
when entering the VM, we need to be absolutely sure that sibling CPUs are
effectively using ASI. The core scheduling sibling stunning code you
referenced [6] uses a mechanism which is fine for userspace synchronization
(the delivery of the IPI forces the sibling to immediately enter the kernel)
but this won't work for ASI as the delivery of the IPI won't guarantee that
the sibling as enter ASI yet. I did some experiments that show that data
will leak if siblings are not perfectly synchronized.

A Possible Alternative to ASI?
=============================
ASI prevents access to sensitive data by unmapping them. On the other
hand, the KVM code somewhat already identifies access to sensitive data
as part of the L1TF/MDS mitigation, and when KVM is about to access
sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
flushed before VMEnter).

With KVM knowing when it accesses sensitive data, I think we can provide
the same mitigation as ASI by simply allowing KVM code which doesn't
access sensitive data to be run concurrently with a VM. This can be done
by tagging the kernel thread when it enters KVM code which doesn't
access sensitive data, and untagging the thread right before it accesses
sensitive data. And when KVM is about to do a VMEnter then we synchronize
siblings CPUs so that they run threads with the same tag. Sounds familar?
Yes, because that's similar to core scheduling but inside the kernel
(let's call it "kernel core scheduling").

I think the benefit of this approach would be that it should be much
simpler to implement and less invasive than ASI, and it doesn't preclude
to later do ASI: ASI can be done in addition and provide an extra level
of mitigation in case some sensitive is still accessed by KVM. Also it
would provide the critical sibling CPU synchronization mechanism that
we also need with ASI.

I did some prototyping to implement this kernel core scheduling a while
ago (and then get diverted on other stuff) but so far performances have
been abyssal especially when doing a strict synchronization between
sibling CPUs. I am planning go back and do more investigations when I
have cycles but probably not that soon.


alex.

[4] https://lore.kernel.org/lkml/[email protected]
[6] https://lore.kernel.org/lkml/[email protected]
[7] https://lore.kernel.org/lkml/[email protected]


> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types
> of speculative execution attacks. Even though the kernel already has several
> speculative execution vulnerability mitigations, some of them can be quite
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing
> mechanisms requires doing an L1 cache flush on every single VM entry as well as
> disabling hyperthreading altogether. (Although core scheduling can provide some
> protection when hyperthreading is enabled, it is not sufficient by itself to
> protect against all leaks unless sibling hyperthread stunning is also performed
> on every VM exit.) ASI provides a much less expensive mitigation for such
> vulnerabilities while still providing an almost similar level of protection.
>
> There are a couple of basic insights/assumptions behind ASI:
>
> 1. Most execution paths in the kernel (especially during virtual machine
> execution) access only memory that is not particularly sensitive even if it were
> to get leaked to the executing process/VM (setting aside for a moment what
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory
> that is mapped in the current page tables into its various caches and internal
> buffers.
>
> Given these, the idea of using ASI to thwart speculative attacks is that we can
> execute the kernel using a restricted set of page tables most of the time and
> switch to the full unrestricted kernel address space only when the kernel needs
> to access something that is not mapped in the restricted address space. And we
> keep track of when a switch to the full kernel address space is done, so that
> before returning back to the process/VM, we can switch back to the restricted
> address space. In the paths where the kernel is able to execute entirely while
> remaining in the restricted address space, we can skip other mitigations for
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
> sibling hyperthread stunning etc.). Only in the cases where we do end up
> switching the page tables, we perform these more expensive mitigations. Assuming
> that happens relatively infrequently, the performance can be significantly
> better compared to performing these mitigations all the time.
>
> Please note that although we do have a sibling hyperthread stunning
> implementation internally, which is fully integrated with KVM-ASI, it is not
> included in this RFC for the time being. The earlier upstream proposal for
> sibling stunning [6] could potentially be integrated into an upstream ASI
> implementation.
>
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
> would be another ASI class. An ASI instance (struct asi) represents a single
> restricted address space. There is a separate ASI instance for each untrusted
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
> there can be multiple untrusted security contexts (and thus multiple restricted
> address spaces) within a single process e.g. in the case of VMs, the userspace
> process is a different security context than the guest VM, and in principle,
> even each VCPU could be considered a separate security context (That would be
> primarily useful for securing nested virtualization).
>
> In this RFC, a process can have at most one ASI instance of each class, though
> this is not an inherent limitation and multiple instances of the same class
> should eventually be supported. (A process can still have ASI instances of
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
> entirely necessary to tie an ASI instance to a process. That is just a
> simplification for the initial implementation.
>
> An asi_enter operation switches into the restricted address space represented by
> the given ASI instance. An asi_exit operation switches to the full unrestricted
> kernel address space. Each ASI class can provide hooks to be executed during
> these operations, which can be used to perform speculative attack mitigations
> relevant to that class. For instance, the KVM-ASI hooks would perform a
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the
> page tables is enough mitigation in that case.
>
> If the kernel attempts to access memory that is not mapped in the currently
> active ASI instance, the page fault handler automatically performs an asi_exit
> operation. This means that except for a few critical pieces of memory, leaving
> something out of an unrestricted address space will result in only a performance
> hit, rather than a catastrophic failure. The kernel can also perform explicit
> asi_exit operations in some paths as needed.
>
> Apart from the page fault handler, other exceptions and interrupts (even NMIs)
> do not automatically cause an asi_exit and could potentially be executed
> completely within a restricted address space if they don't end up accessing any
> sensitive piece of memory.
>
> The mappings within a restricted address space are always a subset of the full
> kernel address space and each mapping is always the same as the corresponding
> mapping in the full kernel address space. This is necessary because we could
> potentially end up performing an asi_exit at any point.
>
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
> class could also be implemented on top of the same infrastructure. Furthermore,
> in the future we could also implement a KPTI-Next class that actually uses the
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
> the restricted address space and trying to execute most syscalls/interrupts
> without switching to the full kernel address space, as opposed to the current
> KPTI which requires an address space switch on every kernel/user mode
> transition.
>
> Memory classification
> =====================
> We divide memory into three categories.
>
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive
> memory is only mapped in the unrestricted kernel page tables. By default, all
> memory is considered sensitive unless specifically categorized otherwise.
>
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it
> were to get leaked to any process or VM in the system. Globally non-sensitive
> memory is mapped in the restricted address spaces for all processes.
>
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to
> get leaked to the currently running process or VM, but would present a security
> issue if it were to get leaked to any other process or VM in the system.
> Examples include userspace memory (or guest memory in the case of VMs) or kernel
> structures containing userspace/guest register context etc. Locally
> non-sensitive memory is mapped only in the restricted address space of a single
> process.
>
> Various mechanisms are provided to annotate different types of memory (static,
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
> addition, the ASI infrastructure takes care to ensure that different classes of
> memory do not share the same physical page. This includes separation of
> sensitive, globally non-sensitive and locally non-sensitive memory into
> different pages and also separation of locally non-sensitive memory for
> different processes into different pages as well.
>
> What exactly should be considered non-sensitive (either globally or locally) is
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
> many things also fall into a gray area, depending on how paranoid one wants to
> be. For this proof of concept, we have generally treated such things as
> non-sensitive, though that may not necessarily be the ideal classification in
> each case. Similarly, there is also a gray area between globally and locally
> non-sensitive classifications in some cases, and in those cases this RFC has
> mostly erred on the side of marking them as locally non-sensitive, even though
> many of those cases could likely be safely classified as globally non-sensitive.
>
> Although this implementation includes fairly extensive support for marking most
> types of dynamically allocated memory as locally non-sensitive, it is possibly
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such
> as [5]), if we are very selective about what memory we treat as locally
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
> general mechanism is included in this proof of concept as an illustration for
> what could be done if we really needed to treat any arbitrary kernel memory as
> locally non-sensitive.
>
> It is also possible to have ASI classes that do not utilize the above described
> infrastructure and instead manage all the memory mappings inside the restricted
> address space on their own.
>
>
> References
> ==========
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/[email protected]
> [3] https://lore.kernel.org/lkml/[email protected]
> [4] https://lore.kernel.org/lkml/[email protected]
> [5] https://lore.kernel.org/lkml/[email protected]
> [6] https://lore.kernel.org/lkml/[email protected]
>

2022-03-17 03:41:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions

Junaid,

On Mon, Mar 14 2022 at 19:01, Junaid Shahid wrote:
> On 3/14/22 08:50, Thomas Gleixner wrote:
>> On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>>> #define DEFINE_IDTENTRY_RAW(func) \
>>> -__visible noinstr void func(struct pt_regs *regs)
>>> +static __always_inline void __##func(struct pt_regs *regs); \
>>> + \
>>> +__visible noinstr void func(struct pt_regs *regs) \
>>> +{ \
>>> + asi_intr_enter(); \
>>
>> This is wrong. You cannot invoke arbitrary code within a noinstr
>> section.
>>
>> Please enable CONFIG_VMLINUX_VALIDATION and watch the build result with
>> and without your patches.
>>
> Thank you for the pointer. It seems that marking asi_intr_enter/exit
> and asi_enter/exit, and the few functions that they in turn call, as
> noinstr would fix this, correct? (Along with removing the VM_BUG_ONs
> from those functions and using notrace/nodebug variants of a couple of
> functions).

you can keep the BUG_ON()s. If such a bug happens the noinstr
correctness is the least of your worries, but it's important to get the
information out, right?

Vs. adding noinstr. Yes, making the full callchain noinstr is going to
cure it, but you really want to think hard whether these calls need to
be in this section of the exception handlers.

These code sections have other constraints aside of being excluded from
instrumentation, the main one being that you cannot use RCU there.

I'm not yet convinced that asi_intr_enter()/exit() need to be invoked in
exactly the places you put it. The changelog does not give any clue
about the why...

Thanks,

tglx



2022-03-17 06:11:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

Junaid,

On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>
> The patches apply on top of Linux v5.16.

Why are you posting patches against some randomly chosen release?

Documentation/process/ is pretty clear about how this works. It's not
optional.

> These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

This is useful because?

If you want to provide patches in a usable form then please expose them
as git tree which can be pulled and not via the random tool of the day.

Thanks,

tglx


2022-03-17 21:56:06

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

On 3/16/22 15:49, Thomas Gleixner wrote:
> Junaid,
>
> On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>>
>> The patches apply on top of Linux v5.16.
>
> Why are you posting patches against some randomly chosen release?
>
> Documentation/process/ is pretty clear about how this works. It's not
> optional.

Sorry, I assumed that for an RFC, it may be acceptable to base on the last release version, but looks like I guessed wrong. I will base the next version of the RFC on the HEAD of the Linus tree.

>
>> These patches are also available via
>> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>
> This is useful because?
>
> If you want to provide patches in a usable form then please expose them
> as git tree which can be pulled and not via the random tool of the day.

The patches are now available as the branch "asi-rfc-v1" in the git repo https://github.com/googleprodkernel/linux-kvm.git

Thanks,
Junaid

>
> Thanks,
>
> tglx
>
>

2022-03-18 00:12:28

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

Hi Alex,

On 3/16/22 14:34, Alexandre Chartre wrote:
>
> Hi Junaid,
>
> On 2/23/22 06:21, Junaid Shahid wrote:
>> This patch series is a proof-of-concept RFC for an end-to-end implementation of
>> Address Space Isolation for KVM. It has similar goals and a somewhat similar
>> high-level design as the original ASI patches from Alexandre Chartre
>> ([1],[2],[3],[4]), but with a different underlying implementation. This also
>> includes several memory management changes to help with differentiating between
>> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
>> ASI restricted address spaces.
>>
>> This RFC is intended as a demonstration of what a full ASI implementation for
>> KVM could look like, not necessarily as a direct proposal for what might
>> eventually be merged. In particular, these patches do not yet implement KPTI on
>> top of ASI, although the framework is generic enough to be able to support it.
>> Similarly, these patches do not include non-sensitive annotations for data
>> structures that did not get frequently accessed during execution of our test
>> workloads, but the framework is designed such that new non-sensitive memory
>> annotations can be added trivially.
>>
>> The patches apply on top of Linux v5.16. These patches are also available via
>> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>>
> Sorry for the late answer, and thanks for investigating possible ASI
> implementations. I have to admit I put ASI on the back-burner for
> a while because I am more and more wondering if the complexity of
> ASI is worth the benefit, especially given challenges to effectively
> exploit flaws that ASI is expected to mitigate, in particular when VMs
> are running on dedicated cpu cores, or when core scheduling is used.
> So I have been looking at a more simplistic approach (see below, A
> Possible Alternative to ASI).
>
> But first, your implementation confirms that KVM-ASI can be broken up
> into different parts: pagetable management, ASI core and sibling cpus
> synchronization.
>
> Pagetable Management
> ====================
> For ASI, we need to build a pagetable with a subset of the kernel
> pagetable mappings. Your solution is interesting as it is provides
> a broad solution and also works well with dynamic allocations (while
> my approach to copy mappings had several limitations). The drawback
> is the extend of your changes which spread over all the mm code
> (while the simple solution to copy mappings can be done with a few
> self-contained independent functions).
>
> ASI Core
> ========
>
> KPTI
> ----
> Implementing KPTI with ASI is possible but this is not straight forward.
> This requires some special handling in particular in the assembly kernel
> entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
> as an example) because we are also switching privilege level in addition
> of switching the pagetable. So this might be something to consider early
> in your implementation to ensure it is effectively compatible with KPTI.

Yes, I will look in more detail into how to implement KPTI on top of this ASI implementation, but at least at a high level, it seems that it should work. Of course, the devil is always in the details :)

>
> Going beyond KPTI (with a KPTI-next) and trying to execute most
> syscalls/interrupts without switching to the full kernel address space
> is more challenging, because it would require much more kernel mapping
> in the user pagetable, and this would basically defeat the purpose of
> KPTI. You can refer to discussions about the RFC to defer CR3 switch
> to C code [7] which was an attempt to just reach the kernel entry C
> code with a KPTI pagetable.

In principle, the ASI restricted address space would not contain any sensitive data, so having more mappings should be ok as long as they really are non-sensitive. Of course, it is possible that we may mistakenly think that some data is not sensitive and mark it as such, but in reality it really was sensitive in some way. In that sense, a strict KPTI is certainly a little more secure than the KPTI-Next that I mentioned, but KPTI-Next would also have lower performance overhead compared to the strict KPTI.

>
> Interrupts/Exceptions
> ---------------------
> As long as interrupts/exceptions are not expected to be processed with
> ASI, it is probably better to explicitly exit ASI before processing an
> interrupt/exception, otherwise you will have an extra overhead on each
> interrupt/exception to take a page fault and then exit ASI.

I agree that for those interrupts/exceptions that will need to access sensitive data, it is better to do an explicit ASI Exit at the start. But it is probably possible for many interrupts to be handled without needing to access sensitive data, in which case, it would be better to avoid the ASI Exit.

>
> This is particularily true if you have want to have KPTI use ASI, and
> in that case the ASI exit will need to be done early in the interrupt
> and exception assembly entry code.
>
> ASI Hooks
> ---------
> ASI hooks are certainly a good idea to perform specific actions on ASI
> enter or exit. However, I am not sure they are appropriate places for cpus
> stunning with KVM-ASI. That's because cpus stunning doesn't need to be
> done precisely when entering and exiting ASI, and it probably shouldn't be
> done there: it should be done right before VMEnter and right after VMExit
> (see below).
>

I believe that performing sibling CPU stunning right after VM Exit will negate most of the performance advantage of ASI. I think that it is feasible to do the stunning on ASI Exit. Please see below for how we handle the problem that you have mentioned.


> Sibling CPUs Synchronization
> ============================
> KVM-ASI requires the synchronization of sibling CPUs from the same CPU
> core so that when a VM is running then sibling CPUs are running with the
> ASI associated with this VM (or an ASI compatible with the VM, depending
> on how ASI is defined). That way the VM can only spy on data from ASI
> and won't be able to access any sensitive data.
>
> So, right before entering a VM, KVM should ensures that sibling CPUs are
> using ASI. If a sibling CPU is not using ASI then KVM can either wait for
> that sibling to run ASI, or force it to use ASI (or to become idle).
> This behavior should be enforced as long as any sibling is running the
> VM. When all siblings are not running the VM then other siblings can run
> any code (using ASI or not).
>
> I would be interesting to see the code you use to achieve this, because
> I don't get how this is achieved from the description of your sibling
> hyperthread stun and unstun mechanism.
>
> Note that this synchronization is critical for ASI to work, in particular
> when entering the VM, we need to be absolutely sure that sibling CPUs are
> effectively using ASI. The core scheduling sibling stunning code you
> referenced [6] uses a mechanism which is fine for userspace synchronization
> (the delivery of the IPI forces the sibling to immediately enter the kernel)
> but this won't work for ASI as the delivery of the IPI won't guarantee that
> the sibling as enter ASI yet. I did some experiments that show that data
> will leak if siblings are not perfectly synchronized.

I agree that it is not secure to run one sibling in the unrestricted kernel address space while the other sibling is running in an ASI restricted address space, without doing a cache flush before re-entering the VM. However, I think that avoiding this situation does not require doing a sibling stun operation immediately after VM Exit. The way we avoid it is as follows.

First, we always use ASI in conjunction with core scheduling. This means that if HT0 is running a VCPU thread, then HT1 will be running either a VCPU thread of the same VM or the Idle thread. If it is running a VCPU thread, then if/when that thread takes a VM Exit, it will also be running in the same ASI restricted address space. For the idle thread, we have created another ASI Class, called Idle-ASI, which maps only globally non-sensitive kernel memory. The idle loop enters this ASI address space.

This means that when HT0 does a VM Exit, HT1 will either be running the guest code of a VCPU of the same VM, or it will be running kernel code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is already running in the full kernel address space, that would imply that it had previously done an ASI Exit, which would have triggered a stun_sibling, which would have already caused HT0 to exit the VM and wait in the kernel).

If HT1 now does an ASI Exit, that will trigger the stun_sibling() operation in its pre_asi_exit() handler, which will set the state of the core/HT0 to Stunned (and possibly send an IPI too, though that will be ignored if HT0 was already in kernel mode). Now when HT0 tries to re-enter the VM, since its state is set to Stunned, it will just wait in a loop until HT1 does an unstun_sibling() operation, which it will do in its post_asi_enter handler the next time it does an ASI Enter (which would be either just before VM Enter if it was KVM-ASI, or in the next iteration of the idle loop if it was Idle-ASI). In either case, HT1's post_asi_enter() handler would also do a flush_sensitive_cpu_state operation before the unstun_sibling(), so when HT0 gets out of its wait-loop and does a VM Enter, there will not be any sensitive state left.

One thing that probably was not clear from the patch, is that the stun state check and wait-loop is still always executed before VM Enter, even if no ASI Exit happened in that execution.

>
> A Possible Alternative to ASI?
> =============================
> ASI prevents access to sensitive data by unmapping them. On the other
> hand, the KVM code somewhat already identifies access to sensitive data
> as part of the L1TF/MDS mitigation, and when KVM is about to access
> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
> flushed before VMEnter).
>
> With KVM knowing when it accesses sensitive data, I think we can provide
> the same mitigation as ASI by simply allowing KVM code which doesn't
> access sensitive data to be run concurrently with a VM. This can be done
> by tagging the kernel thread when it enters KVM code which doesn't
> access sensitive data, and untagging the thread right before it accesses
> sensitive data. And when KVM is about to do a VMEnter then we synchronize
> siblings CPUs so that they run threads with the same tag. Sounds familar?
> Yes, because that's similar to core scheduling but inside the kernel
> (let's call it "kernel core scheduling").
>
> I think the benefit of this approach would be that it should be much
> simpler to implement and less invasive than ASI, and it doesn't preclude
> to later do ASI: ASI can be done in addition and provide an extra level
> of mitigation in case some sensitive is still accessed by KVM. Also it
> would provide the critical sibling CPU synchronization mechanism that
> we also need with ASI.
>
> I did some prototyping to implement this kernel core scheduling a while
> ago (and then get diverted on other stuff) but so far performances have
> been abyssal especially when doing a strict synchronization between
> sibling CPUs. I am planning go back and do more investigations when I
> have cycles but probably not that soon.
>

This also seems like an interesting approach. It does have some different trade-offs compared to ASI. First, there is the trade-off between a blacklist vs. whitelist approach. Secondly, ASI has a more structured approach based on the sensitivity of the data itself, instead of having to audit every new code path to verify whether or not it can potentially access any sensitive data. On the other hand, as you point out, this approach is much simpler than ASI, which is certainly a plus.

Thanks,
Junaid

2022-03-22 11:07:56

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM


On 3/18/22 00:25, Junaid Shahid wrote:
>> ASI Core
>> ========
>>
>> KPTI
>> ----
>> Implementing KPTI with ASI is possible but this is not straight forward.
>> This requires some special handling in particular in the assembly kernel
>> entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
>> as an example) because we are also switching privilege level in addition
>> of switching the pagetable. So this might be something to consider early
>> in your implementation to ensure it is effectively compatible with KPTI.
>
> Yes, I will look in more detail into how to implement KPTI on top of
> this ASI implementation, but at least at a high level, it seems that
> it should work. Of course, the devil is always in the details :)
>
>>
>> Going beyond KPTI (with a KPTI-next) and trying to execute most
>> syscalls/interrupts without switching to the full kernel address space
>> is more challenging, because it would require much more kernel mapping
>> in the user pagetable, and this would basically defeat the purpose of
>> KPTI. You can refer to discussions about the RFC to defer CR3 switch
>> to C code [7] which was an attempt to just reach the kernel entry C
>> code with a KPTI pagetable.
>
> In principle, the ASI restricted address space would not contain any
> sensitive data, so having more mappings should be ok as long as they
> really are non-sensitive. Of course, it is possible that we may
> mistakenly think that some data is not sensitive and mark it as such,
> but in reality it really was sensitive in some way. In that sense, a
> strict KPTI is certainly a little more secure than the KPTI-Next that
> I mentioned, but KPTI-Next would also have lower performance overhead
> compared to the strict KPTI.
>

Mappings are precisely the issue for KPTI-next. The RFC I submitted show
that going beyond KPTI might require to map data which could be deemed
sensitive. Also they are extra complications that make it difficult to reach
C code with a KPTI page-table. This was discussed in v2 of the "Defer CR3
switch to C code" RFC:
https://lore.kernel.org/all/[email protected]/


>>
>> Interrupts/Exceptions
>> ---------------------
>> As long as interrupts/exceptions are not expected to be processed with
>> ASI, it is probably better to explicitly exit ASI before processing an
>> interrupt/exception, otherwise you will have an extra overhead on each
>> interrupt/exception to take a page fault and then exit ASI.
>
> I agree that for those interrupts/exceptions that will need to access
> sensitive data, it is better to do an explicit ASI Exit at the start.
> But it is probably possible for many interrupts to be handled without
> needing to access sensitive data, in which case, it would be better
> to avoid the ASI Exit.
>
>>
>> This is particularily true if you have want to have KPTI use ASI, and
>> in that case the ASI exit will need to be done early in the interrupt
>> and exception assembly entry code.
>>
>> ASI Hooks
>> ---------
>> ASI hooks are certainly a good idea to perform specific actions on ASI
>> enter or exit. However, I am not sure they are appropriate places for cpus
>> stunning with KVM-ASI. That's because cpus stunning doesn't need to be
>> done precisely when entering and exiting ASI, and it probably shouldn't be
>> done there: it should be done right before VMEnter and right after VMExit
>> (see below).
>>
>
> I believe that performing sibling CPU stunning right after VM Exit
> will negate most of the performance advantage of ASI. I think that it
> is feasible to do the stunning on ASI Exit. Please see below for how
> we handle the problem that you have mentioned.
>

Right, I was confused by what you exactly meant by cpu stun/unstun but
I think it's now clearer with your explanation below.


>
>> Sibling CPUs Synchronization
>> ============================
>> KVM-ASI requires the synchronization of sibling CPUs from the same CPU
>> core so that when a VM is running then sibling CPUs are running with the
>> ASI associated with this VM (or an ASI compatible with the VM, depending
>> on how ASI is defined). That way the VM can only spy on data from ASI
>> and won't be able to access any sensitive data.
>>
>> So, right before entering a VM, KVM should ensures that sibling CPUs are
>> using ASI. If a sibling CPU is not using ASI then KVM can either wait for
>> that sibling to run ASI, or force it to use ASI (or to become idle).
>> This behavior should be enforced as long as any sibling is running the
>> VM. When all siblings are not running the VM then other siblings can run
>> any code (using ASI or not).
>>
>> I would be interesting to see the code you use to achieve this, because
>> I don't get how this is achieved from the description of your sibling
>> hyperthread stun and unstun mechanism.
>>
>> Note that this synchronization is critical for ASI to work, in particular
>> when entering the VM, we need to be absolutely sure that sibling CPUs are
>> effectively using ASI. The core scheduling sibling stunning code you
>> referenced [6] uses a mechanism which is fine for userspace synchronization
>> (the delivery of the IPI forces the sibling to immediately enter the kernel)
>> but this won't work for ASI as the delivery of the IPI won't guarantee that
>> the sibling as enter ASI yet. I did some experiments that show that data
>> will leak if siblings are not perfectly synchronized.
>
> I agree that it is not secure to run one sibling in the unrestricted
> kernel address space while the other sibling is running in an ASI
> restricted address space, without doing a cache flush before
> re-entering the VM. However, I think that avoiding this situation
> does not require doing a sibling stun operation immediately after VM
> Exit. The way we avoid it is as follows.
>
> First, we always use ASI in conjunction with core scheduling. This
> means that if HT0 is running a VCPU thread, then HT1 will be running
> either a VCPU thread of the same VM or the Idle thread. If it is
> running a VCPU thread, then if/when that thread takes a VM Exit, it
> will also be running in the same ASI restricted address space. For
> the idle thread, we have created another ASI Class, called Idle-ASI,
> which maps only globally non-sensitive kernel memory. The idle loop
> enters this ASI address space.
>
> This means that when HT0 does a VM Exit, HT1 will either be running
> the guest code of a VCPU of the same VM, or it will be running kernel
> code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is
> already running in the full kernel address space, that would imply
> that it had previously done an ASI Exit, which would have triggered a
> stun_sibling, which would have already caused HT0 to exit the VM and
> wait in the kernel).

Note that using core scheduling (or not) is a detail, what is important
is whether HT are running with ASI or not. Running core scheduling will
just improve chances to have all siblings run ASI at the same time
and so improve ASI performances.


> If HT1 now does an ASI Exit, that will trigger the stun_sibling()
> operation in its pre_asi_exit() handler, which will set the state of
> the core/HT0 to Stunned (and possibly send an IPI too, though that
> will be ignored if HT0 was already in kernel mode). Now when HT0
> tries to re-enter the VM, since its state is set to Stunned, it will
> just wait in a loop until HT1 does an unstun_sibling() operation,
> which it will do in its post_asi_enter handler the next time it does
> an ASI Enter (which would be either just before VM Enter if it was
> KVM-ASI, or in the next iteration of the idle loop if it was
> Idle-ASI). In either case, HT1's post_asi_enter() handler would also
> do a flush_sensitive_cpu_state operation before the unstun_sibling(),
> so when HT0 gets out of its wait-loop and does a VM Enter, there will
> not be any sensitive state left.
>
> One thing that probably was not clear from the patch, is that the
> stun state check and wait-loop is still always executed before VM
> Enter, even if no ASI Exit happened in that execution.
>

So if I understand correctly, you have following sequence:

0 - Initially state is set to "stunned" for all cpus (i.e. a cpu should
wait before VMEnter)

1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling can
do VMEnter)

2 - Before VMEnter : wait while my state is "stunned"

3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling should
wait before VMEnter)

I have tried this kind of implementation, and the problem is with step 2
(wait while my state is "stunned"); how do you wait exactly? You can't
just do an active wait otherwise you have all kind of problems (depending
if you have interrupts enabled or not) especially as you don't know how
long you have to wait for (this depends on what the other cpu is doing).

That's why I have been dissociating ASI and cpu stunning (and eventually
move to only do kernel core scheduling). Basically I replaced step 2 by
a call to the scheduler to select threads using ASI on all siblings (or
run something else if there's higher priority threads to run) which means
enabling kernel core scheduling at this point.

>>
>> A Possible Alternative to ASI?
>> =============================
>> ASI prevents access to sensitive data by unmapping them. On the other
>> hand, the KVM code somewhat already identifies access to sensitive data
>> as part of the L1TF/MDS mitigation, and when KVM is about to access
>> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
>> flushed before VMEnter).
>>
>> With KVM knowing when it accesses sensitive data, I think we can provide
>> the same mitigation as ASI by simply allowing KVM code which doesn't
>> access sensitive data to be run concurrently with a VM. This can be done
>> by tagging the kernel thread when it enters KVM code which doesn't
>> access sensitive data, and untagging the thread right before it accesses
>> sensitive data. And when KVM is about to do a VMEnter then we synchronize
>> siblings CPUs so that they run threads with the same tag. Sounds familar?
>> Yes, because that's similar to core scheduling but inside the kernel
>> (let's call it "kernel core scheduling").
>>
>> I think the benefit of this approach would be that it should be much
>> simpler to implement and less invasive than ASI, and it doesn't preclude
>> to later do ASI: ASI can be done in addition and provide an extra level
>> of mitigation in case some sensitive is still accessed by KVM. Also it
>> would provide the critical sibling CPU synchronization mechanism that
>> we also need with ASI.
>>
>> I did some prototyping to implement this kernel core scheduling a while
>> ago (and then get diverted on other stuff) but so far performances have
>> been abyssal especially when doing a strict synchronization between
>> sibling CPUs. I am planning go back and do more investigations when I
>> have cycles but probably not that soon.
>>
>
> This also seems like an interesting approach. It does have some
> different trade-offs compared to ASI. First, there is the trade-off
> between a blacklist vs. whitelist approach. Secondly, ASI has a more
> structured approach based on the sensitivity of the data itself,
> instead of having to audit every new code path to verify whether or
> not it can potentially access any sensitive data. On the other hand,
> as you point out, this approach is much simpler than ASI, which is
> certainly a plus.

I think the main benefit is that it provides a mechanism for running
specific kernel threads together on sibling cpus independently of ASI.
So it will be easier to implement (you don't need ASI) and to test.

Then, once this mechanism has proven to work (and to be efficient),
you can have KVM ASI use it.

alex.

2022-03-24 16:59:48

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations

Hi Matthew,

On 3/23/22 14:06, Matthew Wilcox wrote:
> On Tue, Feb 22, 2022 at 09:21:46PM -0800, Junaid Shahid wrote:
>> standard ASI instances. A new page flag is also added so that when
>> these pages are freed, they can also be unmapped from the ASI page
>> tables.
>
> It's cute how you just throw this in as an aside. Page flags are
> in high demand and just adding them is not to be done lightly. Is
> there any other way of accomplishing what you want?
>

I suppose we may be able to use page_ext instead. That certainly should be
feasible for the PG_local_nonsensitive flag introduced in a later patch,
although I am not completely sure about the PG_global_nonsensitive flag. That
could get slightly tricky (though likely still possible to do) in case we need
to allocate any non-sensitive memory before page_ext is initialized. One concern
with using page_ext could be the extra memory usage on large machines.

BTW is page flag scarcity an issue on 64-bit systems as well, or only 32-bit
systems? ASI is only supported on 64-bit systems (at least currently).

>> @@ -542,6 +545,12 @@ TESTCLEARFLAG(Young, young, PF_ANY)
>> PAGEFLAG(Idle, idle, PF_ANY)
>> #endif
>>
>> +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
>> +__PAGEFLAG(GlobalNonSensitive, global_nonsensitive, PF_ANY);
>
> Why is PF_ANY appropriate?
>

I think we actually can use PF_HEAD here. That would make the alloc_pages path
faster too. I'll change to that. (Initially I had gone with PF_ANY because in
theory, you could allocate a compound page and then break it later and free the
individual pages, but I don't know if that actually happens apart from THP, and
certainly doesn't look like the case for any of the stuff that we have marked as
non-sensitive.)

Thanks,
Junaid

2022-03-24 23:39:28

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

On 3/22/22 02:46, Alexandre Chartre wrote:
>
> On 3/18/22 00:25, Junaid Shahid wrote:
>>
>> I agree that it is not secure to run one sibling in the unrestricted
>> kernel address space while the other sibling is running in an ASI
>> restricted address space, without doing a cache flush before
>> re-entering the VM. However, I think that avoiding this situation
>> does not require doing a sibling stun operation immediately after VM
>> Exit. The way we avoid it is as follows.
>>
>> First, we always use ASI in conjunction with core scheduling. This
>> means that if HT0 is running a VCPU thread, then HT1 will be running
>> either a VCPU thread of the same VM or the Idle thread. If it is
>> running a VCPU thread, then if/when that thread takes a VM Exit, it
>> will also be running in the same ASI restricted address space. For
>> the idle thread, we have created another ASI Class, called Idle-ASI,
>> which maps only globally non-sensitive kernel memory. The idle loop
>> enters this ASI address space.
>>
>> This means that when HT0 does a VM Exit, HT1 will either be running
>> the guest code of a VCPU of the same VM, or it will be running kernel
>> code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is
>> already running in the full kernel address space, that would imply
>> that it had previously done an ASI Exit, which would have triggered a
>> stun_sibling, which would have already caused HT0 to exit the VM and
>> wait in the kernel).
>
> Note that using core scheduling (or not) is a detail, what is important
> is whether HT are running with ASI or not. Running core scheduling will
> just improve chances to have all siblings run ASI at the same time
> and so improve ASI performances.
>
>
>> If HT1 now does an ASI Exit, that will trigger the stun_sibling()
>> operation in its pre_asi_exit() handler, which will set the state of
>> the core/HT0 to Stunned (and possibly send an IPI too, though that
>> will be ignored if HT0 was already in kernel mode). Now when HT0
>> tries to re-enter the VM, since its state is set to Stunned, it will
>> just wait in a loop until HT1 does an unstun_sibling() operation,
>> which it will do in its post_asi_enter handler the next time it does
>> an ASI Enter (which would be either just before VM Enter if it was
>> KVM-ASI, or in the next iteration of the idle loop if it was
>> Idle-ASI). In either case, HT1's post_asi_enter() handler would also
>> do a flush_sensitive_cpu_state operation before the unstun_sibling(),
>> so when HT0 gets out of its wait-loop and does a VM Enter, there will
>> not be any sensitive state left.
>>
>> One thing that probably was not clear from the patch, is that the
>> stun state check and wait-loop is still always executed before VM
>> Enter, even if no ASI Exit happened in that execution.
>>
>
> So if I understand correctly, you have following sequence:
>
> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu should
>     wait before VMEnter)
>
> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling can
>     do VMEnter)
>
> 2 - Before VMEnter : wait while my state is "stunned"
>
> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling should
>     wait before VMEnter)
>
> I have tried this kind of implementation, and the problem is with step 2
> (wait while my state is "stunned"); how do you wait exactly? You can't
> just do an active wait otherwise you have all kind of problems (depending
> if you have interrupts enabled or not) especially as you don't know how
> long you have to wait for (this depends on what the other cpu is doing).

In our stunning implementation, we do an active wait with interrupts enabled and with a need_resched() check to decide when to bail out to the scheduler (plus we also make sure that we re-enter ASI at the end of the wait in case some interrupt exited ASI). What kind of problems have you run into with an active wait, besides wasted CPU cycles?

In any case, the specific stunning mechanism is orthogonal to ASI. This implementation of ASI can be integrated with different stunning implementations. The "kernel core scheduling" that you proposed is also an alternative to stunning and could be similarly integrated with ASI.

>
> That's why I have been dissociating ASI and cpu stunning (and eventually
> move to only do kernel core scheduling). Basically I replaced step 2 by
> a call to the scheduler to select threads using ASI on all siblings (or
> run something else if there's higher priority threads to run) which means
> enabling kernel core scheduling at this point.
>
>>>
>>> A Possible Alternative to ASI?
>>> =============================
>>> ASI prevents access to sensitive data by unmapping them. On the other
>>> hand, the KVM code somewhat already identifies access to sensitive data
>>> as part of the L1TF/MDS mitigation, and when KVM is about to access
>>> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
>>> flushed before VMEnter).
>>>
>>> With KVM knowing when it accesses sensitive data, I think we can provide
>>> the same mitigation as ASI by simply allowing KVM code which doesn't
>>> access sensitive data to be run concurrently with a VM. This can be done
>>> by tagging the kernel thread when it enters KVM code which doesn't
>>> access sensitive data, and untagging the thread right before it accesses
>>> sensitive data. And when KVM is about to do a VMEnter then we synchronize
>>> siblings CPUs so that they run threads with the same tag. Sounds familar?
>>> Yes, because that's similar to core scheduling but inside the kernel
>>> (let's call it "kernel core scheduling").
>>>
>>> I think the benefit of this approach would be that it should be much
>>> simpler to implement and less invasive than ASI, and it doesn't preclude
>>> to later do ASI: ASI can be done in addition and provide an extra level
>>> of mitigation in case some sensitive is still accessed by KVM. Also it
>>> would provide the critical sibling CPU synchronization mechanism that
>>> we also need with ASI.
>>>
>>> I did some prototyping to implement this kernel core scheduling a while
>>> ago (and then get diverted on other stuff) but so far performances have
>>> been abyssal especially when doing a strict synchronization between
>>> sibling CPUs. I am planning go back and do more investigations when I
>>> have cycles but probably not that soon.
>>>
>>
>> This also seems like an interesting approach. It does have some
>> different trade-offs compared to ASI. First, there is the trade-off
>> between a blacklist vs. whitelist approach. Secondly, ASI has a more
>> structured approach based on the sensitivity of the data itself,
>> instead of having to audit every new code path to verify whether or
>> not it can potentially access any sensitive data. On the other hand,
>> as you point out, this approach is much simpler than ASI, which is
>> certainly a plus.
>
> I think the main benefit is that it provides a mechanism for running
> specific kernel threads together on sibling cpus independently of ASI.
> So it will be easier to implement (you don't need ASI) and to test.
>

It would be interesting to see the performance characteristics of this approach compared to stunning. I think it would really depend on how long do we typically end up staying in the full kernel address space when running VCPUs.

Note that stunning can also be implemented independently of ASI by integrating it with the same conditional L1TF mitigation mechanism (l1tf_flush_l1d) that currently exists in KVM. The way I see it, this kernel core scheduling is an alternative to stunning, regardless of whether we integrate it with ASI or with the existing conditional mitigation mechanism.

> Then, once this mechanism has proven to work (and to be efficient),
> you can have KVM ASI use it.
>

Yes, if this mechanism seems to work better than stunning, then we could certainly integrate this with ASI. Though it is possible that we may end up needing ASI to get to the "efficient" part.

Thanks,
Junaid

2022-03-25 02:47:29

by Junaid Shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations

On 3/23/22 16:48, Junaid Shahid wrote:
> Hi Matthew,
>
> On 3/23/22 14:06, Matthew Wilcox wrote:
>> On Tue, Feb 22, 2022 at 09:21:46PM -0800, Junaid Shahid wrote:
>>> standard ASI instances. A new page flag is also added so that when
>>> these pages are freed, they can also be unmapped from the ASI page
>>> tables.
>>
>> It's cute how you just throw this in as an aside.  Page flags are
>> in high demand and just adding them is not to be done lightly.  Is
>> there any other way of accomplishing what you want?
>>
>
> I suppose we may be able to use page_ext instead. That certainly should be
> feasible for the PG_local_nonsensitive flag introduced in a later patch,
> although I am not completely sure about the PG_global_nonsensitive flag. That
> could get slightly tricky (though likely still possible to do) in case we need
> to allocate any non-sensitive memory before page_ext is initialized. One concern
> with using page_ext could be the extra memory usage on large machines.
>
> BTW is page flag scarcity an issue on 64-bit systems as well, or only 32-bit
> systems? ASI is only supported on 64-bit systems (at least currently).
>

One other thing that we could do to remove the need for the
PG_global_nonsensitive flag altogether (though not the PG_local_nonsensitive
flag) would be to always try to unmap pages from the asi_global_nonsensitive_pgd
in free_pages(). Basically, that would mean adding a page table walk to every
free_pages() rather than just non-sensitive free_pages(). Do you think that may
be a better trade-off in order to avoid the flag?

Thanks,
Junaid

2022-03-25 19:29:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations

On Tue, Feb 22, 2022 at 09:21:46PM -0800, Junaid Shahid wrote:
> standard ASI instances. A new page flag is also added so that when
> these pages are freed, they can also be unmapped from the ASI page
> tables.

It's cute how you just throw this in as an aside. Page flags are
in high demand and just adding them is not to be done lightly. Is
there any other way of accomplishing what you want?

> @@ -542,6 +545,12 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> PAGEFLAG(Idle, idle, PF_ANY)
> #endif
>
> +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
> +__PAGEFLAG(GlobalNonSensitive, global_nonsensitive, PF_ANY);

Why is PF_ANY appropriate?

2022-04-08 12:28:03

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM


On 3/23/22 20:35, Junaid Shahid wrote:
> On 3/22/22 02:46, Alexandre Chartre wrote:
>>
>> On 3/18/22 00:25, Junaid Shahid wrote:
>>>
>>> I agree that it is not secure to run one sibling in the
>>> unrestricted kernel address space while the other sibling is
>>> running in an ASI restricted address space, without doing a cache
>>> flush before re-entering the VM. However, I think that avoiding
>>> this situation does not require doing a sibling stun operation
>>> immediately after VM Exit. The way we avoid it is as follows.
>>>
>>> First, we always use ASI in conjunction with core scheduling.
>>> This means that if HT0 is running a VCPU thread, then HT1 will be
>>> running either a VCPU thread of the same VM or the Idle thread.
>>> If it is running a VCPU thread, then if/when that thread takes a
>>> VM Exit, it will also be running in the same ASI restricted
>>> address space. For the idle thread, we have created another ASI
>>> Class, called Idle-ASI, which maps only globally non-sensitive
>>> kernel memory. The idle loop enters this ASI address space.
>>>
>>> This means that when HT0 does a VM Exit, HT1 will either be
>>> running the guest code of a VCPU of the same VM, or it will be
>>> running kernel code in either a KVM-ASI or the Idle-ASI address
>>> space. (If HT1 is already running in the full kernel address
>>> space, that would imply that it had previously done an ASI Exit,
>>> which would have triggered a stun_sibling, which would have
>>> already caused HT0 to exit the VM and wait in the kernel).
>>
>> Note that using core scheduling (or not) is a detail, what is
>> important is whether HT are running with ASI or not. Running core
>> scheduling will just improve chances to have all siblings run ASI
>> at the same time and so improve ASI performances.
>>
>>
>>> If HT1 now does an ASI Exit, that will trigger the
>>> stun_sibling() operation in its pre_asi_exit() handler, which
>>> will set the state of the core/HT0 to Stunned (and possibly send
>>> an IPI too, though that will be ignored if HT0 was already in
>>> kernel mode). Now when HT0 tries to re-enter the VM, since its
>>> state is set to Stunned, it will just wait in a loop until HT1
>>> does an unstun_sibling() operation, which it will do in its
>>> post_asi_enter handler the next time it does an ASI Enter (which
>>> would be either just before VM Enter if it was KVM-ASI, or in the
>>> next iteration of the idle loop if it was Idle-ASI). In either
>>> case, HT1's post_asi_enter() handler would also do a
>>> flush_sensitive_cpu_state operation before the unstun_sibling(),
>>> so when HT0 gets out of its wait-loop and does a VM Enter, there
>>> will not be any sensitive state left.
>>>
>>> One thing that probably was not clear from the patch, is that
>>> the stun state check and wait-loop is still always executed
>>> before VM Enter, even if no ASI Exit happened in that execution.
>>>
>>
>> So if I understand correctly, you have following sequence:
>>
>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>> should wait before VMEnter)
>>
>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>> can do VMEnter)
>>
>> 2 - Before VMEnter : wait while my state is "stunned"
>>
>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>> should wait before VMEnter)
>>
>> I have tried this kind of implementation, and the problem is with
>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>> You can't just do an active wait otherwise you have all kind of
>> problems (depending if you have interrupts enabled or not)
>> especially as you don't know how long you have to wait for (this
>> depends on what the other cpu is doing).
>
> In our stunning implementation, we do an active wait with interrupts
> enabled and with a need_resched() check to decide when to bail out
> to the scheduler (plus we also make sure that we re-enter ASI at the
> end of the wait in case some interrupt exited ASI). What kind of
> problems have you run into with an active wait, besides wasted CPU
> cycles?

If you wait with interrupts enabled then there is window after the
wait and before interrupts get disabled where a cpu can get an interrupt,
exit ASI while the sibling is entering the VM. Also after a CPU has passed
the wait and have disable interrupts, it can't be notified if the sibling
has exited ASI:

T+01 - cpu A and B enter ASI - interrupts are enabled
T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
T+03 - cpu A gets an interrupt
T+04 - cpu B disables interrupts
T+05 - cpu A exit ASI and process interrupts
T+06 - cpu B enters VM => cpu B runs VM while cpu A is not using ASI
T+07 - cpu B exits VM
T+08 - cpu B exits ASI
T+09 - cpu A returns from interrupt
T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI


> In any case, the specific stunning mechanism is orthogonal to ASI.
> This implementation of ASI can be integrated with different stunning
> implementations. The "kernel core scheduling" that you proposed is
> also an alternative to stunning and could be similarly integrated
> with ASI.

Yes, but for ASI to be relevant with KVM to prevent data leak, you need
a fully functional and reliable stunning mechanism, otherwise ASI is
useless. That's why I think it is better to first focus on having an
effective stunning mechanism and then implement ASI.


alex.

2022-04-11 11:11:52

by junaid_shahid

[permalink] [raw]
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM

Hi Alex,

> On 3/23/22 20:35, Junaid Shahid wrote:
>> On 3/22/22 02:46, Alexandre Chartre wrote:
>>>
>>> So if I understand correctly, you have following sequence:
>>>
>>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>>> should wait before VMEnter)
>>>
>>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>>> can do VMEnter)
>>>
>>> 2 - Before VMEnter : wait while my state is "stunned"
>>>
>>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>>> should wait before VMEnter)
>>>
>>> I have tried this kind of implementation, and the problem is with
>>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>>> You can't just do an active wait otherwise you have all kind of
>>> problems (depending if you have interrupts enabled or not)
>>> especially as you don't know how long you have to wait for (this
>>> depends on what the other cpu is doing).
>>
>> In our stunning implementation, we do an active wait with interrupts
>> enabled and with a need_resched() check to decide when to bail out
>> to the scheduler (plus we also make sure that we re-enter ASI at the
>> end of the wait in case some interrupt exited ASI). What kind of
>> problems have you run into with an active wait, besides wasted CPU
>> cycles?
>
> If you wait with interrupts enabled then there is window after the
> wait and before interrupts get disabled where a cpu can get an interrupt,
> exit ASI while the sibling is entering the VM.

We actually do another check after disabling interrupts and if it turns out
that we need to wait again, we just go back to the wait loop after re-enabling
interrupts. But, irrespective of that,

> Also after a CPU has passed
> the wait and have disable interrupts, it can't be notified if the sibling
> has exited ASI:

I don't think that this is actually the case. Yes, the IPI from the sibling
will be blocked while the host kernel has disabled interrupts. However, when
the host kernel executes a VMENTER, if there is a pending IPI, the VM will
immediately exit back to the host even before executing any guest code. So
AFAICT there is not going to be any data leak in the scenario that you
mentioned. Basically, the "cpu B runs VM" in step T+06 won't actually happen.

>
> T+01 - cpu A and B enter ASI - interrupts are enabled
> T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
> T+03 - cpu A gets an interrupt
> T+04 - cpu B disables interrupts
> T+05 - cpu A exit ASI and process interrupts
> T+06 - cpu B enters VM => cpu B runs VM while cpu A is not using ASI
> T+07 - cpu B exits VM
> T+08 - cpu B exits ASI
> T+09 - cpu A returns from interrupt
> T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI

The "cpu A runs VM while cpu A is not using ASI" will also not happen, because
cpu A will re-enter ASI after disabling interrupts and before entering the VM.

>
>> In any case, the specific stunning mechanism is orthogonal to ASI.
>> This implementation of ASI can be integrated with different stunning
>> implementations. The "kernel core scheduling" that you proposed is
>> also an alternative to stunning and could be similarly integrated
>> with ASI.
>
> Yes, but for ASI to be relevant with KVM to prevent data leak, you need
> a fully functional and reliable stunning mechanism, otherwise ASI is
> useless. That's why I think it is better to first focus on having an
> effective stunning mechanism and then implement ASI.
>

Sure, that makes sense. The only caveat is that, at least in our testing, the
overhead of stunning alone without ASI seemed too high. But I can try to see
if we might be able to post our stunning implementation with the next version
of the RFC.

Thanks,
Junaid

PS: I am away from the office for a few weeks, so email replies may be delayed
until next month.