A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the
host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.
This lockorture splat that I saw on the guest while testing this is
indicative of the problem:
[ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
[ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
[ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
(Caused by an oversubscribed host but using mismatched native pv_lock_ops
on the gues.)
This series addresses the problem by doing paravirt switching at runtime.
We keep an interesting subset of pv-ops (pv_lock_ops only for now,
but PV-TLB ops are also good candidates) in .parainstructions.runtime,
while discarding the .parainstructions as usual at init. This is then
used for switching back and forth between native and paravirt mode.
([1] lists some representative numbers of the increased memory
footprint.)
Mechanism: the patching itself is done using stop_machine(). That is
not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
via text_poke_bp(), but I'm using this to address two issues:
1) emulation in text_poke() can only easily handle a small set
of instructions and this is problematic for inlined pv-ops (and see
a possible alternatives use-case below.)
2) paravirt patching might have inter-dependendent ops (ex.
lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
need to be updated atomically.)
The alternative use-case is a runtime version of apply_alternatives()
(not posted with this patch-set) that can be used for some safe subset
of X86_FEATUREs. This could be useful in conjunction with the ongoing
late microcode loading work that Mihai Carabas and others have been
working on.
Also, there are points of similarity with the ongoing static_call work
which does rewriting of indirect calls. The difference here is that
we need to switch a group of calls atomically and given that
some of them can be inlined, need to handle a wider variety of opcodes.
To patch safely we need to satisfy these constraints:
- No references to insn sequences under replacement on any kernel stack
once replacement is in progress. Without this constraint we might end
up returning to an address that is in the middle of an instruction.
- handle inter-dependent ops: as above, lock.queued_lock_unlock(),
lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
a good example.
- handle a broader set of insns than CALL and JMP: some pv-ops end up
getting inlined. Alternatives can contain arbitrary instructions.
- locking operations can be called from interrupt handlers which means
we cannot trivially use IPIs for flushing.
Handling these, necessitates that target pv-ops not be preemptible.
Once that is a given (for safety these need to be explicitly whitelisted
in runtime_patch()), use a state-machine with the primary CPU doing the
patching and secondary CPUs in a sync_core() loop.
In case we hit an INT3/BP (in NMI or thread-context) we makes forward
progress by continuing the patching instead of emulating.
One remaining issue is inter-dependent pv-ops which are also executed in
the NMI handler -- patching can potentially deadlock in case of multiple
NMIs. Handle these by pushing some of this work in the NMI handler where
we know it will be uninterrupted.
There are four main sets of patches in this series:
1. PV-ops management (patches 1-10, 20): mostly infrastructure and
refactoring pieces to make paravirt patching usable at runtime. For the
most part scoped under CONFIG_PARAVIRT_RUNTIME.
Patches 1-7, to persist part of parainstructions in memory:
"x86/paravirt: Specify subsection in PVOP macros"
"x86/paravirt: Allow paravirt patching post-init"
"x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
"x86/alternatives: Refactor alternatives_smp_module*
"x86/alternatives: Rename alternatives_smp*, smp_alt_module
"x86/alternatives: Remove stale symbols
"x86/paravirt: Persist .parainstructions.runtime"
Patches 8-10, develop the inerfaces to safely switch pv-ops:
"x86/paravirt: Stash native pv-ops"
"x86/paravirt: Add runtime_patch()"
"x86/paravirt: Add primitives to stage pv-ops"
Patch 20 enables switching of pv_lock_ops:
"x86/paravirt: Enable pv-spinlocks in runtime_patch()"
2. Non-emulated text poking (patches 11-19)
Patches 11-13 are mostly refactoring to split __text_poke() into map,
unmap and poke/memcpy phases with the poke portion being re-entrant
"x86/alternatives: Remove return value of text_poke*()"
"x86/alternatives: Use __get_unlocked_pte() in text_poke()"
"x86/alternatives: Split __text_poke()"
Patches 15, 17 add the actual poking state-machine:
"x86/alternatives: Non-emulated text poking"
"x86/alternatives: Add patching logic in text_poke_site()"
with patches 14 and 18 containing the pieces for BP handling:
"x86/alternatives: Handle native insns in text_poke_loc*()"
"x86/alternatives: Handle BP in non-emulated text poking"
and patch 19 provides the ability to use the state-machine above in an
NMI context (fixes some potential deadlocks when handling inter-
dependent operations and multiple NMIs):
"x86/alternatives: NMI safe runtime patching".
Patch 16 provides the interface (paravirt_runtime_patch()) to use the
poking mechanism developed above and patch 21 adds a selftest:
"x86/alternatives: Add paravirt patching at runtime"
"x86/alternatives: Paravirt runtime selftest"
3. KVM guest changes to be able to use this (patches 22-23,25-26):
"kvm/paravirt: Encapsulate KVM pv switching logic"
"x86/kvm: Add worker to trigger runtime patching"
"x86/kvm: Guest support for dynamic hints"
"x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".
4. KVM host changes to notify the guest of a change (patch 24):
"x86/kvm: Support dynamic CPUID hints"
Testing:
With paravirt patching, the code is mostly stable on Intel and AMD
systems under kernbench and locktorture with paravirt toggling (with,
without synthetic NMIs) in the background.
Queued spinlock performance for locktorture is also on expected lines:
[ 1533.221563] Writes: Total: 1048759000 Max/Min: 0/0 Fail: 0
# toggle PV spinlocks
[ 1594.713699] Writes: Total: 1111660545 Max/Min: 0/0 Fail: 0
# PV spinlocks (in ~60 seconds) = 62,901,545
# toggle native spinlocks
[ 1656.117175] Writes: Total: 1113888840 Max/Min: 0/0 Fail: 0
# native spinlocks (in ~60 seconds) = 2,228,295
The alternatives testing is more limited with it being used to rewrite
mostly harmless X86_FEATUREs with load in the background.
Patches also at:
ssh://[email protected]/terminus/linux.git alternatives-rfc-upstream-v1
Please review.
Thanks
Ankur
[1] The precise change in memory footprint depends on config options
but the following example inlines queued_spin_unlock() (which forms
the bulk of the added state). The added footprint is the size of the
.parainstructions.runtime section:
$ objdump -h vmlinux|grep .parainstructions
Idx Name Size VMA
LMA File-off Algn
27 .parainstructions 0001013c ffffffff82895000
0000000002895000 01c95000 2**3
28 .parainstructions.runtime 0000cd2c ffffffff828a5140
00000000028a5140 01ca5140 2**3
$ size vmlinux
text data bss dec hex filename
13726196 12302814 14094336 40123346 2643bd2 vmlinux
Ankur Arora (26):
x86/paravirt: Specify subsection in PVOP macros
x86/paravirt: Allow paravirt patching post-init
x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
x86/alternatives: Refactor alternatives_smp_module*
x86/alternatives: Rename alternatives_smp*, smp_alt_module
x86/alternatives: Remove stale symbols
x86/paravirt: Persist .parainstructions.runtime
x86/paravirt: Stash native pv-ops
x86/paravirt: Add runtime_patch()
x86/paravirt: Add primitives to stage pv-ops
x86/alternatives: Remove return value of text_poke*()
x86/alternatives: Use __get_unlocked_pte() in text_poke()
x86/alternatives: Split __text_poke()
x86/alternatives: Handle native insns in text_poke_loc*()
x86/alternatives: Non-emulated text poking
x86/alternatives: Add paravirt patching at runtime
x86/alternatives: Add patching logic in text_poke_site()
x86/alternatives: Handle BP in non-emulated text poking
x86/alternatives: NMI safe runtime patching
x86/paravirt: Enable pv-spinlocks in runtime_patch()
x86/alternatives: Paravirt runtime selftest
kvm/paravirt: Encapsulate KVM pv switching logic
x86/kvm: Add worker to trigger runtime patching
x86/kvm: Support dynamic CPUID hints
x86/kvm: Guest support for dynamic hints
x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
Documentation/virt/kvm/api.rst | 17 +
Documentation/virt/kvm/cpuid.rst | 9 +-
arch/x86/Kconfig | 14 +
arch/x86/Kconfig.debug | 13 +
arch/x86/entry/entry_64.S | 5 +
arch/x86/include/asm/alternative.h | 20 +-
arch/x86/include/asm/kvm_host.h | 6 +
arch/x86/include/asm/kvm_para.h | 17 +
arch/x86/include/asm/paravirt.h | 10 +-
arch/x86/include/asm/paravirt_types.h | 230 ++++--
arch/x86/include/asm/text-patching.h | 18 +-
arch/x86/include/uapi/asm/kvm_para.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/alternative.c | 987 +++++++++++++++++++++++---
arch/x86/kernel/kvm.c | 191 ++++-
arch/x86/kernel/module.c | 42 +-
arch/x86/kernel/paravirt.c | 16 +-
arch/x86/kernel/paravirt_patch.c | 61 ++
arch/x86/kernel/pv_selftest.c | 264 +++++++
arch/x86/kernel/pv_selftest.h | 15 +
arch/x86/kernel/setup.c | 2 +
arch/x86/kernel/vmlinux.lds.S | 16 +
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/x86.c | 39 +
include/asm-generic/kvm_para.h | 12 +
include/asm-generic/vmlinux.lds.h | 8 +
include/linux/kvm_para.h | 5 +
include/linux/mm.h | 16 +-
include/linux/preempt.h | 17 +
include/uapi/linux/kvm.h | 4 +
kernel/locking/lock_events.c | 2 +-
mm/memory.c | 9 +-
32 files changed, 1850 insertions(+), 221 deletions(-)
create mode 100644 arch/x86/kernel/pv_selftest.c
create mode 100644 arch/x86/kernel/pv_selftest.h
--
2.20.1
Add paravirt_stage_alt() which conditionally selects between a paravirt
or native pv-op and then stages it for later patching.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 6 +++
arch/x86/include/asm/text-patching.h | 3 ++
arch/x86/kernel/alternative.c | 58 +++++++++++++++++++++++++++
3 files changed, 67 insertions(+)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3b9f6c105397..0c4ca7ad719c 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
#define PARAVIRT_PATCH(x) \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
+#define paravirt_stage_alt(do_stage, op, opfn) \
+ (text_poke_pv_stage(PARAVIRT_PATCH(op), \
+ (do_stage) ? (opfn) : (native_pv_ops.op)))
+
+#define paravirt_stage_zero() text_poke_pv_stage_zero()
+
/*
* Neat trick to map patch type back to the call within the
* corresponding structure.
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e2ef241c261e..706e61e6967d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -55,6 +55,9 @@ extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void
extern void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate);
extern void text_poke_finish(void);
+bool text_poke_pv_stage(u8 type, void *opfn);
+void text_poke_pv_stage_zero(void);
+
#define INT3_INSN_SIZE 1
#define INT3_INSN_OPCODE 0xCC
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8189ac21624c..0c335af9ee28 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1307,3 +1307,61 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
text_poke_loc_init(&tp, addr, opcode, len, emulate);
text_poke_bp_batch(&tp, 1);
}
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+struct paravirt_stage_entry {
+ void *dest; /* pv_op destination */
+ u8 type; /* pv_op type */
+};
+
+/*
+ * We don't anticipate many pv-ops being written at runtime.
+ */
+#define PARAVIRT_STAGE_MAX 8
+struct paravirt_stage {
+ struct paravirt_stage_entry ops[PARAVIRT_STAGE_MAX];
+ u32 count;
+};
+
+/* Protected by text_mutex */
+static struct paravirt_stage pv_stage;
+
+/**
+ * text_poke_pv_stage - Stage paravirt-op for poking.
+ * @addr: address in struct paravirt_patch_template
+ * @type: pv-op type
+ * @opfn: destination of the pv-op
+ *
+ * Return: staging status.
+ */
+bool text_poke_pv_stage(u8 type, void *opfn)
+{
+ if (system_state == SYSTEM_BOOTING) { /* Passthrough */
+ PARAVIRT_PATCH_OP(pv_ops, type) = (long)opfn;
+ goto out;
+ }
+
+ lockdep_assert_held(&text_mutex);
+
+ if (PARAVIRT_PATCH_OP(pv_ops, type) == (long)opfn)
+ goto out;
+
+ if (pv_stage.count >= PARAVIRT_STAGE_MAX)
+ goto out;
+
+ pv_stage.ops[pv_stage.count].type = type;
+ pv_stage.ops[pv_stage.count].dest = opfn;
+
+ pv_stage.count++;
+
+ return true;
+out:
+ return false;
+}
+
+void text_poke_pv_stage_zero(void)
+{
+ lockdep_assert_held(&text_mutex);
+ pv_stage.count = 0;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
--
2.20.1
Various text_poke() variants don't return a useful value. Remove it.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/text-patching.h | 4 ++--
arch/x86/kernel/alternative.c | 11 +++++------
2 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 706e61e6967d..04778c2bc34e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -46,9 +46,9 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
* On the local CPU you need to be protected against NMI or MCE handlers seeing
* an inconsistent instruction while you patch.
*/
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
extern void text_poke_sync(void);
-extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
+extern void text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0c335af9ee28..8c79a3dc5e72 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,7 +805,7 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
__ro_after_init struct mm_struct *poking_mm;
__ro_after_init unsigned long poking_addr;
-static void *__text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke(void *addr, const void *opcode, size_t len)
{
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
struct page *pages[2] = {NULL};
@@ -906,7 +906,6 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
pte_unmap_unlock(ptep, ptl);
local_irq_restore(flags);
- return addr;
}
/**
@@ -925,11 +924,11 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
* by registering a module notifier, and ordering module removal and patching
* trough a mutex.
*/
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
{
lockdep_assert_held(&text_mutex);
- return __text_poke(addr, opcode, len);
+ __text_poke(addr, opcode, len);
}
/**
@@ -946,9 +945,9 @@ void *text_poke(void *addr, const void *opcode, size_t len)
* Context: should only be used by kgdb, which ensures no other core is running,
* despite the fact it does not hold the text_mutex.
*/
-void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
+void text_poke_kgdb(void *addr, const void *opcode, size_t len)
{
- return __text_poke(addr, opcode, len);
+ __text_poke(addr, opcode, len);
}
static void do_sync_core(void *info)
--
2.20.1
Persist .parainstructions.runtime in memory. We will use it to
patch paravirt-ops at runtime.
The extra memory footprint depends on chosen config options but the
inlined queued_spin_unlock() presents an edge case:
$ objdump -h vmlinux|grep .parainstructions
Idx Name Size VMA
LMA File-off Algn
27 .parainstructions 0001013c ffffffff82895000
0000000002895000 01c95000 2**3
28 .parainstructions.runtime 0000cd2c ffffffff828a5140
00000000028a5140 01ca5140 2**3
(The added footprint is the size of the .parainstructions.runtime
section.)
$ size vmlinux
text data bss dec hex filename
13726196 12302814 14094336 40123346 2643bd2 vmlinux
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/alternative.h | 1 +
arch/x86/kernel/alternative.c | 16 +++++++++++++++-
arch/x86/kernel/module.c | 28 +++++++++++++++++++++++-----
3 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index db91a7731d87..d19546c14ff6 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -76,6 +76,7 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
struct module;
void alternatives_module_add(struct module *mod, char *name,
+ void *para, void *para_end,
void *locks, void *locks_end,
void *text, void *text_end);
void alternatives_module_del(struct module *mod);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 09e4ee0e09a2..8189ac21624c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -482,6 +482,12 @@ struct alt_module {
struct module *mod;
char *name;
+#ifdef CONFIG_PARAVIRT_RUNTIME
+ /* ptrs to paravirt sites */
+ struct paravirt_patch_site *para;
+ struct paravirt_patch_site *para_end;
+#endif
+
/* ptrs to lock prefixes */
const s32 *locks;
const s32 *locks_end;
@@ -496,6 +502,7 @@ struct alt_module {
static LIST_HEAD(alt_modules);
void __init_or_module alternatives_module_add(struct module *mod, char *name,
+ void *para, void *para_end,
void *locks, void *locks_end,
void *text, void *text_end)
{
@@ -506,7 +513,7 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
uniproc_patched = true;
#endif
- if (!uniproc_patched)
+ if (!IS_ENABLED(CONFIG_PARAVIRT_RUNTIME) && !uniproc_patched)
return;
mutex_lock(&text_mutex);
@@ -516,6 +523,11 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
alt->mod = mod;
alt->name = name;
+#ifdef CONFIG_PARAVIRT_RUNTIME
+ alt->para = para;
+ alt->para_end = para_end;
+#endif
+
if (num_possible_cpus() != 1 || uniproc_patched) {
/* Remember only if we'll need to undo it. */
alt->locks = locks;
@@ -733,6 +745,8 @@ void __init alternative_instructions(void)
apply_alternatives(__alt_instructions, __alt_instructions_end);
alternatives_module_add(NULL, "core kernel",
+ __parainstructions_runtime,
+ __parainstructions_runtime_end,
__smp_locks, __smp_locks_end,
_text, _etext);
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index fc3d35198b09..7b2632184c11 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -248,12 +248,30 @@ int module_finalize(const Elf_Ehdr *hdr,
void *aseg = (void *)alt->sh_addr;
apply_alternatives(aseg, aseg + alt->sh_size);
}
- if (locks && text) {
- void *lseg = (void *)locks->sh_addr;
- void *tseg = (void *)text->sh_addr;
+ if (para_run || (locks && text)) {
+ void *pseg, *pseg_end;
+ void *lseg, *lseg_end;
+ void *tseg, *tseg_end;
+
+ pseg = pseg_end = NULL;
+ lseg = lseg_end = NULL;
+ tseg = tseg_end = NULL;
+ if (para_run) {
+ pseg = (void *)para_run->sh_addr;
+ pseg_end = pseg + para_run->sh_size;
+ }
+
+ if (locks && text) {
+ tseg = (void *)text->sh_addr;
+ tseg_end = tseg + text->sh_size;
+
+ lseg = (void *)locks->sh_addr;
+ lseg_end = lseg + locks->sh_size;
+ }
alternatives_module_add(me, me->name,
- lseg, lseg + locks->sh_size,
- tseg, tseg + text->sh_size);
+ pseg, pseg_end,
+ lseg, lseg_end,
+ tseg, tseg_end);
}
if (para) {
--
2.20.1
Define PVRT* macros which can be used to put pv-ops in
.parainstructions.runtime.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 49 +++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 00e4a062ca10..f1153f53c529 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,12 @@ struct paravirt_patch_template {
extern struct pv_info pv_info;
extern struct paravirt_patch_template pv_ops;
+#ifdef CONFIG_PARAVIRT_RUNTIME
+#define PVRT_SUFFIX ".runtime"
+#else
+#define PVRT_SUFFIX ""
+#endif
+
/* Sub-section for .parainstructions */
#define PV_SUFFIX ""
@@ -693,6 +699,49 @@ int paravirt_disable_iospace(void);
#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4) \
_PVOP_VCALL4(PV_SUFFIX, op, arg1, arg2, arg3, arg4)
+/*
+ * PVRTOP macros for .parainstructions.runtime
+ */
+#define PVRTOP_CALL0(rettype, op) \
+ _PVOP_CALL0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALL0(op) \
+ _PVOP_VCALL0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALLEE0(rettype, op) \
+ _PVOP_CALLEE0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALLEE0(op) \
+ _PVOP_VCALLEE0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALL1(rettype, op, arg1) \
+ _PVOP_CALL1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALL1(op, arg1) \
+ _PVOP_VCALL1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALLEE1(rettype, op, arg1) \
+ _PVOP_CALLEE1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALLEE1(op, arg1) \
+ _PVOP_VCALLEE1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALL2(rettype, op, arg1, arg2) \
+ _PVOP_CALL2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALL2(op, arg1, arg2) \
+ _PVOP_VCALL2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALLEE2(rettype, op, arg1, arg2) \
+ _PVOP_CALLEE2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALLEE2(op, arg1, arg2) \
+ _PVOP_VCALLEE2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALL3(rettype, op, arg1, arg2, arg3) \
+ _PVOP_CALL3(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3)
+#define PVRTOP_VCALL3(op, arg1, arg2, arg3) \
+ _PVOP_VCALL3(PVRT_SUFFIX, op, arg1, arg2, arg3)
+
+#define PVRTOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \
+ _PVOP_CALL4(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3, arg4)
+#define PVRTOP_VCALL4(op, arg1, arg2, arg3, arg4) \
+ _PVOP_VCALL4(PVRT_SUFFIX, op, arg1, arg2, arg3, arg4)
+
/* Lazy mode for batching updates / context switch */
enum paravirt_lazy_mode {
PARAVIRT_LAZY_NONE,
--
2.20.1
Refactor alternatives_smp_module* logic to make it available for
holding generic late patching state.
Most of the changes are to pull the module handling logic out
from CONFIG_SMP. In addition now we unconditionally call
alternatives_smp_module_add() and make the decision on patching
for UP or not there.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/alternative.h | 13 ++-----
arch/x86/kernel/alternative.c | 55 ++++++++++++++++--------------
2 files changed, 32 insertions(+), 36 deletions(-)
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 13adca37c99a..8235bbb746d9 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -75,24 +75,15 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
struct module;
-#ifdef CONFIG_SMP
extern void alternatives_smp_module_add(struct module *mod, char *name,
void *locks, void *locks_end,
void *text, void *text_end);
extern void alternatives_smp_module_del(struct module *mod);
-extern void alternatives_enable_smp(void);
extern int alternatives_text_reserved(void *start, void *end);
-extern bool skip_smp_alternatives;
+#ifdef CONFIG_SMP
+extern void alternatives_enable_smp(void);
#else
-static inline void alternatives_smp_module_add(struct module *mod, char *name,
- void *locks, void *locks_end,
- void *text, void *text_end) {}
-static inline void alternatives_smp_module_del(struct module *mod) {}
static inline void alternatives_enable_smp(void) {}
-static inline int alternatives_text_reserved(void *start, void *end)
-{
- return 0;
-}
#endif /* CONFIG_SMP */
#define b_replacement(num) "664"#num
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index fdfda1375f82..32aa1ddf441d 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -470,6 +470,13 @@ static void alternatives_smp_unlock(const s32 *start, const s32 *end,
}
}
+static bool uniproc_patched; /* protected by text_mutex */
+#else /* !CONFIG_SMP */
+#define uniproc_patched false
+static inline void alternatives_smp_unlock(const s32 *start, const s32 *end,
+ u8 *text, u8 *text_end) { }
+#endif /* CONFIG_SMP */
+
struct smp_alt_module {
/* what is this ??? */
struct module *mod;
@@ -486,7 +493,6 @@ struct smp_alt_module {
struct list_head next;
};
static LIST_HEAD(smp_alt_modules);
-static bool uniproc_patched = false; /* protected by text_mutex */
void __init_or_module alternatives_smp_module_add(struct module *mod,
char *name,
@@ -495,23 +501,27 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
{
struct smp_alt_module *smp;
- mutex_lock(&text_mutex);
+#ifdef CONFIG_SMP
+ /* Patch to UP if other cpus not imminent. */
+ if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
+ uniproc_patched = true;
+#endif
if (!uniproc_patched)
- goto unlock;
+ return;
- if (num_possible_cpus() == 1)
- /* Don't bother remembering, we'll never have to undo it. */
- goto smp_unlock;
+ mutex_lock(&text_mutex);
- smp = kzalloc(sizeof(*smp), GFP_KERNEL);
- if (NULL == smp)
- /* we'll run the (safe but slow) SMP code then ... */
- goto unlock;
+ smp = kzalloc(sizeof(*smp), GFP_KERNEL | __GFP_NOFAIL);
smp->mod = mod;
smp->name = name;
- smp->locks = locks;
- smp->locks_end = locks_end;
+
+ if (num_possible_cpus() != 1 || uniproc_patched) {
+ /* Remember only if we'll need to undo it. */
+ smp->locks = locks;
+ smp->locks_end = locks_end;
+ }
+
smp->text = text;
smp->text_end = text_end;
DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
@@ -519,9 +529,9 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
smp->text, smp->text_end, smp->name);
list_add_tail(&smp->next, &smp_alt_modules);
-smp_unlock:
- alternatives_smp_unlock(locks, locks_end, text, text_end);
-unlock:
+
+ if (uniproc_patched)
+ alternatives_smp_unlock(locks, locks_end, text, text_end);
mutex_unlock(&text_mutex);
}
@@ -540,6 +550,7 @@ void __init_or_module alternatives_smp_module_del(struct module *mod)
mutex_unlock(&text_mutex);
}
+#ifdef CONFIG_SMP
void alternatives_enable_smp(void)
{
struct smp_alt_module *mod;
@@ -561,6 +572,7 @@ void alternatives_enable_smp(void)
}
mutex_unlock(&text_mutex);
}
+#endif /* CONFIG_SMP */
/*
* Return 1 if the address range is reserved for SMP-alternatives.
@@ -588,7 +600,6 @@ int alternatives_text_reserved(void *start, void *end)
return 0;
}
-#endif /* CONFIG_SMP */
#ifdef CONFIG_PARAVIRT
void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
@@ -723,21 +734,15 @@ void __init alternative_instructions(void)
apply_alternatives(__alt_instructions, __alt_instructions_end);
-#ifdef CONFIG_SMP
- /* Patch to UP if other cpus not imminent. */
- if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1)) {
- uniproc_patched = true;
- alternatives_smp_module_add(NULL, "core kernel",
- __smp_locks, __smp_locks_end,
- _text, _etext);
- }
+ alternatives_smp_module_add(NULL, "core kernel",
+ __smp_locks, __smp_locks_end,
+ _text, _etext);
if (!uniproc_patched || num_possible_cpus() == 1) {
free_init_pages("SMP alternatives",
(unsigned long)__smp_locks,
(unsigned long)__smp_locks_end);
}
-#endif
apply_paravirt(__parainstructions, __parainstructions_end);
apply_paravirt(__parainstructions_runtime,
--
2.20.1
Separate __text_poke() into map, memcpy and unmap portions,
(__text_poke_map(), __text_do_poke() and __text_poke_unmap().)
Do this to separate the non-reentrant bits from the reentrant
__text_do_poke(). __text_poke_map()/_unmap() modify poking_mm,
poking_addr and do the pte-mapping and thus are non-reentrant.
This allows __text_do_poke() to be safely called from an INT3
context with __text_poke_map()/_unmap() being called at the
start and the end of the patching of a call-site instead of
doing that for each stage of the three patching stages.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kernel/alternative.c | 46 +++++++++++++++++++++++++----------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0344e49a4ade..337aad8c2521 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,13 +805,12 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
__ro_after_init struct mm_struct *poking_mm;
__ro_after_init unsigned long poking_addr;
-static void __text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke_map(void *addr, size_t len,
+ temp_mm_state_t *prev_mm, pte_t **ptep)
{
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
struct page *pages[2] = {NULL};
- temp_mm_state_t prev;
- unsigned long flags;
- pte_t pte, *ptep;
+ pte_t pte;
pgprot_t pgprot;
/*
@@ -836,8 +835,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
*/
BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));
- local_irq_save(flags);
-
/*
* Map the page without the global bit, as TLB flushing is done with
* flush_tlb_mm_range(), which is intended for non-global PTEs.
@@ -849,30 +846,42 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
* unlocked. This does mean that we need to be careful that no other
* context (ex. INT3 handler) is simultaneously writing to this pte.
*/
- ptep = __get_unlocked_pte(poking_mm, poking_addr);
+ *ptep = __get_unlocked_pte(poking_mm, poking_addr);
/*
* This must not fail; preallocated in poking_init().
*/
- VM_BUG_ON(!ptep);
+ VM_BUG_ON(!*ptep);
pte = mk_pte(pages[0], pgprot);
- set_pte_at(poking_mm, poking_addr, ptep, pte);
+ set_pte_at(poking_mm, poking_addr, *ptep, pte);
if (cross_page_boundary) {
pte = mk_pte(pages[1], pgprot);
- set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+ set_pte_at(poking_mm, poking_addr + PAGE_SIZE, *ptep + 1, pte);
}
/*
* Loading the temporary mm behaves as a compiler barrier, which
* guarantees that the PTE will be set at the time memcpy() is done.
*/
- prev = use_temporary_mm(poking_mm);
+ *prev_mm = use_temporary_mm(poking_mm);
+}
+/*
+ * Do the actual poke. Needs to be re-entrant as this can be called
+ * via INT3 context as well.
+ */
+static void __text_do_poke(unsigned long offset, const void *opcode, size_t len)
+{
kasan_disable_current();
- memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+ memcpy((u8 *)poking_addr + offset, opcode, len);
kasan_enable_current();
+}
+static void __text_poke_unmap(void *addr, const void *opcode, size_t len,
+ temp_mm_state_t *prev_mm, pte_t *ptep)
+{
+ bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
/*
* Ensure that the PTE is only cleared after the instructions of memcpy
* were issued by using a compiler barrier.
@@ -888,7 +897,7 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
* instruction that already allows the core to see the updated version.
* Xen-PV is assumed to serialize execution in a similar manner.
*/
- unuse_temporary_mm(prev);
+ unuse_temporary_mm(*prev_mm);
/*
* Flushing the TLB might involve IPIs, which would require enabled
@@ -903,7 +912,18 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
* fundamentally screwy; there's nothing we can really do about that.
*/
BUG_ON(memcmp(addr, opcode, len));
+}
+static void __text_poke(void *addr, const void *opcode, size_t len)
+{
+ temp_mm_state_t prev_mm;
+ unsigned long flags;
+ pte_t *ptep;
+
+ local_irq_save(flags);
+ __text_poke_map(addr, len, &prev_mm, &ptep);
+ __text_do_poke(offset_in_page(addr), opcode, len);
+ __text_poke_unmap(addr, opcode, len, &prev_mm, ptep);
local_irq_restore(flags);
}
--
2.20.1
KVM pv-ops are dependent on KVM features/hints which are exposed
via CPUID. Decouple the probing and the enabling of these ops from
__init so they can be called post-init as well.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/kvm.c | 87 ++++++++++++++++++++++++++++++-------------
2 files changed, 63 insertions(+), 25 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 605619938f08..e0629558b6b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -809,6 +809,7 @@ config KVM_GUEST
depends on PARAVIRT
select PARAVIRT_CLOCK
select ARCH_CPUIDLE_HALTPOLL
+ select PARAVIRT_RUNTIME
default y
---help---
This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e56d263159d7..31f5ecfd3907 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -24,6 +24,7 @@
#include <linux/debugfs.h>
#include <linux/nmi.h>
#include <linux/swait.h>
+#include <linux/memory.h>
#include <asm/timer.h>
#include <asm/cpu.h>
#include <asm/traps.h>
@@ -262,12 +263,20 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned lon
}
NOKPROBE_SYMBOL(do_async_page_fault);
+static bool kvm_pv_io_delay(void)
+{
+ bool cond = kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY);
+
+ paravirt_stage_alt(cond, cpu.io_delay, kvm_io_delay);
+
+ return cond;
+}
+
static void __init paravirt_ops_setup(void)
{
pv_info.name = "KVM";
- if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
- pv_ops.cpu.io_delay = kvm_io_delay;
+ kvm_pv_io_delay();
#ifdef CONFIG_X86_IO_APIC
no_timer_check = 1;
@@ -432,6 +441,15 @@ static bool pv_tlb_flush_supported(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
}
+static bool kvm_pv_steal_clock(void)
+{
+ bool cond = kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+ paravirt_stage_alt(cond, time.steal_clock, kvm_steal_clock);
+
+ return cond;
+}
+
static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
#ifdef CONFIG_SMP
@@ -624,6 +642,17 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
native_flush_tlb_others(flushmask, info);
}
+static bool kvm_pv_tlb(void)
+{
+ bool cond = pv_tlb_flush_supported();
+
+ paravirt_stage_alt(cond, mmu.flush_tlb_others,
+ kvm_flush_tlb_others);
+ paravirt_stage_alt(cond, mmu.tlb_remove_table,
+ tlb_remove_table);
+ return cond;
+}
+
static void __init kvm_guest_init(void)
{
int i;
@@ -635,16 +664,11 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
- if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+ if (kvm_pv_steal_clock())
has_steal_clock = 1;
- pv_ops.time.steal_clock = kvm_steal_clock;
- }
- if (pv_tlb_flush_supported()) {
- pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
- pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+ if (kvm_pv_tlb())
pr_info("KVM setup pv remote TLB flush\n");
- }
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
@@ -849,33 +873,46 @@ asm(
#endif
+static inline bool kvm_para_lock_ops(void)
+{
+ /* Does host kernel support KVM_FEATURE_PV_UNHALT? */
+ return kvm_para_has_feature(KVM_FEATURE_PV_UNHALT) &&
+ !kvm_para_has_hint(KVM_HINTS_REALTIME);
+}
+
+static bool kvm_pv_spinlock(void)
+{
+ bool cond = kvm_para_lock_ops();
+ bool preempt_cond = cond &&
+ kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+ paravirt_stage_alt(cond, lock.queued_spin_lock_slowpath,
+ __pv_queued_spin_lock_slowpath);
+ paravirt_stage_alt(cond, lock.queued_spin_unlock.func,
+ PV_CALLEE_SAVE(__pv_queued_spin_unlock).func);
+ paravirt_stage_alt(cond, lock.wait, kvm_wait);
+ paravirt_stage_alt(cond, lock.kick, kvm_kick_cpu);
+
+ paravirt_stage_alt(preempt_cond,
+ lock.vcpu_is_preempted.func,
+ PV_CALLEE_SAVE(__kvm_vcpu_is_preempted).func);
+ return cond;
+}
+
/*
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
*/
void __init kvm_spinlock_init(void)
{
- /* Does host kernel support KVM_FEATURE_PV_UNHALT? */
- if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
- return;
-
- if (kvm_para_has_hint(KVM_HINTS_REALTIME))
- return;
/* Don't use the pvqspinlock code if there is only 1 vCPU. */
if (num_possible_cpus() == 1)
return;
- __pv_init_lock_hash();
- pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
- pv_ops.lock.queued_spin_unlock =
- PV_CALLEE_SAVE(__pv_queued_spin_unlock);
- pv_ops.lock.wait = kvm_wait;
- pv_ops.lock.kick = kvm_kick_cpu;
+ if (!kvm_pv_spinlock())
+ return;
- if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
- pv_ops.lock.vcpu_is_preempted =
- PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
- }
+ __pv_init_lock_hash();
}
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
--
2.20.1
Change in the state of a KVM hint like KVM_HINTS_REALTIME can lead
to significant performance impact. Given that the hint might not be
stable across the lifetime of a guest, dynamic hints allow the host
to inform the guest if the hint changes.
Do this via KVM CPUID leaf in %ecx. If the guest has registered a
callback via MSR_KVM_HINT_VECTOR, the hint change is notified to it by
means of a callback triggered via vcpu ioctl KVM_CALLBACK.
Signed-off-by: Ankur Arora <[email protected]>
---
The callback vector is currently tied in with the hint notification
and can (should) be made more generic such that we could deliver
arbitrary callbacks on it.
One use might be for TSC frequency switching notifications support for
emulated Hyper-V guests.
---
Documentation/virt/kvm/api.rst | 17 ++++++++++++
Documentation/virt/kvm/cpuid.rst | 9 +++++--
arch/x86/include/asm/kvm_host.h | 6 +++++
arch/x86/include/uapi/asm/kvm_para.h | 2 ++
arch/x86/kvm/cpuid.c | 3 ++-
arch/x86/kvm/x86.c | 39 ++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 4 +++
7 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..40a9b22d6979 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4690,6 +4690,17 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
+4.126 KVM_CALLBACK
+------------------
+
+:Capability: KVM_CAP_CALLBACK
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Queues a callback on the guess's vcpu if a callback has been regisered.
+
5. The kvm_run structure
========================
@@ -6109,3 +6120,9 @@ KVM can therefore start protected VMs.
This capability governs the KVM_S390_PV_COMMAND ioctl and the
KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
guests when the state change is invalid.
+
+8.24 KVM_CAP_CALLBACK
+
+Architectures: x86_64
+
+This capability indicates that the ioctl KVM_CALLBACK is available.
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..5a997c9e74c0 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -86,6 +86,9 @@ KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
before using paravirtualized
sched yield.
+KVM_FEATURE_DYNAMIC_HINTS 14 guest handles feature hints
+ changing under it.
+
KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
per-cpu warps are expeced in
kvmclock
@@ -93,9 +96,11 @@ KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
::
- edx = an OR'ed group of (1 << flag)
+ ecx, edx = an OR'ed group of (1 << flag)
-Where ``flag`` here is defined as below:
+Where the ``flag`` in ecx is currently applicable hints, and ``flag`` in
+edx is the union of all hints ever provided to the guest, both drawn from
+the set listed below:
================== ============ =================================
flag value meaning
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..4f061550274d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -723,6 +723,8 @@ struct kvm_vcpu_arch {
bool nmi_injected; /* Trying to inject an NMI this entry */
bool smi_pending; /* SMI queued after currently running handler */
+ bool callback_pending; /* Callback queued after running handler */
+
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -982,6 +984,10 @@ struct kvm_arch {
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
+
+ struct {
+ u8 vector;
+ } callback;
};
struct kvm_vm_stat {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..bf016e232f2f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
#define KVM_FEATURE_PV_SEND_IPI 11
#define KVM_FEATURE_POLL_CONTROL 12
#define KVM_FEATURE_PV_SCHED_YIELD 13
+#define KVM_FEATURE_DYNAMIC_HINTS 14
#define KVM_HINTS_REALTIME 0
@@ -50,6 +51,7 @@
#define MSR_KVM_STEAL_TIME 0x4b564d03
#define MSR_KVM_PV_EOI_EN 0x4b564d04
#define MSR_KVM_POLL_CONTROL 0x4b564d05
+#define MSR_KVM_HINT_VECTOR 0x4b564d06
struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..db6a4c4d9430 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
(1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
(1 << KVM_FEATURE_PV_SEND_IPI) |
(1 << KVM_FEATURE_POLL_CONTROL) |
- (1 << KVM_FEATURE_PV_SCHED_YIELD);
+ (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+ (1 << KVM_FEATURE_DYNAMIC_HINTS);
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8124b562dea..838d033bf5ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1282,6 +1282,7 @@ static const u32 emulated_msrs_all[] = {
MSR_K7_HWCR,
MSR_KVM_POLL_CONTROL,
+ MSR_KVM_HINT_VECTOR,
};
static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2910,7 +2911,15 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.msr_kvm_poll_control = data;
break;
+ case MSR_KVM_HINT_VECTOR: {
+ u8 vector = (u8)data;
+ if ((u64)data > 0xffUL)
+ return 1;
+
+ vcpu->kvm->arch.callback.vector = vector;
+ break;
+ }
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3156,6 +3165,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_KVM_POLL_CONTROL:
msr_info->data = vcpu->arch.msr_kvm_poll_control;
break;
+ case MSR_KVM_HINT_VECTOR:
+ msr_info->data = vcpu->kvm->arch.callback.vector;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -3373,6 +3385,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_GET_MSR_FEATURES:
case KVM_CAP_MSR_PLATFORM_INFO:
case KVM_CAP_EXCEPTION_PAYLOAD:
+ case KVM_CAP_CALLBACK:
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -3721,6 +3734,20 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
return 0;
}
+static int kvm_vcpu_ioctl_callback(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Has the guest setup a callback?
+ */
+ if (vcpu->kvm->arch.callback.vector) {
+ vcpu->arch.callback_pending = true;
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
+ return 0;
+ } else {
+ return -EINVAL;
+ }
+}
+
static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
{
kvm_inject_nmi(vcpu);
@@ -4611,6 +4638,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = 0;
break;
}
+ case KVM_CALLBACK: {
+ r = kvm_vcpu_ioctl_callback(vcpu);
+ break;
+ }
default:
r = -EINVAL;
}
@@ -7737,6 +7768,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
--vcpu->arch.nmi_pending;
vcpu->arch.nmi_injected = true;
kvm_x86_ops.set_nmi(vcpu);
+ } else if (vcpu->arch.callback_pending) {
+ if (kvm_x86_ops.interrupt_allowed(vcpu)) {
+ vcpu->arch.callback_pending = false;
+ kvm_queue_interrupt(vcpu,
+ vcpu->kvm->arch.callback.vector,
+ false);
+ kvm_x86_ops.set_irq(vcpu);
+ }
} else if (kvm_cpu_has_injectable_intr(vcpu)) {
/*
* Because interrupts can be injected asynchronously, we are
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..5401c056742c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_VCPU_RESETS 179
#define KVM_CAP_S390_PROTECTED 180
#define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_CALLBACK 182
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1518,6 +1519,9 @@ struct kvm_pv_cmd {
/* Available with KVM_CAP_S390_PROTECTED */
#define KVM_S390_PV_COMMAND _IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
+/* Available with KVM_CAP_CALLBACK */
+#define KVM_CALLBACK _IO(KVMIO, 0xc6)
+
/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
/* Guest initialization commands */
--
2.20.1
Paravirt-ops are patched at init to convert indirect calls into
direct calls and in some cases, to inline the target at the call-site.
This is done by way of PVOP* macros which save the call-site
information via compile time annotations.
Pull this state out in .parainstructions.runtime for some pv-ops such
that they can be used for runtime patching.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
arch/x86/include/asm/paravirt_types.h | 5 +++++
arch/x86/include/asm/text-patching.h | 5 +++++
arch/x86/kernel/alternative.c | 2 ++
arch/x86/kernel/module.c | 10 +++++++++-
arch/x86/kernel/vmlinux.lds.S | 16 ++++++++++++++++
include/asm-generic/vmlinux.lds.h | 8 ++++++++
7 files changed, 57 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1edf788d301c..605619938f08 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -764,6 +764,18 @@ config PARAVIRT
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.
+config PARAVIRT_RUNTIME
+ bool "Enable paravirtualized ops to be patched at runtime"
+ depends on PARAVIRT
+ help
+ Enable the paravirtualized guest kernel to switch pv-ops based on
+ changed host conditions, potentially improving performance
+ significantly.
+
+ This would increase the memory footprint of the running kernel
+ slightly (depending mostly on whether lock and unlock are inlined
+ or not.)
+
config PARAVIRT_XXL
bool
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 37e8f27a3b9d..00e4a062ca10 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -723,6 +723,11 @@ struct paravirt_patch_site {
extern struct paravirt_patch_site __parainstructions[],
__parainstructions_end[];
+#ifdef CONFIG_PARAVIRT_RUNTIME
+extern struct paravirt_patch_site __parainstructions_runtime[],
+ __parainstructions_runtime_end[];
+#endif
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 67315fa3956a..e2ef241c261e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
#define __parainstructions_end NULL
#endif
+#ifndef CONFIG_PARAVIRT_RUNTIME
+#define __parainstructions_runtime NULL
+#define __parainstructions_runtime_end NULL
+#endif
+
/*
* Currently, the max observed size in the kernel code is
* JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7867dfb3963e..fdfda1375f82 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -740,6 +740,8 @@ void __init alternative_instructions(void)
#endif
apply_paravirt(__parainstructions, __parainstructions_end);
+ apply_paravirt(__parainstructions_runtime,
+ __parainstructions_runtime_end);
restart_nmi();
alternatives_patched = 1;
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index d5c72cb877b3..658ea60ce324 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -222,7 +222,7 @@ int module_finalize(const Elf_Ehdr *hdr,
struct module *me)
{
const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
- *para = NULL, *orc = NULL, *orc_ip = NULL;
+ *para = NULL, *para_run = NULL, *orc = NULL, *orc_ip = NULL;
char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -234,6 +234,9 @@ int module_finalize(const Elf_Ehdr *hdr,
locks = s;
if (!strcmp(".parainstructions", secstrings + s->sh_name))
para = s;
+ if (!strcmp(".parainstructions.runtime",
+ secstrings + s->sh_name))
+ para_run = s;
if (!strcmp(".orc_unwind", secstrings + s->sh_name))
orc = s;
if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
@@ -257,6 +260,11 @@ int module_finalize(const Elf_Ehdr *hdr,
void *pseg = (void *)para->sh_addr;
apply_paravirt(pseg, pseg + para->sh_size);
}
+ if (para_run) {
+ void *pseg = (void *)para_run->sh_addr;
+
+ apply_paravirt(pseg, pseg + para_run->sh_size);
+ }
/* make jump label nops */
jump_label_apply_nops(me);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1bf7e312361f..7f5b8f6ab96e 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -269,6 +269,7 @@ SECTIONS
.parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
__parainstructions = .;
*(.parainstructions)
+ PARAVIRT_DISCARD(.parainstructions.runtime)
__parainstructions_end = .;
}
@@ -348,6 +349,21 @@ SECTIONS
__smp_locks_end = .;
}
+#ifdef CONFIG_PARAVIRT_RUNTIME
+ /*
+ * .parainstructions.runtime sticks around in memory after
+ * init so it doesn't need to be page-aligned but everything
+ * around us is so we will be too.
+ */
+ . = ALIGN(8);
+ .parainstructions.runtime : AT(ADDR(.parainstructions.runtime) - \
+ LOAD_OFFSET) {
+ __parainstructions_runtime = .;
+ PARAVIRT_KEEP(.parainstructions.runtime)
+ __parainstructions_runtime_end = .;
+ }
+#endif
+
#ifdef CONFIG_X86_64
.data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
NOSAVE_DATA
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 71e387a5fe90..6b009d5ce51f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -135,6 +135,14 @@
#define MEM_DISCARD(sec) *(.mem##sec)
#endif
+#if defined(CONFIG_PARAVIRT_RUNTIME)
+#define PARAVIRT_KEEP(sec) *(sec)
+#define PARAVIRT_DISCARD(sec)
+#else
+#define PARAVIRT_KEEP(sec)
+#define PARAVIRT_DISCARD(sec) *(sec)
+#endif
+
#ifdef CONFIG_FTRACE_MCOUNT_RECORD
/*
* The ftrace call sites are logged to a section whose name depends on the
--
2.20.1
runtime_patch() generates insn sequences for patching supported pv_ops.
It does this by calling paravirt_patch_default() or native_patch()
dpending on if the target is a paravirt or native pv-op.
In addition, runtime_patch() also whitelists pv-ops that are safe to
patch at runtime.
The static conditions that need to be satisfied to patch safely:
- Insn sequences under replacement need to execute without preemption.
This is meant to avoid scenarios where a call-site (ex.
lock.vcpu_is_preempted) switches between the following sequences:
lock.vcpu_is_preempted = __raw_callee_save___kvm_vcpu_is_preempted
0: e8 31 e6 ff ff callq 0xffffffffffffe636
4: 66 90 xchg %ax,%ax # NOP2
lock.vcpu_is_preempted = __raw_callee_save___native_vcpu_is_preempted
0: 31 c0 xor %rax, %rax
2: 0f 1f 44 00 00 nopl 0x0(%rax) # NOP5
If kvm_vcpu_is_preempted() were preemptible, then, post patching
we would return to address 4 above, which is in the middle of an
instruction for native_vcpu_is_preempted().
Even if this were to be made safe (ex. by changing the NOP2 to be a
prefix instead of a suffix), it would still not be enough -- since
we do not want any code from the switched out pv-op to be executing
after the pv-op has been switched out.
- Entered only at the beginning: this allows us to use text_poke()
which uses INT3 as a barrier.
We don't store the address inside any call-sites so the second can be
assumed.
Guaranteeing the first condition boils down to stating that any pv-op
being patched cannot be present/referenced from any call-stack in the
system. pv-ops that are not obviously non-preemptible need to be
enclosed in preempt_disable_runtime_patch()/preempt_enable_runtime_patch().
This should be sufficient because runtime_patch() itself is called from
a stop_machine() context which would would be enough to flush out any
non-preemptible sequences.
Note that preemption in the host is okay: stop_machine() would unwind
any pv-ops sleeping in the host.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 8 +++++
arch/x86/kernel/paravirt.c | 6 +---
arch/x86/kernel/paravirt_patch.c | 49 +++++++++++++++++++++++++++
include/linux/preempt.h | 17 ++++++++++
4 files changed, 75 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index bc935eec7ec6..3b9f6c105397 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
#define PARAVIRT_PATCH(x) \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
+/*
+ * Neat trick to map patch type back to the call within the
+ * corresponding structure.
+ */
+#define PARAVIRT_PATCH_OP(ops, type) (*(long *)(&((long **)&(ops))[type]))
+
#define paravirt_type(op) \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
[paravirt_opptr] "i" (&(pv_ops.op))
@@ -383,6 +389,8 @@ unsigned paravirt_patch_default(u8 type, void *insn_buff, unsigned long addr, un
unsigned paravirt_patch_insns(void *insn_buff, unsigned len, const char *start, const char *end);
unsigned native_patch(u8 type, void *insn_buff, unsigned long addr, unsigned len);
+int runtime_patch(u8 type, void *insn_buff, void *op, unsigned long addr,
+ unsigned int len);
int paravirt_disable_iospace(void);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 8c511cc4d4f4..c4128436b05a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -117,11 +117,7 @@ void __init native_pv_lock_init(void)
unsigned paravirt_patch_default(u8 type, void *insn_buff,
unsigned long addr, unsigned len)
{
- /*
- * Neat trick to map patch type back to the call within the
- * corresponding structure.
- */
- void *opfunc = *((void **)&pv_ops + type);
+ void *opfunc = (void *)PARAVIRT_PATCH_OP(pv_ops, type);
unsigned ret;
if (opfunc == NULL)
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eff63c090d2..3eb8c0e720b4 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/stringify.h>
+#include <linux/errno.h>
#include <asm/paravirt.h>
#include <asm/asm-offsets.h>
@@ -124,3 +125,51 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
return paravirt_patch_default(type, insn_buff, addr, len);
}
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+/**
+ * runtime_patch - Generate patching code for a native/paravirt op
+ * @type: op type to generate code for
+ * @insn_buff: destination buffer
+ * @op: op target
+ * @addr: call site address
+ * @len: length of insn_buff
+ *
+ * Note that pv-ops are only suitable for runtime patching if they are
+ * non-preemptible. This is necessary for two reasons: we don't want to
+ * be overwriting insn sequences which might be referenced from call-stacks
+ * (and thus would be returned to), and we want patching to act as a barrier
+ * so no code from now stale paravirt ops should execute after an op has
+ * changed.
+ *
+ * Return: size of insn sequence on success, -EINVAL on error.
+ */
+int runtime_patch(u8 type, void *insn_buff, void *op,
+ unsigned long addr, unsigned int len)
+{
+ void *native_op;
+ int used = 0;
+
+ /* Nothing whitelisted for now. */
+ switch (type) {
+ default:
+ pr_warn("type=%d unsuitable for runtime-patching\n", type);
+ return -EINVAL;
+ }
+
+ if (PARAVIRT_PATCH_OP(pv_ops, type) != (long)op)
+ PARAVIRT_PATCH_OP(pv_ops, type) = (long)op;
+
+ native_op = (void *)PARAVIRT_PATCH_OP(native_pv_ops, type);
+
+ /*
+ * Use native_patch() to get the right insns if we are switching
+ * back to a native_op.
+ */
+ if (op == native_op)
+ used = native_patch(type, insn_buff, addr, len);
+ else
+ used = paravirt_patch_default(type, insn_buff, addr, len);
+ return used;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index bc3f1aecaa19..c569d077aab2 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -203,6 +203,13 @@ do { \
__preempt_schedule(); \
} while (0)
+/*
+ * preempt_enable_no_resched() so we don't add any preemption points until
+ * after the caller has returned.
+ */
+#define preempt_enable_runtime_patch() preempt_enable_no_resched()
+#define preempt_disable_runtime_patch() preempt_disable()
+
#else /* !CONFIG_PREEMPTION */
#define preempt_enable() \
do { \
@@ -217,6 +224,12 @@ do { \
} while (0)
#define preempt_check_resched() do { } while (0)
+
+/*
+ * NOP, if there's no preemption.
+ */
+#define preempt_disable_runtime_patch() do { } while (0)
+#define preempt_enable_runtime_patch() do { } while (0)
#endif /* CONFIG_PREEMPTION */
#define preempt_disable_notrace() \
@@ -250,6 +263,8 @@ do { \
#define preempt_enable_notrace() barrier()
#define preemptible() 0
+#define preempt_disable_runtime_patch() do { } while (0)
+#define preempt_enable_runtime_patch() do { } while (0)
#endif /* CONFIG_PREEMPT_COUNT */
#ifdef MODULE
@@ -260,6 +275,8 @@ do { \
#undef preempt_enable_no_resched
#undef preempt_enable_no_resched_notrace
#undef preempt_check_resched
+#undef preempt_disable_runtime_patch
+#undef preempt_enable_runtime_patch
#endif
#define preempt_set_need_resched() \
--
2.20.1
Introduce native_pv_ops where we stash the pv_ops array before
hypervisor specific hooks have modified it.
native_pv_ops get used when switching between paravirt and native
pv-ops at runtime.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 4 ++++
arch/x86/kernel/paravirt.c | 10 ++++++++++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index f1153f53c529..bc935eec7ec6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -339,6 +339,7 @@ extern struct paravirt_patch_template pv_ops;
#ifdef CONFIG_PARAVIRT_RUNTIME
#define PVRT_SUFFIX ".runtime"
+extern struct paravirt_patch_template native_pv_ops;
#else
#define PVRT_SUFFIX ""
#endif
@@ -775,6 +776,9 @@ extern struct paravirt_patch_site __parainstructions[],
#ifdef CONFIG_PARAVIRT_RUNTIME
extern struct paravirt_patch_site __parainstructions_runtime[],
__parainstructions_runtime_end[];
+void paravirt_ops_init(void);
+#else
+static inline void paravirt_ops_init(void) { }
#endif
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c131ba4e70ef..8c511cc4d4f4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -458,5 +458,15 @@ NOKPROBE_SYMBOL(native_set_debugreg);
NOKPROBE_SYMBOL(native_load_idt);
#endif
+#ifdef CONFIG_PARAVIRT_RUNTIME
+__ro_after_init struct paravirt_patch_template native_pv_ops;
+
+void __init paravirt_ops_init(void)
+{
+ native_pv_ops = pv_ops;
+}
+EXPORT_SYMBOL(native_pv_ops);
+#endif
+
EXPORT_SYMBOL(pv_ops);
EXPORT_SYMBOL_GPL(pv_info);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e6b545047f38..2746a6a78fe7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -43,6 +43,7 @@
#include <asm/unwind.h>
#include <asm/vsyscall.h>
#include <linux/vmalloc.h>
+#include <asm/paravirt_types.h>
/*
* max_low_pfn_mapped: highest directly mapped pfn < 4 GB
@@ -831,6 +832,7 @@ void __init setup_arch(char **cmdline_p)
boot_cpu_data.x86_phys_bits = MAX_PHYSMEM_BITS;
#endif
+ paravirt_ops_init();
/*
* If we have OLPC OFW, we might end up relocating the fixmap due to
* reserve_top(), so do this before touching the ioremap area.
--
2.20.1
Add paravirt_patch_runtime() which uses text_poke_late() to patch
paravirt sites.
Also add paravirt_worker() which does the actual insn generation
generate_paravirt() (which uses runtime_patch() to generate the
appropriate native or paravirt insn sequences) and then calls
text_poke_site() to do the actual poking.
CPU0 CPUx
---- ----
patch_worker() patch_worker()
/* Traversal, insn-gen */ text_poke_sync_finish()
tps.patch_worker()
/* = paravirt_worker() */ /*
* wait until:
/* for each patch-site */ * tps->state == PATCH_DONE
generate_paravirt() */
runtime_patch()
text_poke_site()
poke_sync()
... ...
smp_store_release(&tps->state, PATCH_DONE)
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/text-patching.h | 2 +
arch/x86/kernel/alternative.c | 98 +++++++++++++++++++++++++++-
2 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c4b2814f2f9d..e86709a8287e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -21,6 +21,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
#ifndef CONFIG_PARAVIRT_RUNTIME
#define __parainstructions_runtime NULL
#define __parainstructions_runtime_end NULL
+#else
+int paravirt_runtime_patch(void);
#endif
/*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 452d4081eded..1c5acdc4f349 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1463,7 +1463,9 @@ static void poke_sync(struct text_poke_state *tps, int state, int offset,
/**
* text_poke_site() - called on the primary to patch a single call site.
*
- * Returns after switching tps->state to PATCH_SYNC_DONE.
+ * Called in thread context with tps->state == PATCH_SYNC_DONE where it
+ * takes tps->state through different PATCH_SYNC_* states, returning
+ * after having switched the tps->state back to PATCH_SYNC_DONE.
*/
static void __maybe_unused text_poke_site(struct text_poke_state *tps,
struct text_poke_loc *tp)
@@ -1598,6 +1600,16 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
return ret;
}
+/*
+ * Check if this address is still in scope of this module's .text section.
+ */
+static bool __maybe_unused stale_address(struct alt_module *am, u8 *p)
+{
+ if (p < am->text || p >= am->text_end)
+ return true;
+ return false;
+}
+
#ifdef CONFIG_PARAVIRT_RUNTIME
struct paravirt_stage_entry {
void *dest; /* pv_op destination */
@@ -1654,4 +1666,88 @@ void text_poke_pv_stage_zero(void)
lockdep_assert_held(&text_mutex);
pv_stage.count = 0;
}
+
+/**
+ * generate_paravirt - fill up the insn sequence for a pv-op.
+ *
+ * @tp - address of struct text_poke_loc
+ * @op - the pv-op entry for this location
+ * @site - patch site (kernel or module text)
+ */
+static void generate_paravirt(struct text_poke_loc *tp,
+ struct paravirt_stage_entry *op,
+ struct paravirt_patch_site *site)
+{
+ unsigned int used;
+
+ BUG_ON(site->len > POKE_MAX_OPCODE_SIZE);
+
+ text_poke_loc_init(tp, site->instr, site->instr, site->len, NULL, true);
+
+ /*
+ * Paravirt patches can patch calls (ex. mmu.tlb_flush),
+ * callee_saves(ex. queued_spin_unlock).
+ *
+ * runtime_patch() calls native_patch(), or paravirt_patch()
+ * based on the destination.
+ */
+ used = runtime_patch(site->type, (void *)tp->text, op->dest,
+ (unsigned long)site->instr, site->len);
+
+ /* No good way to recover. */
+ BUG_ON(used < 0);
+
+ /* Pad the rest with nops */
+ add_nops((void *)tp->text + used, site->len - used);
+}
+
+/**
+ * paravirt_worker - generate the paravirt patching
+ * insns and calls text_poke_site() to do the actual patching.
+ */
+static void paravirt_worker(struct text_poke_state *tps)
+{
+ struct paravirt_patch_site *site;
+ struct paravirt_stage *stage = tps->stage;
+ struct paravirt_stage_entry *op = &stage->ops[0];
+ struct alt_module *am;
+ struct text_poke_loc tp;
+ int i;
+
+ list_for_each_entry(am, tps->head, next) {
+ for (site = am->para; site < am->para_end; site++) {
+ if (stale_address(am, site->instr))
+ continue;
+
+ for (i = 0; i < stage->count; i++) {
+ if (op[i].type != site->type)
+ continue;
+
+ generate_paravirt(&tp, &op[i], site);
+
+ text_poke_site(tps, &tp);
+ }
+ }
+ }
+}
+
+/**
+ * paravirt_runtime_patch() -- patch pv-ops, including paired ops.
+ *
+ * Called holding the text_mutex.
+ *
+ * Modify possibly multiple mutually-dependent pv-op callsites
+ * (ex. pv_lock_ops) using stop_machine().
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int paravirt_runtime_patch(void)
+{
+ lockdep_assert_held(&text_mutex);
+
+ if (!pv_stage.count)
+ return -EINVAL;
+
+ return text_poke_late(paravirt_worker, &pv_stage);
+}
#endif /* CONFIG_PARAVIRT_RUNTIME */
--
2.20.1
Enable runtime patching of paravirt spinlocks. These can be trivially
enabled because pv_lock_ops are never preemptible -- preemption is
disabled at entry to spin_lock*().
Note that a particular CPU instance might get preempted in the host but
because runtime_patching() is called via stop_machine(), the migration
thread would flush out any kernel threads preempted in the host.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/paravirt.h | 10 +++++-----
arch/x86/kernel/paravirt_patch.c | 12 ++++++++++++
kernel/locking/lock_events.c | 2 +-
3 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 694d8daf4983..cb3d0a91c060 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -642,27 +642,27 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock,
u32 val)
{
- PVOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
+ PVRTOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
}
static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
{
- PVOP_VCALLEE1(lock.queued_spin_unlock, lock);
+ PVRTOP_VCALLEE1(lock.queued_spin_unlock, lock);
}
static __always_inline void pv_wait(u8 *ptr, u8 val)
{
- PVOP_VCALL2(lock.wait, ptr, val);
+ PVRTOP_VCALL2(lock.wait, ptr, val);
}
static __always_inline void pv_kick(int cpu)
{
- PVOP_VCALL1(lock.kick, cpu);
+ PVRTOP_VCALL1(lock.kick, cpu);
}
static __always_inline bool pv_vcpu_is_preempted(long cpu)
{
- return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
+ return PVRTOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
}
void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock);
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eb8c0e720b4..3f8606f2811c 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -152,6 +152,18 @@ int runtime_patch(u8 type, void *insn_buff, void *op,
/* Nothing whitelisted for now. */
switch (type) {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ /*
+ * Preemption is always disabled in the lifetime of a spinlock
+ * (whether held or while waiting to acquire.)
+ */
+ case PARAVIRT_PATCH(lock.queued_spin_lock_slowpath):
+ case PARAVIRT_PATCH(lock.queued_spin_unlock):
+ case PARAVIRT_PATCH(lock.wait):
+ case PARAVIRT_PATCH(lock.kick):
+ case PARAVIRT_PATCH(lock.vcpu_is_preempted):
+ break;
+#endif
default:
pr_warn("type=%d unsuitable for runtime-patching\n", type);
return -EINVAL;
diff --git a/kernel/locking/lock_events.c b/kernel/locking/lock_events.c
index fa2c2f951c6b..c3057e82e6f9 100644
--- a/kernel/locking/lock_events.c
+++ b/kernel/locking/lock_events.c
@@ -115,7 +115,7 @@ static const struct file_operations fops_lockevent = {
.llseek = default_llseek,
};
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && !defined(CONFIG_PARAVIRT_RUNTIME)
#include <asm/paravirt.h>
static bool __init skip_lockevent(const char *name)
--
2.20.1
Make __pv_init_lock_hash() conditional on either paravirt spinlocks
being enabled (via kvm_pv_spinlock()) or if paravirt spinlocks
might get enabled (runtime patching via CONFIG_PARAVIRT_RUNTIME.)
Also add a handler for CPUID reprobe which can trigger this patching.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kernel/kvm.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 31f5ecfd3907..1cb7eab805a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,7 @@
#include <asm/hypervisor.h>
#include <asm/tlb.h>
#include <asm/cpuidle_haltpoll.h>
+#include <asm/text-patching.h>
static int kvmapf = 1;
@@ -909,12 +910,15 @@ void __init kvm_spinlock_init(void)
if (num_possible_cpus() == 1)
return;
- if (!kvm_pv_spinlock())
- return;
-
- __pv_init_lock_hash();
+ /*
+ * Allocate if pv_spinlocks are enabled or if we might
+ * end up patching them in later.
+ */
+ if (kvm_pv_spinlock() || IS_ENABLED(CONFIG_PARAVIRT_RUNTIME))
+ __pv_init_lock_hash();
}
-
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline bool kvm_pv_spinlock(void) { return false; }
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
@@ -952,3 +956,23 @@ void arch_haltpoll_disable(unsigned int cpu)
}
EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
#endif
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+void kvm_trigger_reprobe_cpuid(struct work_struct *work)
+{
+ mutex_lock(&text_mutex);
+
+ paravirt_stage_zero();
+
+ kvm_pv_steal_clock();
+ kvm_pv_tlb();
+ paravirt_runtime_patch(false);
+
+ paravirt_stage_zero();
+
+ kvm_pv_spinlock();
+ paravirt_runtime_patch(true);
+
+ mutex_unlock(&text_mutex);
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
--
2.20.1
On Tue, Apr 07, 2020 at 10:03:06PM -0700, Ankur Arora wrote:
> +/*
> + * preempt_enable_no_resched() so we don't add any preemption points until
> + * after the caller has returned.
> + */
> +#define preempt_enable_runtime_patch() preempt_enable_no_resched()
> +#define preempt_disable_runtime_patch() preempt_disable()
NAK, this is probably a stright preemption bug, also, afaict, there
aren't actually any users of this in the patch-set.
On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.
So what, the paravirt spinlock stuff works just fine when you're not
oversubscribed.
> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
> but PV-TLB ops are also good candidates)
The PV-TLB ops also work just fine when not oversubscribed. IIRC
kvm_flush_tlb_others() is pretty much the same in that case.
> in .parainstructions.runtime,
> while discarding the .parainstructions as usual at init. This is then
> used for switching back and forth between native and paravirt mode.
> ([1] lists some representative numbers of the increased memory
> footprint.)
>
> Mechanism: the patching itself is done using stop_machine(). That is
> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> via text_poke_bp(), but I'm using this to address two issues:
> 1) emulation in text_poke() can only easily handle a small set
> of instructions and this is problematic for inlined pv-ops (and see
> a possible alternatives use-case below.)
> 2) paravirt patching might have inter-dependendent ops (ex.
> lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> need to be updated atomically.)
And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.
Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.
That just happens to not be the case now, but it was for a fair while.
> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.
The whole late-microcode loading stuff is crazy already; you're making
it take double bonghits.
> Also, there are points of similarity with the ongoing static_call work
> which does rewriting of indirect calls.
Only in so far as that code patching is involved. An analogy would be
comparing having a beer with shooting dope. They're both 'drugs'.
> The difference here is that
> we need to switch a group of calls atomically and given that
> some of them can be inlined, need to handle a wider variety of opcodes.
>
> To patch safely we need to satisfy these constraints:
>
> - No references to insn sequences under replacement on any kernel stack
> once replacement is in progress. Without this constraint we might end
> up returning to an address that is in the middle of an instruction.
Both ftrace and optprobes have that issue, neither of them are quite as
crazy as this.
> - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
> lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
> a good example.
While I'm sure this is a fun problem, why are we solving it?
> - handle a broader set of insns than CALL and JMP: some pv-ops end up
> getting inlined. Alternatives can contain arbitrary instructions.
So can optprobes.
> - locking operations can be called from interrupt handlers which means
> we cannot trivially use IPIs for flushing.
Heck, some NMI handlers use locks..
> Handling these, necessitates that target pv-ops not be preemptible.
I don't think that is a correct inferrence.
> Once that is a given (for safety these need to be explicitly whitelisted
> in runtime_patch()), use a state-machine with the primary CPU doing the
> patching and secondary CPUs in a sync_core() loop.
>
> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
> progress by continuing the patching instead of emulating.
>
> One remaining issue is inter-dependent pv-ops which are also executed in
> the NMI handler -- patching can potentially deadlock in case of multiple
> NMIs. Handle these by pushing some of this work in the NMI handler where
> we know it will be uninterrupted.
I'm just seeing a lot of bonghits without sane rationale. Why is any of
this important?
On 08.04.20 07:02, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
Then this hint is wrong if it can't be guaranteed.
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.
I think using pvops for such a feature change is just wrong.
What comes next? Using pvops for being able to migrate a guest from an
Intel to an AMD machine?
...
> There are four main sets of patches in this series:
>
> 1. PV-ops management (patches 1-10, 20): mostly infrastructure and
> refactoring pieces to make paravirt patching usable at runtime. For the
> most part scoped under CONFIG_PARAVIRT_RUNTIME.
>
> Patches 1-7, to persist part of parainstructions in memory:
> "x86/paravirt: Specify subsection in PVOP macros"
> "x86/paravirt: Allow paravirt patching post-init"
> "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
> "x86/alternatives: Refactor alternatives_smp_module*
> "x86/alternatives: Rename alternatives_smp*, smp_alt_module
> "x86/alternatives: Remove stale symbols
> "x86/paravirt: Persist .parainstructions.runtime"
>
> Patches 8-10, develop the inerfaces to safely switch pv-ops:
> "x86/paravirt: Stash native pv-ops"
> "x86/paravirt: Add runtime_patch()"
> "x86/paravirt: Add primitives to stage pv-ops"
>
> Patch 20 enables switching of pv_lock_ops:
> "x86/paravirt: Enable pv-spinlocks in runtime_patch()"
>
> 2. Non-emulated text poking (patches 11-19)
>
> Patches 11-13 are mostly refactoring to split __text_poke() into map,
> unmap and poke/memcpy phases with the poke portion being re-entrant
> "x86/alternatives: Remove return value of text_poke*()"
> "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
> "x86/alternatives: Split __text_poke()"
>
> Patches 15, 17 add the actual poking state-machine:
> "x86/alternatives: Non-emulated text poking"
> "x86/alternatives: Add patching logic in text_poke_site()"
>
> with patches 14 and 18 containing the pieces for BP handling:
> "x86/alternatives: Handle native insns in text_poke_loc*()"
> "x86/alternatives: Handle BP in non-emulated text poking"
>
> and patch 19 provides the ability to use the state-machine above in an
> NMI context (fixes some potential deadlocks when handling inter-
> dependent operations and multiple NMIs):
> "x86/alternatives: NMI safe runtime patching".
>
> Patch 16 provides the interface (paravirt_runtime_patch()) to use the
> poking mechanism developed above and patch 21 adds a selftest:
> "x86/alternatives: Add paravirt patching at runtime"
> "x86/alternatives: Paravirt runtime selftest"
>
> 3. KVM guest changes to be able to use this (patches 22-23,25-26):
> "kvm/paravirt: Encapsulate KVM pv switching logic"
> "x86/kvm: Add worker to trigger runtime patching"
> "x86/kvm: Guest support for dynamic hints"
> "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".
>
> 4. KVM host changes to notify the guest of a change (patch 24):
> "x86/kvm: Support dynamic CPUID hints"
>
> Testing:
> With paravirt patching, the code is mostly stable on Intel and AMD
> systems under kernbench and locktorture with paravirt toggling (with,
> without synthetic NMIs) in the background.
>
> Queued spinlock performance for locktorture is also on expected lines:
> [ 1533.221563] Writes: Total: 1048759000 Max/Min: 0/0 Fail: 0
> # toggle PV spinlocks
>
> [ 1594.713699] Writes: Total: 1111660545 Max/Min: 0/0 Fail: 0
> # PV spinlocks (in ~60 seconds) = 62,901,545
>
> # toggle native spinlocks
> [ 1656.117175] Writes: Total: 1113888840 Max/Min: 0/0 Fail: 0
> # native spinlocks (in ~60 seconds) = 2,228,295
>
> The alternatives testing is more limited with it being used to rewrite
> mostly harmless X86_FEATUREs with load in the background.
>
> Patches also at:
>
> ssh://[email protected]/terminus/linux.git alternatives-rfc-upstream-v1
>
> Please review.
>
> Thanks
> Ankur
>
> [1] The precise change in memory footprint depends on config options
> but the following example inlines queued_spin_unlock() (which forms
> the bulk of the added state). The added footprint is the size of the
> .parainstructions.runtime section:
>
> $ objdump -h vmlinux|grep .parainstructions
> Idx Name Size VMA
> LMA File-off Algn
> 27 .parainstructions 0001013c ffffffff82895000
> 0000000002895000 01c95000 2**3
> 28 .parainstructions.runtime 0000cd2c ffffffff828a5140
> 00000000028a5140 01ca5140 2**3
>
> $ size vmlinux
> text data bss dec hex filename
> 13726196 12302814 14094336 40123346 2643bd2 vmlinux
>
> Ankur Arora (26):
> x86/paravirt: Specify subsection in PVOP macros
> x86/paravirt: Allow paravirt patching post-init
> x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
> x86/alternatives: Refactor alternatives_smp_module*
> x86/alternatives: Rename alternatives_smp*, smp_alt_module
> x86/alternatives: Remove stale symbols
> x86/paravirt: Persist .parainstructions.runtime
> x86/paravirt: Stash native pv-ops
> x86/paravirt: Add runtime_patch()
> x86/paravirt: Add primitives to stage pv-ops
> x86/alternatives: Remove return value of text_poke*()
> x86/alternatives: Use __get_unlocked_pte() in text_poke()
> x86/alternatives: Split __text_poke()
> x86/alternatives: Handle native insns in text_poke_loc*()
> x86/alternatives: Non-emulated text poking
> x86/alternatives: Add paravirt patching at runtime
> x86/alternatives: Add patching logic in text_poke_site()
> x86/alternatives: Handle BP in non-emulated text poking
> x86/alternatives: NMI safe runtime patching
> x86/paravirt: Enable pv-spinlocks in runtime_patch()
> x86/alternatives: Paravirt runtime selftest
> kvm/paravirt: Encapsulate KVM pv switching logic
> x86/kvm: Add worker to trigger runtime patching
> x86/kvm: Support dynamic CPUID hints
> x86/kvm: Guest support for dynamic hints
> x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
>
> Documentation/virt/kvm/api.rst | 17 +
> Documentation/virt/kvm/cpuid.rst | 9 +-
> arch/x86/Kconfig | 14 +
> arch/x86/Kconfig.debug | 13 +
> arch/x86/entry/entry_64.S | 5 +
> arch/x86/include/asm/alternative.h | 20 +-
> arch/x86/include/asm/kvm_host.h | 6 +
> arch/x86/include/asm/kvm_para.h | 17 +
> arch/x86/include/asm/paravirt.h | 10 +-
> arch/x86/include/asm/paravirt_types.h | 230 ++++--
> arch/x86/include/asm/text-patching.h | 18 +-
> arch/x86/include/uapi/asm/kvm_para.h | 2 +
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/alternative.c | 987 +++++++++++++++++++++++---
> arch/x86/kernel/kvm.c | 191 ++++-
> arch/x86/kernel/module.c | 42 +-
> arch/x86/kernel/paravirt.c | 16 +-
> arch/x86/kernel/paravirt_patch.c | 61 ++
> arch/x86/kernel/pv_selftest.c | 264 +++++++
> arch/x86/kernel/pv_selftest.h | 15 +
> arch/x86/kernel/setup.c | 2 +
> arch/x86/kernel/vmlinux.lds.S | 16 +
> arch/x86/kvm/cpuid.c | 3 +-
> arch/x86/kvm/x86.c | 39 +
> include/asm-generic/kvm_para.h | 12 +
> include/asm-generic/vmlinux.lds.h | 8 +
> include/linux/kvm_para.h | 5 +
> include/linux/mm.h | 16 +-
> include/linux/preempt.h | 17 +
> include/uapi/linux/kvm.h | 4 +
> kernel/locking/lock_events.c | 2 +-
> mm/memory.c | 9 +-
> 32 files changed, 1850 insertions(+), 221 deletions(-)
> create mode 100644 arch/x86/kernel/pv_selftest.c
> create mode 100644 arch/x86/kernel/pv_selftest.h
>
Quite a lot of code churn and hacks for a problem which should not
occur on a well administrated machine.
Especially the NMI dependencies make me not wanting to Ack this series.
Juergen
On 08.04.20 14:08, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>> 1) emulation in text_poke() can only easily handle a small set
>> of instructions and this is problematic for inlined pv-ops (and see
>> a possible alternatives use-case below.)
>> 2) paravirt patching might have inter-dependendent ops (ex.
>> lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>> need to be updated atomically.)
>
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
>
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
>
> That just happens to not be the case now, but it was for a fair while.
Sure? This would mean that before spinlock-pvops are being set no lock
is allowed to be used in the kernel, because this would block the boot
time transition of the lock variant to use.
Another problem I'm seeing is that runtime pvops patching would rely on
the fact that stop_machine() isn't guarded by a spinlock.
Juergen
Ankur Arora <[email protected]> writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.
If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.
> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
> [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
> [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
> [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)
And this illustrates what? The fact that you used a misconfigured setup.
> This series addresses the problem by doing paravirt switching at
> runtime.
You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.
> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.
This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:
https://lore.kernel.org/lkml/[email protected]/
Thanks,
tglx
On Wed, Apr 08, 2020 at 03:33:52PM +0200, J?rgen Gro? wrote:
> On 08.04.20 14:08, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> > > Mechanism: the patching itself is done using stop_machine(). That is
> > > not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> > > via text_poke_bp(), but I'm using this to address two issues:
> > > 1) emulation in text_poke() can only easily handle a small set
> > > of instructions and this is problematic for inlined pv-ops (and see
> > > a possible alternatives use-case below.)
> > > 2) paravirt patching might have inter-dependendent ops (ex.
> > > lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> > > need to be updated atomically.)
> >
> > And then you hope that the spinlock state transfers.. That is that both
> > implementations agree what an unlocked spinlock looks like.
> >
> > Suppose the native one was a ticket spinlock, where unlocked means 'head
> > == tail' while the paravirt one is a test-and-set spinlock, where
> > unlocked means 'val == 0'.
> >
> > That just happens to not be the case now, but it was for a fair while.
>
> Sure? This would mean that before spinlock-pvops are being set no lock
> is allowed to be used in the kernel, because this would block the boot
> time transition of the lock variant to use.
Hurm.. true. I suppose I completely forgot how paravirt spinlocks looked
before it got rewritten.
> Another problem I'm seeing is that runtime pvops patching would rely on
> the fact that stop_machine() isn't guarded by a spinlock.
It can't be, stop_machine() relies on scheduling. But yes, that another
variation of 'stuff uses spinlocks'.
So, first thanks for the quick comments even though some of my choices
were straight NAKs (or maybe because of that!)
Second, I clearly did a bad job of motivating the series. Let me try
to address the motivation comments first and then I can address the
technical concerns separately.
[ I'm collating all the motivation comments below. ]
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
Thomas> If your host changes his advertised behaviour then you want to
Thomas> fix the host setup or find a competent admin.
Juergen> Then this hint is wrong if it can't be guaranteed.
I agree, the hint behaviour is wrong and the host shouldn't be giving
hints it can only temporarily honor.
The host problem is hard to fix though: the behaviour change is
either because of a guest migration or in case of a hosted guest,
cloud economics -- customers want to go to a 2-1 or worse VCPU-CPU
ratio at times of low load.
I had an offline discussion with Paolo Bonzini where he agreed that
it makes sense to make KVM_HINTS_REALTIME a dynamic hint rather than
static as it is now. (That was really the starting point for this
series.)
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
Juergen> I think using pvops for such a feature change is just wrong.
Juergen> What comes next? Using pvops for being able to migrate a guest
Juergen> from an Intel to an AMD machine?
My statement about switching pv-ops was too broadly worded. What
I meant to say was that KVM guests choose pv_lock_ops to be native
or paravirt based on undersubscribed/oversubscribed hint at boot,
and this choice should be available at run-time as well.
KVM chooses between native/paravirt spinlocks at boot based on this
reasoning (from commit b2798ba0b8):
"Waiman Long mentioned that:
> Generally speaking, unfair lock performs well for VMs with a small
> number of vCPUs. Native qspinlock may perform better than pvqspinlock
> if there is vCPU pinning and there is no vCPU over-commitment.
"
PeterZ> So what, the paravirt spinlock stuff works just fine when
PeterZ> you're not oversubscribed.
Yeah, the paravirt spinlocks work fine for both under and oversubscribed
hosts, but they are more expensive and that extra cost provides no benefits
when CPUs are pinned.
For instance, pvqueued spin_unlock() is a call+locked cmpxchg as opposed
to just a movb $0, (%rdi).
This difference shows up in kernbench running on a KVM guest with native
and paravirt spinlocks. I ran with 8 and 64 CPU guests with CPUs pinned.
The native version performs same or better.
8 CPU Native (std-dev) Paravirt (std-dev)
----------------- -----------------
-j 4: sys 151.89 ( 0.2462) 160.14 ( 4.8366) +5.4%
-j 32: sys 162.715 (11.4129) 170.225 (11.1138) +4.6%
-j 0: sys 164.193 ( 9.4063) 170.843 ( 8.9651) +4.0%
64 CPU Native (std-dev) Paravirt (std-dev)
----------------- -----------------
-j 32: sys 209.448 (0.37009) 210.976 (0.4245) +0.7%
-j 256: sys 267.401 (61.0928) 285.73 (78.8021) +6.8%
-j 0: sys 286.313 (56.5978) 307.721 (70.9758) +7.4%
In all cases the pv_kick, pv_wait numbers were minimal as expected.
The lock_slowpath counts were higher with PV but AFAICS the native
and paravirt lock_slowpath are not directly comparable.
Detailed kernbench numbers attached.
Thanks
Ankur
On 2020-04-08 5:08 a.m., Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
>
> So what, the paravirt spinlock stuff works just fine when you're not
> oversubscribed.
>
>> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
>> but PV-TLB ops are also good candidates)
>
> The PV-TLB ops also work just fine when not oversubscribed. IIRC
> kvm_flush_tlb_others() is pretty much the same in that case.
>
>> in .parainstructions.runtime,
>> while discarding the .parainstructions as usual at init. This is then
>> used for switching back and forth between native and paravirt mode.
>> ([1] lists some representative numbers of the increased memory
>> footprint.)
>>
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>> 1) emulation in text_poke() can only easily handle a small set
>> of instructions and this is problematic for inlined pv-ops (and see
>> a possible alternatives use-case below.)
>> 2) paravirt patching might have inter-dependendent ops (ex.
>> lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>> need to be updated atomically.)
>
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
>
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
>
> That just happens to not be the case now, but it was for a fair while.
>
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
>
> The whole late-microcode loading stuff is crazy already; you're making
> it take double bonghits.
That's fair. I was talking in a fairly limited sense, ex making static_cpu_has()
catch up with boot_cpu_has() after a microcode update but I should have
specified that.
>
>> Also, there are points of similarity with the ongoing static_call work
>> which does rewriting of indirect calls.
>
> Only in so far as that code patching is involved. An analogy would be
> comparing having a beer with shooting dope. They're both 'drugs'.
I meant closer to updating indirect pointers, like static_call_update()
semantics. But of course I don't know static_call code well enough.
>
>> The difference here is that
>> we need to switch a group of calls atomically and given that
>> some of them can be inlined, need to handle a wider variety of opcodes.
>>
>> To patch safely we need to satisfy these constraints:
>>
>> - No references to insn sequences under replacement on any kernel stack
>> once replacement is in progress. Without this constraint we might end
>> up returning to an address that is in the middle of an instruction.
>
> Both ftrace and optprobes have that issue, neither of them are quite as
> crazy as this.
I did look at ftrace. Will look at optprobes. Thanks.
>
>> - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>> lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>> a good example.
>
> While I'm sure this is a fun problem, why are we solving it?
>
>> - handle a broader set of insns than CALL and JMP: some pv-ops end up
>> getting inlined. Alternatives can contain arbitrary instructions.
>
> So can optprobes.>
>> - locking operations can be called from interrupt handlers which means
>> we cannot trivially use IPIs for flushing.
>
> Heck, some NMI handlers use locks..
This does handle the NMI locking problem. The solution -- doing it
in the NMI handler was of course pretty ugly.
>> Handling these, necessitates that target pv-ops not be preemptible.
>
> I don't think that is a correct inferrence.The non-preemptibility requirement was to ensure that any pv-op under
replacement not be under execution after it is patched out.
(Not a concern for pv_lock_ops.)
Ensuring that we don't return to an address in the middle of an instruction
could be done by moving the NOPs in the prefix, but I couldn't think of
any other way to ensure that a function not be under execution.
Thanks
Ankur
>> Once that is a given (for safety these need to be explicitly whitelisted
>> in runtime_patch()), use a state-machine with the primary CPU doing the
>> patching and secondary CPUs in a sync_core() loop.
>>
>> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
>> progress by continuing the patching instead of emulating.
>>
>> One remaining issue is inter-dependent pv-ops which are also executed in
>> the NMI handler -- patching can potentially deadlock in case of multiple
>> NMIs. Handle these by pushing some of this work in the NMI handler where
>> we know it will be uninterrupted.
>
> I'm just seeing a lot of bonghits without sane rationale. Why is any of
> this important?
>
On 2020-04-08 5:28 a.m., Jürgen Groß wrote:
> On 08.04.20 07:02, Ankur Arora wrote:
[ snip ]
>
> Quite a lot of code churn and hacks for a problem which should not
> occur on a well administrated machine.
Yeah, I agree the patch set is pretty large and clearly the NMI or
the stop_machine() are completely out. That said, as I wrote in my
other mail I think the problem is still worth solving.
> Especially the NMI dependencies make me not wanting to Ack this series.
The NMI solution did turn out to be pretty ugly.
I was using it to solve two problems: avoid a deadlock where an NMI handler
could use a lock while the stop_machine() thread is trying to rewrite the
corresponding call-sites. And, needed to ensure that we don't lock
and unlock using mismatched primitives.
Thanks
Ankur
>
>
> Juergen
On 2020-04-08 7:12 a.m., Thomas Gleixner wrote:
> Ankur Arora <[email protected]> writes:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
>
> If your host changes his advertised behaviour then you want to fix the
> host setup or find a competent admin.
>
>> This lockorture splat that I saw on the guest while testing this is
>> indicative of the problem:
>>
>> [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>> [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>> [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>>
>> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
>> on the gues.)
>
> And this illustrates what? The fact that you used a misconfigured setup.
>
>> This series addresses the problem by doing paravirt switching at
>> runtime.
>
> You're not addressing the problem. Your fixing the symptom, which is
> wrong to begin with.
>
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
>
> This has been discussed to death before and there is no safe subset as
> long as this hasn't been resolved:
>
> https://lore.kernel.org/lkml/[email protected]/
Thanks. I was thinking of fairly limited subset: ex re-evaluate
X86_FEATURE_ALWAYS to make sure static_cpu_has() reflects reality
but I guess that has second order effects here.
Ankur
>
> Thanks,
>
> tglx
>