2019-02-20 20:23:58

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

Hey,

Presented herewith a series that allows KVM to boot Xen x86 HVM guests (with their
respective frontends and backends). On the hypervisor side, the approach is to keep
the implementation similar to how HyperV was done on x86 KVM. On the backend driver
side, the intent is to reuse Xen support.

Note that this is an RFC so there are bugs and room for improvement. The intent
is to get overall feedback before proceeding further.

Running Xen guests on KVM, enables the following use-cases:

* Run unmodified Xen HVM images;

* Facilitate development/testing of Xen guests and Xen PV drivers;

There has been a similar proposal in the past with Xenner, albeit this work
has the following advantages over it:

* Allows use of existing Xen PV drivers such as block, net etc, both as
frontends and backends;

* Xen tooling will see the same UABI as on Xen. This means things like
xenstored, xenstore-list, xenstore-read run unmodified. Optionally,
userspace VMM can emulate xenstore operations;

This work is divided in two parts:

1. Xen HVM ABI support (patches 1 - 16)

Support the necessary mechanisms to allow HVM guests to
boot *without* PV frontends exposed to the guest.

We start by intercepting hypercalls made by the guest, followed
by event channel IPIs/VIRQs (for PV IPI, timers, spinlocks),
pvclock and steal clock.

Ankur Arora (1):
KVM: x86/xen: support upcall vector

Boris Ostrovsky (1):
KVM: x86/xen: handle PV spinlocks slowpath

Joao Martins (14):
KVM: x86: fix Xen hypercall page msr handling
KVM: x86/xen: intercept xen hypercalls if enabled
KVM: x86/xen: register shared_info page
KVM: x86/xen: setup pvclock updates
KVM: x86/xen: update wallclock region
KVM: x86/xen: register vcpu info
KVM: x86/xen: register vcpu time info region
KVM: x86/xen: register steal clock
KVM: x86: declare Xen HVM guest capability
KVM: x86/xen: evtchn signaling via eventfd
KVM: x86/xen: store virq when assigning evtchn
KVM: x86/xen: handle PV timers oneshot mode
KVM: x86/xen: handle PV IPI vcpu yield
KVM: x86: declare Xen HVM evtchn offload capability

Documentation/virtual/kvm/api.txt | 23 ++
arch/x86/include/asm/kvm_host.h | 46 ++++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/irq.c | 25 ++-
arch/x86/kvm/irq_comm.c | 11 +
arch/x86/kvm/trace.h | 33 +++
arch/x86/kvm/x86.c | 60 +++++-
arch/x86/kvm/x86.h | 1 +
arch/x86/kvm/xen.c | 1025 +++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 48 +++++
include/linux/kvm_host.h | 24 +++
include/uapi/linux/kvm.h | 73 ++++++-
12 files changed, 1361 insertions(+), 10 deletions(-)


2. PV Driver support (patches 17 - 39)

We start by redirecting hypercalls from the backend to routines
which emulate the behaviour that PV backends expect i.e. grant
table and interdomain events. Next, we add support for late
initialization of xenbus, followed by implementing
frontend/backend communication mechanisms (i.e. grant tables and
interdomain event channels). Finally, introduce xen-shim.ko,
which will setup a limited Xen environment. This uses the added
functionality of Xen specific shared memory (grant tables) and
notifications (event channels).

Note that patch 19 is useful to Xen on its own.

Ankur Arora (11):
x86/xen: export vcpu_info and shared_info
x86/xen: make hypercall_page generic
KVM: x86/xen: backend hypercall support
KVM: x86/xen: grant map support
KVM: x86/xen: grant unmap support
KVM: x86/xen: interdomain evtchn support
KVM: x86/xen: evtchn unmask support
KVM: x86/xen: add additional evtchn ops
xen-shim: introduce shim domain driver
xen/gntdev: xen_shim_domain() support
xen/xenbus: xen_shim_domain() support

Joao Martins (12):
xen/xenbus: xenbus uninit support
xen-blkback: module_exit support
KVM: x86/xen: domid allocation
KVM: x86/xen: grant table init
KVM: x86/xen: grant table grow support
KVM: x86/xen: grant copy support
xen/balloon: xen_shim_domain() support
xen/grant-table: xen_shim_domain() support
drivers/xen: xen_shim_domain() support
xen-netback: xen_shim_domain() support
xen-blkback: xen_shim_domain() support
KVM: x86: declare Xen HVM Dom0 capability

[See the entire series diffstat at the end]

There are additional Qemu patches to take advantage of this (and
they are available here[0][1]). An example on how you could run it would be:

$ ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -serial mon:stdio \
-machine xenfv,accel=kvm \
-cpu host,-kvm,+xen,xen-major-version=4,xen-minor-version=4,+xen-vapic,+xen-pvclock \
-kernel /path/to/kernel -m 16G -smp 16,sockets=1,cores=16,threads=1,maxcpus=16 \
-append "earlyprintk=ttyS0 console=tty0 console=ttyS0" \
-device xen-platform -device usb-ehci -device usb-tablet,bus=usb-bus.0 -vnc :0 -k pt \
-netdev type=tap,id=net1,ifname=vif1.0,script=qemu-ifup \
-device xen-nic,netdev=net1,mac=52:54:00:12:34:56,backendtype=vif,backend=0 \
-drive file=/path/to/image.img,format=raw,id=legacy,if=ide \
-blockdev file,node-name=drive,filename=/path/to/image.img,locking=off \
-device xen-disk,vdev=xvda,drive=drive,backendtype=vbd

Naturally other options are still at your disposal (e.g. booting with q35
platform, or with virtio, etc).

Thoughts?

Cheers,
Joao

[0] https://www.github.com/jpemartins/qemu xen-shim-rfc
[1] https://www.github.com/jpemartins/linux xen-shim-rfc

Documentation/virtual/kvm/api.txt | 33 +
arch/x86/include/asm/kvm_host.h | 88 ++
arch/x86/include/asm/xen/hypercall.h | 12 +-
arch/x86/kvm/Kconfig | 10 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/irq.c | 25 +-
arch/x86/kvm/irq_comm.c | 11 +
arch/x86/kvm/trace.h | 33 +
arch/x86/kvm/x86.c | 66 +-
arch/x86/kvm/x86.h | 1 +
arch/x86/kvm/xen-asm.S | 66 +
arch/x86/kvm/xen-shim.c | 138 ++
arch/x86/kvm/xen.c | 2262 ++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 55 +
arch/x86/xen/enlighten.c | 49 +
arch/x86/xen/enlighten_hvm.c | 3 +-
arch/x86/xen/enlighten_pv.c | 1 +
arch/x86/xen/enlighten_pvh.c | 3 +-
arch/x86/xen/xen-asm_32.S | 2 +-
arch/x86/xen/xen-asm_64.S | 2 +-
arch/x86/xen/xen-head.S | 8 +-
drivers/block/xen-blkback/blkback.c | 27 +-
drivers/block/xen-blkback/common.h | 2 +
drivers/block/xen-blkback/xenbus.c | 19 +-
drivers/net/xen-netback/netback.c | 25 +-
drivers/xen/balloon.c | 15 +-
drivers/xen/events/events_2l.c | 4 +-
drivers/xen/events/events_base.c | 6 +-
drivers/xen/events/events_fifo.c | 2 +-
drivers/xen/evtchn.c | 4 +-
drivers/xen/features.c | 1 +
drivers/xen/gntdev.c | 10 +-
drivers/xen/grant-table.c | 15 +-
drivers/xen/privcmd.c | 5 +-
drivers/xen/xenbus/xenbus.h | 2 +
drivers/xen/xenbus/xenbus_client.c | 28 +-
drivers/xen/xenbus/xenbus_dev_backend.c | 4 +-
drivers/xen/xenbus/xenbus_dev_frontend.c | 6 +-
drivers/xen/xenbus/xenbus_probe.c | 57 +-
drivers/xen/xenbus/xenbus_xs.c | 40 +-
drivers/xen/xenfs/super.c | 7 +-
drivers/xen/xenfs/xensyms.c | 4 +
include/linux/kvm_host.h | 24 +
include/uapi/linux/kvm.h | 104 +-
include/xen/balloon.h | 7 +
include/xen/xen-ops.h | 2 +-
include/xen/xen.h | 5 +
include/xen/xenbus.h | 3 +
48 files changed, 3232 insertions(+), 67 deletions(-)
create mode 100644 arch/x86/kvm/xen-asm.S
create mode 100644 arch/x86/kvm/xen-shim.c
create mode 100644 arch/x86/kvm/xen.c
create mode 100644 arch/x86/kvm/xen.h

--
2.11.0



2019-02-20 20:18:06

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

Add a new exit reason for emulator to handle Xen hypercalls.
Albeit these are injected only if guest has initialized the Xen
hypercall page - the hypercall is just a convenience but one
that is done by pretty much all guests. Hence if the guest
sets the hypercall page, we assume a Xen guest is going to
be set up.

Emulator will then panic with:

KVM: unknown exit reason 28
RAX=0000000000000011 RBX=ffffffff81e03e94 RCX=0000000040000000
RDX=0000000000000000
RSI=ffffffff81e03e70 RDI=0000000000000006 RBP=ffffffff81e03e90
RSP=ffffffff81e03e68
R8 =73726576206e6558 R9 =ffffffff81e03e90 R10=ffffffff81e03e94
R11=2e362e34206e6f69
R12=0000000040000004 R13=ffffffff81e03e8c R14=ffffffff81e03e88
R15=0000000000000000
RIP=ffffffff81001228 RFL=00000082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffffffff81f34000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0020 0000000000000000 00000fff 00808b00 DPL=0 TSS64-busy
GDT= ffffffff81f3c000 0000007f
IDT= ffffffff83265000 00000fff
CR0=80050033 CR2=ffff880001fa6ff8 CR3=0000000001fa6000 CR4=000406a0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=cc cc cc cc cc cc cc cc cc cc cc cc b8 11 00 00 00 0f 01 c1 <c3> cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 12
00 00 00 0f

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 13 +++++++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/trace.h | 33 +++++++++++++++++
arch/x86/kvm/x86.c | 12 +++++++
arch/x86/kvm/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 10 ++++++
include/uapi/linux/kvm.h | 17 ++++++++-
7 files changed, 164 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kvm/xen.c
create mode 100644 arch/x86/kvm/xen.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9417febf8490..0f469ce439c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,6 +79,7 @@
#define KVM_REQ_HV_STIMER KVM_ARCH_REQ(22)
#define KVM_REQ_LOAD_EOI_EXITMAP KVM_ARCH_REQ(23)
#define KVM_REQ_GET_VMCS12_PAGES KVM_ARCH_REQ(24)
+#define KVM_REQ_XEN_EXIT KVM_ARCH_REQ(25)

#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -533,6 +534,11 @@ struct kvm_vcpu_hv {
cpumask_t tlb_flush;
};

+/* Xen per vcpu emulation context */
+struct kvm_vcpu_xen {
+ struct kvm_xen_exit exit;
+};
+
struct kvm_vcpu_arch {
/*
* rip and regs accesses must go through
@@ -720,6 +726,7 @@ struct kvm_vcpu_arch {
unsigned long singlestep_rip;

struct kvm_vcpu_hv hyperv;
+ struct kvm_vcpu_xen xen;

cpumask_var_t wbinvd_dirty_mask;

@@ -833,6 +840,11 @@ struct kvm_hv {
atomic_t num_mismatched_vp_indexes;
};

+/* Xen emulation context */
+struct kvm_xen {
+ u64 xen_hypercall;
+};
+
enum kvm_irqchip_mode {
KVM_IRQCHIP_NONE,
KVM_IRQCHIP_KERNEL, /* created with KVM_CREATE_IRQCHIP */
@@ -899,6 +911,7 @@ struct kvm_arch {
struct hlist_head mask_notifier_list;

struct kvm_hv hyperv;
+ struct kvm_xen xen;

#ifdef CONFIG_KVM_MMU_AUDIT
int audit_point;
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31ecf7a76d5a..2b46c93c9380 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o page_track.o debugfs.o
+ hyperv.o xen.o page_track.o debugfs.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
kvm-amd-y += svm.o pmu_amd.o
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 6432d08c7de7..4fe9fd86292f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -91,6 +91,39 @@ TRACE_EVENT(kvm_hv_hypercall,
);

/*
+ * Tracepoint for Xen hypercall.
+ */
+TRACE_EVENT(kvm_xen_hypercall,
+ TP_PROTO(unsigned long nr, unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3, unsigned long a4),
+ TP_ARGS(nr, a0, a1, a2, a3, a4),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr)
+ __field(unsigned long, a0)
+ __field(unsigned long, a1)
+ __field(unsigned long, a2)
+ __field(unsigned long, a3)
+ __field(unsigned long, a4)
+ ),
+
+ TP_fast_assign(
+ __entry->nr = nr;
+ __entry->a0 = a0;
+ __entry->a1 = a1;
+ __entry->a2 = a2;
+ __entry->a3 = a3;
+ __entry->a4 = a4;
+ ),
+
+ TP_printk("nr 0x%lx a0 0x%lx a1 0x%lx a2 0x%lx a3 0x%lx a4 0x%lx",
+ __entry->nr, __entry->a0, __entry->a1, __entry->a2,
+ __entry->a3, __entry->a4)
+);
+
+
+
+/*
* Tracepoint for PIO.
*/

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47360a4b0d42..be8def385e3f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -29,6 +29,7 @@
#include "cpuid.h"
#include "pmu.h"
#include "hyperv.h"
+#include "xen.h"

#include <linux/clocksource.h>
#include <linux/interrupt.h>
@@ -2338,6 +2339,8 @@ static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
}
if (kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE))
goto out_free;
+
+ kvm_xen_hypercall_set(kvm);
r = 0;
out_free:
kfree(page);
@@ -7076,6 +7079,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
if (kvm_hv_hypercall_enabled(vcpu->kvm))
return kvm_hv_hypercall(vcpu);

+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
a0 = kvm_register_read(vcpu, VCPU_REGS_RBX);
a1 = kvm_register_read(vcpu, VCPU_REGS_RCX);
@@ -7736,6 +7742,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
r = 0;
goto out;
}
+ if (kvm_check_request(KVM_REQ_XEN_EXIT, vcpu)) {
+ vcpu->run->exit_reason = KVM_EXIT_XEN;
+ vcpu->run->xen = vcpu->arch.xen.exit;
+ r = 0;
+ goto out;
+ }

/*
* KVM_REQ_HV_STIMER has to be processed after
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
new file mode 100644
index 000000000000..76f0e4b812d2
--- /dev/null
+++ b/arch/x86/kvm/xen.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved.
+ *
+ * KVM Xen emulation
+ */
+
+#include "x86.h"
+#include "xen.h"
+
+#include <linux/kvm_host.h>
+
+#include <trace/events/kvm.h>
+
+#include "trace.h"
+
+bool kvm_xen_hypercall_enabled(struct kvm *kvm)
+{
+ return READ_ONCE(kvm->arch.xen.xen_hypercall);
+}
+
+bool kvm_xen_hypercall_set(struct kvm *kvm)
+{
+ return WRITE_ONCE(kvm->arch.xen.xen_hypercall, 1);
+}
+
+static void kvm_xen_hypercall_set_result(struct kvm_vcpu *vcpu, u64 result)
+{
+ kvm_register_write(vcpu, VCPU_REGS_RAX, result);
+}
+
+static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
+{
+ struct kvm_run *run = vcpu->run;
+
+ kvm_xen_hypercall_set_result(vcpu, run->xen.u.hcall.result);
+ return kvm_skip_emulated_instruction(vcpu);
+}
+
+int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
+{
+ bool longmode;
+ u64 input, params[5];
+
+ input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
+
+ longmode = is_64_bit_mode(vcpu);
+ if (!longmode) {
+ params[0] = (u64)kvm_register_read(vcpu, VCPU_REGS_RBX);
+ params[1] = (u64)kvm_register_read(vcpu, VCPU_REGS_RCX);
+ params[2] = (u64)kvm_register_read(vcpu, VCPU_REGS_RDX);
+ params[3] = (u64)kvm_register_read(vcpu, VCPU_REGS_RSI);
+ params[4] = (u64)kvm_register_read(vcpu, VCPU_REGS_RDI);
+ }
+#ifdef CONFIG_X86_64
+ else {
+ params[0] = (u64)kvm_register_read(vcpu, VCPU_REGS_RDI);
+ params[1] = (u64)kvm_register_read(vcpu, VCPU_REGS_RSI);
+ params[2] = (u64)kvm_register_read(vcpu, VCPU_REGS_RDX);
+ params[3] = (u64)kvm_register_read(vcpu, VCPU_REGS_R10);
+ params[4] = (u64)kvm_register_read(vcpu, VCPU_REGS_R8);
+ }
+#endif
+ trace_kvm_xen_hypercall(input, params[0], params[1], params[2],
+ params[3], params[4]);
+
+ vcpu->run->exit_reason = KVM_EXIT_XEN;
+ vcpu->run->xen.type = KVM_EXIT_XEN_HCALL;
+ vcpu->run->xen.u.hcall.input = input;
+ vcpu->run->xen.u.hcall.params[0] = params[0];
+ vcpu->run->xen.u.hcall.params[1] = params[1];
+ vcpu->run->xen.u.hcall.params[2] = params[2];
+ vcpu->run->xen.u.hcall.params[3] = params[3];
+ vcpu->run->xen.u.hcall.params[4] = params[4];
+ vcpu->arch.complete_userspace_io =
+ kvm_xen_hypercall_complete_userspace;
+
+ return 0;
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
new file mode 100644
index 000000000000..a2ae079c3ef3
--- /dev/null
+++ b/arch/x86/kvm/xen.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved. */
+#ifndef __ARCH_X86_KVM_XEN_H__
+#define __ARCH_X86_KVM_XEN_H__
+
+bool kvm_xen_hypercall_enabled(struct kvm *kvm);
+bool kvm_xen_hypercall_set(struct kvm *kvm);
+int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
+
+#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..d07520c216a1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -204,6 +204,18 @@ struct kvm_hyperv_exit {
} u;
};

+struct kvm_xen_exit {
+#define KVM_EXIT_XEN_HCALL 1
+ __u32 type;
+ union {
+ struct {
+ __u64 input;
+ __u64 result;
+ __u64 params[5];
+ } hcall;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -235,6 +247,7 @@ struct kvm_hyperv_exit {
#define KVM_EXIT_S390_STSI 25
#define KVM_EXIT_IOAPIC_EOI 26
#define KVM_EXIT_HYPERV 27
+#define KVM_EXIT_XEN 28

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -392,8 +405,10 @@ struct kvm_run {
} eoi;
/* KVM_EXIT_HYPERV */
struct kvm_hyperv_exit hyperv;
+ /* KVM_EXIT_XEN */
+ struct kvm_xen_exit xen;
/* Fix the size of the union. */
- char padding[256];
+ char padding[196];
};

/* 2048 is the size of the char array used to bound/pad the size
--
2.11.0


2019-02-20 20:18:08

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 04/39] KVM: x86/xen: setup pvclock updates

This means when we set shared_info page GPA, and request a master
clock update. This will trigger all vcpus to update their respective
shared pvclock data with guests. We follow a similar approach
as Hyper-V and KVM and adjust it accordingly.

Note however that Xen differs a little on how pvclock pages are set up.
Specifically KVM assumes 4K page alignment and pvclock data starts in
the beginning of the page. Whereas Xen you can place that information
anywhere in the page.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/x86.c | 2 ++
arch/x86/kvm/xen.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 1 +
3 files changed, 50 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eda96304180..6eb2afaa2af2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2211,6 +2211,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)

if (vcpu->pv_time_enabled)
kvm_setup_pvclock_page(v);
+ if (ka->xen.shinfo)
+ kvm_xen_setup_pvclock_page(v);
if (v == kvm_get_vcpu(v->kvm, 0))
kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock);
return 0;
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4df223bd3cd7..b4bd1949656e 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -29,9 +29,56 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
shared_info = page_to_virt(page);
memset(shared_info, 0, sizeof(struct shared_info));
kvm->arch.xen.shinfo = shared_info;
+
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
return 0;
}

+void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
+{
+ struct kvm_vcpu_arch *vcpu = &v->arch;
+ struct pvclock_vcpu_time_info *guest_hv_clock;
+ unsigned int offset;
+
+ if (v->vcpu_id >= MAX_VIRT_CPUS)
+ return;
+
+ offset = offsetof(struct vcpu_info, time);
+ offset += offsetof(struct shared_info, vcpu_info);
+ offset += v->vcpu_id * sizeof(struct vcpu_info);
+
+ guest_hv_clock = (struct pvclock_vcpu_time_info *)
+ (((void *)v->kvm->arch.xen.shinfo) + offset);
+
+ BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
+
+ if (guest_hv_clock->version & 1)
+ ++guest_hv_clock->version; /* first time write, random junk */
+
+ vcpu->hv_clock.version = guest_hv_clock->version + 1;
+ guest_hv_clock->version = vcpu->hv_clock.version;
+
+ smp_wmb();
+
+ /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+ vcpu->hv_clock.flags |= (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
+
+ if (vcpu->pvclock_set_guest_stopped_request) {
+ vcpu->hv_clock.flags |= PVCLOCK_GUEST_STOPPED;
+ vcpu->pvclock_set_guest_stopped_request = false;
+ }
+
+ trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock);
+
+ *guest_hv_clock = vcpu->hv_clock;
+
+ smp_wmb();
+
+ vcpu->hv_clock.version++;
+
+ guest_hv_clock->version = vcpu->hv_clock.version;
+}
+
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
{
int r = -ENOENT;
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index bb38edf383fe..827c9390da34 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -3,6 +3,7 @@
#ifndef __ARCH_X86_KVM_XEN_H__
#define __ARCH_X86_KVM_XEN_H__

+void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
bool kvm_xen_hypercall_enabled(struct kvm *kvm);
--
2.11.0


2019-02-20 20:18:23

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 08/39] KVM: x86/xen: register steal clock

Allow emulator to register vcpu runstates which allow Xen guests
to use that for steal clock. The 'preempted' state of KVM steal clock
equates to 'runnable' state, 'running' has similar meanings for both and
'offline' is used when system admin needs to bring vcpu offline or
hotplug.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 10 ++++++++
arch/x86/kvm/xen.c | 51 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 2 ++
include/uapi/linux/kvm.h | 1 +
5 files changed, 66 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f39d50dd8f40..9d388ba0a05c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -541,6 +541,8 @@ struct kvm_vcpu_xen {
struct vcpu_info *vcpu_info;
gpa_t pv_time_addr;
struct pvclock_vcpu_time_info *pv_time;
+ gpa_t steal_time_addr;
+ struct vcpu_runstate_info *steal_time;
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ce97860e6ee..888598fdf543 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2389,6 +2389,11 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)

static void record_steal_time(struct kvm_vcpu *vcpu)
{
+ if (vcpu->arch.xen.steal_time_addr) {
+ kvm_xen_setup_runstate_page(vcpu);
+ return;
+ }
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

@@ -3251,6 +3256,11 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
{
+ if (vcpu->arch.xen.steal_time_addr) {
+ kvm_xen_runstate_set_preempted(vcpu);
+ return;
+ }
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 77d1153386bc..4fdc4c71245a 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -9,9 +9,11 @@
#include "xen.h"

#include <linux/kvm_host.h>
+#include <linux/sched/stat.h>

#include <trace/events/kvm.h>
#include <xen/interface/xen.h>
+#include <xen/interface/vcpu.h>

#include "trace.h"

@@ -30,6 +32,11 @@ static void set_vcpu_attr(struct kvm_vcpu *v, u16 type, gpa_t gpa, void *addr)
vcpu_xen->pv_time = addr;
kvm_xen_setup_pvclock_page(v);
break;
+ case KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE:
+ vcpu_xen->steal_time_addr = gpa;
+ vcpu_xen->steal_time = addr;
+ kvm_xen_setup_runstate_page(v);
+ break;
default:
break;
}
@@ -44,6 +51,8 @@ static gpa_t get_vcpu_attr(struct kvm_vcpu *v, u16 type)
return vcpu_xen->vcpu_info_addr;
case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
return vcpu_xen->pv_time_addr;
+ case KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE:
+ return vcpu_xen->steal_time_addr;
default:
return 0;
}
@@ -124,6 +133,41 @@ static void kvm_xen_update_vcpu_time(struct kvm_vcpu *v,
guest_hv_clock->version = vcpu->hv_clock.version;
}

+void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ int state = RUNSTATE_runnable;
+
+ vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;
+
+ vcpu_xen->steal_time->state = state;
+}
+
+void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ struct vcpu_runstate_info runstate;
+
+ runstate = *vcpu_xen->steal_time;
+
+ runstate.state_entry_time += 1;
+ runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+ vcpu_xen->steal_time->state_entry_time = runstate.state_entry_time;
+ smp_wmb();
+
+ vcpu->arch.st.steal.steal += current->sched_info.run_delay -
+ vcpu->arch.st.last_steal;
+ vcpu->arch.st.last_steal = current->sched_info.run_delay;
+
+ runstate.state = RUNSTATE_running;
+ runstate.time[RUNSTATE_runnable] = vcpu->arch.st.steal.steal;
+ *vcpu_xen->steal_time = runstate;
+
+ runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+ vcpu_xen->steal_time->state_entry_time = runstate.state_entry_time;
+ smp_wmb();
+}
+
void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
@@ -155,6 +199,10 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = kvm_xen_shared_info_init(kvm, gfn);
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE:
+ if (unlikely(!sched_info_on()))
+ return -ENOTSUPP;
+ /* fallthrough */
case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
gpa_t gpa = data->u.vcpu_attr.gpa;
@@ -191,6 +239,7 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
data->u.shared_info.gfn = kvm->arch.xen.shinfo_addr;
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE:
case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
struct kvm_vcpu *v;
@@ -282,6 +331,8 @@ void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
put_page(virt_to_page(vcpu_xen->vcpu_info));
if (vcpu_xen->pv_time)
put_page(virt_to_page(vcpu_xen->pv_time));
+ if (vcpu_xen->steal_time)
+ put_page(virt_to_page(vcpu_xen->steal_time));
}

void kvm_xen_destroy_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 10ebd0b7a25e..2feef68ee80f 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -17,6 +17,8 @@ static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
}

void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
+void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu);
+void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu);
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
bool kvm_xen_hypercall_enabled(struct kvm *kvm);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8296c3a2434f..b91e57d9e6d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1475,6 +1475,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x0
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1
#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x2
+#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x3

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:18:37

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 09/39] KVM: x86: declare Xen HVM guest capability

Also take the chance to document this rather old capability of
KVM_CAP_XEN_HVM which only means it supports the Xen hypercall MSR.

Signed-off-by: Joao Martins <[email protected]>
---
Documentation/virtual/kvm/api.txt | 14 ++++++++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 3 +++
3 files changed, 18 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 356156f5c52d..c3a1402b76bf 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5016,6 +5016,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
AArch64, this value will be reported in the ISS field of ESR_ELx.

See KVM_CAP_VCPU_EVENTS for more details.
+
8.20 KVM_CAP_HYPERV_SEND_IPI

Architectures: x86
@@ -5023,3 +5024,16 @@ Architectures: x86
This capability indicates that KVM supports paravirtualized Hyper-V IPI send
hypercalls:
HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.21 KVM_CAP_XEN_HVM
+
+Architectures: x86
+
+This capability indicates that KVM supports the Xen hypercall page MSR.
+
+8.22 KVM_CAP_XEN_HVM_GUEST
+
+Architectures: x86
+
+This capability indicates that KVM supports Xen HVM guests.
+This includes KVM_IRQ_ROUTING_XEN_EVTCHN as well.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 888598fdf543..11b9ff2bd901 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3009,6 +3009,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
case KVM_CAP_XEN_HVM:
+ case KVM_CAP_XEN_HVM_GUEST:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_HYPERV:
case KVM_CAP_HYPERV_VAPIC:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b91e57d9e6d3..682ea00abd58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1003,6 +1003,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_ARM_VM_IPA_SIZE 165
#define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
#define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_XEN_HVM_GUEST 168

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1455,6 +1456,7 @@ struct kvm_enc_region {
/* Available with KVM_CAP_HYPERV_CPUID */
#define KVM_GET_SUPPORTED_HV_CPUID _IOWR(KVMIO, 0xc1, struct kvm_cpuid2)

+/* Available with KVM_CAP_XEN_HVM_GUEST */
#define KVM_XEN_HVM_GET_ATTR _IOWR(KVMIO, 0xc2, struct kvm_xen_hvm_attr)
#define KVM_XEN_HVM_SET_ATTR _IOW(KVMIO, 0xc3, struct kvm_xen_hvm_attr)

@@ -1472,6 +1474,7 @@ struct kvm_xen_hvm_attr {
} u;
};

+/* Available with KVM_CAP_XEN_HVM_GUEST */
#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x0
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1
#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x2
--
2.11.0


2019-02-20 20:19:13

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 16/39] KVM: x86: declare Xen HVM evtchn offload capability

Add new capability for event channel support in hypervisor.

Signed-off-by: Joao Martins <[email protected]>
---
Documentation/virtual/kvm/api.txt | 9 +++++++++
arch/x86/kvm/x86.c | 1 +
include/uapi/linux/kvm.h | 3 +++
3 files changed, 13 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c3a1402b76bf..36d9386415fa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5037,3 +5037,12 @@ Architectures: x86

This capability indicates that KVM supports Xen HVM guests.
This includes KVM_IRQ_ROUTING_XEN_EVTCHN as well.
+
+8.23 KVM_CAP_XEN_HVM_EVTCHN
+
+Architectures: x86
+
+This capability indicates KVM's support for the event channel offload.
+Implies support for KVM_IRQ_ROUTING_XEN_EVTCHN irq routing, and
+for attribute KVM_XEN_ATTR_TYPE_EVTCHN in KVM_XEN_HVM_GET_ATTR or
+KVM_XEN_HVM_SET_ATTR.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e29cefd2dc6a..b1d9045d7989 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3010,6 +3010,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
case KVM_CAP_XEN_HVM:
case KVM_CAP_XEN_HVM_GUEST:
+ case KVM_CAP_XEN_HVM_EVTCHN:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_HYPERV:
case KVM_CAP_HYPERV_VAPIC:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4eae47a0ef63..1b3ecce5f92e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1004,6 +1004,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
#define KVM_CAP_HYPERV_CPUID 167
#define KVM_CAP_XEN_HVM_GUEST 168
+#define KVM_CAP_XEN_HVM_EVTCHN 169

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1046,6 +1047,7 @@ struct kvm_irq_routing_xen_evtchn {
#define KVM_IRQ_ROUTING_MSI 2
#define KVM_IRQ_ROUTING_S390_ADAPTER 3
#define KVM_IRQ_ROUTING_HV_SINT 4
+/* Available with KVM_CAP_XEN_HVM_EVTCHN */
#define KVM_IRQ_ROUTING_XEN_EVTCHN 5

struct kvm_irq_routing_entry {
@@ -1506,6 +1508,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1
#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x2
#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x3
+/* Available with KVM_CAP_XEN_HVM_EVTCHN */
#define KVM_XEN_ATTR_TYPE_EVTCHN 0x4

/* Secure Encrypted Virtualization command */
--
2.11.0


2019-02-20 20:19:20

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 19/39] xen/xenbus: xenbus uninit support

This allows reinitialization of xenbus which is useful for
xen_shim_domain() support. Cleaning xenbus state means cancelling
pending watch events, and deleting all watches, closing xenstore event
channel and finally stopping xenbus/xenwatch kthreads alongside
unregistering /proc/xen.

Signed-off-by: Joao Martins <[email protected]>
---
drivers/xen/xenbus/xenbus.h | 2 ++
drivers/xen/xenbus/xenbus_client.c | 5 ++++
drivers/xen/xenbus/xenbus_probe.c | 51 +++++++++++++++++++++++++++++++++++---
drivers/xen/xenbus/xenbus_xs.c | 38 ++++++++++++++++++++++++++++
4 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus.h b/drivers/xen/xenbus/xenbus.h
index 092981171df1..e0e586d81d48 100644
--- a/drivers/xen/xenbus/xenbus.h
+++ b/drivers/xen/xenbus/xenbus.h
@@ -96,6 +96,7 @@ extern wait_queue_head_t xb_waitq;
extern struct mutex xb_write_mutex;

int xs_init(void);
+void xs_deinit(void);
int xb_init_comms(void);
void xb_deinit_comms(void);
int xs_watch_msg(struct xs_watch_event *event);
@@ -129,6 +130,7 @@ int xenbus_read_otherend_details(struct xenbus_device *xendev,
char *id_node, char *path_node);

void xenbus_ring_ops_init(void);
+void xenbus_ring_ops_deinit(void);

int xenbus_dev_request_and_reply(struct xsd_sockmsg *msg, void *par);
void xenbus_dev_queue_reply(struct xb_req_data *req);
diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index e17ca8156171..ada1c9aa6525 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -935,3 +935,8 @@ void __init xenbus_ring_ops_init(void)
#endif
ring_ops = &ring_ops_hvm;
}
+
+void xenbus_ring_ops_deinit(void)
+{
+ ring_ops = NULL;
+}
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 5b471889d723..2e0ed46b05e7 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -741,6 +741,21 @@ static int __init xenstored_local_init(void)
return err;
}

+static void xenstored_local_deinit(void)
+{
+ struct evtchn_close close;
+ void *page = NULL;
+
+ page = gfn_to_virt(xen_store_gfn);
+ free_page((unsigned long)page);
+
+ close.port = xen_store_evtchn;
+
+ HYPERVISOR_event_channel_op(EVTCHNOP_close, &close);
+
+ xen_store_evtchn = 0;
+}
+
static int xenbus_resume_cb(struct notifier_block *nb,
unsigned long action, void *data)
{
@@ -765,7 +780,11 @@ static struct notifier_block xenbus_resume_nb = {
.notifier_call = xenbus_resume_cb,
};

-static int __init xenbus_init(void)
+#ifdef CONFIG_XEN_COMPAT_XENFS
+struct proc_dir_entry *xen_procfs;
+#endif
+
+int xenbus_init(void)
{
int err = 0;
uint64_t v = 0;
@@ -833,13 +852,39 @@ static int __init xenbus_init(void)
* Create xenfs mountpoint in /proc for compatibility with
* utilities that expect to find "xenbus" under "/proc/xen".
*/
- proc_create_mount_point("xen");
+ xen_procfs = proc_create_mount_point("xen");
#endif

out_error:
return err;
}
-
+EXPORT_SYMBOL_GPL(xenbus_init);
postcore_initcall(xenbus_init);

+void xenbus_deinit(void)
+{
+ if (!xen_domain())
+ return;
+
+#ifdef CONFIG_XEN_COMPAT_XENFS
+ proc_remove(xen_procfs);
+ xen_procfs = NULL;
+#endif
+
+ xs_deinit();
+ xenstored_ready = 0;
+
+ switch (xen_store_domain_type) {
+ case XS_LOCAL:
+ xenstored_local_deinit();
+ xen_store_interface = NULL;
+ break;
+ default:
+ pr_warn("Xenstore state unknown\n");
+ break;
+ }
+ xenbus_ring_ops_deinit();
+}
+EXPORT_SYMBOL_GPL(xenbus_deinit);
+
MODULE_LICENSE("GPL");
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index 49a3874ae6bb..bd6db3703972 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -866,6 +866,7 @@ static int xenwatch_thread(void *unused)

for (;;) {
wait_event_interruptible(watch_events_waitq,
+ kthread_should_stop() ||
!list_empty(&watch_events));

if (kthread_should_stop())
@@ -917,6 +918,8 @@ static struct notifier_block xs_reboot_nb = {
.notifier_call = xs_reboot_notify,
};

+static struct task_struct *xenwatch_task;
+
int xs_init(void)
{
int err;
@@ -932,9 +935,44 @@ int xs_init(void)
task = kthread_run(xenwatch_thread, NULL, "xenwatch");
if (IS_ERR(task))
return PTR_ERR(task);
+ xenwatch_task = task;

/* shutdown watches for kexec boot */
xs_reset_watches();

return 0;
}
+
+void cancel_watches(void)
+{
+ struct xs_watch_event *event, *tmp;
+
+ /* Cancel pending watch events. */
+ spin_lock(&watch_events_lock);
+ list_for_each_entry_safe(event, tmp, &watch_events, list) {
+ list_del(&event->list);
+ kfree(event);
+ }
+ spin_unlock(&watch_events_lock);
+}
+
+void delete_watches(void)
+{
+ struct xenbus_watch *watch, *tmp;
+
+ spin_lock(&watches_lock);
+ list_for_each_entry_safe(watch, tmp, &watches, list) {
+ list_del(&watch->list);
+ }
+ spin_unlock(&watches_lock);
+}
+
+void xs_deinit(void)
+{
+ kthread_stop(xenwatch_task);
+ xenwatch_task = NULL;
+ xb_deinit_comms();
+ unregister_reboot_notifier(&xs_reboot_nb);
+ cancel_watches();
+ delete_watches();
+}
--
2.11.0


2019-02-20 20:19:36

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 21/39] KVM: x86/xen: domid allocation

Userspace requests a free @domid to be assigned to itself, or
explicitly selects one by setting @any to 0. The @domid is then
used for various interdomain/unbound event purposes.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 2 ++
arch/x86/kvm/xen.c | 70 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 2 ++
include/uapi/linux/kvm.h | 4 +++
5 files changed, 80 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c629fedb2e21..384247fc433d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -27,6 +27,7 @@
#include <linux/clocksource.h>
#include <linux/irqbypass.h>
#include <linux/hyperv.h>
+#include <xen/interface/xen.h>

#include <asm/apic.h>
#include <asm/pvclock-abi.h>
@@ -862,6 +863,7 @@ struct kvm_hv {
/* Xen emulation context */
struct kvm_xen {
u64 xen_hypercall;
+ domid_t domid;

gfn_t shinfo_addr;
struct shared_info *shinfo;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b1d9045d7989..cb95f7f8bed9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6986,6 +6986,7 @@ int kvm_arch_init(void *opaque)
if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
#endif
+ kvm_xen_init();

return 0;

@@ -6999,6 +7000,7 @@ int kvm_arch_init(void *opaque)

void kvm_arch_exit(void)
{
+ kvm_xen_exit();
#ifdef CONFIG_X86_64
if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
clear_hv_tscchange_cb();
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 07066402737d..e570c9b26563 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -36,6 +36,48 @@ struct evtchnfd {
static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port);
static void *xen_vcpu_info(struct kvm_vcpu *v);

+#define XEN_DOMID_MIN 1
+#define XEN_DOMID_MAX (DOMID_FIRST_RESERVED - 1)
+
+static rwlock_t domid_lock;
+static struct idr domid_to_kvm;
+
+static int kvm_xen_domid_init(struct kvm *kvm, bool any, domid_t domid)
+{
+ u16 min = XEN_DOMID_MIN, max = XEN_DOMID_MAX;
+ struct kvm_xen *xen = &kvm->arch.xen;
+ int ret;
+
+ if (!any) {
+ min = domid;
+ max = domid + 1;
+ }
+
+ write_lock_bh(&domid_lock);
+ ret = idr_alloc(&domid_to_kvm, kvm, min, max, GFP_ATOMIC);
+ write_unlock_bh(&domid_lock);
+
+ if (ret < 0)
+ return ret;
+
+ xen->domid = ret;
+ return 0;
+}
+
+int kvm_xen_free_domid(struct kvm *kvm)
+{
+ struct kvm_xen *xen = &kvm->arch.xen;
+ struct kvm *vm;
+
+ write_lock_bh(&domid_lock);
+ vm = idr_remove(&domid_to_kvm, xen->domid);
+ write_unlock_bh(&domid_lock);
+
+ synchronize_srcu(&kvm->srcu);
+
+ return vm == kvm;
+}
+
int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
@@ -460,6 +502,17 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = kvm_vm_ioctl_xen_eventfd(kvm, &xevfd);
break;
}
+ case KVM_XEN_ATTR_TYPE_DOMID: {
+ domid_t domid = (u16) data->u.dom.domid;
+ bool any = (data->u.dom.domid < 0);
+
+ /* Domain ID 0 or >= 0x7ff0 are reserved */
+ if (!any && (!domid || (domid >= XEN_DOMID_MAX)))
+ return -EINVAL;
+
+ r = kvm_xen_domid_init(kvm, any, domid);
+ break;
+ }
default:
break;
}
@@ -489,6 +542,11 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = 0;
break;
}
+ case KVM_XEN_ATTR_TYPE_DOMID: {
+ data->u.dom.domid = kvm->arch.xen.domid;
+ r = 0;
+ break;
+ }
default:
break;
}
@@ -909,6 +967,18 @@ void kvm_xen_destroy_vm(struct kvm *kvm)

if (xen->shinfo)
put_page(virt_to_page(xen->shinfo));
+
+ kvm_xen_free_domid(kvm);
+}
+
+void kvm_xen_init(void)
+{
+ idr_init(&domid_to_kvm);
+ rwlock_init(&domid_lock);
+}
+
+void kvm_xen_exit(void)
+{
}

static int kvm_xen_eventfd_update(struct kvm *kvm, struct idr *port_to_evt,
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index f82b8b5b3345..76ef2150c650 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -39,6 +39,8 @@ void kvm_xen_destroy_vm(struct kvm *kvm);
int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args);
void kvm_xen_vcpu_init(struct kvm_vcpu *vcpu);
void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);
+void kvm_xen_init(void);
+void kvm_xen_exit(void);

void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu);
int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1b3ecce5f92e..3212cad732dd 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1500,6 +1500,9 @@ struct kvm_xen_hvm_attr {
__u32 padding[2];
};
} evtchn;
+ struct {
+ __s32 domid;
+ } dom;
} u;
};

@@ -1510,6 +1513,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x3
/* Available with KVM_CAP_XEN_HVM_EVTCHN */
#define KVM_XEN_ATTR_TYPE_EVTCHN 0x4
+#define KVM_XEN_ATTR_TYPE_DOMID 0x5

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:19:44

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 22/39] KVM: x86/xen: grant table init

Add support for guest grant table initialization. This is mostly
scaffolding at this point: we allocate grant table state and map
it globally.

Later patches add support for seeding the grant table with reserved
entries, and setup maptrack which would be used for grant map and unmap
operations.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 19 +++++++++
arch/x86/kvm/xen.c | 88 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 1 +
include/uapi/linux/kvm.h | 13 ++++++
4 files changed, 121 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 384247fc433d..e0cbc0899580 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -860,6 +860,23 @@ struct kvm_hv {
atomic_t num_mismatched_vp_indexes;
};

+/* Xen grant table */
+struct kvm_grant_table {
+ u32 nr_frames;
+ u32 max_nr_frames;
+ union {
+ void **frames;
+ struct grant_entry_v1 **frames_v1;
+ };
+ gfn_t *frames_addr;
+ gpa_t initial_addr;
+ struct grant_entry_v1 *initial;
+
+ /* maptrack limits */
+ u32 max_mt_frames;
+ u32 nr_mt_frames;
+};
+
/* Xen emulation context */
struct kvm_xen {
u64 xen_hypercall;
@@ -871,6 +888,8 @@ struct kvm_xen {
struct idr port_to_evt;
unsigned long poll_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)];
struct mutex xen_lock;
+
+ struct kvm_grant_table gnttab;
};

enum kvm_xen_callback_via {
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index e570c9b26563..b9e6e8f72d87 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -17,6 +17,7 @@
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>
#include <xen/interface/event_channel.h>
+#include <xen/interface/grant_table.h>
#include <xen/interface/sched.h>

#include "trace.h"
@@ -35,6 +36,7 @@ struct evtchnfd {

static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port);
static void *xen_vcpu_info(struct kvm_vcpu *v);
+static void kvm_xen_gnttab_free(struct kvm_xen *xen);

#define XEN_DOMID_MIN 1
#define XEN_DOMID_MAX (DOMID_FIRST_RESERVED - 1)
@@ -513,6 +515,12 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = kvm_xen_domid_init(kvm, any, domid);
break;
}
+ case KVM_XEN_ATTR_TYPE_GNTTAB: {
+ struct kvm_xen_gnttab xevfd = data->u.gnttab;
+
+ r = kvm_vm_ioctl_xen_gnttab(kvm, &xevfd);
+ break;
+ }
default:
break;
}
@@ -969,6 +977,7 @@ void kvm_xen_destroy_vm(struct kvm *kvm)
put_page(virt_to_page(xen->shinfo));

kvm_xen_free_domid(kvm);
+ kvm_xen_gnttab_free(&kvm->arch.xen);
}

void kvm_xen_init(void)
@@ -1093,3 +1102,82 @@ int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args)
return kvm_xen_eventfd_assign(kvm, &xen->port_to_evt,
&xen->xen_lock, args);
}
+
+int kvm_xen_gnttab_init(struct kvm *kvm, struct kvm_xen *xen,
+ struct kvm_xen_gnttab *op, int dom0)
+{
+ u32 max_mt_frames = op->init.max_maptrack_frames;
+ unsigned long initial = op->init.initial_frame;
+ struct kvm_grant_table *gnttab = &xen->gnttab;
+ u32 max_frames = op->init.max_frames;
+ struct page *page = NULL;
+ void *addr;
+
+ if (!dom0) {
+ if (!op->init.initial_frame ||
+ offset_in_page(op->init.initial_frame))
+ return -EINVAL;
+
+ if (get_user_pages_fast(initial, 1, 1, &page) != 1)
+ return -EFAULT;
+
+ gnttab->initial_addr = initial;
+ gnttab->initial = page_to_virt(page);
+ put_page(page);
+ }
+
+ addr = kcalloc(max_frames, sizeof(gfn_t), GFP_KERNEL);
+ if (!addr)
+ goto out;
+ xen->gnttab.frames_addr = addr;
+
+ addr = kcalloc(max_frames, sizeof(addr), GFP_KERNEL);
+ if (!addr)
+ goto out;
+
+ gnttab->frames = addr;
+ gnttab->frames[0] = xen->gnttab.initial;
+ gnttab->max_nr_frames = max_frames;
+ gnttab->max_mt_frames = max_mt_frames;
+ gnttab->nr_mt_frames = 1;
+ gnttab->nr_frames = 0;
+
+ pr_debug("kvm_xen: dom%u: grant table limits (gnttab:%d maptrack:%d)\n",
+ xen->domid, gnttab->max_nr_frames, gnttab->max_mt_frames);
+ return 0;
+
+out:
+ kfree(xen->gnttab.frames);
+ kfree(xen->gnttab.frames_addr);
+ if (page)
+ put_page(page);
+ memset(&xen->gnttab, 0, sizeof(xen->gnttab));
+ return -ENOMEM;
+}
+
+void kvm_xen_gnttab_free(struct kvm_xen *xen)
+{
+ struct kvm_grant_table *gnttab = &xen->gnttab;
+
+ kfree(gnttab->frames);
+ kfree(gnttab->frames_addr);
+}
+
+int kvm_vm_ioctl_xen_gnttab(struct kvm *kvm, struct kvm_xen_gnttab *op)
+{
+ int r = -EINVAL;
+
+ if (!op)
+ return r;
+
+ switch (op->flags) {
+ case KVM_XEN_GNTTAB_F_INIT:
+ r = kvm_xen_gnttab_init(kvm, &kvm->arch.xen, op, 0);
+ break;
+ default:
+ r = -ENOSYS;
+ break;
+ }
+
+ return r;
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 76ef2150c650..08ad4e1259df 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -37,6 +37,7 @@ int kvm_xen_setup_evtchn(struct kvm *kvm,
void kvm_xen_init_vm(struct kvm *kvm);
void kvm_xen_destroy_vm(struct kvm *kvm);
int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args);
+int kvm_vm_ioctl_xen_gnttab(struct kvm *kvm, struct kvm_xen_gnttab *op);
void kvm_xen_vcpu_init(struct kvm_vcpu *vcpu);
void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);
void kvm_xen_init(void);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3212cad732dd..e4fb9bc34d61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1503,6 +1503,18 @@ struct kvm_xen_hvm_attr {
struct {
__s32 domid;
} dom;
+ struct kvm_xen_gnttab {
+#define KVM_XEN_GNTTAB_F_INIT 0
+ __u32 flags;
+ union {
+ struct {
+ __u32 max_frames;
+ __u32 max_maptrack_frames;
+ __u64 initial_frame;
+ } init;
+ __u32 padding[4];
+ };
+ } gnttab;
} u;
};

@@ -1514,6 +1526,7 @@ struct kvm_xen_hvm_attr {
/* Available with KVM_CAP_XEN_HVM_EVTCHN */
#define KVM_XEN_ATTR_TYPE_EVTCHN 0x4
#define KVM_XEN_ATTR_TYPE_DOMID 0x5
+#define KVM_XEN_ATTR_TYPE_GNTTAB 0x6

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:19:48

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 25/39] KVM: x86/xen: grant map support

From: Ankur Arora <[email protected]>

Introduce support for mapping grant references. The sequence of events
to map a grant is:

rframe = read_shared_entry(guest_grant_table, grant-ref);
rpfn = get_user_pages_remote(remote_mm, rframe);
mark_shared_entry(guest_grant_table, grant-ref,
GTF_reading | GTF_writing);

To correctly handle grant unmaps for mapped grants, we save the mapping
parameters in maptrack. Also, grant map (and unmap) can be called from
non-sleeping contexts, so we call get_user_pages_remote() in
non-blocking mode and ask the user to retry.

Also note that this code is not compliant with Xen's grant map/unmap
ABI. In particular, we do not support multiple simultaneous mappings of
a grant-reference. Later versions will support that.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 396 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 396 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 645cd22ab4e7..3603645086a7 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -9,6 +9,7 @@
#include "xen.h"
#include "ioapic.h"

+#include <linux/mman.h>
#include <linux/kvm_host.h>
#include <linux/eventfd.h>
#include <linux/sched/stat.h>
@@ -29,9 +30,11 @@

/* Grant v1 references per 4K page */
#define GPP_V1 (PAGE_SIZE / sizeof(struct grant_entry_v1))
+#define shared_entry(gt, ref) (&((gt)[(ref) / GPP_V1][(ref) % GPP_V1]))

/* Grant mappings per 4K page */
#define MPP (PAGE_SIZE / sizeof(struct kvm_grant_map))
+#define maptrack_entry(mt, hdl) (&((mt)[(hdl) / MPP][(hdl) % MPP]))

struct evtchnfd {
struct eventfd_ctx *ctx;
@@ -81,6 +84,18 @@ static int kvm_xen_domid_init(struct kvm *kvm, bool any, domid_t domid)
return 0;
}

+static struct kvm *kvm_xen_find_vm(domid_t domid)
+{
+ unsigned long flags;
+ struct kvm *vm;
+
+ read_lock_irqsave(&domid_lock, flags);
+ vm = idr_find(&domid_to_kvm, domid);
+ read_unlock_irqrestore(&domid_lock, flags);
+
+ return vm;
+}
+
int kvm_xen_free_domid(struct kvm *kvm)
{
struct kvm_xen *xen = &kvm->arch.xen;
@@ -1153,7 +1168,20 @@ int kvm_xen_gnttab_init(struct kvm *kvm, struct kvm_xen *xen,
gnttab->frames = addr;
gnttab->frames[0] = xen->gnttab.initial;
gnttab->max_nr_frames = max_frames;
+
+ addr = kcalloc(max_mt_frames, sizeof(addr), GFP_KERNEL);
+ if (!addr)
+ goto out;
+
+ /* Needs to be aligned at 16b boundary. */
+ gnttab->handle = addr;
gnttab->max_mt_frames = max_mt_frames;
+
+ addr = (void *) get_zeroed_page(GFP_KERNEL);
+ if (!addr)
+ goto out;
+ gnttab->handle[0] = addr;
+
gnttab->nr_mt_frames = 1;
gnttab->nr_frames = 0;

@@ -1162,6 +1190,7 @@ int kvm_xen_gnttab_init(struct kvm *kvm, struct kvm_xen *xen,
return 0;

out:
+ kfree(xen->gnttab.handle);
kfree(xen->gnttab.frames);
kfree(xen->gnttab.frames_addr);
if (page)
@@ -1170,11 +1199,38 @@ int kvm_xen_gnttab_init(struct kvm *kvm, struct kvm_xen *xen,
return -ENOMEM;
}

+static void kvm_xen_maptrack_free(struct kvm_xen *xen)
+{
+ u32 max_entries = xen->gnttab.nr_mt_frames * MPP;
+ struct kvm_grant_map *map;
+ int ref, inuse = 0;
+
+ for (ref = 0; ref < max_entries; ref++) {
+ map = maptrack_entry(xen->gnttab.handle, ref);
+
+ if (test_and_clear_bit(_KVM_GNTMAP_ACTIVE,
+ (unsigned long *)&map->flags)) {
+ put_page(virt_to_page(map->gpa));
+ inuse++;
+ }
+ }
+
+ if (inuse)
+ pr_debug("kvm: dom%u teardown %u mappings\n",
+ xen->domid, inuse);
+}
+
void kvm_xen_gnttab_free(struct kvm_xen *xen)
{
struct kvm_grant_table *gnttab = &xen->gnttab;
int i;

+ if (xen->domid)
+ kvm_xen_maptrack_free(xen);
+
+ for (i = 0; i < gnttab->nr_mt_frames; i++)
+ free_page((unsigned long)gnttab->handle[i]);
+
for (i = 0; i < gnttab->nr_frames; i++)
put_page(virt_to_page(gnttab->frames[i]));

@@ -1313,6 +1369,343 @@ void kvm_xen_unregister_lcall(void)
}
EXPORT_SYMBOL_GPL(kvm_xen_unregister_lcall);

+static inline int gnttab_entries(struct kvm *kvm)
+{
+ struct kvm_grant_table *gnttab = &kvm->arch.xen.gnttab;
+ int n = max_t(unsigned int, gnttab->nr_frames, 1);
+
+ return n * ((n << PAGE_SHIFT) / sizeof(struct grant_entry_v1));
+}
+
+/*
+ * The first two members of a grant entry are updated as a combined pair.
+ * The following union allows that to happen in an endian-neutral fashion.
+ * Taken from Xen.
+ */
+union grant_combo {
+ uint32_t word;
+ struct {
+ uint16_t flags;
+ domid_t domid;
+ } shorts;
+};
+
+/* Marks a grant in use. Code largely borrowed from Xen. */
+static int set_grant_status(domid_t domid, bool readonly,
+ struct grant_entry_v1 *shah)
+{
+ int rc = GNTST_okay;
+ union grant_combo scombo, prev_scombo, new_scombo;
+ uint16_t mask = GTF_type_mask;
+
+ /*
+ * We bound the number of times we retry CMPXCHG on memory locations
+ * that we share with a guest OS. The reason is that the guest can
+ * modify that location at a higher rate than we can
+ * read-modify-CMPXCHG, so the guest could cause us to livelock. There
+ * are a few cases where it is valid for the guest to race our updates
+ * (e.g., to change the GTF_readonly flag), so we allow a few retries
+ * before failing.
+ */
+ int retries = 0;
+
+ scombo.word = *(u32 *)shah;
+
+ /*
+ * This loop attempts to set the access (reading/writing) flags
+ * in the grant table entry. It tries a cmpxchg on the field
+ * up to five times, and then fails under the assumption that
+ * the guest is misbehaving.
+ */
+ for (;;) {
+ /* If not already pinned, check the grant domid and type. */
+ if ((((scombo.shorts.flags & mask) != GTF_permit_access) ||
+ (scombo.shorts.domid != domid))) {
+ rc = GNTST_general_error;
+ pr_err("Bad flags (%x) or dom (%d); expected d%d\n",
+ scombo.shorts.flags, scombo.shorts.domid,
+ domid);
+ return rc;
+ }
+
+ new_scombo = scombo;
+ new_scombo.shorts.flags |= GTF_reading;
+
+ if (!readonly) {
+ new_scombo.shorts.flags |= GTF_writing;
+ if (unlikely(scombo.shorts.flags & GTF_readonly)) {
+ rc = GNTST_general_error;
+ pr_err("Attempt to write-pin a r/o grant entry\n");
+ return rc;
+ }
+ }
+
+ prev_scombo.word = cmpxchg((u32 *)shah,
+ scombo.word, new_scombo.word);
+ if (likely(prev_scombo.word == scombo.word))
+ break;
+
+ if (retries++ == 4) {
+ rc = GNTST_general_error;
+ pr_err("Shared grant entry is unstable\n");
+ return rc;
+ }
+
+ scombo = prev_scombo;
+ }
+
+ return rc;
+}
+
+#define MT_HANDLE_DOMID_SHIFT 17
+#define MT_HANDLE_DOMID_MASK 0x7fff
+#define MT_HANDLE_GREF_MASK 0x1ffff
+
+static u32 handle_get(domid_t domid, grant_ref_t ref)
+{
+ return (domid << MT_HANDLE_DOMID_SHIFT) | ref;
+}
+
+static u16 handle_get_domid(grant_handle_t handle)
+{
+ return (handle >> MT_HANDLE_DOMID_SHIFT) & MT_HANDLE_DOMID_MASK;
+}
+
+static grant_ref_t handle_get_grant(grant_handle_t handle)
+{
+ return handle & MT_HANDLE_GREF_MASK;
+}
+
+static int map_grant_nosleep(struct kvm *rd, u64 frame, bool readonly,
+ struct page **page, u16 *err)
+{
+ unsigned long rhva;
+ int gup_flags, non_blocking;
+ int ret;
+
+ *err = GNTST_general_error;
+
+ if (!err || !page)
+ return -EINVAL;
+
+ rhva = gfn_to_hva(rd, frame);
+ if (kvm_is_error_hva(rhva)) {
+ *err = GNTST_bad_page;
+ return -EFAULT;
+ }
+
+ gup_flags = (readonly ? 0 : FOLL_WRITE) | FOLL_NOWAIT;
+
+ /* get_user_pages will reset this were IO to be needed */
+ non_blocking = 1;
+
+ /*
+ * get_user_pages_*() family of functions can sleep if the page needs
+ * to be mapped in. However, our main consumer is the grant map
+ * hypercall and because we run in the same context as the caller
+ * (unlike a real hypercall) sleeping is not an option.
+ *
+ * This is how we avoid it:
+ * - sleeping on mmap_sem acquisition: we handle that by acquiring the
+ * read-lock before calling.
+ * If mmap_sem is contended, return with GNTST_eagain.
+ * - sync wait for pages to be swapped in: specify FOLL_NOWAIT. If IO
+ * was needed, would be returned via @non_blocking. Return
+ * GNTST_eagain if it is necessary and the user would retry.
+ * Also, in the blocking case, mmap_sem will be released
+ * asynchronously when the IO completes.
+ */
+ ret = down_read_trylock(&rd->mm->mmap_sem);
+ if (ret == 0) {
+ *err = GNTST_eagain;
+ return -EBUSY;
+ }
+
+ ret = get_user_pages_remote(rd->mm->owner, rd->mm, rhva, 1, gup_flags,
+ page, NULL, &non_blocking);
+ if (non_blocking)
+ up_read(&rd->mm->mmap_sem);
+
+ if (ret == 1) {
+ *err = GNTST_okay;
+ } else if (ret == 0) {
+ *err = GNTST_eagain;
+ ret = -EBUSY;
+ } else if (ret < 0) {
+ pr_err("gnttab: failed to get pfn for hva %lx, err %d\n",
+ rhva, ret);
+ if (ret == -EFAULT) {
+ *err = GNTST_bad_page;
+ } else if (ret == -EBUSY) {
+ WARN_ON(non_blocking);
+ *err = GNTST_eagain;
+ } else {
+ *err = GNTST_general_error;
+ }
+ }
+
+ return (ret >= 0) ? 0 : ret;
+}
+
+static int shim_hcall_gntmap(struct kvm_xen *ld,
+ struct gnttab_map_grant_ref *op)
+{
+ struct kvm_grant_map map_old, map_new, *map = NULL;
+ bool readonly = op->flags & GNTMAP_readonly;
+ struct grant_entry_v1 *shah;
+ struct page *page = NULL;
+ unsigned long host_kaddr;
+ int err = -ENOSYS;
+ struct kvm *rd;
+ kvm_pfn_t rpfn;
+ u32 frame;
+ u32 idx;
+
+ BUILD_BUG_ON(sizeof(*map) != 16);
+
+ if (unlikely((op->host_addr))) {
+ pr_err("gnttab: bad host_addr %llx in map\n", op->host_addr);
+ op->status = GNTST_bad_virt_addr;
+ return 0;
+ }
+
+ /*
+ * Make sure the guest does not try to smuggle any flags here
+ * (for instance _KVM_GNTMAP_ACTIVE.)
+ * The only allowable flag is GNTMAP_readonly.
+ */
+ if (unlikely(op->flags & ~((u16) GNTMAP_readonly))) {
+ pr_err("gnttab: bad flags %x in map\n", op->flags);
+ op->status = GNTST_bad_gntref;
+ return 0;
+ }
+
+ rd = kvm_xen_find_vm(op->dom);
+ if (unlikely(!rd)) {
+ pr_err("gnttab: could not find domain %u\n", op->dom);
+ op->status = GNTST_bad_domain;
+ return 0;
+ }
+
+ if (unlikely(op->ref >= gnttab_entries(rd))) {
+ pr_err("gnttab: bad ref %u\n", op->ref);
+ op->status = GNTST_bad_gntref;
+ return 0;
+ }
+
+ /*
+ * shah is potentially controlled by the user. We cache the frame but
+ * don't care about any changes to domid or flags since those get
+ * validated in set_grant_status() anyway.
+ *
+ * Note that if the guest changes the frame we will end up mapping the
+ * old frame.
+ */
+ shah = shared_entry(rd->arch.xen.gnttab.frames_v1, op->ref);
+ frame = READ_ONCE(shah->frame);
+
+ if (unlikely(shah->domid != ld->domid)) {
+ pr_err("gnttab: bad domain (%u != %u)\n",
+ shah->domid, ld->domid);
+ op->status = GNTST_bad_gntref;
+ goto out;
+ }
+
+ idx = handle_get(op->dom, op->ref);
+ if (handle_get_grant(idx) < op->ref ||
+ handle_get_domid(idx) < op->dom) {
+ pr_err("gnttab: out of maptrack entries (dom %u)\n", ld->domid);
+ op->status = GNTST_general_error;
+ goto out;
+ }
+
+ map = maptrack_entry(rd->arch.xen.gnttab.handle, op->ref);
+
+ /*
+ * Cache the old map value so we can do our checks on the stable
+ * version. Once the map is done, swap the mapping with the new map.
+ */
+ map_old = *map;
+ if (map_old.flags & KVM_GNTMAP_ACTIVE) {
+ pr_err("gnttab: grant ref %u dom %u in use\n",
+ op->ref, ld->domid);
+ op->status = GNTST_bad_gntref;
+ goto out;
+ }
+
+ err = map_grant_nosleep(rd, frame, readonly, &page, &op->status);
+ if (err) {
+ if (err != -EBUSY)
+ op->status = GNTST_bad_gntref;
+ goto out;
+ }
+
+ err = set_grant_status(ld->domid, readonly, shah);
+ if (err != GNTST_okay) {
+ pr_err("gnttab: pin failed\n");
+ put_page(page);
+ op->status = err;
+ goto out;
+ }
+
+ rpfn = page_to_pfn(page);
+ host_kaddr = (unsigned long) pfn_to_kaddr(rpfn);
+
+ map_new.domid = op->dom;
+ map_new.ref = op->ref;
+ map_new.flags = op->flags;
+ map_new.gpa = host_kaddr;
+
+ map_new.flags |= KVM_GNTMAP_ACTIVE;
+
+ /*
+ * Protect against a grant-map that could come in between our check for
+ * KVM_GNTMAP_ACTIVE above and assuming the ownership of the mapping.
+ *
+ * Use cmpxchg_double() so we can update mapping atomically (which
+ * luckily fits in 16b.)
+ */
+ if (cmpxchg_double(&map->gpa, &map->fields,
+ map_old.gpa, map_old.fields,
+ map_new.gpa, map_new.fields) == false) {
+ put_page(page);
+ op->status = GNTST_bad_gntref;
+ goto out;
+ }
+
+ op->dev_bus_addr = rpfn << PAGE_SHIFT;
+ op->handle = idx;
+ op->status = GNTST_okay;
+ op->host_addr = host_kaddr;
+ return 0;
+
+out:
+ /* The error code is stored in @status. */
+ return 0;
+}
+
+static int shim_hcall_gnttab(int op, void *p, int count)
+{
+ int ret = -ENOSYS;
+ int i;
+
+ switch (op) {
+ case GNTTABOP_map_grant_ref: {
+ struct gnttab_map_grant_ref *ref = p;
+
+ for (i = 0; i < count; i++)
+ shim_hcall_gntmap(xen_shim, ref + i);
+ ret = 0;
+ break;
+ }
+ default:
+ pr_info("lcall-gnttab:op default=%d\n", op);
+ break;
+ }
+
+ return ret;
+}
+
static int shim_hcall_version(int op, struct xen_feature_info *fi)
{
if (op != XENVER_get_features || !fi || fi->submap_idx != 0)
@@ -1330,6 +1723,9 @@ static int shim_hypercall(u64 code, u64 a0, u64 a1, u64 a2, u64 a3, u64 a4)
int ret = -ENOSYS;

switch (code) {
+ case __HYPERVISOR_grant_table_op:
+ ret = shim_hcall_gnttab((int) a0, (void *) a1, (int) a2);
+ break;
case __HYPERVISOR_xen_version:
ret = shim_hcall_version((int)a0, (void *)a1);
break;
--
2.11.0


2019-02-20 20:19:48

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 27/39] KVM: x86/xen: grant copy support

Copies from source grant reference to dest grant reference (or the
other way around.)

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 151 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 8f06924e0dfa..fecc548b2f12 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -10,6 +10,7 @@
#include "ioapic.h"

#include <linux/mman.h>
+#include <linux/highmem.h>
#include <linux/kvm_host.h>
#include <linux/eventfd.h>
#include <linux/sched/stat.h>
@@ -1747,6 +1748,148 @@ static int shim_hcall_gntunmap(struct kvm_xen *xen,
return 0;
}

+static unsigned long __kvm_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ struct kvm_xen *xen = vcpu ? &vcpu->kvm->arch.xen : xen_shim;
+ unsigned long hva;
+
+ if (xen->domid == 0)
+ return (unsigned long) page_to_virt(pfn_to_page(gfn));
+
+ hva = gfn_to_hva(vcpu->kvm, gfn);
+ if (unlikely(kvm_is_error_hva(hva)))
+ return 0;
+
+ return hva;
+}
+
+static int __kvm_gref_to_page(struct kvm_vcpu *vcpu, grant_ref_t ref,
+ domid_t domid, struct page **page,
+ int16_t *status)
+{
+ struct kvm_xen *source = vcpu ? &vcpu->kvm->arch.xen : xen_shim;
+ struct grant_entry_v1 *shah;
+ struct grant_entry_v1 **gt;
+ struct kvm *dest;
+
+ dest = kvm_xen_find_vm(domid);
+ if (unlikely(!dest)) {
+ pr_err("gnttab: could not find domain %u\n", domid);
+ *status = GNTST_bad_domain;
+ return 0;
+ }
+
+ if (unlikely(ref >= gnttab_entries(dest))) {
+ pr_err("gnttab: bad ref %u\n", ref);
+ *status = GNTST_bad_gntref;
+ return 0;
+ }
+
+ gt = dest->arch.xen.gnttab.frames_v1;
+ shah = shared_entry(gt, ref);
+ if (unlikely(shah->domid != source->domid)) {
+ pr_err("gnttab: bad domain (%u != %u)\n",
+ shah->domid, source->domid);
+ *status = GNTST_bad_gntref;
+ return 0;
+ }
+
+ (void) map_grant_nosleep(dest, shah->frame, 0, page, status);
+
+ return 0;
+}
+
+static int shim_hcall_gntcopy(struct kvm_vcpu *vcpu,
+ struct gnttab_copy *op)
+{
+ void *saddr = NULL, *daddr = NULL;
+ struct page *spage = NULL, *dpage = NULL;
+ unsigned long hva;
+ int err = -ENOSYS;
+ gfn_t gfn;
+
+ if (!(op->flags & GNTCOPY_source_gref) &&
+ (op->source.domid == DOMID_SELF)) {
+ gfn = op->source.u.gmfn;
+ hva = __kvm_gfn_to_hva(vcpu, gfn);
+ if (unlikely(!hva)) {
+ pr_err("gnttab: bad source gfn:%llx\n", gfn);
+ op->status = GNTST_general_error;
+ err = 0;
+ return 0;
+ }
+
+ saddr = (void *) (((unsigned long) hva) + op->source.offset);
+ } else if (op->flags & GNTCOPY_source_gref) {
+ op->status = GNTST_okay;
+ if (__kvm_gref_to_page(vcpu, op->source.u.ref,
+ op->source.domid, &spage, &op->status))
+ return -EFAULT;
+
+ if (!spage || op->status != GNTST_okay) {
+ pr_err("gnttab: failed to get page for source gref:%x\n",
+ op->source.u.ref);
+ err = 0;
+ goto out;
+ }
+
+ saddr = kmap(spage);
+ saddr = (void *) (((unsigned long) saddr) + op->source.offset);
+ }
+
+ if (!(op->flags & GNTCOPY_dest_gref) &&
+ (op->dest.domid == DOMID_SELF)) {
+ gfn = op->dest.u.gmfn;
+ hva = __kvm_gfn_to_hva(vcpu, gfn);
+ if (unlikely(!hva)) {
+ pr_err("gnttab: bad dest gfn:%llx\n", gfn);
+ op->status = GNTST_general_error;
+ err = 0;
+ return 0;
+ }
+
+ daddr = (void *) (((unsigned long) hva) + op->dest.offset);
+ } else if (op->flags & GNTCOPY_dest_gref) {
+ op->status = GNTST_okay;
+ if (__kvm_gref_to_page(vcpu, op->dest.u.ref,
+ op->dest.domid, &dpage, &op->status))
+ return -EFAULT;
+
+ if (!dpage || op->status != GNTST_okay) {
+ pr_err("gnttab: failed to get page for dest gref:%x\n",
+ op->dest.u.ref);
+ err = 0;
+ goto out;
+ }
+
+ daddr = kmap(dpage);
+ daddr = (void *) (((unsigned long) daddr) + op->dest.offset);
+ }
+
+ if (unlikely(!daddr || !saddr)) {
+ op->status = GNTST_general_error;
+ err = 0;
+ goto out;
+ }
+
+ memcpy(daddr, saddr, op->len);
+
+ if (spage)
+ kunmap(spage);
+ if (dpage)
+ kunmap(dpage);
+
+
+ err = 0;
+ op->status = GNTST_okay;
+out:
+ if (spage)
+ put_page(spage);
+ if (dpage)
+ put_page(dpage);
+ return err;
+}
+
static int shim_hcall_gnttab(int op, void *p, int count)
{
int ret = -ENOSYS;
@@ -1771,6 +1914,14 @@ static int shim_hcall_gnttab(int op, void *p, int count)
ret = 0;
break;
}
+ case GNTTABOP_copy: {
+ struct gnttab_copy *op = p;
+
+ for (i = 0; i < count; i++)
+ shim_hcall_gntcopy(NULL, op + i);
+ ret = 0;
+ break;
+ }
default:
pr_info("lcall-gnttab:op default=%d\n", op);
break;
--
2.11.0


2019-02-20 20:19:50

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 28/39] KVM: x86/xen: interdomain evtchn support

From: Ankur Arora <[email protected]>

Implement sending events between backend and the guest. To send an
event we mark the event channel pending by setting some bits in the
shared_info and vcpu_info pages and deliver the upcall on the
destination vcpu.

To send an event to dom0, we mark the event channel pending and
send an IPI to the destination vcpu, which would invoke the event
channel upcall handler, which inturn calls the ISR registered by
the backend drivers.

When sending to the guest we fetch the vcpu from the guest, mark the
event channel pending and deliver the interrupt to the guest.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 271 ++++++++++++++++++++++++++++++++++++++++++++---
include/uapi/linux/kvm.h | 10 +-
2 files changed, 263 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index fecc548b2f12..420e3ebb66bc 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -27,6 +27,10 @@
#include <xen/features.h>
#include <asm/xen/hypercall.h>

+#include <xen/xen.h>
+#include <xen/events.h>
+#include <xen/xen-ops.h>
+
#include "trace.h"

/* Grant v1 references per 4K page */
@@ -46,12 +50,18 @@ struct evtchnfd {
struct {
u8 type;
} virq;
+ struct {
+ domid_t dom;
+ struct kvm *vm;
+ u32 port;
+ } remote;
};
};

static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port);
-static void *xen_vcpu_info(struct kvm_vcpu *v);
+static void *vcpu_to_xen_vcpu_info(struct kvm_vcpu *v);
static void kvm_xen_gnttab_free(struct kvm_xen *xen);
+static int kvm_xen_evtchn_send_shim(struct kvm_xen *shim, struct evtchnfd *evt);
static int shim_hypercall(u64 code, u64 a0, u64 a1, u64 a2, u64 a3, u64 a4);

#define XEN_DOMID_MIN 1
@@ -114,7 +124,7 @@ int kvm_xen_free_domid(struct kvm *kvm)
int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
- struct vcpu_info *vcpu_info = xen_vcpu_info(vcpu);
+ struct vcpu_info *vcpu_info = vcpu_to_xen_vcpu_info(vcpu);

if (!!atomic_read(&vcpu_xen->cb.queued) || (vcpu_info &&
test_bit(0, (unsigned long *) &vcpu_info->evtchn_upcall_pending)))
@@ -386,7 +396,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
return 0;
}

-static void *xen_vcpu_info(struct kvm_vcpu *v)
+static void *vcpu_to_xen_vcpu_info(struct kvm_vcpu *v)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
struct kvm_xen *kvm = &v->kvm->arch.xen;
@@ -478,7 +488,7 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
struct pvclock_vcpu_time_info *guest_hv_clock;
- void *hva = xen_vcpu_info(v);
+ void *hva = vcpu_to_xen_vcpu_info(v);
unsigned int offset;

offset = offsetof(struct vcpu_info, time);
@@ -638,8 +648,6 @@ static int kvm_xen_evtchn_2l_set_pending(struct shared_info *shared_info,
return 1;
}

-#undef BITS_PER_EVTCHN_WORD
-
static int kvm_xen_evtchn_set_pending(struct kvm_vcpu *svcpu,
struct evtchnfd *evfd)
{
@@ -670,8 +678,44 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
wake_up(&vcpu_xen->sched_waitq);
}

+static void kvm_xen_evtchn_2l_reset_port(struct shared_info *shared_info,
+ int port)
+{
+ clear_bit(port, (unsigned long *) shared_info->evtchn_pending);
+ clear_bit(port, (unsigned long *) shared_info->evtchn_mask);
+}
+
+static inline struct evtchnfd *port_to_evtchn(struct kvm *kvm, int port)
+{
+ struct kvm_xen *xen = kvm ? &kvm->arch.xen : xen_shim;
+
+ return idr_find(&xen->port_to_evt, port);
+}
+
+static struct kvm_vcpu *get_remote_vcpu(struct evtchnfd *source)
+{
+ struct kvm *rkvm = source->remote.vm;
+ int rport = source->remote.port;
+ struct evtchnfd *dest = NULL;
+ struct kvm_vcpu *vcpu = NULL;
+
+ WARN_ON(source->type <= XEN_EVTCHN_TYPE_IPI);
+
+ if (!rkvm)
+ return NULL;
+
+ /* conn_to_evt is protected by vcpu->kvm->srcu */
+ dest = port_to_evtchn(rkvm, rport);
+ if (!dest)
+ return NULL;
+
+ vcpu = kvm_get_vcpu(rkvm, dest->vcpu);
+ return vcpu;
+}
+
static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
{
+ struct kvm_vcpu *target = vcpu;
struct eventfd_ctx *eventfd;
struct evtchnfd *evtchnfd;

@@ -680,10 +724,19 @@ static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
if (!evtchnfd)
return -ENOENT;

+ if (evtchnfd->type == XEN_EVTCHN_TYPE_INTERDOM ||
+ evtchnfd->type == XEN_EVTCHN_TYPE_UNBOUND) {
+ target = get_remote_vcpu(evtchnfd);
+ port = evtchnfd->remote.port;
+
+ if (!target && !evtchnfd->remote.dom)
+ return kvm_xen_evtchn_send_shim(xen_shim, evtchnfd);
+ }
+
eventfd = evtchnfd->ctx;
- if (!kvm_xen_evtchn_set_pending(vcpu, evtchnfd)) {
+ if (!kvm_xen_evtchn_set_pending(target, evtchnfd)) {
if (!eventfd)
- kvm_xen_evtchnfd_upcall(vcpu, evtchnfd);
+ kvm_xen_evtchnfd_upcall(target, evtchnfd);
else
eventfd_signal(eventfd, 1);
}
@@ -894,6 +947,67 @@ static int kvm_xen_hcall_sched_op(struct kvm_vcpu *vcpu, int cmd, u64 param)
return ret;
}

+static void kvm_xen_call_function_deliver(void *_)
+{
+ xen_hvm_evtchn_do_upcall();
+}
+
+static inline int kvm_xen_evtchn_call_function(struct evtchnfd *event)
+{
+ int ret;
+
+ if (!irqs_disabled())
+ return smp_call_function_single(event->vcpu,
+ kvm_xen_call_function_deliver,
+ NULL, 0);
+
+ local_irq_enable();
+ ret = smp_call_function_single(event->vcpu,
+ kvm_xen_call_function_deliver, NULL, 0);
+ local_irq_disable();
+
+ return ret;
+}
+
+static int kvm_xen_evtchn_send_shim(struct kvm_xen *dom0, struct evtchnfd *e)
+{
+ struct shared_info *s = HYPERVISOR_shared_info;
+ struct evtchnfd *remote;
+ int pending;
+
+ remote = idr_find(&dom0->port_to_evt, e->remote.port);
+ if (!remote)
+ return -ENOENT;
+
+ pending = kvm_xen_evtchn_2l_set_pending(s,
+ per_cpu(xen_vcpu, remote->vcpu),
+ remote->port);
+ return kvm_xen_evtchn_call_function(remote);
+}
+
+static int __kvm_xen_evtchn_send_guest(struct kvm_vcpu *vcpu, int port)
+{
+ struct evtchnfd *evtchnfd;
+ struct eventfd_ctx *eventfd;
+
+ /* conn_to_evt is protected by vcpu->kvm->srcu */
+ evtchnfd = idr_find(&vcpu->kvm->arch.xen.port_to_evt, port);
+ if (!evtchnfd)
+ return -ENOENT;
+
+ eventfd = evtchnfd->ctx;
+ if (!kvm_xen_evtchn_set_pending(vcpu, evtchnfd))
+ kvm_xen_evtchnfd_upcall(vcpu, evtchnfd);
+
+ kvm_xen_check_poller(kvm_get_vcpu(vcpu->kvm, evtchnfd->vcpu), port);
+ return 0;
+}
+
+static int kvm_xen_evtchn_send_guest(struct evtchnfd *evt, int port)
+{
+ return __kvm_xen_evtchn_send_guest(get_remote_vcpu(evt), port);
+}
+
int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
{
bool longmode;
@@ -1045,13 +1159,15 @@ static int kvm_xen_eventfd_update(struct kvm *kvm, struct idr *port_to_evt,
return 0;
}

-static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
- struct mutex *port_lock,
- struct kvm_xen_eventfd *args)
+int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
+ struct mutex *port_lock,
+ struct kvm_xen_eventfd *args)
{
+ struct evtchnfd *evtchnfd, *unbound = NULL;
struct eventfd_ctx *eventfd = NULL;
- struct evtchnfd *evtchnfd;
+ struct kvm *remote_vm = NULL;
u32 port = args->port;
+ u32 endport = 0;
int ret;

if (args->fd != -1) {
@@ -1064,25 +1180,56 @@ static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
args->virq.type >= KVM_XEN_NR_VIRQS)
return -EINVAL;

+ if (args->remote.domid == DOMID_SELF)
+ remote_vm = kvm;
+ else if (args->remote.domid == xen_shim->domid)
+ remote_vm = NULL;
+ else if ((args->type == XEN_EVTCHN_TYPE_INTERDOM ||
+ args->type == XEN_EVTCHN_TYPE_UNBOUND)) {
+ remote_vm = kvm_xen_find_vm(args->remote.domid);
+ if (!remote_vm)
+ return -ENOENT;
+ }
+
+ if (args->type == XEN_EVTCHN_TYPE_INTERDOM) {
+ unbound = port_to_evtchn(remote_vm, args->remote.port);
+ if (!unbound)
+ return -ENOENT;
+ }
+
evtchnfd = kzalloc(sizeof(struct evtchnfd), GFP_KERNEL);
if (!evtchnfd)
return -ENOMEM;

evtchnfd->ctx = eventfd;
- evtchnfd->port = port;
evtchnfd->vcpu = args->vcpu;
evtchnfd->type = args->type;
+
if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
evtchnfd->virq.type = args->virq.type;
+ else if ((evtchnfd->type == XEN_EVTCHN_TYPE_UNBOUND) ||
+ (evtchnfd->type == XEN_EVTCHN_TYPE_INTERDOM)) {
+ evtchnfd->remote.dom = args->remote.domid;
+ evtchnfd->remote.vm = remote_vm;
+ evtchnfd->remote.port = args->remote.port;
+ }
+
+ if (port == 0)
+ port = 1; /* evtchns in range (0..INT_MAX] */
+ else
+ endport = port + 1;

mutex_lock(port_lock);
- ret = idr_alloc(port_to_evt, evtchnfd, port, port + 1,
+ ret = idr_alloc(port_to_evt, evtchnfd, port, endport,
GFP_KERNEL);
mutex_unlock(port_lock);

if (ret >= 0) {
- if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
+ evtchnfd->port = args->port = ret;
+ if (kvm && evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
kvm_xen_set_virq(kvm, evtchnfd);
+ else if (evtchnfd->type == XEN_EVTCHN_TYPE_INTERDOM)
+ unbound->remote.port = ret;
return 0;
}

@@ -1107,8 +1254,14 @@ static int kvm_xen_eventfd_deassign(struct kvm *kvm, struct idr *port_to_evt,
if (!evtchnfd)
return -ENOENT;

- if (kvm)
+ if (!kvm) {
+ struct shared_info *shinfo = HYPERVISOR_shared_info;
+
+ kvm_xen_evtchn_2l_reset_port(shinfo, port);
+ } else {
synchronize_srcu(&kvm->srcu);
+ }
+
if (evtchnfd->ctx)
eventfd_ctx_put(evtchnfd->ctx);
kfree(evtchnfd);
@@ -1930,6 +2083,89 @@ static int shim_hcall_gnttab(int op, void *p, int count)
return ret;
}

+static int shim_hcall_evtchn_send(struct kvm_xen *dom0, struct evtchn_send *snd)
+{
+ struct evtchnfd *event;
+
+ event = idr_find(&dom0->port_to_evt, snd->port);
+ if (!event)
+ return -ENOENT;
+
+ if (event->remote.vm == NULL)
+ return kvm_xen_evtchn_send_shim(xen_shim, event);
+ else if (event->type == XEN_EVTCHN_TYPE_INTERDOM ||
+ event->type == XEN_EVTCHN_TYPE_UNBOUND)
+ return kvm_xen_evtchn_send_guest(event, event->remote.port);
+ else
+ return -EINVAL;
+
+ return 0;
+}
+
+static int shim_hcall_evtchn(int op, void *p)
+{
+ int ret;
+ struct kvm_xen_eventfd evt;
+
+ if (p == NULL)
+ return -EINVAL;
+
+ memset(&evt, 0, sizeof(evt));
+
+ switch (op) {
+ case EVTCHNOP_bind_interdomain: {
+ struct evtchn_bind_interdomain *un;
+
+ un = (struct evtchn_bind_interdomain *) p;
+
+ evt.fd = -1;
+ evt.port = 0;
+ if (un->remote_port == 0) {
+ evt.type = XEN_EVTCHN_TYPE_UNBOUND;
+ evt.remote.domid = un->remote_dom;
+ } else {
+ evt.type = XEN_EVTCHN_TYPE_INTERDOM;
+ evt.remote.domid = un->remote_dom;
+ evt.remote.port = un->remote_port;
+ }
+
+ ret = kvm_xen_eventfd_assign(NULL, &xen_shim->port_to_evt,
+ &xen_shim->xen_lock, &evt);
+ un->local_port = evt.port;
+ break;
+ }
+ case EVTCHNOP_alloc_unbound: {
+ struct evtchn_alloc_unbound *un;
+
+ un = (struct evtchn_alloc_unbound *) p;
+
+ if (un->dom != DOMID_SELF || un->remote_dom != DOMID_SELF)
+ return -EINVAL;
+ evt.fd = -1;
+ evt.port = 0;
+ evt.type = XEN_EVTCHN_TYPE_UNBOUND;
+ evt.remote.domid = DOMID_SELF;
+
+ ret = kvm_xen_eventfd_assign(NULL, &xen_shim->port_to_evt,
+ &xen_shim->xen_lock, &evt);
+ un->port = evt.port;
+ break;
+ }
+ case EVTCHNOP_send: {
+ struct evtchn_send *send;
+
+ send = (struct evtchn_send *) p;
+ ret = shim_hcall_evtchn_send(xen_shim, send);
+ break;
+ }
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
static int shim_hcall_version(int op, struct xen_feature_info *fi)
{
if (op != XENVER_get_features || !fi || fi->submap_idx != 0)
@@ -1947,6 +2183,9 @@ static int shim_hypercall(u64 code, u64 a0, u64 a1, u64 a2, u64 a3, u64 a4)
int ret = -ENOSYS;

switch (code) {
+ case __HYPERVISOR_event_channel_op:
+ ret = shim_hcall_evtchn((int) a0, (void *)a1);
+ break;
case __HYPERVISOR_grant_table_op:
ret = shim_hcall_gnttab((int) a0, (void *) a1, (int) a2);
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ff7f7d019472..74d877792dfa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1483,8 +1483,10 @@ struct kvm_xen_hvm_attr {
} vcpu_attr;
struct kvm_xen_eventfd {

-#define XEN_EVTCHN_TYPE_VIRQ 0
-#define XEN_EVTCHN_TYPE_IPI 1
+#define XEN_EVTCHN_TYPE_VIRQ 0
+#define XEN_EVTCHN_TYPE_IPI 1
+#define XEN_EVTCHN_TYPE_INTERDOM 2
+#define XEN_EVTCHN_TYPE_UNBOUND 3
__u32 type;
__u32 port;
__u32 vcpu;
@@ -1497,6 +1499,10 @@ struct kvm_xen_hvm_attr {
struct {
__u8 type;
} virq;
+ struct {
+ __u16 domid;
+ __u32 port;
+ } remote;
__u32 padding[2];
};
} evtchn;
--
2.11.0


2019-02-20 20:19:52

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 18/39] x86/xen: make hypercall_page generic

From: Ankur Arora <[email protected]>

Export hypercall_page as a generic interface which can be implemented
by other hypervisors. With this change, hypercall_page now points to
the newly introduced xen_hypercall_page which is seeded by Xen, or to
one that is filled in by a different hypervisor.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/xen/hypercall.h | 12 +++++++-----
arch/x86/xen/enlighten.c | 1 +
arch/x86/xen/enlighten_hvm.c | 3 ++-
arch/x86/xen/enlighten_pv.c | 1 +
arch/x86/xen/enlighten_pvh.c | 3 ++-
arch/x86/xen/xen-asm_32.S | 2 +-
arch/x86/xen/xen-asm_64.S | 2 +-
arch/x86/xen/xen-head.S | 8 ++++----
8 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index ef05bea7010d..1a3cd6680e6f 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -86,11 +86,13 @@ struct xen_dm_op_buf;
* there aren't more than 5 arguments...)
*/

-extern struct { char _entry[32]; } hypercall_page[];
+struct hypercall_entry { char _entry[32]; };
+extern struct hypercall_entry xen_hypercall_page[128];
+extern struct hypercall_entry *hypercall_page;

-#define __HYPERCALL "call hypercall_page+%c[offset]"
+#define __HYPERCALL CALL_NOSPEC
#define __HYPERCALL_ENTRY(x) \
- [offset] "i" (__HYPERVISOR_##x * sizeof(hypercall_page[0]))
+ [thunk_target] "0" (hypercall_page + __HYPERVISOR_##x)

#ifdef CONFIG_X86_32
#define __HYPERCALL_RETREG "eax"
@@ -116,7 +118,7 @@ extern struct { char _entry[32]; } hypercall_page[];
register unsigned long __arg4 asm(__HYPERCALL_ARG4REG) = __arg4; \
register unsigned long __arg5 asm(__HYPERCALL_ARG5REG) = __arg5;

-#define __HYPERCALL_0PARAM "=r" (__res), ASM_CALL_CONSTRAINT
+#define __HYPERCALL_0PARAM "=&r" (__res), ASM_CALL_CONSTRAINT
#define __HYPERCALL_1PARAM __HYPERCALL_0PARAM, "+r" (__arg1)
#define __HYPERCALL_2PARAM __HYPERCALL_1PARAM, "+r" (__arg2)
#define __HYPERCALL_3PARAM __HYPERCALL_2PARAM, "+r" (__arg3)
@@ -208,7 +210,7 @@ xen_single_call(unsigned int call,

asm volatile(CALL_NOSPEC
: __HYPERCALL_5PARAM
- : [thunk_target] "a" (&hypercall_page[call])
+ : [thunk_target] "0" (hypercall_page + call)
: __HYPERCALL_CLOBBER5);

return (long)__res;
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 73b9736e89d2..b36a10e6b5d7 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -20,6 +20,7 @@
#include "smp.h"
#include "pmu.h"

+struct hypercall_entry *hypercall_page;
EXPORT_SYMBOL_GPL(hypercall_page);

/*
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 0e75642d42a3..40845e3e9a96 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -105,8 +105,9 @@ static void __init init_hvm_pv_info(void)

pv_info.name = "Xen HVM";
msr = cpuid_ebx(base + 2);
- pfn = __pa(hypercall_page);
+ pfn = __pa(xen_hypercall_page);
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+ hypercall_page = xen_hypercall_page;
}

xen_setup_features();
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index c54a493e139a..e1537713b57d 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1197,6 +1197,7 @@ asmlinkage __visible void __init xen_start_kernel(void)

if (!xen_start_info)
return;
+ hypercall_page = xen_hypercall_page;

xen_domain_type = XEN_PV_DOMAIN;
xen_start_flags = xen_start_info->flags;
diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index 35b7599d2d0b..d57a8ad1769e 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -30,8 +30,9 @@ void __init xen_pvh_init(void)
xen_start_flags = pvh_start_info.flags;

msr = cpuid_ebx(xen_cpuid_base() + 2);
- pfn = __pa(hypercall_page);
+ pfn = __pa(xen_hypercall_page);
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+ hypercall_page = xen_hypercall_page;
}

void __init mem_map_via_hcall(struct boot_params *boot_params_p)
diff --git a/arch/x86/xen/xen-asm_32.S b/arch/x86/xen/xen-asm_32.S
index c15db060a242..ee4998055ea9 100644
--- a/arch/x86/xen/xen-asm_32.S
+++ b/arch/x86/xen/xen-asm_32.S
@@ -121,7 +121,7 @@ xen_iret_end_crit:

hyper_iret:
/* put this out of line since its very rarely used */
- jmp hypercall_page + __HYPERVISOR_iret * 32
+ jmp xen_hypercall_page + __HYPERVISOR_iret * 32

.globl xen_iret_start_crit, xen_iret_end_crit

diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 1e9ef0ba30a5..2172d6aec9a3 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -70,7 +70,7 @@ ENTRY(xen_early_idt_handler_array)
END(xen_early_idt_handler_array)
__FINIT

-hypercall_iret = hypercall_page + __HYPERVISOR_iret * 32
+hypercall_iret = xen_hypercall_page + __HYPERVISOR_iret * 32
/*
* Xen64 iret frame:
*
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 5077ead5e59c..7ff5437bd83f 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -58,18 +58,18 @@ END(startup_xen)

.pushsection .text
.balign PAGE_SIZE
-ENTRY(hypercall_page)
+ENTRY(xen_hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_EMPTY
.skip 32
.endr

#define HYPERCALL(n) \
- .equ xen_hypercall_##n, hypercall_page + __HYPERVISOR_##n * 32; \
+ .equ xen_hypercall_##n, xen_hypercall_page + __HYPERVISOR_##n * 32; \
.type xen_hypercall_##n, @function; .size xen_hypercall_##n, 32
#include <asm/xen-hypercalls.h>
#undef HYPERCALL
-END(hypercall_page)
+END(xen_hypercall_page)
.popsection

ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz "linux")
@@ -85,7 +85,7 @@ END(hypercall_page)
#ifdef CONFIG_XEN_PV
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, _ASM_PTR startup_xen)
#endif
- ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
+ ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR xen_hypercall_page)
ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,
.ascii "!writable_page_tables|pae_pgdir_above_4gb")
ELFNOTE(Xen, XEN_ELFNOTE_SUPPORTED_FEATURES,
--
2.11.0


2019-02-20 20:19:53

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 12/39] KVM: x86/xen: store virq when assigning evtchn

Enable virq offload to the hypervisor. The primary user for this is
the timer virq.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/xen.c | 23 ++++++++++++++++++++++-
2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f31fcaf8fa7c..92b76127eb43 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -550,6 +550,8 @@ struct kvm_vcpu_xen {
gpa_t steal_time_addr;
struct vcpu_runstate_info *steal_time;
struct kvm_xen_callback cb;
+#define KVM_XEN_NR_VIRQS 24
+ unsigned int virq_to_port[KVM_XEN_NR_VIRQS];
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 1fbdfa7c4356..42c1fe01600d 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -101,6 +101,20 @@ static void kvm_xen_evtchnfd_upcall(struct kvm_vcpu *vcpu, struct evtchnfd *e)
kvm_xen_do_upcall(vcpu->kvm, e->vcpu, vx->cb.via, vx->cb.vector, 0);
}

+void kvm_xen_set_virq(struct kvm *kvm, struct evtchnfd *evt)
+{
+ int virq = evt->virq.type;
+ struct kvm_vcpu_xen *vcpu_xen;
+ struct kvm_vcpu *vcpu;
+
+ vcpu = kvm_get_vcpu(kvm, evt->vcpu);
+ if (!vcpu)
+ return;
+
+ vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ vcpu_xen->virq_to_port[virq] = evt->port;
+}
+
int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level,
bool line_status)
@@ -620,6 +634,10 @@ static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
return PTR_ERR(eventfd);
}

+ if (args->type == XEN_EVTCHN_TYPE_VIRQ &&
+ args->virq.type >= KVM_XEN_NR_VIRQS)
+ return -EINVAL;
+
evtchnfd = kzalloc(sizeof(struct evtchnfd), GFP_KERNEL);
if (!evtchnfd)
return -ENOMEM;
@@ -636,8 +654,11 @@ static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
GFP_KERNEL);
mutex_unlock(port_lock);

- if (ret >= 0)
+ if (ret >= 0) {
+ if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
+ kvm_xen_set_virq(kvm, evtchnfd);
return 0;
+ }

if (ret == -ENOSPC)
ret = -EEXIST;
--
2.11.0


2019-02-20 20:19:53

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

From: Ankur Arora <[email protected]>

Add support for HVM_PARAM_CALLBACK_VIA_TYPE_VECTOR and
HVM_PARAM_CALLBACK_VIA_TYPE_EVTCHN upcall. Some Xen upcall variants do
not have an EOI for received upcalls. We handle that by directly
injecting the interrupt in the VMCS instead of going through the
LAPIC.

Note that the route @vcpu field represents the vcpu index and not a vcpu
id. The vcpu_id is architecture specific e.g. on x86 it's set to the
apic id by userspace.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 14 ++++++
arch/x86/kvm/irq.c | 14 ++++--
arch/x86/kvm/irq_comm.c | 11 +++++
arch/x86/kvm/xen.c | 106 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 9 ++++
include/linux/kvm_host.h | 24 +++++++++
include/uapi/linux/kvm.h | 8 +++
7 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9d388ba0a05c..3305173bf10b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -534,6 +534,12 @@ struct kvm_vcpu_hv {
cpumask_t tlb_flush;
};

+struct kvm_xen_callback {
+ u32 via;
+ u32 vector;
+ atomic_t queued;
+};
+
/* Xen per vcpu emulation context */
struct kvm_vcpu_xen {
struct kvm_xen_exit exit;
@@ -543,6 +549,7 @@ struct kvm_vcpu_xen {
struct pvclock_vcpu_time_info *pv_time;
gpa_t steal_time_addr;
struct vcpu_runstate_info *steal_time;
+ struct kvm_xen_callback cb;
};

struct kvm_vcpu_arch {
@@ -854,6 +861,13 @@ struct kvm_xen {
struct shared_info *shinfo;
};

+enum kvm_xen_callback_via {
+ KVM_XEN_CALLBACK_VIA_GSI,
+ KVM_XEN_CALLBACK_VIA_PCI_INTX,
+ KVM_XEN_CALLBACK_VIA_VECTOR,
+ KVM_XEN_CALLBACK_VIA_EVTCHN,
+};
+
enum kvm_irqchip_mode {
KVM_IRQCHIP_NONE,
KVM_IRQCHIP_KERNEL, /* created with KVM_CREATE_IRQCHIP */
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index faa264822cee..cdb1dbfcc9b1 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -26,6 +26,7 @@
#include "irq.h"
#include "i8254.h"
#include "x86.h"
+#include "xen.h"

/*
* check if there are pending timer events
@@ -61,7 +62,9 @@ static int kvm_cpu_has_extint(struct kvm_vcpu *v)
return pending_userspace_extint(v);
else
return v->kvm->arch.vpic->output;
- } else
+ } else if (kvm_xen_has_interrupt(v) != -1)
+ return 1;
+ else
return 0;
}

@@ -119,7 +122,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
if (kvm_cpu_has_extint(v))
return 1;

- return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
+ return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);

@@ -135,8 +138,13 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)

v->arch.pending_external_vector = -1;
return vector;
- } else
+ } else {
+ int vector = kvm_xen_get_interrupt(v);
+
+ if (vector)
+ return vector; /* Xen */
return kvm_pic_read_irq(v->kvm); /* PIC */
+ }
} else
return -1;
}
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 3cc3b2d130a0..3b5da18c9ce2 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -36,6 +36,7 @@
#include "lapic.h"

#include "hyperv.h"
+#include "xen.h"
#include "x86.h"

static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
@@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
int r;

switch (e->type) {
+ case KVM_IRQ_ROUTING_XEN_EVTCHN:
+ return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
+ line_status);
case KVM_IRQ_ROUTING_HV_SINT:
return kvm_hv_set_sint(e, kvm, irq_source_id, level,
line_status);
@@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
e->hv_sint.sint = ue->u.hv_sint.sint;
break;
+ case KVM_IRQ_ROUTING_XEN_EVTCHN:
+ e->set = kvm_xen_set_evtchn;
+ e->evtchn.vcpu = ue->u.evtchn.vcpu;
+ e->evtchn.vector = ue->u.evtchn.vector;
+ e->evtchn.via = ue->u.evtchn.via;
+
+ return kvm_xen_setup_evtchn(kvm, e);
default:
return -EINVAL;
}
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4fdc4c71245a..99a3722146d8 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -7,6 +7,7 @@

#include "x86.h"
#include "xen.h"
+#include "ioapic.h"

#include <linux/kvm_host.h>
#include <linux/sched/stat.h>
@@ -17,6 +18,111 @@

#include "trace.h"

+static void *xen_vcpu_info(struct kvm_vcpu *v);
+
+int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ struct vcpu_info *vcpu_info = xen_vcpu_info(vcpu);
+
+ if (!!atomic_read(&vcpu_xen->cb.queued) || (vcpu_info &&
+ test_bit(0, (unsigned long *) &vcpu_info->evtchn_upcall_pending)))
+ return 1;
+
+ return -1;
+}
+
+int kvm_xen_get_interrupt(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ u32 vector = vcpu_xen->cb.vector;
+
+ if (kvm_xen_has_interrupt(vcpu) == -1)
+ return 0;
+
+ atomic_set(&vcpu_xen->cb.queued, 0);
+ return vector;
+}
+
+static int kvm_xen_do_upcall(struct kvm *kvm, u32 dest_vcpu,
+ u32 via, u32 vector, int level)
+{
+ struct kvm_vcpu_xen *vcpu_xen;
+ struct kvm_lapic_irq irq;
+ struct kvm_vcpu *vcpu;
+
+ if (vector > 0xff || vector < 0x10 || dest_vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ vcpu = kvm_get_vcpu(kvm, dest_vcpu);
+ if (!vcpu)
+ return -EINVAL;
+
+ memset(&irq, 0, sizeof(irq));
+ if (via == KVM_XEN_CALLBACK_VIA_VECTOR) {
+ vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ atomic_set(&vcpu_xen->cb.queued, 1);
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
+ kvm_vcpu_kick(vcpu);
+ } else if (via == KVM_XEN_CALLBACK_VIA_EVTCHN) {
+ irq.shorthand = APIC_DEST_SELF;
+ irq.dest_mode = APIC_DEST_PHYSICAL;
+ irq.delivery_mode = APIC_DM_FIXED;
+ irq.vector = vector;
+ irq.level = level;
+
+ /* Deliver upcall to a vector on the destination vcpu */
+ kvm_irq_delivery_to_apic(kvm, vcpu->arch.apic, &irq, NULL);
+ } else {
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
+ struct kvm *kvm, int irq_source_id, int level,
+ bool line_status)
+{
+ /*
+ * The routing information for the kirq specifies the vector
+ * on the destination vcpu.
+ */
+ return kvm_xen_do_upcall(kvm, e->evtchn.vcpu, e->evtchn.via,
+ e->evtchn.vector, level);
+}
+
+int kvm_xen_setup_evtchn(struct kvm *kvm,
+ struct kvm_kernel_irq_routing_entry *e)
+{
+ struct kvm_vcpu_xen *vcpu_xen;
+ struct kvm_vcpu *vcpu = NULL;
+
+ if (e->evtchn.vector > 0xff || e->evtchn.vector < 0x10)
+ return -EINVAL;
+
+ /* Expect vcpu to be sane */
+ if (e->evtchn.vcpu >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ vcpu = kvm_get_vcpu(kvm, e->evtchn.vcpu);
+ if (!vcpu)
+ return -EINVAL;
+
+ vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ if (e->evtchn.via == KVM_XEN_CALLBACK_VIA_VECTOR) {
+ vcpu_xen->cb.via = KVM_XEN_CALLBACK_VIA_VECTOR;
+ vcpu_xen->cb.vector = e->evtchn.vector;
+ } else if (e->evtchn.via == KVM_XEN_CALLBACK_VIA_EVTCHN) {
+ vcpu_xen->cb.via = KVM_XEN_CALLBACK_VIA_EVTCHN;
+ vcpu_xen->cb.vector = e->evtchn.vector;
+ } else {
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static void set_vcpu_attr(struct kvm_vcpu *v, u16 type, gpa_t gpa, void *addr)
{
struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 2feef68ee80f..6a42e134924a 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -25,6 +25,15 @@ bool kvm_xen_hypercall_enabled(struct kvm *kvm);
bool kvm_xen_hypercall_set(struct kvm *kvm);
int kvm_xen_hypercall(struct kvm_vcpu *vcpu);

+int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu);
+int kvm_xen_get_interrupt(struct kvm_vcpu *vcpu);
+
+int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
+ struct kvm *kvm, int irq_source_id, int level,
+ bool line_status);
+int kvm_xen_setup_evtchn(struct kvm *kvm,
+ struct kvm_kernel_irq_routing_entry *e);
+
void kvm_xen_destroy_vm(struct kvm *kvm);
void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d55c63db09b..af5e7455ff6a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -350,6 +350,29 @@ struct kvm_hv_sint {
u32 sint;
};

+/*
+ * struct kvm_xen_evtchn: currently specifies the upcall vector setup to
+ * deliver the interrupt to the guest.
+ *
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_GSI|_PCI
+ * vcpu: always deliver to vcpu-0
+ * vector: is used as upcall-vector
+ * EOI: none
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_VECTOR
+ * vcpu: deliver to specified vcpu
+ * vector: used as upcall-vector
+ * EOI: none
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_EVTCHN
+ * vcpu: deliver to specified vcpu (vector should be bound to the vcpu)
+ * vector: used as upcall-vector
+ * EOI: expected
+ */
+struct kvm_xen_evtchn {
+ u32 via;
+ u32 vcpu;
+ u32 vector;
+};
+
struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
@@ -370,6 +393,7 @@ struct kvm_kernel_irq_routing_entry {
} msi;
struct kvm_s390_adapter_int adapter;
struct kvm_hv_sint hv_sint;
+ struct kvm_xen_evtchn evtchn;
};
struct hlist_node link;
};
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 682ea00abd58..49001f681cd1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1035,11 +1035,18 @@ struct kvm_irq_routing_hv_sint {
__u32 sint;
};

+struct kvm_irq_routing_xen_evtchn {
+ __u32 via;
+ __u32 vcpu;
+ __u32 vector;
+};
+
/* gsi routing entry types */
#define KVM_IRQ_ROUTING_IRQCHIP 1
#define KVM_IRQ_ROUTING_MSI 2
#define KVM_IRQ_ROUTING_S390_ADAPTER 3
#define KVM_IRQ_ROUTING_HV_SINT 4
+#define KVM_IRQ_ROUTING_XEN_EVTCHN 5

struct kvm_irq_routing_entry {
__u32 gsi;
@@ -1051,6 +1058,7 @@ struct kvm_irq_routing_entry {
struct kvm_irq_routing_msi msi;
struct kvm_irq_routing_s390_adapter adapter;
struct kvm_irq_routing_hv_sint hv_sint;
+ struct kvm_irq_routing_xen_evtchn evtchn;
__u32 pad[8];
} u;
};
--
2.11.0


2019-02-20 20:19:56

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 30/39] KVM: x86/xen: add additional evtchn ops

From: Ankur Arora <[email protected]>

Add support for changing event channel affinity (EVTCHNOP_bind_vcpu)
and closing an event (EVTCHNOP_close).

We just piggy back on the functionality already implemented for
guest event channels.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kvm/xen.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 1988ed3866bf..666dd6d1f5a3 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2188,6 +2188,38 @@ static int shim_hcall_evtchn(int op, void *p)
ret = shim_hcall_evtchn_send(xen_shim, send);
break;
}
+ case EVTCHNOP_bind_virq: {
+ struct evtchn_bind_virq *un;
+
+ un = (struct evtchn_bind_virq *) p;
+
+ evt.fd = -1;
+ evt.port = 0;
+ evt.type = XEN_EVTCHN_TYPE_VIRQ;
+ ret = kvm_xen_eventfd_assign(NULL, &xen_shim->port_to_evt,
+ &xen_shim->xen_lock, &evt);
+ un->port = evt.port;
+ break;
+ }
+ case EVTCHNOP_bind_vcpu: {
+ struct evtchn_bind_vcpu *bind_vcpu;
+
+ bind_vcpu = (struct evtchn_bind_vcpu *) p;
+
+ evt.port = bind_vcpu->port;
+ evt.vcpu = bind_vcpu->vcpu;
+ ret = kvm_xen_eventfd_update(NULL, &xen_shim->port_to_evt,
+ &xen_shim->xen_lock, &evt);
+ break;
+ }
+ case EVTCHNOP_close: {
+ struct evtchn_close *cls;
+
+ cls = (struct evtchn_close *) p;
+ ret = kvm_xen_eventfd_deassign(NULL, &xen_shim->port_to_evt,
+ &xen_shim->xen_lock, cls->port);
+ break;
+ }
default:
ret = -EINVAL;
break;
--
2.11.0


2019-02-20 20:20:02

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 31/39] xen-shim: introduce shim domain driver

From: Ankur Arora <[email protected]>

xen-shim.ko sets up and tears down state needed to support Xen
backends. The underlying primitives that are exposed are interdomain
event-channels and grant-table map/unmap/copy.

We setup the following:

* Initialize shared_info and vcpu_info pages, essentially setting
up event-channel state.
* Set up features (this allows xen_feature() to work)
* Initialize event-channel subsystem (select event ops and related
setup.)
* Initialize xenbus and tear it down on module exit.

This functionality would be used by the backend drivers (e.g. netback,
scsiback, blkback etc) in order to drive guest I/O.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/Kconfig | 10 +++
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/xen-shim.c | 107 +++++++++++++++++++++++++++++++
arch/x86/xen/enlighten.c | 45 +++++++++++++
drivers/xen/events/events_2l.c | 2 +-
drivers/xen/events/events_base.c | 6 +-
drivers/xen/features.c | 1 +
drivers/xen/xenbus/xenbus_dev_frontend.c | 4 +-
include/xen/xen.h | 5 ++
include/xen/xenbus.h | 3 +
11 files changed, 182 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/kvm/xen-shim.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55609e919e14..6bdae8649d56 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -896,6 +896,8 @@ struct kvm_grant_table {
/* Xen emulation context */
struct kvm_xen {
u64 xen_hypercall;
+
+#define XEN_SHIM_DOMID 0
domid_t domid;

gfn_t shinfo_addr;
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..47347df282dc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,16 @@ config KVM_MMU_AUDIT
This option adds a R/W kVM module parameter 'mmu_audit', which allows
auditing of KVM MMU events at runtime.

+config XEN_SHIM
+ tristate "Xen hypercall emulation shim"
+ depends on KVM
+ depends on XEN
+ default m
+ help
+ Shim to support Xen hypercalls on non-Xen hosts. It intercepts grant
+ table and event channels hypercalls same way as Xen hypervisor. This is
+ useful for having Xen backend drivers work on KVM.
+
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source "drivers/vhost/Kconfig"
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index c1eaabbd0a54..a96a96a002a7 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -18,3 +18,4 @@ kvm-amd-y += svm.o pmu_amd.o
obj-$(CONFIG_KVM) += kvm.o
obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+obj-$(CONFIG_XEN_SHIM) += xen-shim.o
diff --git a/arch/x86/kvm/xen-shim.c b/arch/x86/kvm/xen-shim.c
new file mode 100644
index 000000000000..61fdceb63ec2
--- /dev/null
+++ b/arch/x86/kvm/xen-shim.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved.
+ *
+ * Xen hypercall emulation shim
+ */
+
+#define pr_fmt(fmt) "KVM:" KBUILD_MODNAME ": " fmt
+
+#include <asm/kvm_host.h>
+
+#include <xen/xen.h>
+#include <xen/xen-ops.h>
+#include <xen/events.h>
+#include <xen/xenbus.h>
+
+#define BITS_PER_EVTCHN_WORD (sizeof(xen_ulong_t)*8)
+
+static struct kvm_xen shim = { .domid = XEN_SHIM_DOMID };
+
+static void shim_evtchn_setup(struct shared_info *s)
+{
+ int cpu;
+
+ /* Point Xen's shared_info to the domain's sinfo page */
+ HYPERVISOR_shared_info = s;
+
+ /* Evtchns will be marked pending on allocation */
+ memset(s->evtchn_pending, 0, sizeof(s->evtchn_pending));
+ /* ... but we do mask all of -- dom0 expect it. */
+ memset(s->evtchn_mask, 1, sizeof(s->evtchn_mask));
+
+ for_each_possible_cpu(cpu) {
+ struct vcpu_info *vcpu_info;
+ int i;
+
+ /* Direct CPU mapping as far as dom0 is concerned */
+ per_cpu(xen_vcpu_id, cpu) = cpu;
+
+ vcpu_info = &per_cpu(xen_vcpu_info, cpu);
+ memset(vcpu_info, 0, sizeof(*vcpu_info));
+
+ vcpu_info->evtchn_upcall_mask = 0;
+
+ vcpu_info->evtchn_upcall_pending = 0;
+ for (i = 0; i < BITS_PER_EVTCHN_WORD; i++)
+ clear_bit(i, &vcpu_info->evtchn_pending_sel);
+
+ per_cpu(xen_vcpu, cpu) = vcpu_info;
+ }
+}
+
+static int __init shim_register(void)
+{
+ struct shared_info *shinfo;
+
+ shinfo = (struct shared_info *)get_zeroed_page(GFP_KERNEL);
+ if (!shinfo) {
+ pr_err("Failed to allocate shared_info page\n");
+ return -ENOMEM;
+ }
+ shim.shinfo = shinfo;
+
+ idr_init(&shim.port_to_evt);
+ mutex_init(&shim.xen_lock);
+
+ kvm_xen_register_lcall(&shim);
+
+ /* We can handle hypercalls after this point */
+ xen_shim_domain = 1;
+
+ shim_evtchn_setup(shim.shinfo);
+
+ xen_setup_features();
+
+ xen_init_IRQ();
+
+ xenbus_init();
+
+ return 0;
+}
+
+static int __init shim_init(void)
+{
+ if (xen_domain())
+ return -ENODEV;
+
+ return shim_register();
+}
+
+static void __exit shim_exit(void)
+{
+ xenbus_deinit();
+ xen_shim_domain = 0;
+
+ kvm_xen_unregister_lcall();
+ HYPERVISOR_shared_info = NULL;
+ free_page((unsigned long) shim.shinfo);
+ shim.shinfo = NULL;
+}
+
+module_init(shim_init);
+module_exit(shim_exit)
+
+MODULE_AUTHOR("Ankur Arora <[email protected]>,"
+ "Joao Martins <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index b36a10e6b5d7..8d9e93b6eb09 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -57,6 +57,9 @@ EXPORT_PER_CPU_SYMBOL(xen_vcpu_info);
enum xen_domain_type xen_domain_type = XEN_NATIVE;
EXPORT_SYMBOL_GPL(xen_domain_type);

+int xen_shim_domain;
+EXPORT_SYMBOL_GPL(xen_shim_domain);
+
unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
EXPORT_SYMBOL(machine_to_phys_mapping);
unsigned long machine_to_phys_nr;
@@ -349,3 +352,45 @@ void xen_arch_unregister_cpu(int num)
}
EXPORT_SYMBOL(xen_arch_unregister_cpu);
#endif
+
+static struct module *find_module_shim(void)
+{
+ static const char name[] = "xen_shim";
+ struct module *module;
+
+ mutex_lock(&module_mutex);
+ module = find_module(name);
+ mutex_unlock(&module_mutex);
+
+ return module;
+}
+
+bool xen_shim_domain_get(void)
+{
+ struct module *shim;
+
+ if (!xen_shim_domain())
+ return false;
+
+ shim = find_module_shim();
+ if (!shim)
+ return false;
+
+ return try_module_get(shim);
+}
+EXPORT_SYMBOL(xen_shim_domain_get);
+
+void xen_shim_domain_put(void)
+{
+ struct module *shim;
+
+ if (!xen_shim_domain())
+ return;
+
+ shim = find_module_shim();
+ if (!shim)
+ return;
+
+ module_put(shim);
+}
+EXPORT_SYMBOL(xen_shim_domain_put);
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c
index b5acf4b09971..f08d13a033c1 100644
--- a/drivers/xen/events/events_2l.c
+++ b/drivers/xen/events/events_2l.c
@@ -89,7 +89,7 @@ static void evtchn_2l_unmask(unsigned port)
unsigned int cpu = get_cpu();
int do_hypercall = 0, evtchn_pending = 0;

- BUG_ON(!irqs_disabled());
+ WARN_ON(!irqs_disabled());

if (unlikely((cpu != cpu_from_evtchn(port))))
do_hypercall = 1;
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 117e76b2f939..a2087287c3b6 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1665,7 +1665,7 @@ void xen_callback_vector(void) {}
static bool fifo_events = true;
module_param(fifo_events, bool, 0);

-void __init xen_init_IRQ(void)
+void xen_init_IRQ(void)
{
int ret = -EINVAL;
unsigned int evtchn;
@@ -1683,6 +1683,9 @@ void __init xen_init_IRQ(void)
for (evtchn = 0; evtchn < xen_evtchn_nr_channels(); evtchn++)
mask_evtchn(evtchn);

+ if (xen_shim_domain())
+ return;
+
pirq_needs_eoi = pirq_needs_eoi_flag;

#ifdef CONFIG_X86
@@ -1714,3 +1717,4 @@ void __init xen_init_IRQ(void)
}
#endif
}
+EXPORT_SYMBOL_GPL(xen_init_IRQ);
diff --git a/drivers/xen/features.c b/drivers/xen/features.c
index d7d34fdfc993..1518c3b6f004 100644
--- a/drivers/xen/features.c
+++ b/drivers/xen/features.c
@@ -31,3 +31,4 @@ void xen_setup_features(void)
xen_features[i * 32 + j] = !!(fi.submap & 1<<j);
}
}
+EXPORT_SYMBOL_GPL(xen_setup_features);
diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index c3e201025ef0..a4080d04a01c 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -680,7 +680,7 @@ static struct miscdevice xenbus_dev = {
.fops = &xen_xenbus_fops,
};

-static int __init xenbus_init(void)
+static int __init xenbus_frontend_init(void)
{
int err;

@@ -692,4 +692,4 @@ static int __init xenbus_init(void)
pr_err("Could not register xenbus frontend device\n");
return err;
}
-device_initcall(xenbus_init);
+device_initcall(xenbus_frontend_init);
diff --git a/include/xen/xen.h b/include/xen/xen.h
index 0e2156786ad2..04dfa99e67eb 100644
--- a/include/xen/xen.h
+++ b/include/xen/xen.h
@@ -10,8 +10,12 @@ enum xen_domain_type {

#ifdef CONFIG_XEN
extern enum xen_domain_type xen_domain_type;
+extern int xen_shim_domain;
+extern bool xen_shim_domain_get(void);
+extern void xen_shim_domain_put(void);
#else
#define xen_domain_type XEN_NATIVE
+#define xen_shim_domain 0
#endif

#ifdef CONFIG_XEN_PVH
@@ -24,6 +28,7 @@ extern bool xen_pvh;
#define xen_pv_domain() (xen_domain_type == XEN_PV_DOMAIN)
#define xen_hvm_domain() (xen_domain_type == XEN_HVM_DOMAIN)
#define xen_pvh_domain() (xen_pvh)
+#define xen_shim_domain() (!xen_domain() && xen_shim_domain)

#include <linux/types.h>

diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 869c816d5f8c..d2789e7d2055 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -233,4 +233,7 @@ extern const struct file_operations xen_xenbus_fops;
extern struct xenstore_domain_interface *xen_store_interface;
extern int xen_store_evtchn;

+int xenbus_init(void);
+void xenbus_deinit(void);
+
#endif /* _XEN_XENBUS_H */
--
2.11.0


2019-02-20 20:20:06

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 15/39] KVM: x86/xen: handle PV spinlocks slowpath

From: Boris Ostrovsky <[email protected]>

Add support for SCHEDOP_poll hypercall.

This implementation is optimized for polling for a single channel, which
is what Linux does. Polling for multiple channels is not especially
efficient (and has not been tested).

PV spinlocks slow path uses this hypercall, and explicitly crash if it's
not supported.

Signed-off-by: Boris Ostrovsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/xen.c | 108 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 111 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7fcc81dbb688..c629fedb2e21 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -554,6 +554,8 @@ struct kvm_vcpu_xen {
unsigned int virq_to_port[KVM_XEN_NR_VIRQS];
struct hrtimer timer;
atomic_t timer_pending;
+ wait_queue_head_t sched_waitq;
+ int poll_evtchn;
};

struct kvm_vcpu_arch {
@@ -865,6 +867,7 @@ struct kvm_xen {
struct shared_info *shinfo;

struct idr port_to_evt;
+ unsigned long poll_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)];
struct mutex xen_lock;
};

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 753a6d2c11cd..07066402737d 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -563,6 +563,16 @@ static int kvm_xen_evtchn_set_pending(struct kvm_vcpu *svcpu,
evfd->port);
}

+static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ if ((vcpu_xen->poll_evtchn == port ||
+ vcpu_xen->poll_evtchn == -1) &&
+ test_and_clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.xen.poll_mask))
+ wake_up(&vcpu_xen->sched_waitq);
+}
+
static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
{
struct eventfd_ctx *eventfd;
@@ -581,6 +591,8 @@ static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
eventfd_signal(eventfd, 1);
}

+ kvm_xen_check_poller(kvm_get_vcpu(vcpu->kvm, evtchnfd->vcpu), port);
+
return 0;
}

@@ -669,6 +681,94 @@ static int kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout)
return 0;
}

+static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
+ evtchn_port_t *ports)
+{
+ int i;
+ struct shared_info *shared_info =
+ (struct shared_info *)vcpu->kvm->arch.xen.shinfo;
+
+ for (i = 0; i < nr_ports; i++)
+ if (test_bit(ports[i],
+ (unsigned long *)shared_info->evtchn_pending))
+ return true;
+
+ return false;
+}
+
+static int kvm_xen_schedop_poll(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ int idx, i;
+ struct sched_poll sched_poll;
+ evtchn_port_t port, *ports;
+ struct shared_info *shared_info;
+ struct evtchnfd *evtchnfd;
+ int ret = 0;
+
+ if (kvm_vcpu_read_guest(vcpu, gpa,
+ &sched_poll, sizeof(sched_poll)))
+ return -EFAULT;
+
+ shared_info = (struct shared_info *)vcpu->kvm->arch.xen.shinfo;
+
+ if (unlikely(sched_poll.nr_ports > 1)) {
+ /* Xen (unofficially) limits number of pollers to 128 */
+ if (sched_poll.nr_ports > 128)
+ return -EINVAL;
+
+ ports = kmalloc_array(sched_poll.nr_ports,
+ sizeof(*ports), GFP_KERNEL);
+ if (!ports)
+ return -ENOMEM;
+ } else
+ ports = &port;
+
+ set_bit(vcpu->vcpu_id, vcpu->kvm->arch.xen.poll_mask);
+
+ for (i = 0; i < sched_poll.nr_ports; i++) {
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu,
+ (gva_t)(sched_poll.ports + i),
+ NULL);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+ if (!gpa || kvm_vcpu_read_guest(vcpu, gpa,
+ &ports[i], sizeof(port))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ evtchnfd = idr_find(&vcpu->kvm->arch.xen.port_to_evt,
+ ports[i]);
+ if (!evtchnfd) {
+ ret = -ENOENT;
+ goto out;
+ }
+ }
+
+ if (sched_poll.nr_ports == 1)
+ vcpu_xen->poll_evtchn = port;
+ else
+ vcpu_xen->poll_evtchn = -1;
+
+ if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports))
+ wait_event_interruptible_timeout(
+ vcpu_xen->sched_waitq,
+ wait_pending_event(vcpu, sched_poll.nr_ports, ports),
+ sched_poll.timeout ?: KTIME_MAX);
+
+ vcpu_xen->poll_evtchn = 0;
+
+out:
+ /* Really, this is only needed in case of timeout */
+ clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.xen.poll_mask);
+
+ if (unlikely(sched_poll.nr_ports > 1))
+ kfree(ports);
+ return ret;
+}
+
static int kvm_xen_hcall_sched_op(struct kvm_vcpu *vcpu, int cmd, u64 param)
{
int ret = -ENOSYS;
@@ -687,6 +787,9 @@ static int kvm_xen_hcall_sched_op(struct kvm_vcpu *vcpu, int cmd, u64 param)
kvm_vcpu_on_spin(vcpu, true);
ret = 0;
break;
+ case SCHEDOP_poll:
+ ret = kvm_xen_schedop_poll(vcpu, gpa);
+ break;
default:
break;
}
@@ -744,6 +847,9 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
r = kvm_xen_hcall_sched_op(vcpu, params[0], params[1]);
if (!r)
goto hcall_success;
+ else if (params[0] == SCHEDOP_poll)
+ /* SCHEDOP_poll should be handled in kernel */
+ return r;
break;
/* fallthrough */
default:
@@ -770,6 +876,8 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)

void kvm_xen_vcpu_init(struct kvm_vcpu *vcpu)
{
+ init_waitqueue_head(&vcpu->arch.xen.sched_waitq);
+ vcpu->arch.xen.poll_evtchn = 0;
}

void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
--
2.11.0


2019-02-20 20:20:15

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 23/39] KVM: x86/xen: grant table grow support

Guests grant tables with core Xen PV devices (xenbus, console) need to
be seeded with a bunch of reserved entries at boot. However, at init,
the grant table is, from a guest perspective, empty and has no frames
backing it. That only happens once the guest does:

XENMEM_add_to_physmap(idx=N,gfn=M,space=XENMAPSPACE_grant_table)

Which will share the added page with the hypervisor.

The way we handle this then is to seed (from userspace) the initial
frame where we store special entries which reference guest PV ring
pages. These pages are in-turn mapped/unmapped in backend domains
hosting xenstored and xenconsoled.

When the guest initializes its grant tables (with the hypercall listed
above) we copy the entries from the private frame into a "mapped" gfn.
To do this, the userspace VMM handles XENMEM_add_to_physmap hypercall and
the hypervisor grows its grant table. Note that a grant table can only
grow - no shrinking is possible.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 16 ++++++++
arch/x86/kvm/xen.c | 90 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 5 +++
3 files changed, 111 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0cbc0899580..70bb7339ddd4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -860,6 +860,21 @@ struct kvm_hv {
atomic_t num_mismatched_vp_indexes;
};

+struct kvm_grant_map {
+ u64 gpa;
+ union {
+ struct {
+
+#define _KVM_GNTMAP_ACTIVE (15)
+#define KVM_GNTMAP_ACTIVE (1 << _KVM_GNTMAP_ACTIVE)
+ u16 flags;
+ u16 ref;
+ u32 domid;
+ };
+ u64 fields;
+ };
+};
+
/* Xen grant table */
struct kvm_grant_table {
u32 nr_frames;
@@ -871,6 +886,7 @@ struct kvm_grant_table {
gfn_t *frames_addr;
gpa_t initial_addr;
struct grant_entry_v1 *initial;
+ struct kvm_grant_map **handle;

/* maptrack limits */
u32 max_mt_frames;
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b9e6e8f72d87..7266d27db210 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -22,6 +22,12 @@

#include "trace.h"

+/* Grant v1 references per 4K page */
+#define GPP_V1 (PAGE_SIZE / sizeof(struct grant_entry_v1))
+
+/* Grant mappings per 4K page */
+#define MPP (PAGE_SIZE / sizeof(struct kvm_grant_map))
+
struct evtchnfd {
struct eventfd_ctx *ctx;
u32 vcpu;
@@ -1158,11 +1164,92 @@ int kvm_xen_gnttab_init(struct kvm *kvm, struct kvm_xen *xen,
void kvm_xen_gnttab_free(struct kvm_xen *xen)
{
struct kvm_grant_table *gnttab = &xen->gnttab;
+ int i;
+
+ for (i = 0; i < gnttab->nr_frames; i++)
+ put_page(virt_to_page(gnttab->frames[i]));

kfree(gnttab->frames);
kfree(gnttab->frames_addr);
}

+int kvm_xen_gnttab_copy_initial_frame(struct kvm *kvm)
+{
+ struct kvm_grant_table *gnttab = &kvm->arch.xen.gnttab;
+ int idx = 0;
+
+ /* Only meant to copy the first gpa being populated */
+ if (!gnttab->initial_addr || !gnttab->frames[idx])
+ return -EINVAL;
+
+ memcpy(gnttab->frames[idx], gnttab->initial, PAGE_SIZE);
+ return 0;
+}
+
+int kvm_xen_maptrack_grow(struct kvm_xen *xen, u32 target)
+{
+ u32 max_entries = target * GPP_V1;
+ u32 nr_entries = xen->gnttab.nr_mt_frames * MPP;
+ int i, j, err = 0;
+ void *addr;
+
+ for (i = nr_entries, j = xen->gnttab.nr_mt_frames;
+ i < max_entries; i += MPP, j++) {
+ addr = (void *) get_zeroed_page(GFP_KERNEL);
+ if (!addr) {
+ err = -ENOMEM;
+ break;
+ }
+
+ xen->gnttab.handle[j] = addr;
+ }
+
+ xen->gnttab.nr_mt_frames = j;
+ xen->gnttab.nr_frames = target;
+ return err;
+}
+
+int kvm_xen_gnttab_grow(struct kvm *kvm, struct kvm_xen_gnttab *op)
+{
+ struct kvm_xen *xen = &kvm->arch.xen;
+ struct kvm_grant_table *gnttab = &xen->gnttab;
+ gfn_t *map = gnttab->frames_addr;
+ u64 gfn = op->grow.gfn;
+ u32 idx = op->grow.idx;
+ struct page *page;
+
+ if (idx < gnttab->nr_frames || idx >= gnttab->max_nr_frames)
+ return -EINVAL;
+
+ if (!idx && !gnttab->nr_frames &&
+ !gnttab->initial) {
+ return -EINVAL;
+ }
+
+ page = gfn_to_page(kvm, gfn);
+ if (is_error_page(page))
+ return -EINVAL;
+
+ map[idx] = gfn;
+
+ gnttab->frames[idx] = page_to_virt(page);
+ if (!idx && !gnttab->nr_frames &&
+ kvm_xen_gnttab_copy_initial_frame(kvm)) {
+ pr_err("kvm_xen: dom%u: failed to copy initial frame\n",
+ xen->domid);
+ return -EFAULT;
+ }
+
+ if (kvm_xen_maptrack_grow(xen, gnttab->nr_frames + 1)) {
+ pr_warn("kvm_xen: dom%u: cannot grow maptrack\n", xen->domid);
+ return -EFAULT;
+ }
+
+ pr_debug("kvm_xen: dom%u: grant table grow frames:%d/%d\n", xen->domid,
+ gnttab->nr_frames, gnttab->max_nr_frames);
+ return 0;
+}
+
int kvm_vm_ioctl_xen_gnttab(struct kvm *kvm, struct kvm_xen_gnttab *op)
{
int r = -EINVAL;
@@ -1174,6 +1261,9 @@ int kvm_vm_ioctl_xen_gnttab(struct kvm *kvm, struct kvm_xen_gnttab *op)
case KVM_XEN_GNTTAB_F_INIT:
r = kvm_xen_gnttab_init(kvm, &kvm->arch.xen, op, 0);
break;
+ case KVM_XEN_GNTTAB_F_GROW:
+ r = kvm_xen_gnttab_grow(kvm, op);
+ break;
default:
r = -ENOSYS;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e4fb9bc34d61..ff7f7d019472 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1505,6 +1505,7 @@ struct kvm_xen_hvm_attr {
} dom;
struct kvm_xen_gnttab {
#define KVM_XEN_GNTTAB_F_INIT 0
+#define KVM_XEN_GNTTAB_F_GROW (1 << 0)
__u32 flags;
union {
struct {
@@ -1512,6 +1513,10 @@ struct kvm_xen_hvm_attr {
__u32 max_maptrack_frames;
__u64 initial_frame;
} init;
+ struct {
+ __u32 idx;
+ __u64 gfn;
+ } grow;
__u32 padding[4];
};
} gnttab;
--
2.11.0


2019-02-20 20:20:26

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 39/39] KVM: x86: declare Xen HVM Dom0 capability

Add new capability for domid, interdomain/unbound event channel types
and grant table support in hypervisor. This would be used to drive Xen
kernel backends.

Co-developed-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
Documentation/virtual/kvm/api.txt | 10 ++++++++++
arch/x86/kvm/x86.c | 4 ++++
include/uapi/linux/kvm.h | 3 +++
3 files changed, 17 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 36d9386415fa..311dcded5e28 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5046,3 +5046,13 @@ This capability indicates KVM's support for the event channel offload.
Implies support for KVM_IRQ_ROUTING_XEN_EVTCHN irq routing, and
for attribute KVM_XEN_ATTR_TYPE_EVTCHN in KVM_XEN_HVM_GET_ATTR or
KVM_XEN_HVM_SET_ATTR.
+
+8.24 KVM_CAP_XEN_HVM_DOM0
+
+Architectures: x86
+
+This capability indicates support for assigning domid and handling kernel
+backends in the hypervisor. Also implies that attributes
+KVM_XEN_ATTR_TYPE_DOMID, KVM_XEN_ATTR_TYPE_GNTTAB are supported. For the
+existing KVM_XEN_ATTR_TYPE_EVTCHN attribute, it indicates support for
+interdomain and unbound event channels.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb95f7f8bed9..e8c3494b10cb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -71,6 +71,7 @@
#include <asm/mshyperv.h>
#include <asm/hypervisor.h>
#include <asm/intel_pt.h>
+#include <xen/xen.h>

#define CREATE_TRACE_POINTS
#include "trace.h"
@@ -3049,6 +3050,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ADJUST_CLOCK:
r = KVM_CLOCK_TSC_STABLE;
break;
+ case KVM_CAP_XEN_HVM_DOM0:
+ r = xen_shim_domain();
+ break;
case KVM_CAP_X86_DISABLE_EXITS:
r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE;
if(kvm_can_mwait_in_guest())
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d877792dfa..d817a7bbf507 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1005,6 +1005,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_HYPERV_CPUID 167
#define KVM_CAP_XEN_HVM_GUEST 168
#define KVM_CAP_XEN_HVM_EVTCHN 169
+#define KVM_CAP_XEN_HVM_DOM0 170

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1485,6 +1486,7 @@ struct kvm_xen_hvm_attr {

#define XEN_EVTCHN_TYPE_VIRQ 0
#define XEN_EVTCHN_TYPE_IPI 1
+/* Available with KVM_CAP_XEN_HVM_DOM0 */
#define XEN_EVTCHN_TYPE_INTERDOM 2
#define XEN_EVTCHN_TYPE_UNBOUND 3
__u32 type;
@@ -1536,6 +1538,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x3
/* Available with KVM_CAP_XEN_HVM_EVTCHN */
#define KVM_XEN_ATTR_TYPE_EVTCHN 0x4
+/* Available with KVM_CAP_XEN_HVM_DOM0 */
#define KVM_XEN_ATTR_TYPE_DOMID 0x5
#define KVM_XEN_ATTR_TYPE_GNTTAB 0x6

--
2.11.0


2019-02-20 20:20:32

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 34/39] xen/gntdev: xen_shim_domain() support

From: Ankur Arora <[email protected]>

GNTTABOP_map_grant_ref treats host_addr as an OUT parameter for
xen_shim_domaim().

Accordingly it's updated in struct gnttab_unmap_grant_ref before it gets
used via GNTTABOP_unmap_grant_ref.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
drivers/xen/gntdev.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 5efc5eee9544..8540a51f7597 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -351,6 +351,8 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
}

map->unmap_ops[i].handle = map->map_ops[i].handle;
+ if (xen_shim_domain())
+ map->unmap_ops[i].host_addr = map->map_ops[i].host_addr;
if (use_ptemod)
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
#ifdef CONFIG_XEN_GRANT_DMA_ALLOC
@@ -1122,7 +1124,9 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
(map->flags & GNTMAP_readonly))
goto out_unlock_put;
} else {
- map->flags = GNTMAP_host_map;
+ map->flags = 0;
+ if (!xen_shim_domain())
+ map->flags = GNTMAP_host_map;
if (!(vma->vm_flags & VM_WRITE))
map->flags |= GNTMAP_readonly;
}
@@ -1207,7 +1211,7 @@ static int __init gntdev_init(void)
{
int err;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

use_ptemod = !xen_feature(XENFEAT_auto_translated_physmap);
@@ -1215,6 +1219,7 @@ static int __init gntdev_init(void)
err = misc_register(&gntdev_miscdev);
if (err != 0) {
pr_err("Could not register gntdev device\n");
+ xen_shim_domain_put();
return err;
}
return 0;
@@ -1223,6 +1228,7 @@ static int __init gntdev_init(void)
static void __exit gntdev_exit(void)
{
misc_deregister(&gntdev_miscdev);
+ xen_shim_domain_put();
}

module_init(gntdev_init);
--
2.11.0


2019-02-20 20:20:44

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 17/39] x86/xen: export vcpu_info and shared_info

From: Ankur Arora <[email protected]>

Also remove __init annotations from xen_evtchn_2l/fifo_init().

This allows us to support 2-level event channel ABI on
xen_shim_domain().

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/xen/enlighten.c | 3 +++
drivers/xen/events/events_2l.c | 2 +-
drivers/xen/events/events_fifo.c | 2 +-
include/xen/xen-ops.h | 2 +-
4 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 750f46ad018a..73b9736e89d2 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -50,6 +50,8 @@ DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info);
/* Linux <-> Xen vCPU id mapping */
DEFINE_PER_CPU(uint32_t, xen_vcpu_id);
EXPORT_PER_CPU_SYMBOL(xen_vcpu_id);
+EXPORT_PER_CPU_SYMBOL(xen_vcpu);
+EXPORT_PER_CPU_SYMBOL(xen_vcpu_info);

enum xen_domain_type xen_domain_type = XEN_NATIVE;
EXPORT_SYMBOL_GPL(xen_domain_type);
@@ -79,6 +81,7 @@ EXPORT_SYMBOL(xen_start_flags);
* page as soon as fixmap is up and running.
*/
struct shared_info *HYPERVISOR_shared_info = &xen_dummy_shared_info;
+EXPORT_SYMBOL_GPL(HYPERVISOR_shared_info);

/*
* Flag to determine whether vcpu info placement is available on all
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c
index 8edef51c92e5..b5acf4b09971 100644
--- a/drivers/xen/events/events_2l.c
+++ b/drivers/xen/events/events_2l.c
@@ -369,7 +369,7 @@ static const struct evtchn_ops evtchn_ops_2l = {
.resume = evtchn_2l_resume,
};

-void __init xen_evtchn_2l_init(void)
+void xen_evtchn_2l_init(void)
{
pr_info("Using 2-level ABI\n");
evtchn_ops = &evtchn_ops_2l;
diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index 76b318e88382..453b4b05f238 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -430,7 +430,7 @@ static int xen_evtchn_cpu_dead(unsigned int cpu)
return 0;
}

-int __init xen_evtchn_fifo_init(void)
+int xen_evtchn_fifo_init(void)
{
int cpu = smp_processor_id();
int ret;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 4969817124a8..92fc45075500 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -10,7 +10,7 @@
#include <xen/interface/vcpu.h>

DECLARE_PER_CPU(struct vcpu_info *, xen_vcpu);
-
+DECLARE_PER_CPU(struct vcpu_info, xen_vcpu_info);
DECLARE_PER_CPU(uint32_t, xen_vcpu_id);
static inline uint32_t xen_vcpu_nr(int cpu)
{
--
2.11.0


2019-02-20 20:20:49

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 38/39] xen-blkback: xen_shim_domain() support

With xen_shim_domain() enabled, the struct pages used for grant mappings
are only valid in the interval between GNTTABOP_map_grant_ref and
GNTTABOP_unmap_grant_ref.

Ensure that we do not cache the struct page outside that duration.

Also, update the struct page for the segment after
GNTTABOP_map_grant_ref, for the subsequent tracking or when doing IO.

In addition, we disable persistent grants for this configuration, since
the map/unmap overhead is fairly small (in the fast path, page lookup,
get_page() and put_page().) This reduces the memory usage in
blkback/blkfront.

Co-developed-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 19 ++++++++++++++++---
drivers/block/xen-blkback/xenbus.c | 5 +++--
2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index d51d88be88e1..20c15e6787b1 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -167,6 +167,15 @@ static inline void put_free_pages(struct xen_blkif_ring *ring, struct page **pag
unsigned long flags;
int i;

+ /*
+ * We don't own the struct page after unmap has been called.
+ * Reallocation is cheap anyway for the shim domain case, so free it.
+ */
+ if (xen_shim_domain()) {
+ gnttab_free_pages(num, page);
+ return;
+ }
+
spin_lock_irqsave(&ring->free_pages_lock, flags);
for (i = 0; i < num; i++)
list_add(&page[i]->lru, &ring->free_pages);
@@ -825,7 +834,7 @@ static int xen_blkbk_map(struct xen_blkif_ring *ring,
*/
again:
for (i = map_until; i < num; i++) {
- uint32_t flags;
+ uint32_t flags = 0;

if (use_persistent_gnts) {
persistent_gnt = get_persistent_gnt(
@@ -846,7 +855,8 @@ static int xen_blkbk_map(struct xen_blkif_ring *ring,
addr = vaddr(pages[i]->page);
pages_to_gnt[segs_to_map] = pages[i]->page;
pages[i]->persistent_gnt = NULL;
- flags = GNTMAP_host_map;
+ if (!xen_shim_domain())
+ flags = GNTMAP_host_map;
if (!use_persistent_gnts && ro)
flags |= GNTMAP_readonly;
gnttab_set_map_op(&map[segs_to_map++], addr,
@@ -880,6 +890,8 @@ static int xen_blkbk_map(struct xen_blkif_ring *ring,
goto next;
}
pages[seg_idx]->handle = map[new_map_idx].handle;
+ if (xen_shim_domain())
+ pages[seg_idx]->page = pages_to_gnt[new_map_idx];
} else {
continue;
}
@@ -1478,7 +1490,7 @@ static int __init xen_blkif_init(void)
{
int rc = 0;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

if (xen_blkif_max_ring_order > XENBUS_MAX_RING_GRANT_ORDER) {
@@ -1508,6 +1520,7 @@ static void __exit xen_blkif_exit(void)
{
xen_blkif_interface_exit();
xen_blkif_xenbus_exit();
+ xen_shim_domain_put();
}

module_exit(xen_blkif_exit);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 424e2efebe85..11be0c052b77 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -871,7 +871,8 @@ static void connect(struct backend_info *be)

xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);

- err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
+ err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u",
+ !xen_shim_domain());
if (err) {
xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
dev->nodename);
@@ -1059,7 +1060,7 @@ static int connect_ring(struct backend_info *be)
}
pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
0);
- be->blkif->vbd.feature_gnt_persistent = pers_grants;
+ be->blkif->vbd.feature_gnt_persistent = pers_grants && !xen_shim_domain();
be->blkif->vbd.overflow_max_grants = 0;

/*
--
2.11.0


2019-02-20 20:21:05

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 29/39] KVM: x86/xen: evtchn unmask support

From: Ankur Arora <[email protected]>

Handle hypercall to unmasks event channel ports.

A subtlety here is that we deliver an upcall if an actual unmask
happened and the event channel was pending (and if the vcpu-wide
evtchn_pending_sel was not already marked pending.)

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 420e3ebb66bc..1988ed3866bf 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2102,6 +2102,29 @@ static int shim_hcall_evtchn_send(struct kvm_xen *dom0, struct evtchn_send *snd)
return 0;
}

+static int shim_hcall_evtchn_unmask(struct kvm_xen *dom0, int port)
+{
+ struct shared_info *s = HYPERVISOR_shared_info;
+ struct evtchnfd *evtchnfd;
+ struct vcpu_info *v;
+
+ evtchnfd = idr_find(&dom0->port_to_evt, port);
+ if (!evtchnfd)
+ return -ENOENT;
+
+ v = per_cpu(xen_vcpu, evtchnfd->vcpu);
+
+ if (test_and_clear_bit(port, (unsigned long *) s->evtchn_mask) &&
+ test_bit(port, (unsigned long *) s->evtchn_pending) &&
+ !test_and_set_bit(port / BITS_PER_EVTCHN_WORD,
+ (unsigned long *) &v->evtchn_pending_sel)) {
+ kvm_xen_evtchn_2l_vcpu_set_pending(v);
+ return kvm_xen_evtchn_call_function(evtchnfd);
+ }
+
+ return 0;
+}
+
static int shim_hcall_evtchn(int op, void *p)
{
int ret;
@@ -2151,6 +2174,13 @@ static int shim_hcall_evtchn(int op, void *p)
un->port = evt.port;
break;
}
+ case EVTCHNOP_unmask: {
+ struct evtchn_unmask *unmask;
+
+ unmask = (struct evtchn_unmask *) p;
+ ret = shim_hcall_evtchn_unmask(xen_shim, unmask->port);
+ break;
+ }
case EVTCHNOP_send: {
struct evtchn_send *send;

--
2.11.0


2019-02-20 20:21:12

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 37/39] xen-netback: xen_shim_domain() support


GNTTABOP_map_grant_ref treats host_address as an OUT parameter for
xen_shim_domain(). After doing gnttab_map_refs, we check each map op
struct and replace the page passed with the page returned by the
hypercall.

Note that mmap_pages is just a placeholder when netback runs in
xen_shim_domain() mode.

Signed-off-by: Joao Martins <[email protected]>
---
drivers/net/xen-netback/netback.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 80aae3a32c2a..a5dd6de6d3e6 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -324,6 +324,9 @@ struct xenvif_tx_cb {

#define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)

+/* Selects host map only if on native Xen */
+#define GNTTAB_host_map (xen_shim_domain() ? 0 : GNTMAP_host_map)
+
static inline void xenvif_tx_create_map_op(struct xenvif_queue *queue,
u16 pending_idx,
struct xen_netif_tx_request *txp,
@@ -332,9 +335,12 @@ static inline void xenvif_tx_create_map_op(struct xenvif_queue *queue,
{
queue->pages_to_map[mop-queue->tx_map_ops] = queue->mmap_pages[pending_idx];
gnttab_set_map_op(mop, idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map | GNTMAP_readonly,
+ GNTTAB_host_map | GNTMAP_readonly,
txp->gref, queue->vif->domid);

+ if (xen_shim_domain())
+ mop->host_addr = 0;
+
memcpy(&queue->pending_tx_info[pending_idx].req, txp,
sizeof(*txp));
queue->pending_tx_info[pending_idx].extra_count = extra_count;
@@ -405,6 +411,12 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif_queue *que
return gop;
}

+static inline void xenvif_grant_replace_page(struct xenvif_queue *queue,
+ u16 pending_idx, u32 mapping_idx)
+{
+ queue->mmap_pages[pending_idx] = queue->pages_to_map[mapping_idx];
+}
+
static inline void xenvif_grant_handle_set(struct xenvif_queue *queue,
u16 pending_idx,
grant_handle_t handle)
@@ -481,6 +493,10 @@ static int xenvif_tx_check_gop(struct xenvif_queue *queue,
xenvif_grant_handle_set(queue,
pending_idx,
gop_map->handle);
+
+ if (!err && xen_shim_domain())
+ xenvif_grant_replace_page(queue, pending_idx,
+ gop_map - queue->tx_map_ops);
/* Had a previous error? Invalidate this fragment. */
if (unlikely(err)) {
xenvif_idx_unmap(queue, pending_idx);
@@ -1268,7 +1284,7 @@ static inline void xenvif_tx_dealloc_action(struct xenvif_queue *queue)
queue->mmap_pages[pending_idx];
gnttab_set_unmap_op(gop,
idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map,
+ GNTTAB_host_map,
queue->grant_tx_handle[pending_idx]);
xenvif_grant_handle_reset(queue, pending_idx);
++gop;
@@ -1394,7 +1410,7 @@ void xenvif_idx_unmap(struct xenvif_queue *queue, u16 pending_idx)

gnttab_set_unmap_op(&tx_unmap_op,
idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map,
+ GNTTAB_host_map,
queue->grant_tx_handle[pending_idx]);
xenvif_grant_handle_reset(queue, pending_idx);

@@ -1622,7 +1638,7 @@ static int __init netback_init(void)
{
int rc = 0;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

/* Allow as many queues as there are CPUs but max. 8 if user has not
@@ -1663,6 +1679,7 @@ static void __exit netback_fini(void)
debugfs_remove_recursive(xen_netback_dbg_root);
#endif /* CONFIG_DEBUG_FS */
xenvif_xenbus_fini();
+ xen_shim_domain_put();
}
module_exit(netback_fini);

--
2.11.0


2019-02-20 20:21:21

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 35/39] xen/xenbus: xen_shim_domain() support

From: Ankur Arora <[email protected]>

Fixup the gnttab unmap_ops (and other data structures) to handle
host_addr as an OUT parameter from the call to GNTTABOP_map_grant_ref.

Also, allow xenstored to be hosted in XS_LOCAL mode for
xen_shim_domain() -- this means that it does not need to acquire
xenstore evtchn and pfn externally.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
drivers/xen/xenbus/xenbus_client.c | 23 +++++++++++++++++------
drivers/xen/xenbus/xenbus_dev_backend.c | 4 ++--
drivers/xen/xenbus/xenbus_dev_frontend.c | 4 ++--
drivers/xen/xenbus/xenbus_probe.c | 8 ++++----
4 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index ada1c9aa6525..b22149a789d4 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -487,8 +487,11 @@ static int __xenbus_map_ring(struct xenbus_device *dev,
"mapping in shared page %d from domain %d",
gnt_refs[i], dev->otherend_id);
goto fail;
- } else
+ } else {
handles[i] = map[i].handle;
+ if (xen_shim_domain())
+ addrs[i] = map[i].host_addr;
+ }
}

return GNTST_okay;
@@ -498,7 +501,8 @@ static int __xenbus_map_ring(struct xenbus_device *dev,
if (handles[i] != INVALID_GRANT_HANDLE) {
memset(&unmap[j], 0, sizeof(unmap[j]));
gnttab_set_unmap_op(&unmap[j], (phys_addr_t)addrs[i],
- GNTMAP_host_map, handles[i]);
+ !xen_shim_domain()?GNTMAP_host_map:0,
+ handles[i]);
j++;
}
}
@@ -546,7 +550,7 @@ static int xenbus_map_ring_valloc_hvm(struct xenbus_device *dev,
void **vaddr)
{
struct xenbus_map_node *node;
- int err;
+ int i, err;
void *addr;
bool leaked = false;
struct map_ring_valloc_hvm info = {
@@ -572,9 +576,16 @@ static int xenbus_map_ring_valloc_hvm(struct xenbus_device *dev,
&info);

err = __xenbus_map_ring(dev, gnt_ref, nr_grefs, node->handles,
- info.phys_addrs, GNTMAP_host_map, &leaked);
+ info.phys_addrs,
+ !xen_shim_domain() ? GNTMAP_host_map : 0,
+ &leaked);
node->nr_handles = nr_grefs;

+ if (xen_shim_domain()) {
+ for (i = 0; i < nr_grefs; i++)
+ node->hvm.pages[i] = virt_to_page(info.phys_addrs[i]);
+ }
+
if (err)
goto out_free_ballooned_pages;

@@ -882,7 +893,7 @@ int xenbus_unmap_ring(struct xenbus_device *dev,

for (i = 0; i < nr_handles; i++)
gnttab_set_unmap_op(&unmap[i], vaddrs[i],
- GNTMAP_host_map, handles[i]);
+ !xen_shim_domain()?GNTMAP_host_map:0, handles[i]);

if (HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, unmap, i))
BUG();
@@ -926,7 +937,7 @@ static const struct xenbus_ring_ops ring_ops_hvm = {
.unmap = xenbus_unmap_ring_vfree_hvm,
};

-void __init xenbus_ring_ops_init(void)
+void xenbus_ring_ops_init(void)
{
#ifdef CONFIG_XEN_PV
if (!xen_feature(XENFEAT_auto_translated_physmap))
diff --git a/drivers/xen/xenbus/xenbus_dev_backend.c b/drivers/xen/xenbus/xenbus_dev_backend.c
index edba5fecde4d..b605c87bff76 100644
--- a/drivers/xen/xenbus/xenbus_dev_backend.c
+++ b/drivers/xen/xenbus/xenbus_dev_backend.c
@@ -119,11 +119,11 @@ static struct miscdevice xenbus_backend_dev = {
.fops = &xenbus_backend_fops,
};

-static int __init xenbus_backend_init(void)
+int xenbus_backend_init(void)
{
int err;

- if (!xen_initial_domain())
+ if (!xen_initial_domain() && !xen_shim_domain())
return -ENODEV;

err = misc_register(&xenbus_backend_dev);
diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index a4080d04a01c..c6fca6cca6c8 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -680,11 +680,11 @@ static struct miscdevice xenbus_dev = {
.fops = &xen_xenbus_fops,
};

-static int __init xenbus_frontend_init(void)
+static int xenbus_frontend_init(void)
{
int err;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain())
return -ENODEV;

err = misc_register(&xenbus_dev);
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 2e0ed46b05e7..bbc405cd01ef 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -693,7 +693,7 @@ EXPORT_SYMBOL_GPL(xenbus_probe);

static int __init xenbus_probe_initcall(void)
{
- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain())
return -ENODEV;

if (xen_initial_domain() || xen_hvm_domain())
@@ -790,7 +790,7 @@ int xenbus_init(void)
uint64_t v = 0;
xen_store_domain_type = XS_UNKNOWN;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain())
return -ENODEV;

xenbus_ring_ops_init();
@@ -799,7 +799,7 @@ int xenbus_init(void)
xen_store_domain_type = XS_PV;
if (xen_hvm_domain())
xen_store_domain_type = XS_HVM;
- if (xen_hvm_domain() && xen_initial_domain())
+ if ((xen_hvm_domain() && xen_initial_domain()) || xen_shim_domain())
xen_store_domain_type = XS_LOCAL;
if (xen_pv_domain() && !xen_start_info->store_evtchn)
xen_store_domain_type = XS_LOCAL;
@@ -863,7 +863,7 @@ postcore_initcall(xenbus_init);

void xenbus_deinit(void)
{
- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain())
return;

#ifdef CONFIG_XEN_COMPAT_XENFS
--
2.11.0


2019-02-20 20:21:22

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 33/39] xen/grant-table: xen_shim_domain() support

With xen-shim, allocate_xenballooned_pages() only allocates a
place-holder page (pfn 0) expecting a subsequent map_grant_ref to fix
it up.

However, this means that, until the grant operation
(GNTTABOP_map_grant_ref) provides a valid page, we cannot set
PagePrivate or save any state.

This patch elides the setting of that state if xen_shim_domain(). In
addition, gnttab_map_refs() now fills in the appropriate page returned
from the grant operation.

Signed-off-by: Joao Martins <[email protected]>
---
drivers/xen/grant-table.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 7ea6fb6a2e5d..ab05b70d98bb 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -804,7 +804,7 @@ int gnttab_alloc_pages(int nr_pages, struct page **pages)
int ret;

ret = alloc_xenballooned_pages(nr_pages, pages);
- if (ret < 0)
+ if (ret < 0 || xen_shim_domain())
return ret;

ret = gnttab_pages_set_private(nr_pages, pages);
@@ -1045,6 +1045,11 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
{
struct xen_page_foreign *foreign;

+ if (xen_shim_domain()) {
+ pages[i] = virt_to_page(map_ops[i].host_addr);
+ continue;
+ }
+
SetPageForeign(pages[i]);
foreign = xen_page_foreign(pages[i]);
foreign->domid = map_ops[i].dom;
@@ -1085,8 +1090,10 @@ int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
if (ret)
return ret;

- for (i = 0; i < count; i++)
- ClearPageForeign(pages[i]);
+ for (i = 0; i < count; i++) {
+ if (!xen_shim_domain())
+ ClearPageForeign(pages[i]);
+ }

return clear_foreign_p2m_mapping(unmap_ops, kunmap_ops, pages, count);
}
@@ -1113,7 +1120,7 @@ static void __gnttab_unmap_refs_async(struct gntab_unmap_queue_data* item)
int pc;

for (pc = 0; pc < item->count; pc++) {
- if (page_count(item->pages[pc]) > 1) {
+ if (page_count(item->pages[pc]) > 1 && !xen_shim_domain()) {
unsigned long delay = GNTTAB_UNMAP_REFS_DELAY * (item->age + 1);
schedule_delayed_work(&item->gnttab_work,
msecs_to_jiffies(delay));
--
2.11.0


2019-02-20 20:21:23

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 20/39] xen-blkback: module_exit support


Implement module_exit to allow users to do module unload of blkback.
We prevent users from module unload whenever there are still interfaces
allocated, in other words, do module_get on xen_blkif_alloc() and
module_put on xen_blkif_free().

Signed-off-by: Joao Martins <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 8 ++++++++
drivers/block/xen-blkback/common.h | 2 ++
drivers/block/xen-blkback/xenbus.c | 14 ++++++++++++++
3 files changed, 24 insertions(+)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index fd1e19f1a49f..d51d88be88e1 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1504,5 +1504,13 @@ static int __init xen_blkif_init(void)

module_init(xen_blkif_init);

+static void __exit xen_blkif_exit(void)
+{
+ xen_blkif_interface_exit();
+ xen_blkif_xenbus_exit();
+}
+
+module_exit(xen_blkif_exit);
+
MODULE_LICENSE("Dual BSD/GPL");
MODULE_ALIAS("xen-backend:vbd");
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 1d3002d773f7..3415c558e115 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -376,8 +376,10 @@ struct phys_req {
blkif_sector_t sector_number;
};
int xen_blkif_interface_init(void);
+void xen_blkif_interface_exit(void);

int xen_blkif_xenbus_init(void);
+void xen_blkif_xenbus_exit(void);

irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index a4bc74e72c39..424e2efebe85 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -181,6 +181,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
init_completion(&blkif->drain_complete);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);

+ __module_get(THIS_MODULE);
+
return blkif;
}

@@ -328,6 +330,8 @@ static void xen_blkif_free(struct xen_blkif *blkif)

/* Make sure everything is drained before shutting down */
kmem_cache_free(xen_blkif_cachep, blkif);
+
+ module_put(THIS_MODULE);
}

int __init xen_blkif_interface_init(void)
@@ -341,6 +345,11 @@ int __init xen_blkif_interface_init(void)
return 0;
}

+void xen_blkif_interface_exit(void)
+{
+ kmem_cache_destroy(xen_blkif_cachep);
+}
+
/*
* sysfs interface for VBD I/O requests
*/
@@ -1115,3 +1124,8 @@ int xen_blkif_xenbus_init(void)
{
return xenbus_register_backend(&xen_blkbk_driver);
}
+
+void xen_blkif_xenbus_exit(void)
+{
+ xenbus_unregister_driver(&xen_blkbk_driver);
+}
--
2.11.0


2019-02-20 20:21:28

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 32/39] xen/balloon: xen_shim_domain() support

Xen ballooning uses hollow struct pages (with the underlying PFNs being
populated/unpopulated via hypercalls) which are used by the grant logic
to map grants from other domains.

For purposes of a KVM based xen-shim, this model is not useful --
mapping is unnecessary since all guest memory is already mapped in the
KVM host. The simplest option is to just translate grant references to
GPAs (essentially a get_page() on the appropriate GPA.)

This patch provides an alternate balloon allocation mechanism where in
the allocation path we just provide a constant struct page
(corresponding to page 0.) This allows the calling code -- which does a
page_to_pfn() on the returned struct page -- to remain unchanged before
doing the grant operation (which in this case would fill in the real
struct page.)

Co-developed-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kvm/xen-shim.c | 31 +++++++++++++++++++++++++++++++
drivers/xen/balloon.c | 15 ++++++++++++++-
include/xen/balloon.h | 7 +++++++
3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/xen-shim.c b/arch/x86/kvm/xen-shim.c
index 61fdceb63ec2..4086d92a4bfb 100644
--- a/arch/x86/kvm/xen-shim.c
+++ b/arch/x86/kvm/xen-shim.c
@@ -13,11 +13,40 @@
#include <xen/xen-ops.h>
#include <xen/events.h>
#include <xen/xenbus.h>
+#include <xen/balloon.h>

#define BITS_PER_EVTCHN_WORD (sizeof(xen_ulong_t)*8)

static struct kvm_xen shim = { .domid = XEN_SHIM_DOMID };

+static int shim_alloc_pages(int nr_pages, struct page **pages)
+{
+ int i;
+
+ /*
+ * We provide page 0 instead of NULL because we'll effectively
+ * do the inverse operation while deriving the pfn to pass to
+ * xen for mapping.
+ */
+ for (i = 0; i < nr_pages; i++)
+ pages[i] = pfn_to_page(0);
+
+ return 0;
+}
+
+static void shim_free_pages(int nr_pages, struct page **pages)
+{
+ int i;
+
+ for (i = 0; i < nr_pages; i++)
+ pages[i] = NULL;
+}
+
+static struct xen_balloon_ops shim_balloon_ops = {
+ .alloc_pages = shim_alloc_pages,
+ .free_pages = shim_free_pages,
+};
+
static void shim_evtchn_setup(struct shared_info *s)
{
int cpu;
@@ -65,6 +94,7 @@ static int __init shim_register(void)
mutex_init(&shim.xen_lock);

kvm_xen_register_lcall(&shim);
+ xen_balloon_ops = &shim_balloon_ops;

/* We can handle hypercalls after this point */
xen_shim_domain = 1;
@@ -94,6 +124,7 @@ static void __exit shim_exit(void)
xen_shim_domain = 0;

kvm_xen_unregister_lcall();
+ xen_balloon_ops = NULL;
HYPERVISOR_shared_info = NULL;
free_page((unsigned long) shim.shinfo);
shim.shinfo = NULL;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index ceb5048de9a7..00375fa6c122 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -138,7 +138,7 @@ enum bp_state {

static DEFINE_MUTEX(balloon_mutex);

-struct balloon_stats balloon_stats;
+struct balloon_stats balloon_stats __read_mostly;
EXPORT_SYMBOL_GPL(balloon_stats);

/* We increase/decrease in batches which fit in a page */
@@ -158,6 +158,9 @@ static DECLARE_DELAYED_WORK(balloon_worker, balloon_process);
#define GFP_BALLOON \
(GFP_HIGHUSER | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC)

+struct xen_balloon_ops *xen_balloon_ops;
+EXPORT_SYMBOL(xen_balloon_ops);
+
/* balloon_append: add the given page to the balloon. */
static void __balloon_append(struct page *page)
{
@@ -589,6 +592,11 @@ int alloc_xenballooned_pages(int nr_pages, struct page **pages)
struct page *page;
int ret;

+ if (xen_shim_domain() && xen_balloon_ops)
+ return xen_balloon_ops->alloc_pages(nr_pages, pages);
+
+ WARN_ON_ONCE(xen_shim_domain());
+
mutex_lock(&balloon_mutex);

balloon_stats.target_unpopulated += nr_pages;
@@ -634,6 +642,11 @@ void free_xenballooned_pages(int nr_pages, struct page **pages)
{
int i;

+ if (xen_shim_domain() && xen_balloon_ops)
+ return xen_balloon_ops->free_pages(nr_pages, pages);
+
+ WARN_ON_ONCE(xen_shim_domain());
+
mutex_lock(&balloon_mutex);

for (i = 0; i < nr_pages; i++) {
diff --git a/include/xen/balloon.h b/include/xen/balloon.h
index 4914b93a23f2..9ba6a7e91d5e 100644
--- a/include/xen/balloon.h
+++ b/include/xen/balloon.h
@@ -22,6 +22,13 @@ struct balloon_stats {

extern struct balloon_stats balloon_stats;

+struct xen_balloon_ops {
+ int (*alloc_pages)(int nr_pages, struct page **pages);
+ void (*free_pages)(int nr_pages, struct page **pages);
+};
+
+extern struct xen_balloon_ops *xen_balloon_ops;
+
void balloon_set_new_target(unsigned long target);

int alloc_xenballooned_pages(int nr_pages, struct page **pages);
--
2.11.0


2019-02-20 20:21:50

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 36/39] drivers/xen: xen_shim_domain() support

Enable /dev/xen/{gntdev,evtchn} and /proc/xen/ for xen_shim_domain().

These interfaces will be used by xenstored to initialize its
event channel port and the kva used to communicate with the
xenbus driver.

Signed-off-by: Joao Martins <[email protected]>
---
drivers/xen/evtchn.c | 4 +++-
drivers/xen/privcmd.c | 5 ++++-
drivers/xen/xenbus/xenbus_xs.c | 2 +-
drivers/xen/xenfs/super.c | 7 ++++---
drivers/xen/xenfs/xensyms.c | 4 ++++
5 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 6d1a5e58968f..367390eb5883 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -708,13 +708,14 @@ static int __init evtchn_init(void)
{
int err;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

/* Create '/dev/xen/evtchn'. */
err = misc_register(&evtchn_miscdev);
if (err != 0) {
pr_err("Could not register /dev/xen/evtchn\n");
+ xen_shim_domain_put();
return err;
}

@@ -726,6 +727,7 @@ static int __init evtchn_init(void)
static void __exit evtchn_cleanup(void)
{
misc_deregister(&evtchn_miscdev);
+ xen_shim_domain_put();
}

module_init(evtchn_init);
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index b24ddac1604b..a063ad5c8260 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -999,12 +999,13 @@ static int __init privcmd_init(void)
{
int err;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

err = misc_register(&privcmd_dev);
if (err != 0) {
pr_err("Could not register Xen privcmd device\n");
+ xen_shim_domain_put();
return err;
}

@@ -1012,6 +1013,7 @@ static int __init privcmd_init(void)
if (err != 0) {
pr_err("Could not register Xen hypercall-buf device\n");
misc_deregister(&privcmd_dev);
+ xen_shim_domain_put();
return err;
}

@@ -1022,6 +1024,7 @@ static void __exit privcmd_exit(void)
{
misc_deregister(&privcmd_dev);
misc_deregister(&xen_privcmdbuf_dev);
+ xen_shim_domain_put();
}

module_init(privcmd_init);
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index bd6db3703972..062c0a5fa415 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -735,7 +735,7 @@ static void xs_reset_watches(void)
{
int err;

- if (!xen_hvm_domain() || xen_initial_domain())
+ if (!xen_hvm_domain() || xen_initial_domain() || xen_shim_domain())
return;

if (xen_strict_xenbus_quirk())
diff --git a/drivers/xen/xenfs/super.c b/drivers/xen/xenfs/super.c
index 71ddfb4cf61c..2ef326645880 100644
--- a/drivers/xen/xenfs/super.c
+++ b/drivers/xen/xenfs/super.c
@@ -64,7 +64,8 @@ static int xenfs_fill_super(struct super_block *sb, void *data, int silent)
};

return simple_fill_super(sb, XENFS_SUPER_MAGIC,
- xen_initial_domain() ? xenfs_init_files : xenfs_files);
+ xen_initial_domain() || xen_shim_domain() ?
+ xenfs_init_files : xenfs_files);
}

static struct dentry *xenfs_mount(struct file_system_type *fs_type,
@@ -84,7 +85,7 @@ MODULE_ALIAS_FS("xenfs");

static int __init xenfs_init(void)
{
- if (xen_domain())
+ if (xen_domain() || xen_shim_domain())
return register_filesystem(&xenfs_type);

return 0;
@@ -92,7 +93,7 @@ static int __init xenfs_init(void)

static void __exit xenfs_exit(void)
{
- if (xen_domain())
+ if (xen_domain() || xen_shim_domain())
unregister_filesystem(&xenfs_type);
}

diff --git a/drivers/xen/xenfs/xensyms.c b/drivers/xen/xenfs/xensyms.c
index c6c73a33c44d..79bccb53d344 100644
--- a/drivers/xen/xenfs/xensyms.c
+++ b/drivers/xen/xenfs/xensyms.c
@@ -8,6 +8,7 @@
#include <xen/interface/platform.h>
#include <asm/xen/hypercall.h>
#include <xen/xen-ops.h>
+#include <xen/xen.h>
#include "xenfs.h"


@@ -114,6 +115,9 @@ static int xensyms_open(struct inode *inode, struct file *file)
struct xensyms *xs;
int ret;

+ if (xen_shim_domain())
+ return -EINVAL;
+
ret = seq_open_private(file, &xensyms_seq_ops,
sizeof(struct xensyms));
if (ret)
--
2.11.0


2019-02-20 20:21:58

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 26/39] KVM: x86/xen: grant unmap support

From: Ankur Arora <[email protected]>

Grant unmap removes the grant from maptrack and marks it as not in use.
We maintain a one-to-one correspondence between grant table and maptrack
entries so there's no contention in allocation/free.

Co-developed-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 3603645086a7..8f06924e0dfa 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1684,6 +1684,69 @@ static int shim_hcall_gntmap(struct kvm_xen *ld,
return 0;
}

+static int shim_hcall_gntunmap(struct kvm_xen *xen,
+ struct gnttab_unmap_grant_ref *op)
+{
+ struct kvm_grant_map *map, unmap;
+ struct grant_entry_v1 **rgt;
+ struct grant_entry_v1 *shah;
+ struct kvm *rd = NULL;
+ domid_t domid;
+ u32 ref;
+
+ domid = handle_get_domid(op->handle);
+ ref = handle_get_grant(op->handle);
+
+
+ rd = kvm_xen_find_vm(domid);
+ if (unlikely(!rd)) {
+ /* We already teardown all ongoing grant maps */
+ op->status = GNTST_okay;
+ return 0;
+ }
+
+ if (unlikely(ref >= gnttab_entries(rd))) {
+ pr_err("gnttab: bad ref %u\n", ref);
+ op->status = GNTST_bad_handle;
+ return 0;
+ }
+
+ rgt = rd->arch.xen.gnttab.frames_v1;
+ map = maptrack_entry(rd->arch.xen.gnttab.handle, ref);
+
+ /*
+ * The test_and_clear_bit (below) serializes ownership of this
+ * grant-entry. After we clear it, there can be a grant-map on this
+ * entry. So we cache the unmap entry before relinquishing ownership.
+ */
+ unmap = *map;
+
+ if (!test_and_clear_bit(_KVM_GNTMAP_ACTIVE,
+ (unsigned long *) &map->flags)) {
+ pr_err("gnttab: bad flags for %u (dom %u ref %u) flags %x\n",
+ op->handle, domid, ref, unmap.flags);
+ op->status = GNTST_bad_handle;
+ return 0;
+ }
+
+ /* Give up the reference taken in get_user_pages_remote(). */
+ put_page(virt_to_page(unmap.gpa));
+
+ shah = shared_entry(rgt, unmap.ref);
+
+ /*
+ * We have cleared _KVM_GNTMAP_ACTIVE, so a simultaneous grant-map
+ * could update the shah and we would stomp all over it but the
+ * guest deserves it.
+ */
+ if (!(unmap.flags & GNTMAP_readonly))
+ clear_bit(_GTF_writing, (unsigned long *) &shah->flags);
+ clear_bit(_GTF_reading, (unsigned long *) &shah->flags);
+
+ op->status = GNTST_okay;
+ return 0;
+}
+
static int shim_hcall_gnttab(int op, void *p, int count)
{
int ret = -ENOSYS;
@@ -1698,6 +1761,16 @@ static int shim_hcall_gnttab(int op, void *p, int count)
ret = 0;
break;
}
+ case GNTTABOP_unmap_grant_ref: {
+ struct gnttab_unmap_grant_ref *ref = p;
+
+ for (i = 0; i < count; i++) {
+ shim_hcall_gntunmap(xen_shim, ref + i);
+ ref[i].host_addr = 0;
+ }
+ ret = 0;
+ break;
+ }
default:
pr_info("lcall-gnttab:op default=%d\n", op);
break;
--
2.11.0


2019-02-20 20:22:51

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 14/39] KVM: x86/xen: handle PV IPI vcpu yield

Cooperative Linux guests after an IPI-many may yield vcpu if
any of the IPI'd vcpus were preempted (i.e. runstate is 'runnable'.)
Support SCHEDOP_yield for handling yield.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/xen.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index ec40cb1de6b6..753a6d2c11cd 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -17,6 +17,7 @@
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>
#include <xen/interface/event_channel.h>
+#include <xen/interface/sched.h>

#include "trace.h"

@@ -668,6 +669,31 @@ static int kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout)
return 0;
}

+static int kvm_xen_hcall_sched_op(struct kvm_vcpu *vcpu, int cmd, u64 param)
+{
+ int ret = -ENOSYS;
+ gpa_t gpa;
+ int idx;
+
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+ if (!gpa)
+ return -EFAULT;
+
+ switch (cmd) {
+ case SCHEDOP_yield:
+ kvm_vcpu_on_spin(vcpu, true);
+ ret = 0;
+ break;
+ default:
+ break;
+ }
+
+ return ret;
+}
+
int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
{
bool longmode;
@@ -714,6 +740,11 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
if (!r)
goto hcall_success;
break;
+ case __HYPERVISOR_sched_op:
+ r = kvm_xen_hcall_sched_op(vcpu, params[0], params[1]);
+ if (!r)
+ goto hcall_success;
+ break;
/* fallthrough */
default:
break;
--
2.11.0


2019-02-20 20:23:15

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page


We add a new ioctl, XEN_HVM_SHARED_INFO, to allow hypervisor
to know where the guest's shared info page is.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/x86.c | 21 +++++++++++++++
arch/x86/kvm/xen.c | 60 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 4 +++
include/uapi/linux/kvm.h | 15 +++++++++++
5 files changed, 103 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0f469ce439c0..befc0e37f162 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -843,6 +843,9 @@ struct kvm_hv {
/* Xen emulation context */
struct kvm_xen {
u64 xen_hypercall;
+
+ gfn_t shinfo_addr;
+ struct shared_info *shinfo;
};

enum kvm_irqchip_mode {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be8def385e3f..1eda96304180 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4793,6 +4793,26 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
break;
}
+ case KVM_XEN_HVM_GET_ATTR: {
+ struct kvm_xen_hvm_attr xha;
+
+ r = -EFAULT;
+ if (copy_from_user(&xha, argp, sizeof(xha)))
+ goto out;
+ r = kvm_xen_hvm_get_attr(kvm, &xha);
+ if (copy_to_user(argp, &xha, sizeof(xha)))
+ goto out;
+ break;
+ }
+ case KVM_XEN_HVM_SET_ATTR: {
+ struct kvm_xen_hvm_attr xha;
+
+ r = -EFAULT;
+ if (copy_from_user(&xha, argp, sizeof(xha)))
+ goto out;
+ r = kvm_xen_hvm_set_attr(kvm, &xha);
+ break;
+ }
case KVM_SET_CLOCK: {
struct kvm_clock_data user_ns;
u64 now_ns;
@@ -9279,6 +9299,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_mmu_uninit_vm(kvm);
kvm_page_track_cleanup(kvm);
kvm_hv_destroy_vm(kvm);
+ kvm_xen_destroy_vm(kvm);
}

void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 76f0e4b812d2..4df223bd3cd7 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -11,9 +11,61 @@
#include <linux/kvm_host.h>

#include <trace/events/kvm.h>
+#include <xen/interface/xen.h>

#include "trace.h"

+static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
+{
+ struct shared_info *shared_info;
+ struct page *page;
+
+ page = gfn_to_page(kvm, gfn);
+ if (is_error_page(page))
+ return -EINVAL;
+
+ kvm->arch.xen.shinfo_addr = gfn;
+
+ shared_info = page_to_virt(page);
+ memset(shared_info, 0, sizeof(struct shared_info));
+ kvm->arch.xen.shinfo = shared_info;
+ return 0;
+}
+
+int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
+{
+ int r = -ENOENT;
+
+ switch (data->type) {
+ case KVM_XEN_ATTR_TYPE_SHARED_INFO: {
+ gfn_t gfn = data->u.shared_info.gfn;
+
+ r = kvm_xen_shared_info_init(kvm, gfn);
+ break;
+ }
+ default:
+ break;
+ }
+
+ return r;
+}
+
+int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
+{
+ int r = -ENOENT;
+
+ switch (data->type) {
+ case KVM_XEN_ATTR_TYPE_SHARED_INFO: {
+ data->u.shared_info.gfn = kvm->arch.xen.shinfo_addr;
+ break;
+ }
+ default:
+ break;
+ }
+
+ return r;
+}
+
bool kvm_xen_hypercall_enabled(struct kvm *kvm)
{
return READ_ONCE(kvm->arch.xen.xen_hypercall);
@@ -77,3 +129,11 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)

return 0;
}
+
+void kvm_xen_destroy_vm(struct kvm *kvm)
+{
+ struct kvm_xen *xen = &kvm->arch.xen;
+
+ if (xen->shinfo)
+ put_page(virt_to_page(xen->shinfo));
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index a2ae079c3ef3..bb38edf383fe 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -3,8 +3,12 @@
#ifndef __ARCH_X86_KVM_XEN_H__
#define __ARCH_X86_KVM_XEN_H__

+int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
+int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
bool kvm_xen_hypercall_enabled(struct kvm *kvm);
bool kvm_xen_hypercall_set(struct kvm *kvm);
int kvm_xen_hypercall(struct kvm_vcpu *vcpu);

+void kvm_xen_destroy_vm(struct kvm *kvm);
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d07520c216a1..de2168d235af 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1455,6 +1455,21 @@ struct kvm_enc_region {
/* Available with KVM_CAP_HYPERV_CPUID */
#define KVM_GET_SUPPORTED_HV_CPUID _IOWR(KVMIO, 0xc1, struct kvm_cpuid2)

+#define KVM_XEN_HVM_GET_ATTR _IOWR(KVMIO, 0xc2, struct kvm_xen_hvm_attr)
+#define KVM_XEN_HVM_SET_ATTR _IOW(KVMIO, 0xc3, struct kvm_xen_hvm_attr)
+
+struct kvm_xen_hvm_attr {
+ __u16 type;
+
+ union {
+ struct {
+ __u64 gfn;
+ } shared_info;
+ } u;
+};
+
+#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x0
+
/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
/* Guest initialization commands */
--
2.11.0


2019-02-20 20:23:15

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 24/39] KVM: x86/xen: backend hypercall support

From: Ankur Arora <[email protected]>

Ordinarily a Xen backend domain would do hypercalls via int 0x81 (or
vmcall) to enter a lower ring of execution. This is done via a
hypercall_page which contains call stubs corresponding to each
hypercall.

For Xen backend driver support, however, we would like to do Xen
hypercalls in the same ring. To that end we point the hypercall_page to
a kvm owned text page which just does a local call (to
kvm_xen_host_hcall().)

Note, that this is different from hypercalls handled in
kvm_xen_hypercall(), because the latter refers to domU hypercalls (so
there is an actual drop in execution ring) while there isn't in
kvm_xen_host_hcall().

Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/xen-asm.S | 66 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.c | 68 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 4 +++
5 files changed, 142 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/xen-asm.S

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 70bb7339ddd4..55609e919e14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1669,4 +1669,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
#define put_smstate(type, buf, offset, val) \
*(type *)((buf) + (offset) - 0x7e00) = val

+void kvm_xen_register_lcall(struct kvm_xen *shim);
+void kvm_xen_unregister_lcall(void);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 2b46c93c9380..c1eaabbd0a54 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o xen.o page_track.o debugfs.o
+ hyperv.o xen-asm.o xen.o page_track.o debugfs.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
kvm-amd-y += svm.o pmu_amd.o
diff --git a/arch/x86/kvm/xen-asm.S b/arch/x86/kvm/xen-asm.S
new file mode 100644
index 000000000000..10559fcfbe38
--- /dev/null
+++ b/arch/x86/kvm/xen-asm.S
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2019 Oracle and/or its affiliates. All rights reserved. */
+#include <linux/linkage.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/page_types.h>
+#include <asm/unwind_hints.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/xen-mca.h>
+#include <asm/xen/interface.h>
+
+ .balign PAGE_SIZE
+ENTRY(kvm_xen_hypercall_page)
+ hcall=0
+ .rept (PAGE_SIZE / 32)
+ FRAME_BEGIN
+ push %rcx /* Push call clobbered registers */
+ push %r9
+ push %r11
+ mov $hcall, %rax
+
+ call kvm_xen_host_hcall
+ pop %r11
+ pop %r9
+ pop %rcx
+
+ FRAME_END
+ ret
+ .balign 32
+ hcall = hcall + 1
+ .endr
+/*
+ * Hypercall symbols are used for unwinding the stack, so we give them names
+ * prefixed with kvm_xen_ (Xen hypercalls have symbols prefixed with xen_.)
+ */
+#define HYPERCALL(n) \
+ .equ kvm_xen_hypercall_##n, kvm_xen_hypercall_page + __HYPERVISOR_##n * 32; \
+ .type kvm_xen_hypercall_##n, @function; \
+ .size kvm_xen_hypercall_##n, 32
+#include <asm/xen-hypercalls.h>
+#undef HYPERCALL
+END(kvm_xen_hypercall_page)
+
+/*
+ * Some call stubs generated above do not have associated symbols. Generate
+ * bogus symbols for those hypercall blocks to stop objtool from complaining
+ * about unreachable code.
+ */
+.altmacro
+.macro hypercall_missing N
+ .equ kvm_xen_hypercall_missing_\N, kvm_xen_hypercall_page + \N * 32;
+ .type kvm_xen_hypercall_missing_\N, @function;
+ .size kvm_xen_hypercall_missing_\N, 32;
+.endm
+
+.macro hypercalls_missing N count=1
+ .set n,\N
+ .rept \count
+ hypercall_missing %n
+ .set n,n+1
+ .endr
+.endm
+
+hypercalls_missing 11 1
+hypercalls_missing 42 6
+hypercalls_missing 56 72
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 7266d27db210..645cd22ab4e7 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -12,6 +12,7 @@
#include <linux/kvm_host.h>
#include <linux/eventfd.h>
#include <linux/sched/stat.h>
+#include <linux/linkage.h>

#include <trace/events/kvm.h>
#include <xen/interface/xen.h>
@@ -19,6 +20,10 @@
#include <xen/interface/event_channel.h>
#include <xen/interface/grant_table.h>
#include <xen/interface/sched.h>
+#include <xen/interface/version.h>
+#include <xen/xen.h>
+#include <xen/features.h>
+#include <asm/xen/hypercall.h>

#include "trace.h"

@@ -43,6 +48,7 @@ struct evtchnfd {
static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port);
static void *xen_vcpu_info(struct kvm_vcpu *v);
static void kvm_xen_gnttab_free(struct kvm_xen *xen);
+static int shim_hypercall(u64 code, u64 a0, u64 a1, u64 a2, u64 a3, u64 a4);

#define XEN_DOMID_MIN 1
#define XEN_DOMID_MAX (DOMID_FIRST_RESERVED - 1)
@@ -50,6 +56,9 @@ static void kvm_xen_gnttab_free(struct kvm_xen *xen);
static rwlock_t domid_lock;
static struct idr domid_to_kvm;

+static struct hypercall_entry *hypercall_page_save;
+static struct kvm_xen *xen_shim __read_mostly;
+
static int kvm_xen_domid_init(struct kvm *kvm, bool any, domid_t domid)
{
u16 min = XEN_DOMID_MIN, max = XEN_DOMID_MAX;
@@ -1271,3 +1280,62 @@ int kvm_vm_ioctl_xen_gnttab(struct kvm *kvm, struct kvm_xen_gnttab *op)

return r;
}
+
+asmlinkage int kvm_xen_host_hcall(void)
+{
+ register unsigned long a0 asm(__HYPERCALL_RETREG);
+ register unsigned long a1 asm(__HYPERCALL_ARG1REG);
+ register unsigned long a2 asm(__HYPERCALL_ARG2REG);
+ register unsigned long a3 asm(__HYPERCALL_ARG3REG);
+ register unsigned long a4 asm(__HYPERCALL_ARG4REG);
+ register unsigned long a5 asm(__HYPERCALL_ARG5REG);
+ int ret;
+
+ preempt_disable();
+ ret = shim_hypercall(a0, a1, a2, a3, a4, a5);
+ preempt_enable();
+
+ return ret;
+}
+
+void kvm_xen_register_lcall(struct kvm_xen *shim)
+{
+ hypercall_page_save = hypercall_page;
+ hypercall_page = kvm_xen_hypercall_page;
+ xen_shim = shim;
+}
+EXPORT_SYMBOL_GPL(kvm_xen_register_lcall);
+
+void kvm_xen_unregister_lcall(void)
+{
+ hypercall_page = hypercall_page_save;
+ hypercall_page_save = NULL;
+}
+EXPORT_SYMBOL_GPL(kvm_xen_unregister_lcall);
+
+static int shim_hcall_version(int op, struct xen_feature_info *fi)
+{
+ if (op != XENVER_get_features || !fi || fi->submap_idx != 0)
+ return -EINVAL;
+
+ /*
+ * We need a limited set of features for a pseudo dom0.
+ */
+ fi->submap = (1U << XENFEAT_auto_translated_physmap);
+ return 0;
+}
+
+static int shim_hypercall(u64 code, u64 a0, u64 a1, u64 a2, u64 a3, u64 a4)
+{
+ int ret = -ENOSYS;
+
+ switch (code) {
+ case __HYPERVISOR_xen_version:
+ ret = shim_hcall_version((int)a0, (void *)a1);
+ break;
+ default:
+ break;
+ }
+
+ return ret;
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 08ad4e1259df..9fa7c3dd111a 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -3,6 +3,8 @@
#ifndef __ARCH_X86_KVM_XEN_H__
#define __ARCH_X86_KVM_XEN_H__

+#include <asm/xen/hypercall.h>
+
static inline struct kvm_vcpu_xen *vcpu_to_xen_vcpu(struct kvm_vcpu *vcpu)
{
return &vcpu->arch.xen;
@@ -48,4 +50,6 @@ int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu);
void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu);
bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu);

+extern struct hypercall_entry kvm_xen_hypercall_page[128];
+
#endif
--
2.11.0


2019-02-20 20:23:21

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 07/39] KVM: x86/xen: register vcpu time info region

Allow the Xen emulated guest the ability to register secondary
vcpu time information. On Xen guests this is used in order to be
mapped to userspace and hence allow vdso gettimeofday to work.

In doing so, move kvm_xen_set_pvclock_page() logic to
kvm_xen_update_vcpu_time() and have the former a top-level
function which updates primary vcpu time info (in struct vcpu_info)
and secondary one.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/xen.c | 45 ++++++++++++++++++++++++++++++-----------
include/uapi/linux/kvm.h | 1 +
3 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 96f65ba4b3c0..f39d50dd8f40 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -539,6 +539,8 @@ struct kvm_vcpu_xen {
struct kvm_xen_exit exit;
gpa_t vcpu_info_addr;
struct vcpu_info *vcpu_info;
+ gpa_t pv_time_addr;
+ struct pvclock_vcpu_time_info *pv_time;
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 36d6dd0ea4b8..77d1153386bc 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -25,6 +25,11 @@ static void set_vcpu_attr(struct kvm_vcpu *v, u16 type, gpa_t gpa, void *addr)
vcpu_xen->vcpu_info = addr;
kvm_xen_setup_pvclock_page(v);
break;
+ case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
+ vcpu_xen->pv_time_addr = gpa;
+ vcpu_xen->pv_time = addr;
+ kvm_xen_setup_pvclock_page(v);
+ break;
default:
break;
}
@@ -37,6 +42,8 @@ static gpa_t get_vcpu_attr(struct kvm_vcpu *v, u16 type)
switch (type) {
case KVM_XEN_ATTR_TYPE_VCPU_INFO:
return vcpu_xen->vcpu_info_addr;
+ case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
+ return vcpu_xen->pv_time_addr;
default:
return 0;
}
@@ -83,20 +90,10 @@ static void *xen_vcpu_info(struct kvm_vcpu *v)
return hva + offset;
}

-void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
+static void kvm_xen_update_vcpu_time(struct kvm_vcpu *v,
+ struct pvclock_vcpu_time_info *guest_hv_clock)
{
struct kvm_vcpu_arch *vcpu = &v->arch;
- struct pvclock_vcpu_time_info *guest_hv_clock;
- void *hva = xen_vcpu_info(v);
- unsigned int offset;
-
- if (!hva)
- return;
-
- offset = offsetof(struct vcpu_info, time);
-
- guest_hv_clock = (struct pvclock_vcpu_time_info *)
- (hva + offset);

BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);

@@ -127,6 +124,26 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
guest_hv_clock->version = vcpu->hv_clock.version;
}

+void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
+ struct pvclock_vcpu_time_info *guest_hv_clock;
+ void *hva = xen_vcpu_info(v);
+ unsigned int offset;
+
+ offset = offsetof(struct vcpu_info, time);
+ guest_hv_clock = (struct pvclock_vcpu_time_info *) (hva + offset);
+
+ if (likely(hva))
+ kvm_xen_update_vcpu_time(v, guest_hv_clock);
+
+ /* Update secondary pvclock region if registered */
+ if (vcpu_xen->pv_time_addr) {
+ guest_hv_clock = vcpu_xen->pv_time;
+ kvm_xen_update_vcpu_time(v, guest_hv_clock);
+ }
+}
+
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
{
int r = -ENOENT;
@@ -138,6 +155,7 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = kvm_xen_shared_info_init(kvm, gfn);
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
gpa_t gpa = data->u.vcpu_attr.gpa;
struct kvm_vcpu *v;
@@ -173,6 +191,7 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
data->u.shared_info.gfn = kvm->arch.xen.shinfo_addr;
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO:
case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
struct kvm_vcpu *v;

@@ -261,6 +280,8 @@ void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)

if (vcpu_xen->vcpu_info)
put_page(virt_to_page(vcpu_xen->vcpu_info));
+ if (vcpu_xen->pv_time)
+ put_page(virt_to_page(vcpu_xen->pv_time));
}

void kvm_xen_destroy_vm(struct kvm *kvm)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 782f497a0fdd..8296c3a2434f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1474,6 +1474,7 @@ struct kvm_xen_hvm_attr {

#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x0
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1
+#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x2

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:23:31

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

userspace registers a @port to an @eventfd, that is bound to a
@vcpu. This information is then used when the guest does an
EVTCHNOP_send with a port registered with the kernel.

EVTCHNOP_send short-circuiting happens by marking the event as pending
in the shared info and vcpu info pages and doing the upcall. For IPIs
and interdomain event channels, we do the upcall on the assigned vcpu.

After binding events the guest or host may wish to bind those
events to a particular vcpu. This is usually done for unbound
and and interdomain events. Update requests are handled via the
KVM_XEN_EVENTFD_UPDATE flag.

Unregistered ports are handled by the emulator.

Co-developed-by: Ankur Arora <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/kvm/x86.c | 1 +
arch/x86/kvm/xen.c | 238 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/xen.h | 2 +
include/uapi/linux/kvm.h | 20 ++++
5 files changed, 264 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3305173bf10b..f31fcaf8fa7c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -859,6 +859,9 @@ struct kvm_xen {

gfn_t shinfo_addr;
struct shared_info *shinfo;
+
+ struct idr port_to_evt;
+ struct mutex xen_lock;
};

enum kvm_xen_callback_via {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 11b9ff2bd901..76bd23113ccd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9181,6 +9181,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);

+ kvm_xen_init_vm(kvm);
kvm_hv_init_vm(kvm);
kvm_page_track_init(kvm);
kvm_mmu_init_vm(kvm);
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 99a3722146d8..1fbdfa7c4356 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -10,14 +10,28 @@
#include "ioapic.h"

#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
#include <linux/sched/stat.h>

#include <trace/events/kvm.h>
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>
+#include <xen/interface/event_channel.h>

#include "trace.h"

+struct evtchnfd {
+ struct eventfd_ctx *ctx;
+ u32 vcpu;
+ u32 port;
+ u32 type;
+ union {
+ struct {
+ u8 type;
+ } virq;
+ };
+};
+
static void *xen_vcpu_info(struct kvm_vcpu *v);

int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
@@ -80,6 +94,13 @@ static int kvm_xen_do_upcall(struct kvm *kvm, u32 dest_vcpu,
return 0;
}

+static void kvm_xen_evtchnfd_upcall(struct kvm_vcpu *vcpu, struct evtchnfd *e)
+{
+ struct kvm_vcpu_xen *vx = vcpu_to_xen_vcpu(vcpu);
+
+ kvm_xen_do_upcall(vcpu->kvm, e->vcpu, vx->cb.via, vx->cb.vector, 0);
+}
+
int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level,
bool line_status)
@@ -329,6 +350,12 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = 0;
break;
}
+ case KVM_XEN_ATTR_TYPE_EVTCHN: {
+ struct kvm_xen_eventfd xevfd = data->u.evtchn;
+
+ r = kvm_vm_ioctl_xen_eventfd(kvm, &xevfd);
+ break;
+ }
default:
break;
}
@@ -388,10 +415,96 @@ static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

+static int kvm_xen_evtchn_2l_vcpu_set_pending(struct vcpu_info *v)
+{
+ return test_and_set_bit(0, (unsigned long *) &v->evtchn_upcall_pending);
+}
+
+#define BITS_PER_EVTCHN_WORD (sizeof(xen_ulong_t)*8)
+
+static int kvm_xen_evtchn_2l_set_pending(struct shared_info *shared_info,
+ struct vcpu_info *vcpu_info,
+ int p)
+{
+ if (test_and_set_bit(p, (unsigned long *) shared_info->evtchn_pending))
+ return 1;
+
+ if (!test_bit(p, (unsigned long *) shared_info->evtchn_mask) &&
+ !test_and_set_bit(p / BITS_PER_EVTCHN_WORD,
+ (unsigned long *) &vcpu_info->evtchn_pending_sel))
+ return kvm_xen_evtchn_2l_vcpu_set_pending(vcpu_info);
+
+ return 1;
+}
+
+#undef BITS_PER_EVTCHN_WORD
+
+static int kvm_xen_evtchn_set_pending(struct kvm_vcpu *svcpu,
+ struct evtchnfd *evfd)
+{
+ struct kvm_vcpu_xen *vcpu_xen;
+ struct vcpu_info *vcpu_info;
+ struct shared_info *shared_info;
+ struct kvm_vcpu *vcpu;
+
+ vcpu = kvm_get_vcpu(svcpu->kvm, evfd->vcpu);
+ if (!vcpu)
+ return -ENOENT;
+
+ vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ shared_info = (struct shared_info *) vcpu->kvm->arch.xen.shinfo;
+ vcpu_info = (struct vcpu_info *) vcpu_xen->vcpu_info;
+
+ return kvm_xen_evtchn_2l_set_pending(shared_info, vcpu_info,
+ evfd->port);
+}
+
+static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
+{
+ struct eventfd_ctx *eventfd;
+ struct evtchnfd *evtchnfd;
+
+ /* conn_to_evt is protected by vcpu->kvm->srcu */
+ evtchnfd = idr_find(&vcpu->kvm->arch.xen.port_to_evt, port);
+ if (!evtchnfd)
+ return -ENOENT;
+
+ eventfd = evtchnfd->ctx;
+ if (!kvm_xen_evtchn_set_pending(vcpu, evtchnfd)) {
+ if (!eventfd)
+ kvm_xen_evtchnfd_upcall(vcpu, evtchnfd);
+ else
+ eventfd_signal(eventfd, 1);
+ }
+
+ return 0;
+}
+
+static int kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, int cmd, u64 param)
+{
+ struct evtchn_send send;
+ gpa_t gpa;
+ int idx;
+
+ /* Port management is done in userspace */
+ if (cmd != EVTCHNOP_send)
+ return -EINVAL;
+
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+ if (!gpa || kvm_vcpu_read_guest(vcpu, gpa, &send, sizeof(send)))
+ return -EFAULT;
+
+ return kvm_xen_evtchn_send(vcpu, send.port);
+}
+
int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
{
bool longmode;
u64 input, params[5];
+ int r;

input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);

@@ -415,6 +528,19 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
trace_kvm_xen_hypercall(input, params[0], params[1], params[2],
params[3], params[4]);

+ switch (input) {
+ case __HYPERVISOR_event_channel_op:
+ r = kvm_xen_hcall_evtchn_send(vcpu, params[0],
+ params[1]);
+ if (!r) {
+ kvm_xen_hypercall_set_result(vcpu, r);
+ return kvm_skip_emulated_instruction(vcpu);
+ }
+ /* fallthrough */
+ default:
+ break;
+ }
+
vcpu->run->exit_reason = KVM_EXIT_XEN;
vcpu->run->xen.type = KVM_EXIT_XEN_HCALL;
vcpu->run->xen.u.hcall.input = input;
@@ -441,6 +567,12 @@ void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
put_page(virt_to_page(vcpu_xen->steal_time));
}

+void kvm_xen_init_vm(struct kvm *kvm)
+{
+ mutex_init(&kvm->arch.xen.xen_lock);
+ idr_init(&kvm->arch.xen.port_to_evt);
+}
+
void kvm_xen_destroy_vm(struct kvm *kvm)
{
struct kvm_xen *xen = &kvm->arch.xen;
@@ -448,3 +580,109 @@ void kvm_xen_destroy_vm(struct kvm *kvm)
if (xen->shinfo)
put_page(virt_to_page(xen->shinfo));
}
+
+static int kvm_xen_eventfd_update(struct kvm *kvm, struct idr *port_to_evt,
+ struct mutex *port_lock,
+ struct kvm_xen_eventfd *args)
+{
+ struct eventfd_ctx *eventfd = NULL;
+ struct evtchnfd *evtchnfd;
+
+ mutex_lock(port_lock);
+ evtchnfd = idr_find(port_to_evt, args->port);
+ mutex_unlock(port_lock);
+
+ if (!evtchnfd)
+ return -ENOENT;
+
+ if (args->fd != -1) {
+ eventfd = eventfd_ctx_fdget(args->fd);
+ if (IS_ERR(eventfd))
+ return PTR_ERR(eventfd);
+ }
+
+ evtchnfd->vcpu = args->vcpu;
+ return 0;
+}
+
+static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
+ struct mutex *port_lock,
+ struct kvm_xen_eventfd *args)
+{
+ struct eventfd_ctx *eventfd = NULL;
+ struct evtchnfd *evtchnfd;
+ u32 port = args->port;
+ int ret;
+
+ if (args->fd != -1) {
+ eventfd = eventfd_ctx_fdget(args->fd);
+ if (IS_ERR(eventfd))
+ return PTR_ERR(eventfd);
+ }
+
+ evtchnfd = kzalloc(sizeof(struct evtchnfd), GFP_KERNEL);
+ if (!evtchnfd)
+ return -ENOMEM;
+
+ evtchnfd->ctx = eventfd;
+ evtchnfd->port = port;
+ evtchnfd->vcpu = args->vcpu;
+ evtchnfd->type = args->type;
+ if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
+ evtchnfd->virq.type = args->virq.type;
+
+ mutex_lock(port_lock);
+ ret = idr_alloc(port_to_evt, evtchnfd, port, port + 1,
+ GFP_KERNEL);
+ mutex_unlock(port_lock);
+
+ if (ret >= 0)
+ return 0;
+
+ if (ret == -ENOSPC)
+ ret = -EEXIST;
+
+ if (eventfd)
+ eventfd_ctx_put(eventfd);
+ kfree(evtchnfd);
+ return ret;
+}
+
+static int kvm_xen_eventfd_deassign(struct kvm *kvm, struct idr *port_to_evt,
+ struct mutex *port_lock, u32 port)
+{
+ struct evtchnfd *evtchnfd;
+
+ mutex_lock(port_lock);
+ evtchnfd = idr_remove(port_to_evt, port);
+ mutex_unlock(port_lock);
+
+ if (!evtchnfd)
+ return -ENOENT;
+
+ if (kvm)
+ synchronize_srcu(&kvm->srcu);
+ if (evtchnfd->ctx)
+ eventfd_ctx_put(evtchnfd->ctx);
+ kfree(evtchnfd);
+ return 0;
+}
+
+int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args)
+{
+ struct kvm_xen *xen = &kvm->arch.xen;
+ int allowed_flags = (KVM_XEN_EVENTFD_DEASSIGN | KVM_XEN_EVENTFD_UPDATE);
+
+ if ((args->flags & (~allowed_flags)) ||
+ (args->port <= 0))
+ return -EINVAL;
+
+ if (args->flags == KVM_XEN_EVENTFD_DEASSIGN)
+ return kvm_xen_eventfd_deassign(kvm, &xen->port_to_evt,
+ &xen->xen_lock, args->port);
+ if (args->flags == KVM_XEN_EVENTFD_UPDATE)
+ return kvm_xen_eventfd_update(kvm, &xen->port_to_evt,
+ &xen->xen_lock, args);
+ return kvm_xen_eventfd_assign(kvm, &xen->port_to_evt,
+ &xen->xen_lock, args);
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 6a42e134924a..8f26625564c8 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -34,7 +34,9 @@ int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
int kvm_xen_setup_evtchn(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *e);

+void kvm_xen_init_vm(struct kvm *kvm);
void kvm_xen_destroy_vm(struct kvm *kvm);
+int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args);
void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);

#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 49001f681cd1..4eae47a0ef63 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1479,6 +1479,25 @@ struct kvm_xen_hvm_attr {
__u32 vcpu;
__u64 gpa;
} vcpu_attr;
+ struct kvm_xen_eventfd {
+
+#define XEN_EVTCHN_TYPE_VIRQ 0
+#define XEN_EVTCHN_TYPE_IPI 1
+ __u32 type;
+ __u32 port;
+ __u32 vcpu;
+ __s32 fd;
+
+#define KVM_XEN_EVENTFD_DEASSIGN (1 << 0)
+#define KVM_XEN_EVENTFD_UPDATE (1 << 1)
+ __u32 flags;
+ union {
+ struct {
+ __u8 type;
+ } virq;
+ __u32 padding[2];
+ };
+ } evtchn;
} u;
};

@@ -1487,6 +1506,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1
#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x2
#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x3
+#define KVM_XEN_ATTR_TYPE_EVTCHN 0x4

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:23:36

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 13/39] KVM: x86/xen: handle PV timers oneshot mode

If the guest has offloaded the timer virq, handle the following
hypercalls for programming the timer:

VCPUOP_set_singleshot_timer
VCPUOP_stop_singleshot_timer
set_timer_op(timestamp_ns)

The event channel corresponding to the timer virq is then used to inject
events once timer deadlines are met. For now we back the PV timer with
hrtimer.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/irq.c | 11 ++-
arch/x86/kvm/x86.c | 4 +
arch/x86/kvm/xen.c | 185 +++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/xen.h | 6 ++
5 files changed, 202 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 92b76127eb43..7fcc81dbb688 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -552,6 +552,8 @@ struct kvm_vcpu_xen {
struct kvm_xen_callback cb;
#define KVM_XEN_NR_VIRQS 24
unsigned int virq_to_port[KVM_XEN_NR_VIRQS];
+ struct hrtimer timer;
+ atomic_t timer_pending;
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index cdb1dbfcc9b1..936c31ae019a 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -34,10 +34,14 @@
*/
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
{
+ int r = 0;
+
if (lapic_in_kernel(vcpu))
- return apic_has_pending_timer(vcpu);
+ r = apic_has_pending_timer(vcpu);
+ if (kvm_xen_timer_enabled(vcpu))
+ r += kvm_xen_has_pending_timer(vcpu);

- return 0;
+ return r;
}
EXPORT_SYMBOL(kvm_cpu_has_pending_timer);

@@ -172,6 +176,8 @@ void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
{
if (lapic_in_kernel(vcpu))
kvm_inject_apic_timer_irqs(vcpu);
+ if (kvm_xen_timer_enabled(vcpu))
+ kvm_xen_inject_timer_irqs(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_inject_pending_timer_irqs);

@@ -179,4 +185,5 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
{
__kvm_migrate_apic_timer(vcpu);
__kvm_migrate_pit_timer(vcpu);
+ __kvm_migrate_xen_timer(vcpu);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76bd23113ccd..e29cefd2dc6a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9115,6 +9115,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
vcpu->arch.preempted_in_kernel = false;

kvm_hv_vcpu_init(vcpu);
+ kvm_xen_vcpu_init(vcpu);

return 0;

@@ -9566,6 +9567,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
if (kvm_hv_has_stimer_pending(vcpu))
return true;

+ if (kvm_xen_has_pending_timer(vcpu))
+ return true;
+
return false;
}

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 42c1fe01600d..ec40cb1de6b6 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -32,6 +32,7 @@ struct evtchnfd {
};
};

+static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port);
static void *xen_vcpu_info(struct kvm_vcpu *v);

int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
@@ -101,6 +102,91 @@ static void kvm_xen_evtchnfd_upcall(struct kvm_vcpu *vcpu, struct evtchnfd *e)
kvm_xen_do_upcall(vcpu->kvm, e->vcpu, vx->cb.via, vx->cb.vector, 0);
}

+int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ if (kvm_xen_hypercall_enabled(vcpu->kvm) && kvm_xen_timer_enabled(vcpu))
+ return atomic_read(&vcpu_xen->timer_pending);
+
+ return 0;
+}
+
+void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ if (atomic_read(&vcpu_xen->timer_pending) > 0) {
+ kvm_xen_evtchn_send(vcpu, vcpu_xen->virq_to_port[VIRQ_TIMER]);
+
+ atomic_set(&vcpu_xen->timer_pending, 0);
+ }
+}
+
+static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
+{
+ struct kvm_vcpu_xen *vcpu_xen =
+ container_of(timer, struct kvm_vcpu_xen, timer);
+ struct kvm_vcpu *vcpu = xen_vcpu_to_vcpu(vcpu_xen);
+ struct swait_queue_head *wq = &vcpu->wq;
+
+ if (atomic_read(&vcpu_xen->timer_pending))
+ return HRTIMER_NORESTART;
+
+ atomic_inc(&vcpu_xen->timer_pending);
+ kvm_set_pending_timer(vcpu);
+
+ if (swait_active(wq))
+ swake_up_one(wq);
+
+ return HRTIMER_NORESTART;
+}
+
+void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu)
+{
+ struct hrtimer *timer;
+
+ if (!kvm_xen_timer_enabled(vcpu))
+ return;
+
+ timer = &vcpu->arch.xen.timer;
+ if (hrtimer_cancel(timer))
+ hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+}
+
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 delta_ns)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+ struct hrtimer *timer = &vcpu_xen->timer;
+ ktime_t ktime_now;
+
+ atomic_set(&vcpu_xen->timer_pending, 0);
+ ktime_now = ktime_get();
+ hrtimer_start(timer, ktime_add_ns(ktime_now, delta_ns),
+ HRTIMER_MODE_ABS_PINNED);
+}
+
+static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ hrtimer_cancel(&vcpu_xen->timer);
+}
+
+void kvm_xen_init_timer(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ hrtimer_init(&vcpu_xen->timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_ABS_PINNED);
+ vcpu_xen->timer.function = xen_timer_callback;
+}
+
+bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu)
+{
+ return !!vcpu->arch.xen.virq_to_port[VIRQ_TIMER];
+}
+
void kvm_xen_set_virq(struct kvm *kvm, struct evtchnfd *evt)
{
int virq = evt->virq.type;
@@ -111,6 +197,9 @@ void kvm_xen_set_virq(struct kvm *kvm, struct evtchnfd *evt)
if (!vcpu)
return;

+ if (virq == VIRQ_TIMER)
+ kvm_xen_init_timer(vcpu);
+
vcpu_xen = vcpu_to_xen_vcpu(vcpu);
vcpu_xen->virq_to_port[virq] = evt->port;
}
@@ -514,6 +603,71 @@ static int kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, int cmd, u64 param)
return kvm_xen_evtchn_send(vcpu, send.port);
}

+static int kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, int cmd, int vcpu_id,
+ u64 param)
+{
+ struct vcpu_set_singleshot_timer oneshot;
+ int ret = -EINVAL;
+ long delta;
+ gpa_t gpa;
+ int idx;
+
+ /* Only process timer ops with commands 6 to 9 */
+ if (cmd < VCPUOP_set_periodic_timer ||
+ cmd > VCPUOP_stop_singleshot_timer)
+ return ret;
+
+ if (!kvm_xen_timer_enabled(vcpu))
+ return ret;
+
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+ if (!gpa)
+ return ret;
+
+ switch (cmd) {
+ case VCPUOP_set_singleshot_timer:
+ if (kvm_vcpu_read_guest(vcpu, gpa, &oneshot,
+ sizeof(oneshot)))
+ return -EFAULT;
+
+ delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
+ kvm_xen_start_timer(vcpu, delta);
+ ret = 0;
+ break;
+ case VCPUOP_stop_singleshot_timer:
+ kvm_xen_stop_timer(vcpu);
+ ret = 0;
+ break;
+ default:
+ break;
+ }
+
+ return ret;
+}
+
+static int kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout)
+{
+ ktime_t ktime_now = ktime_get();
+ long delta = timeout - get_kvmclock_ns(vcpu->kvm);
+
+ if (!kvm_xen_timer_enabled(vcpu))
+ return -EINVAL;
+
+ if (timeout == 0) {
+ kvm_xen_stop_timer(vcpu);
+ } else if (unlikely(timeout < ktime_now) ||
+ ((uint32_t) (delta >> 50) != 0)) {
+ kvm_xen_start_timer(vcpu, 50000000);
+ } else {
+ kvm_xen_start_timer(vcpu, delta);
+ }
+
+ return 0;
+}
+
int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
{
bool longmode;
@@ -546,10 +700,20 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
case __HYPERVISOR_event_channel_op:
r = kvm_xen_hcall_evtchn_send(vcpu, params[0],
params[1]);
- if (!r) {
- kvm_xen_hypercall_set_result(vcpu, r);
- return kvm_skip_emulated_instruction(vcpu);
- }
+ if (!r)
+ goto hcall_success;
+ break;
+ case __HYPERVISOR_vcpu_op:
+ r = kvm_xen_hcall_vcpu_op(vcpu, params[0], params[1],
+ params[2]);
+ if (!r)
+ goto hcall_success;
+ break;
+ case __HYPERVISOR_set_timer_op:
+ r = kvm_xen_hcall_set_timer_op(vcpu, params[0]);
+ if (!r)
+ goto hcall_success;
+ break;
/* fallthrough */
default:
break;
@@ -567,6 +731,14 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
kvm_xen_hypercall_complete_userspace;

return 0;
+
+hcall_success:
+ kvm_xen_hypercall_set_result(vcpu, r);
+ return kvm_skip_emulated_instruction(vcpu);
+}
+
+void kvm_xen_vcpu_init(struct kvm_vcpu *vcpu)
+{
}

void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
@@ -579,6 +751,11 @@ void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
put_page(virt_to_page(vcpu_xen->pv_time));
if (vcpu_xen->steal_time)
put_page(virt_to_page(vcpu_xen->steal_time));
+
+ if (!kvm_xen_timer_enabled(vcpu))
+ return;
+
+ kvm_xen_stop_timer(vcpu);
}

void kvm_xen_init_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 8f26625564c8..f82b8b5b3345 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -37,6 +37,12 @@ int kvm_xen_setup_evtchn(struct kvm *kvm,
void kvm_xen_init_vm(struct kvm *kvm);
void kvm_xen_destroy_vm(struct kvm *kvm);
int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args);
+void kvm_xen_vcpu_init(struct kvm_vcpu *vcpu);
void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);

+void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu);
+int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu);
+void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu);
+bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu);
+
#endif
--
2.11.0


2019-02-20 20:23:45

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 06/39] KVM: x86/xen: register vcpu info


The vcpu info supersedes the per vcpu area of the shared info page and
the guest vcpus will use this instead.

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/x86.c | 1 +
arch/x86/kvm/xen.c | 93 ++++++++++++++++++++++++++++++++++++++---
arch/x86/kvm/xen.h | 14 +++++++
include/uapi/linux/kvm.h | 5 +++
5 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index befc0e37f162..96f65ba4b3c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -537,6 +537,8 @@ struct kvm_vcpu_hv {
/* Xen per vcpu emulation context */
struct kvm_vcpu_xen {
struct kvm_xen_exit exit;
+ gpa_t vcpu_info_addr;
+ struct vcpu_info *vcpu_info;
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 31a102b22042..3ce97860e6ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9124,6 +9124,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
int idx;

kvm_hv_vcpu_uninit(vcpu);
+ kvm_xen_vcpu_uninit(vcpu);
kvm_pmu_destroy(vcpu);
kfree(vcpu->arch.mce_banks);
kvm_free_lapic(vcpu);
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 879bcfdd7b1d..36d6dd0ea4b8 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -15,6 +15,33 @@

#include "trace.h"

+static void set_vcpu_attr(struct kvm_vcpu *v, u16 type, gpa_t gpa, void *addr)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
+
+ switch (type) {
+ case KVM_XEN_ATTR_TYPE_VCPU_INFO:
+ vcpu_xen->vcpu_info_addr = gpa;
+ vcpu_xen->vcpu_info = addr;
+ kvm_xen_setup_pvclock_page(v);
+ break;
+ default:
+ break;
+ }
+}
+
+static gpa_t get_vcpu_attr(struct kvm_vcpu *v, u16 type)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
+
+ switch (type) {
+ case KVM_XEN_ATTR_TYPE_VCPU_INFO:
+ return vcpu_xen->vcpu_info_addr;
+ default:
+ return 0;
+ }
+}
+
static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
{
struct shared_info *shared_info;
@@ -37,26 +64,44 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
return 0;
}

+static void *xen_vcpu_info(struct kvm_vcpu *v)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
+ struct kvm_xen *kvm = &v->kvm->arch.xen;
+ unsigned int offset = 0;
+ void *hva = NULL;
+
+ if (vcpu_xen->vcpu_info_addr)
+ return vcpu_xen->vcpu_info;
+
+ if (kvm->shinfo_addr && v->vcpu_id < MAX_VIRT_CPUS) {
+ hva = kvm->shinfo;
+ offset += offsetof(struct shared_info, vcpu_info);
+ offset += v->vcpu_id * sizeof(struct vcpu_info);
+ }
+
+ return hva + offset;
+}
+
void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
{
struct kvm_vcpu_arch *vcpu = &v->arch;
struct pvclock_vcpu_time_info *guest_hv_clock;
+ void *hva = xen_vcpu_info(v);
unsigned int offset;

- if (v->vcpu_id >= MAX_VIRT_CPUS)
+ if (!hva)
return;

offset = offsetof(struct vcpu_info, time);
- offset += offsetof(struct shared_info, vcpu_info);
- offset += v->vcpu_id * sizeof(struct vcpu_info);

guest_hv_clock = (struct pvclock_vcpu_time_info *)
- (((void *)v->kvm->arch.xen.shinfo) + offset);
+ (hva + offset);

BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);

if (guest_hv_clock->version & 1)
- ++guest_hv_clock->version; /* first time write, random junk */
+ ++guest_hv_clock->version;

vcpu->hv_clock.version = guest_hv_clock->version + 1;
guest_hv_clock->version = vcpu->hv_clock.version;
@@ -93,6 +138,25 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
r = kvm_xen_shared_info_init(kvm, gfn);
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
+ gpa_t gpa = data->u.vcpu_attr.gpa;
+ struct kvm_vcpu *v;
+ struct page *page;
+ void *addr;
+
+ v = kvm_get_vcpu(kvm, data->u.vcpu_attr.vcpu);
+ if (!v)
+ return -EINVAL;
+
+ page = gfn_to_page(v->kvm, gpa_to_gfn(gpa));
+ if (is_error_page(page))
+ return -EFAULT;
+
+ addr = page_to_virt(page) + offset_in_page(gpa);
+ set_vcpu_attr(v, data->type, gpa, addr);
+ r = 0;
+ break;
+ }
default:
break;
}
@@ -109,6 +173,17 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
data->u.shared_info.gfn = kvm->arch.xen.shinfo_addr;
break;
}
+ case KVM_XEN_ATTR_TYPE_VCPU_INFO: {
+ struct kvm_vcpu *v;
+
+ v = kvm_get_vcpu(kvm, data->u.vcpu_attr.vcpu);
+ if (!v)
+ return -EINVAL;
+
+ data->u.vcpu_attr.gpa = get_vcpu_attr(v, data->type);
+ r = 0;
+ break;
+ }
default:
break;
}
@@ -180,6 +255,14 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
return 0;
}

+void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+
+ if (vcpu_xen->vcpu_info)
+ put_page(virt_to_page(vcpu_xen->vcpu_info));
+}
+
void kvm_xen_destroy_vm(struct kvm *kvm)
{
struct kvm_xen *xen = &kvm->arch.xen;
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 827c9390da34..10ebd0b7a25e 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -3,6 +3,19 @@
#ifndef __ARCH_X86_KVM_XEN_H__
#define __ARCH_X86_KVM_XEN_H__

+static inline struct kvm_vcpu_xen *vcpu_to_xen_vcpu(struct kvm_vcpu *vcpu)
+{
+ return &vcpu->arch.xen;
+}
+
+static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
+{
+ struct kvm_vcpu_arch *arch;
+
+ arch = container_of(xen_vcpu, struct kvm_vcpu_arch, xen);
+ return container_of(arch, struct kvm_vcpu, arch);
+}
+
void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
@@ -11,5 +24,6 @@ bool kvm_xen_hypercall_set(struct kvm *kvm);
int kvm_xen_hypercall(struct kvm_vcpu *vcpu);

void kvm_xen_destroy_vm(struct kvm *kvm);
+void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);

#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index de2168d235af..782f497a0fdd 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1465,10 +1465,15 @@ struct kvm_xen_hvm_attr {
struct {
__u64 gfn;
} shared_info;
+ struct {
+ __u32 vcpu;
+ __u64 gpa;
+ } vcpu_attr;
} u;
};

#define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x0
+#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x1

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
--
2.11.0


2019-02-20 20:24:21

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 05/39] KVM: x86/xen: update wallclock region


Wallclock on Xen is written in the shared_info page.

To that purpose, export kvm_write_wall_clock() and pass on the GPA of
its location to populate the shared_info wall clock data.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
arch/x86/kvm/x86.h | 1 +
arch/x86/kvm/xen.c | 3 +++
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6eb2afaa2af2..31a102b22042 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1418,7 +1418,7 @@ void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
}

-static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
+void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
{
int version;
int r;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 28406aa1136d..b5f2e66a4c81 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -256,6 +256,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
}

void kvm_set_pending_timer(struct kvm_vcpu *vcpu);
+void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock);
int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);

void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr);
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b4bd1949656e..879bcfdd7b1d 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -19,6 +19,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
{
struct shared_info *shared_info;
struct page *page;
+ gpa_t gpa = gfn_to_gpa(gfn);

page = gfn_to_page(kvm, gfn);
if (is_error_page(page))
@@ -30,6 +31,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
memset(shared_info, 0, sizeof(struct shared_info));
kvm->arch.xen.shinfo = shared_info;

+ kvm_write_wall_clock(kvm, gpa + offsetof(struct shared_info, wc));
+
kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
return 0;
}
--
2.11.0


2019-02-20 20:24:41

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

Xen usually places its MSR at 0x4000000 or 0x4000200 depending on
whether it is running in viridian mode or not. Note that this is not
ABI guaranteed, so it is possible for Xen to advertise the MSR some
place else.

Given the way xen_hvm_config() is handled, if the former address is
selected, this will conflict with HyperV's MSR
(HV_X64_MSR_GUEST_OS_ID) which uses the same address.

Given that the MSR location is arbitrary, move the xen_hvm_config()
handling to the top of kvm_set_msr_common() before falling through.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/kvm/x86.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65e4559eef2f..47360a4b0d42 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2429,6 +2429,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
u32 msr = msr_info->index;
u64 data = msr_info->data;

+ if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
+ return xen_hvm_config(vcpu, data);
+
switch (msr) {
case MSR_AMD64_NB_CFG:
case MSR_IA32_UCODE_WRITE:
@@ -2644,8 +2647,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.msr_misc_features_enables = data;
break;
default:
- if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
- return xen_hvm_config(vcpu, data);
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
if (!ignore_msrs) {
--
2.11.0


2019-02-20 21:10:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 20/02/19 21:15, Joao Martins wrote:
> 2. PV Driver support (patches 17 - 39)
>
> We start by redirecting hypercalls from the backend to routines
> which emulate the behaviour that PV backends expect i.e. grant
> table and interdomain events. Next, we add support for late
> initialization of xenbus, followed by implementing
> frontend/backend communication mechanisms (i.e. grant tables and
> interdomain event channels). Finally, introduce xen-shim.ko,
> which will setup a limited Xen environment. This uses the added
> functionality of Xen specific shared memory (grant tables) and
> notifications (event channels).

I am a bit worried by the last patches, they seem really brittle and
prone to breakage. I don't know Xen well enough to understand if the
lack of support for GNTMAP_host_map is fixable, but if not, you have to
define a completely different hypercall.

Of course, tests are missing. You should use the
tools/testing/selftests/kvm/ framework, and ideally each patch should
come with coverage for the newly-added code.

Thanks,

Paolo

Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
> 2. PV Driver support (patches 17 - 39)
>
> We start by redirecting hypercalls from the backend to routines
> which emulate the behaviour that PV backends expect i.e. grant
> table and interdomain events. Next, we add support for late
> initialization of xenbus, followed by implementing
> frontend/backend communication mechanisms (i.e. grant tables and
> interdomain event channels). Finally, introduce xen-shim.ko,
> which will setup a limited Xen environment. This uses the added
> functionality of Xen specific shared memory (grant tables) and
> notifications (event channels).

Does it mean backends could be run in another guest, similarly as on
real Xen? AFAIK virtio doesn't allow that as virtio backends need
arbitrary write access to guest memory. But grant tables provide enough
abstraction to do that safely.

--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


Attachments:
(No filename) (1.07 kB)
signature.asc (499.00 B)
Download all attachments

2019-02-21 00:30:55

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2/20/19 1:09 PM, Paolo Bonzini wrote:
> On 20/02/19 21:15, Joao Martins wrote:
>> 2. PV Driver support (patches 17 - 39)
>>
>> We start by redirecting hypercalls from the backend to routines
>> which emulate the behaviour that PV backends expect i.e. grant
>> table and interdomain events. Next, we add support for late
>> initialization of xenbus, followed by implementing
>> frontend/backend communication mechanisms (i.e. grant tables and
>> interdomain event channels). Finally, introduce xen-shim.ko,
>> which will setup a limited Xen environment. This uses the added
>> functionality of Xen specific shared memory (grant tables) and
>> notifications (event channels).
>
> I am a bit worried by the last patches, they seem really brittle and
> prone to breakage. I don't know Xen well enough to understand if the
> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> define a completely different hypercall.
I assume you are aware of most of this, but just in case, here's the
flow when a backend driver wants to map a grant-reference in the
host: it allocates an empty struct page (via ballooning) and does a
map_grant_ref(GNTMAP_host_map) hypercall. In response, Xen validates the
grant-reference and maps it onto the address associated with the struct
page.
After this, from the POV of the underlying network/block drivers, these
struct pages can be used as just regular pages.

To support this in a KVM environment, where AFAICS no remapping of pages
is possible, the idea was to make minimal changes to the backend drivers
such that map_grant_ref() could just return the PFN from which the
backend could derive the struct page.

To ensure that backends -- when running in this environment -- have been
modified to deal with these new semantics, our map_grant_ref()
implementation explicitly disallows the GNTMAP_host_map flag.

Now if I'm reading you right, you would prefer something more
straightforward -- perhaps similar semantics but a new flag that
makes this behaviour explicit?

>
> Of course, tests are missing. You should use the
> tools/testing/selftests/kvm/ framework, and ideally each patch should
> come with coverage for the newly-added code.
Agreed.

Thanks
Ankur

>
> Thanks,
>
> Paolo
>

2019-02-21 00:33:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support



On 2/20/19 3:39 PM, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>> 2. PV Driver support (patches 17 - 39)
>>
>> We start by redirecting hypercalls from the backend to routines
>> which emulate the behaviour that PV backends expect i.e. grant
>> table and interdomain events. Next, we add support for late
>> initialization of xenbus, followed by implementing
>> frontend/backend communication mechanisms (i.e. grant tables and
>> interdomain event channels). Finally, introduce xen-shim.ko,
>> which will setup a limited Xen environment. This uses the added
>> functionality of Xen specific shared memory (grant tables) and
>> notifications (event channels).
>
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
I'm afraid not. For now grant operations (map/unmap) can only be done
by backends to the local KVM instance.

Ankur

> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.

>

2019-02-21 07:58:08

by Jürgen Groß

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 21/02/2019 00:39, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>> 2. PV Driver support (patches 17 - 39)
>>
>> We start by redirecting hypercalls from the backend to routines
>> which emulate the behaviour that PV backends expect i.e. grant
>> table and interdomain events. Next, we add support for late
>> initialization of xenbus, followed by implementing
>> frontend/backend communication mechanisms (i.e. grant tables and
>> interdomain event channels). Finally, introduce xen-shim.ko,
>> which will setup a limited Xen environment. This uses the added
>> functionality of Xen specific shared memory (grant tables) and
>> notifications (event channels).
>
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.

As long as the grant table emulation in xen-shim isn't just a wrapper to
"normal" KVM guest memory access.

I guess the xen-shim implementation doesn't support the same kind of
guest memory isolation as Xen does?


Juergen

2019-02-21 11:46:45

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> On 20/02/19 21:15, Joao Martins wrote:
>> 2. PV Driver support (patches 17 - 39)
>>
>> We start by redirecting hypercalls from the backend to routines
>> which emulate the behaviour that PV backends expect i.e. grant
>> table and interdomain events. Next, we add support for late
>> initialization of xenbus, followed by implementing
>> frontend/backend communication mechanisms (i.e. grant tables and
>> interdomain event channels). Finally, introduce xen-shim.ko,
>> which will setup a limited Xen environment. This uses the added
>> functionality of Xen specific shared memory (grant tables) and
>> notifications (event channels).
>
> I am a bit worried by the last patches, they seem really brittle and
> prone to breakage. I don't know Xen well enough to understand if the
> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> define a completely different hypercall.
>
I guess Ankur already answered this; so just to stack this on top of his comment.

The xen_shim_domain() is only meant to handle the case where the backend
has/can-have full access to guest memory [i.e. netback and blkback would work
with similar assumptions as vhost?]. For the normal case, where a backend *in a
guest* maps and unmaps other guest memory, this is not applicable and these
changes don't affect that case.

IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
actual hypercalls but rather invoking shim_hypercall(). The call chain would go
more or less like:

<netback|blkback|scsiback>
gnttab_map_refs(map_ops, pages)
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
shim_hypercall()
shim_hcall_gntmap()

Our reasoning was that given we are already in KVM, why mapping a page if the
user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
the shim determines its user doesn't want to map the page. Also, there's another
issue where PV backends always need a struct page to reference the device
inflight data as Ankur pointed out.

> Of course, tests are missing.

FWIW: this was deliberate as we wanted to get folks impressions before
proceeding further with the work.

> You should use the
> tools/testing/selftests/kvm/ framework, and ideally each patch should
> come with coverage for the newly-added code.
> Got it.

Cheers,
Joao

2019-02-21 11:57:22

by Joao Martins

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2/20/19 11:39 PM, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>> 2. PV Driver support (patches 17 - 39)
>>
>> We start by redirecting hypercalls from the backend to routines
>> which emulate the behaviour that PV backends expect i.e. grant
>> table and interdomain events. Next, we add support for late
>> initialization of xenbus, followed by implementing
>> frontend/backend communication mechanisms (i.e. grant tables and
>> interdomain event channels). Finally, introduce xen-shim.ko,
>> which will setup a limited Xen environment. This uses the added
>> functionality of Xen specific shared memory (grant tables) and
>> notifications (event channels).
>
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.
>
In this series not yet. Here we are trying to resemble how KVM drives its kernel
PV backends (i.e. vhost virtio).

The domU grant {un,}mapping could be added complementary. The main difference
between domU gntmap vs shim gntmap is that the former maps/unmaps the guest
page on the backend. Most of it is common code I think.

Joao

2019-02-21 12:02:53

by Joao Martins

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2/21/19 7:57 AM, Juergen Gross wrote:
> On 21/02/2019 00:39, Marek Marczykowski-Górecki wrote:
>> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>>> 2. PV Driver support (patches 17 - 39)
>>>
>>> We start by redirecting hypercalls from the backend to routines
>>> which emulate the behaviour that PV backends expect i.e. grant
>>> table and interdomain events. Next, we add support for late
>>> initialization of xenbus, followed by implementing
>>> frontend/backend communication mechanisms (i.e. grant tables and
>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>> which will setup a limited Xen environment. This uses the added
>>> functionality of Xen specific shared memory (grant tables) and
>>> notifications (event channels).
>>
>> Does it mean backends could be run in another guest, similarly as on
>> real Xen? AFAIK virtio doesn't allow that as virtio backends need
>> arbitrary write access to guest memory. But grant tables provide enough
>> abstraction to do that safely.
>
> As long as the grant table emulation in xen-shim isn't just a wrapper to
> "normal" KVM guest memory access.
>
> I guess the xen-shim implementation doesn't support the same kind of
> guest memory isolation as Xen does?
>
It doesn't, but it's also two different usecases.

The xen-shim is meant to when PV backend lives in the hypervisor (similar model
as KVM vhost), whereas domU grant mapping that Marek is asking about require
additional hypercalls handled by guest (i.e. in kvm_xen_hypercall). This would
equate to how Xen currently performs grant mapping/unmapping.

Joao

2019-02-21 18:30:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Wed, Feb 20, 2019 at 08:15:32PM +0000, Joao Martins wrote:
> Add a new exit reason for emulator to handle Xen hypercalls.
> Albeit these are injected only if guest has initialized the Xen
> hypercall page - the hypercall is just a convenience but one
> that is done by pretty much all guests. Hence if the guest
> sets the hypercall page, we assume a Xen guest is going to
> be set up.
>
> Emulator will then panic with:
>
> KVM: unknown exit reason 28
> RAX=0000000000000011 RBX=ffffffff81e03e94 RCX=0000000040000000
> RDX=0000000000000000
> RSI=ffffffff81e03e70 RDI=0000000000000006 RBP=ffffffff81e03e90
> RSP=ffffffff81e03e68
> R8 =73726576206e6558 R9 =ffffffff81e03e90 R10=ffffffff81e03e94
> R11=2e362e34206e6f69
> R12=0000000040000004 R13=ffffffff81e03e8c R14=ffffffff81e03e88
> R15=0000000000000000
> RIP=ffffffff81001228 RFL=00000082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0000 0000000000000000 ffffffff 00c00000
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 0000000000000000 ffffffff 00c00000
> GS =0000 ffffffff81f34000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0020 0000000000000000 00000fff 00808b00 DPL=0 TSS64-busy
> GDT= ffffffff81f3c000 0000007f
> IDT= ffffffff83265000 00000fff
> CR0=80050033 CR2=ffff880001fa6ff8 CR3=0000000001fa6000 CR4=000406a0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
> DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=cc cc cc cc cc cc cc cc cc cc cc cc b8 11 00 00 00 0f 01 c1 <c3> cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 12
> 00 00 00 0f
>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 13 +++++++
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/trace.h | 33 +++++++++++++++++
> arch/x86/kvm/x86.c | 12 +++++++
> arch/x86/kvm/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/xen.h | 10 ++++++
> include/uapi/linux/kvm.h | 17 ++++++++-
> 7 files changed, 164 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/kvm/xen.c
> create mode 100644 arch/x86/kvm/xen.h

...

> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 31ecf7a76d5a..2b46c93c9380 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> - hyperv.o page_track.o debugfs.o
> + hyperv.o xen.o page_track.o debugfs.o

Can this be wrapped in a config? Or even better, as a loadable module?
2k+ lines of code is a non-trival amount of baggage for folks that don't
care about running Xen guests. I've only glanced through the series, so
I've no idea if the resulting code would be an abomination.

>
> kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
> kvm-amd-y += svm.o pmu_amd.o

2019-02-21 20:57:12

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On 2/21/19 6:29 PM, Sean Christopherson wrote:
> On Wed, Feb 20, 2019 at 08:15:32PM +0000, Joao Martins wrote:
>> Add a new exit reason for emulator to handle Xen hypercalls.
>> Albeit these are injected only if guest has initialized the Xen
>> hypercall page - the hypercall is just a convenience but one
>> that is done by pretty much all guests. Hence if the guest
>> sets the hypercall page, we assume a Xen guest is going to
>> be set up.
>>
>> Emulator will then panic with:
>>
>> KVM: unknown exit reason 28
>> RAX=0000000000000011 RBX=ffffffff81e03e94 RCX=0000000040000000
>> RDX=0000000000000000
>> RSI=ffffffff81e03e70 RDI=0000000000000006 RBP=ffffffff81e03e90
>> RSP=ffffffff81e03e68
>> R8 =73726576206e6558 R9 =ffffffff81e03e90 R10=ffffffff81e03e94
>> R11=2e362e34206e6f69
>> R12=0000000040000004 R13=ffffffff81e03e8c R14=ffffffff81e03e88
>> R15=0000000000000000
>> RIP=ffffffff81001228 RFL=00000082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0000 0000000000000000 ffffffff 00c00000
>> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>> SS =0000 0000000000000000 ffffffff 00c00000
>> DS =0000 0000000000000000 ffffffff 00c00000
>> FS =0000 0000000000000000 ffffffff 00c00000
>> GS =0000 ffffffff81f34000 ffffffff 00c00000
>> LDT=0000 0000000000000000 ffffffff 00c00000
>> TR =0020 0000000000000000 00000fff 00808b00 DPL=0 TSS64-busy
>> GDT= ffffffff81f3c000 0000007f
>> IDT= ffffffff83265000 00000fff
>> CR0=80050033 CR2=ffff880001fa6ff8 CR3=0000000001fa6000 CR4=000406a0
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
>> DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000d01
>> Code=cc cc cc cc cc cc cc cc cc cc cc cc b8 11 00 00 00 0f 01 c1 <c3> cc
>> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 12
>> 00 00 00 0f
>>
>> Signed-off-by: Joao Martins <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 13 +++++++
>> arch/x86/kvm/Makefile | 2 +-
>> arch/x86/kvm/trace.h | 33 +++++++++++++++++
>> arch/x86/kvm/x86.c | 12 +++++++
>> arch/x86/kvm/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++
>> arch/x86/kvm/xen.h | 10 ++++++
>> include/uapi/linux/kvm.h | 17 ++++++++-
>> 7 files changed, 164 insertions(+), 2 deletions(-)
>> create mode 100644 arch/x86/kvm/xen.c
>> create mode 100644 arch/x86/kvm/xen.h
>
> ...
>
>> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
>> index 31ecf7a76d5a..2b46c93c9380 100644
>> --- a/arch/x86/kvm/Makefile
>> +++ b/arch/x86/kvm/Makefile
>> @@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>>
>> kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>> - hyperv.o page_track.o debugfs.o
>> + hyperv.o xen.o page_track.o debugfs.o
>
> Can this be wrapped in a config? Or even better, as a loadable module?

Turning that into a loadable module might be a little trickier, but I think it
is doable if that's what folks/maintainers would prefer.

The Xen addition follows the same structure as Hyper-V in kvm (what you suggest
here is probably applicable for both). i.e. there's some Xen specific routines
for vm/vcpu init/teardown, and timer handling. We would have to place some of
those functions into a struct that gets registered at modinit.

> 2k+ lines of code is a non-trival amount of baggage for folks that don't
> care about running Xen guests. I've only glanced through the series, so
> I've no idea if the resulting code would be an abomination.
>
>>
>> kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
>> kvm-amd-y += svm.o pmu_amd.o

2019-02-22 00:31:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Thu, Feb 21, 2019 at 08:56:06PM +0000, Joao Martins wrote:
> On 2/21/19 6:29 PM, Sean Christopherson wrote:
> > On Wed, Feb 20, 2019 at 08:15:32PM +0000, Joao Martins wrote:
> >> Add a new exit reason for emulator to handle Xen hypercalls.
> >> Albeit these are injected only if guest has initialized the Xen
> >> hypercall page - the hypercall is just a convenience but one
> >> that is done by pretty much all guests. Hence if the guest
> >> sets the hypercall page, we assume a Xen guest is going to
> >> be set up.
> >>
> >> Emulator will then panic with:
> >>
> >> KVM: unknown exit reason 28
> >> RAX=0000000000000011 RBX=ffffffff81e03e94 RCX=0000000040000000
> >> RDX=0000000000000000
> >> RSI=ffffffff81e03e70 RDI=0000000000000006 RBP=ffffffff81e03e90
> >> RSP=ffffffff81e03e68
> >> R8 =73726576206e6558 R9 =ffffffff81e03e90 R10=ffffffff81e03e94
> >> R11=2e362e34206e6f69
> >> R12=0000000040000004 R13=ffffffff81e03e8c R14=ffffffff81e03e88
> >> R15=0000000000000000
> >> RIP=ffffffff81001228 RFL=00000082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >> ES =0000 0000000000000000 ffffffff 00c00000
> >> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> >> SS =0000 0000000000000000 ffffffff 00c00000
> >> DS =0000 0000000000000000 ffffffff 00c00000
> >> FS =0000 0000000000000000 ffffffff 00c00000
> >> GS =0000 ffffffff81f34000 ffffffff 00c00000
> >> LDT=0000 0000000000000000 ffffffff 00c00000
> >> TR =0020 0000000000000000 00000fff 00808b00 DPL=0 TSS64-busy
> >> GDT= ffffffff81f3c000 0000007f
> >> IDT= ffffffff83265000 00000fff
> >> CR0=80050033 CR2=ffff880001fa6ff8 CR3=0000000001fa6000 CR4=000406a0
> >> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
> >> DR3=0000000000000000
> >> DR6=00000000ffff0ff0 DR7=0000000000000400
> >> EFER=0000000000000d01
> >> Code=cc cc cc cc cc cc cc cc cc cc cc cc b8 11 00 00 00 0f 01 c1 <c3> cc
> >> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 12
> >> 00 00 00 0f
> >>
> >> Signed-off-by: Joao Martins <[email protected]>
> >> ---
> >> arch/x86/include/asm/kvm_host.h | 13 +++++++
> >> arch/x86/kvm/Makefile | 2 +-
> >> arch/x86/kvm/trace.h | 33 +++++++++++++++++
> >> arch/x86/kvm/x86.c | 12 +++++++
> >> arch/x86/kvm/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++
> >> arch/x86/kvm/xen.h | 10 ++++++
> >> include/uapi/linux/kvm.h | 17 ++++++++-
> >> 7 files changed, 164 insertions(+), 2 deletions(-)
> >> create mode 100644 arch/x86/kvm/xen.c
> >> create mode 100644 arch/x86/kvm/xen.h
> >
> > ...
> >
> >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> >> index 31ecf7a76d5a..2b46c93c9380 100644
> >> --- a/arch/x86/kvm/Makefile
> >> +++ b/arch/x86/kvm/Makefile
> >> @@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >>
> >> kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> >> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> >> - hyperv.o page_track.o debugfs.o
> >> + hyperv.o xen.o page_track.o debugfs.o
> >
> > Can this be wrapped in a config? Or even better, as a loadable module?
>
> Turning that into a loadable module might be a little trickier, but I think it
> is doable if that's what folks/maintainers would prefer.
>
> The Xen addition follows the same structure as Hyper-V in kvm (what you suggest
> here is probably applicable for both). i.e. there's some Xen specific routines
> for vm/vcpu init/teardown, and timer handling. We would have to place some of
> those functions into a struct that gets registered at modinit.

Huh. I never really hooked at the Hyper-V code, for some reason I always
assumed it was only related to running KVM on Hyper-V. I agree that this
extra hurdle only makes sense if it's also applied to Hyper-V.

2019-02-22 01:31:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On Wed, Feb 20, 2019 at 08:15:31PM +0000, Joao Martins wrote:
> Xen usually places its MSR at 0x4000000 or 0x4000200 depending on
> whether it is running in viridian mode or not. Note that this is not
> ABI guaranteed, so it is possible for Xen to advertise the MSR some
> place else.
>
> Given the way xen_hvm_config() is handled, if the former address is
> selected, this will conflict with HyperV's MSR
> (HV_X64_MSR_GUEST_OS_ID) which uses the same address.

Unconditionally servicing Hyper-V and KVM MSRs seems wrong, i.e. KVM
should only expose MSRs specific to a hypervisor if userspace has
configured CPUID to advertise support for said hypervisor. If we do the
same thing for Xen, then the common MSR code looks like:

if (kvm_advertise_kvm()) {
if (<handle kvm msr>)
return ...;
} else if (kvm_advertise_hyperv()) {
if (<handle hyperv msr>)
return ...;
} else if (kvm_advertise_xen()) {
if (<handle xen msrs>)
return ...;
}

<fall through to main switch statement>

Obviously assumes KVM only advertises itself as one hypervisor, and so
the ordering is arbitrary.

> Given that the MSR location is arbitrary, move the xen_hvm_config()
> handling to the top of kvm_set_msr_common() before falling through.
> Signed-off-by: Joao Martins <[email protected]>
> ---
> arch/x86/kvm/x86.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 65e4559eef2f..47360a4b0d42 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2429,6 +2429,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> u32 msr = msr_info->index;
> u64 data = msr_info->data;
>
> + if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
> + return xen_hvm_config(vcpu, data);
> +
> switch (msr) {
> case MSR_AMD64_NB_CFG:
> case MSR_IA32_UCODE_WRITE:
> @@ -2644,8 +2647,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.msr_misc_features_enables = data;
> break;
> default:
> - if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
> - return xen_hvm_config(vcpu, data);
> if (kvm_pmu_is_valid_msr(vcpu, msr))
> return kvm_pmu_set_msr(vcpu, msr_info);
> if (!ignore_msrs) {
> --
> 2.11.0
>

2019-02-22 11:48:50

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On 2/22/19 1:30 AM, Sean Christopherson wrote:
> On Wed, Feb 20, 2019 at 08:15:31PM +0000, Joao Martins wrote:
>> Xen usually places its MSR at 0x4000000 or 0x4000200 depending on
>> whether it is running in viridian mode or not. Note that this is not
>> ABI guaranteed, so it is possible for Xen to advertise the MSR some
>> place else.
>>
>> Given the way xen_hvm_config() is handled, if the former address is
>> selected, this will conflict with HyperV's MSR
>> (HV_X64_MSR_GUEST_OS_ID) which uses the same address.
>
> Unconditionally servicing Hyper-V and KVM MSRs seems wrong, i.e. KVM
> should only expose MSRs specific to a hypervisor if userspace has
> configured CPUID to advertise support for said hypervisor.
>
Yeah, that makes sense.

> If we do the
> same thing for Xen, then the common MSR code looks like:
>
> if (kvm_advertise_kvm()) {
> if (<handle kvm msr>)
> return ...;
> } else if (kvm_advertise_hyperv()) {
> if (<handle hyperv msr>)
> return ...;
> } else if (kvm_advertise_xen()) {
> if (<handle xen msrs>)
> return ...;
> }
>
> <fall through to main switch statement>
>
> Obviously assumes KVM only advertises itself as one hypervisor, and so
> the ordering is arbitrary.
>
One thing to consider on the above is that there might be multiple hypervisors
in advertised in the hypervisor leaf. Which is used when you want to run a KVM
guest but with Hyper-V drivers, or an Hyper-V guest with KVM extensions or VirtIO.

The else part would probably need to be removed as IIUC when multiple
hypervisors are advertised (e.g. KVM in leaf 0x40000000, Hyper-V in leaf
0x40000100) you are still allowed to use all of its features.

2019-02-22 12:51:12

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On 22/02/19 01:30, Sean Christopherson wrote:
> On Thu, Feb 21, 2019 at 08:56:06PM +0000, Joao Martins wrote:
>> The Xen addition follows the same structure as Hyper-V in kvm (what you suggest
>> here is probably applicable for both). i.e. there's some Xen specific routines
>> for vm/vcpu init/teardown, and timer handling. We would have to place some of
>> those functions into a struct that gets registered at modinit.
>
> Huh. I never really hooked at the Hyper-V code, for some reason I always
> assumed it was only related to running KVM on Hyper-V. I agree that this
> extra hurdle only makes sense if it's also applied to Hyper-V.

The difference is that Hyper-V support is more or less mandatory to run
recent Windows guests. Having a config symbol would be enough for me.

Paolo

2019-02-22 12:53:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On 22/02/19 02:30, Sean Christopherson wrote:
> if (kvm_advertise_kvm()) {
> if (<handle kvm msr>)
> return ...;
> } else if (kvm_advertise_hyperv()) {
> if (<handle hyperv msr>)
> return ...;
> } else if (kvm_advertise_xen()) {
> if (<handle xen msrs>)
> return ...;
> }
>
> <fall through to main switch statement>
>
> Obviously assumes KVM only advertises itself as one hypervisor, and so
> the ordering is arbitrary.

No, KVM can advertise as both KVM and Hyper-V. CPUID 0x40000000 is used
for Hyper-V, while 0x40000100 is used for KVM. The MSRs do not conflict.

Paolo

2019-02-22 17:00:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 21/02/19 12:45, Joao Martins wrote:
> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>> On 20/02/19 21:15, Joao Martins wrote:
>>> 2. PV Driver support (patches 17 - 39)
>>>
>>> We start by redirecting hypercalls from the backend to routines
>>> which emulate the behaviour that PV backends expect i.e. grant
>>> table and interdomain events. Next, we add support for late
>>> initialization of xenbus, followed by implementing
>>> frontend/backend communication mechanisms (i.e. grant tables and
>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>> which will setup a limited Xen environment. This uses the added
>>> functionality of Xen specific shared memory (grant tables) and
>>> notifications (event channels).
>>
>> I am a bit worried by the last patches, they seem really brittle and
>> prone to breakage. I don't know Xen well enough to understand if the
>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>> define a completely different hypercall.
>>
> I guess Ankur already answered this; so just to stack this on top of his comment.
>
> The xen_shim_domain() is only meant to handle the case where the backend
> has/can-have full access to guest memory [i.e. netback and blkback would work
> with similar assumptions as vhost?]. For the normal case, where a backend *in a
> guest* maps and unmaps other guest memory, this is not applicable and these
> changes don't affect that case.
>
> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
> more or less like:
>
> <netback|blkback|scsiback>
> gnttab_map_refs(map_ops, pages)
> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
> shim_hypercall()
> shim_hcall_gntmap()
>
> Our reasoning was that given we are already in KVM, why mapping a page if the
> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
> the shim determines its user doesn't want to map the page. Also, there's another
> issue where PV backends always need a struct page to reference the device
> inflight data as Ankur pointed out.

Ultimately it's up to the Xen people. It does make their API uglier,
especially the in/out change for the parameter. If you can at least
avoid that, it would alleviate my concerns quite a bit.

Paolo

2019-02-25 18:58:16

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC 20/39] xen-blkback: module_exit support

On Wed, Feb 20, 2019 at 08:15:50PM +0000, Joao Martins wrote:
>
> Implement module_exit to allow users to do module unload of blkback.
> We prevent users from module unload whenever there are still interfaces
> allocated, in other words, do module_get on xen_blkif_alloc() and
> module_put on xen_blkif_free().

This patch looks like it can go now in right?

>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> drivers/block/xen-blkback/blkback.c | 8 ++++++++
> drivers/block/xen-blkback/common.h | 2 ++
> drivers/block/xen-blkback/xenbus.c | 14 ++++++++++++++
> 3 files changed, 24 insertions(+)
>
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index fd1e19f1a49f..d51d88be88e1 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1504,5 +1504,13 @@ static int __init xen_blkif_init(void)
>
> module_init(xen_blkif_init);
>
> +static void __exit xen_blkif_exit(void)
> +{
> + xen_blkif_interface_exit();
> + xen_blkif_xenbus_exit();
> +}
> +
> +module_exit(xen_blkif_exit);
> +
> MODULE_LICENSE("Dual BSD/GPL");
> MODULE_ALIAS("xen-backend:vbd");
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 1d3002d773f7..3415c558e115 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -376,8 +376,10 @@ struct phys_req {
> blkif_sector_t sector_number;
> };
> int xen_blkif_interface_init(void);
> +void xen_blkif_interface_exit(void);
>
> int xen_blkif_xenbus_init(void);
> +void xen_blkif_xenbus_exit(void);
>
> irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
> int xen_blkif_schedule(void *arg);
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index a4bc74e72c39..424e2efebe85 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -181,6 +181,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> init_completion(&blkif->drain_complete);
> INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
>
> + __module_get(THIS_MODULE);
> +
> return blkif;
> }
>
> @@ -328,6 +330,8 @@ static void xen_blkif_free(struct xen_blkif *blkif)
>
> /* Make sure everything is drained before shutting down */
> kmem_cache_free(xen_blkif_cachep, blkif);
> +
> + module_put(THIS_MODULE);
> }
>
> int __init xen_blkif_interface_init(void)
> @@ -341,6 +345,11 @@ int __init xen_blkif_interface_init(void)
> return 0;
> }
>
> +void xen_blkif_interface_exit(void)
> +{
> + kmem_cache_destroy(xen_blkif_cachep);
> +}
> +
> /*
> * sysfs interface for VBD I/O requests
> */
> @@ -1115,3 +1124,8 @@ int xen_blkif_xenbus_init(void)
> {
> return xenbus_register_backend(&xen_blkbk_driver);
> }
> +
> +void xen_blkif_xenbus_exit(void)
> +{
> + xenbus_unregister_driver(&xen_blkbk_driver);
> +}
> --
> 2.11.0
>

2019-02-26 11:23:40

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 20/39] xen-blkback: module_exit support

On 2/25/19 6:57 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 20, 2019 at 08:15:50PM +0000, Joao Martins wrote:
>>
>> Implement module_exit to allow users to do module unload of blkback.
>> We prevent users from module unload whenever there are still interfaces
>> allocated, in other words, do module_get on xen_blkif_alloc() and
>> module_put on xen_blkif_free().
>
> This patch looks like it can go now in right?
>
Yes.

Joao

2019-03-12 17:45:22

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> On 21/02/19 12:45, Joao Martins wrote:
>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>> On 20/02/19 21:15, Joao Martins wrote:
>>>> 2. PV Driver support (patches 17 - 39)
>>>>
>>>> We start by redirecting hypercalls from the backend to routines
>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>> table and interdomain events. Next, we add support for late
>>>> initialization of xenbus, followed by implementing
>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>> which will setup a limited Xen environment. This uses the added
>>>> functionality of Xen specific shared memory (grant tables) and
>>>> notifications (event channels).
>>>
>>> I am a bit worried by the last patches, they seem really brittle and
>>> prone to breakage. I don't know Xen well enough to understand if the
>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>> define a completely different hypercall.
>>>
>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>
>> The xen_shim_domain() is only meant to handle the case where the backend
>> has/can-have full access to guest memory [i.e. netback and blkback would work
>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>> guest* maps and unmaps other guest memory, this is not applicable and these
>> changes don't affect that case.
>>
>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>> more or less like:
>>
>> <netback|blkback|scsiback>
>> gnttab_map_refs(map_ops, pages)
>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>> shim_hypercall()
>> shim_hcall_gntmap()
>>
>> Our reasoning was that given we are already in KVM, why mapping a page if the
>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>> the shim determines its user doesn't want to map the page. Also, there's another
>> issue where PV backends always need a struct page to reference the device
>> inflight data as Ankur pointed out.
>
> Ultimately it's up to the Xen people. It does make their API uglier,
> especially the in/out change for the parameter. If you can at least
> avoid that, it would alleviate my concerns quite a bit.

In my view, we have two options overall:

1) Make it explicit, the changes the PV drivers we have to make in
order to support xen_shim_domain(). This could mean e.g. a) add a callback
argument to gnttab_map_refs() that is invoked for every page that gets looked up
successfully, and inside this callback the PV driver may update it's tracking
page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
shim_domain specific bits would be a little more abstracted from Xen PV
backends. See netback example below the scissors mark. Or b) have sort of a
translate_gref() and put_gref() API that Xen PV drivers use which make it even
more explicit that there's no grant ops involved. The latter is more invasive.

2) The second option is to support guest grant mapping/unmapping [*] to allow
hosting PV backends inside the guest. This would remove the Xen changes in this
series completely. But it would require another guest being used
as netback/blkback/xenstored, and less performance than 1) (though, in theory,
it would be equivalent to what does Xen with grants/events). The only changes in
Linux Xen code is adding xenstored domain support, but that is useful on its own
outside the scope of this work.

I think there's value on both; 1) is probably more familiar for KVM users
perhaps (as it is similar to what vhost does?) while 2) equates to implementing
Xen disagregation capabilities in KVM.

Thoughts? Xen maintainers what's your take on this?

Joao

[*] Interdomain events would also have to change.

---------------- >8 ----------------

It isn't much cleaner, but PV drivers avoid/hide a bunch of xen_shim_domain()
conditionals in the data path. It is more explicit while avoiding the in/out
parameter change in gnttab_map_refs.

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 936c0b3e0ba2..c6e47dcb7e10 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -158,6 +158,7 @@ struct xenvif_queue { /* Per-queue data for xenvif */
struct gnttab_copy tx_copy_ops[MAX_PENDING_REQS];
struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
+ struct gnttab_page_changed page_cb[MAX_PENDING_REQS];
/* passed to gnttab_[un]map_refs with pages under (un)mapping */
struct page *pages_to_map[MAX_PENDING_REQS];
struct page *pages_to_unmap[MAX_PENDING_REQS];
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 80aae3a32c2a..56788d8cd813 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -324,15 +324,29 @@ struct xenvif_tx_cb {

#define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)

+static inline void xenvif_tx_page_changed(phys_addr_t addr, void *opaque)
+{
+ struct page **page = opaque;
+
+ *page = virt_to_page(addr);
+}
static inline void xenvif_tx_create_map_op(struct xenvif_queue *queue,
u16 pending_idx,
struct xen_netif_tx_request *txp,
unsigned int extra_count,
struct gnttab_map_grant_ref *mop)
{
- queue->pages_to_map[mop-queue->tx_map_ops] = queue->mmap_pages[pending_idx];
+ u32 map_idx = mop - queue->tx_map_ops;
+
+ queue->pages_to_map[map_idx] = queue->mmap_pages[pending_idx];
+ queue->page_cb[map_idx].ctx = &queue->mmap_pages[pending_idx];
+ queue->page_cb[map_idx].cb = xenvif_tx_page_changed;
+
gnttab_set_map_op(mop, idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map | GNTMAP_readonly,
+ GNTTAB_host_map | GNTMAP_readonly,
txp->gref, queue->vif->domid);

memcpy(&queue->pending_tx_info[pending_idx].req, txp,
@@ -1268,7 +1283,7 @@ static inline void xenvif_tx_dealloc_action(struct
xenvif_queue *queue)
queue->mmap_pages[pending_idx];
gnttab_set_unmap_op(gop,
idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map,
+ GNTTAB_host_map,
queue->grant_tx_handle[pending_idx]);
xenvif_grant_handle_reset(queue, pending_idx);
++gop;
@@ -1322,7 +1337,7 @@ int xenvif_tx_action(struct xenvif_queue *queue, int budget)
gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
if (nr_mops != 0) {
ret = gnttab_map_refs(queue->tx_map_ops,
- NULL,
+ NULL, queue->page_cb,
queue->pages_to_map,
nr_mops);
BUG_ON(ret);
@@ -1394,7 +1409,7 @@ void xenvif_idx_unmap(struct xenvif_queue *queue, u16
pending_idx)

gnttab_set_unmap_op(&tx_unmap_op,
idx_to_kaddr(queue, pending_idx),
- GNTMAP_host_map,
+ GNTTAB_host_map,
queue->grant_tx_handle[pending_idx]);
xenvif_grant_handle_reset(queue, pending_idx);

@@ -1622,7 +1637,7 @@ static int __init netback_init(void)
{
int rc = 0;

- if (!xen_domain())
+ if (!xen_domain() && !xen_shim_domain_get())
return -ENODEV;

/* Allow as many queues as there are CPUs but max. 8 if user has not
@@ -1663,6 +1678,7 @@ static void __exit netback_fini(void)
debugfs_remove_recursive(xen_netback_dbg_root);
#endif /* CONFIG_DEBUG_FS */
xenvif_xenbus_fini();
+ xen_shim_domain_put();
}
module_exit(netback_fini);

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 7ea6fb6a2e5d..b4c9d7ff531f 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1031,6 +1031,7 @@ void gnttab_foreach_grant(struct page **pages,

int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_grant_ref *kmap_ops,
+ struct gnttab_page_changed *page_cb,
struct page **pages, unsigned int count)
{
int i, ret;
@@ -1045,6 +1046,12 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
{
struct xen_page_foreign *foreign;

+ if (xen_shim_domain() && page_cb) {
+ page_cb[i].cb(map_ops[i].host_addr,
+ page_cb[i].ctx);
+ continue;
+ }
+
SetPageForeign(pages[i]);
foreign = xen_page_foreign(pages[i]);
foreign->domid = map_ops[i].dom;
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 9bc5bc07d4d3..5e17fa08e779 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -55,6 +55,9 @@
/* NR_GRANT_FRAMES must be less than or equal to that configured in Xen */
#define NR_GRANT_FRAMES 4

+/* Selects host map only if on native Xen */
+#define GNTTAB_host_map (xen_shim_domain() ? 0 : GNTMAP_host_map)
+
struct gnttab_free_callback {
struct gnttab_free_callback *next;
void (*fn)(void *);
@@ -78,6 +81,12 @@ struct gntab_unmap_queue_data
unsigned int age;
};

+struct gnttab_page_changed
+{
+ void (*cb)(phys_addr_t addr, void *opaque);
+ void *ctx;
+};
+
int gnttab_init(void);
int gnttab_suspend(void);
int gnttab_resume(void);
@@ -221,6 +230,7 @@ void gnttab_pages_clear_private(int nr_pages, struct page
**pages);

int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_grant_ref *kmap_ops,
+ struct gnttab_page_changed *cb,
struct page **pages, unsigned int count);
int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
struct gnttab_unmap_grant_ref *kunmap_ops,

2019-04-08 06:44:59

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 12/03/2019 18:14, Joao Martins wrote:
> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>> On 21/02/19 12:45, Joao Martins wrote:
>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>
>>>>> We start by redirecting hypercalls from the backend to routines
>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>> table and interdomain events. Next, we add support for late
>>>>> initialization of xenbus, followed by implementing
>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>> which will setup a limited Xen environment. This uses the added
>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>> notifications (event channels).
>>>>
>>>> I am a bit worried by the last patches, they seem really brittle and
>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>> define a completely different hypercall.
>>>>
>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>
>>> The xen_shim_domain() is only meant to handle the case where the backend
>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>> changes don't affect that case.
>>>
>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>> more or less like:
>>>
>>> <netback|blkback|scsiback>
>>> gnttab_map_refs(map_ops, pages)
>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>> shim_hypercall()
>>> shim_hcall_gntmap()
>>>
>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>> the shim determines its user doesn't want to map the page. Also, there's another
>>> issue where PV backends always need a struct page to reference the device
>>> inflight data as Ankur pointed out.
>>
>> Ultimately it's up to the Xen people. It does make their API uglier,
>> especially the in/out change for the parameter. If you can at least
>> avoid that, it would alleviate my concerns quite a bit.
>
> In my view, we have two options overall:
>
> 1) Make it explicit, the changes the PV drivers we have to make in
> order to support xen_shim_domain(). This could mean e.g. a) add a callback
> argument to gnttab_map_refs() that is invoked for every page that gets looked up
> successfully, and inside this callback the PV driver may update it's tracking
> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
> shim_domain specific bits would be a little more abstracted from Xen PV
> backends. See netback example below the scissors mark. Or b) have sort of a
> translate_gref() and put_gref() API that Xen PV drivers use which make it even
> more explicit that there's no grant ops involved. The latter is more invasive.
>
> 2) The second option is to support guest grant mapping/unmapping [*] to allow
> hosting PV backends inside the guest. This would remove the Xen changes in this
> series completely. But it would require another guest being used
> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
> it would be equivalent to what does Xen with grants/events). The only changes in
> Linux Xen code is adding xenstored domain support, but that is useful on its own
> outside the scope of this work.
>
> I think there's value on both; 1) is probably more familiar for KVM users
> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
> Xen disagregation capabilities in KVM.
>
> Thoughts? Xen maintainers what's your take on this?

What I'd like best would be a new handle (e.g. xenhost_t *) used as an
abstraction layer for this kind of stuff. It should be passed to the
backends and those would pass it on to low-level Xen drivers (xenbus,
event channels, grant table, ...).

I was planning to do that (the xenhost_t * stuff) soon in order to add
support for nested Xen using PV devices (you need two Xenstores for that
as the nested dom0 is acting as Xen backend server, while using PV
frontends for accessing the "real" world outside).

The xenhost_t should be used for:

- accessing Xenstore
- issuing and receiving events
- doing hypercalls
- grant table operations

So exactly the kind of stuff you want to do, too.


Juergen

2019-04-08 10:38:49

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 4/8/19 7:44 AM, Juergen Gross wrote:
> On 12/03/2019 18:14, Joao Martins wrote:
>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>> On 21/02/19 12:45, Joao Martins wrote:
>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>
>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>> table and interdomain events. Next, we add support for late
>>>>>> initialization of xenbus, followed by implementing
>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>> notifications (event channels).
>>>>>
>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>> define a completely different hypercall.
>>>>>
>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>
>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>> changes don't affect that case.
>>>>
>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>> more or less like:
>>>>
>>>> <netback|blkback|scsiback>
>>>> gnttab_map_refs(map_ops, pages)
>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>> shim_hypercall()
>>>> shim_hcall_gntmap()
>>>>
>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>> issue where PV backends always need a struct page to reference the device
>>>> inflight data as Ankur pointed out.
>>>
>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>> especially the in/out change for the parameter. If you can at least
>>> avoid that, it would alleviate my concerns quite a bit.
>>
>> In my view, we have two options overall:
>>
>> 1) Make it explicit, the changes the PV drivers we have to make in
>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>> successfully, and inside this callback the PV driver may update it's tracking
>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>> shim_domain specific bits would be a little more abstracted from Xen PV
>> backends. See netback example below the scissors mark. Or b) have sort of a
>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>> more explicit that there's no grant ops involved. The latter is more invasive.
>>
>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>> hosting PV backends inside the guest. This would remove the Xen changes in this
>> series completely. But it would require another guest being used
>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>> it would be equivalent to what does Xen with grants/events). The only changes in
>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>> outside the scope of this work.
>>
>> I think there's value on both; 1) is probably more familiar for KVM users
>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>> Xen disagregation capabilities in KVM.
>>
>> Thoughts? Xen maintainers what's your take on this?
>
> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
> abstraction layer for this kind of stuff. It should be passed to the
> backends and those would pass it on to low-level Xen drivers (xenbus,
> event channels, grant table, ...).
>
So if IIRC backends would use the xenhost layer to access grants or frames
referenced by grants, and that would grok into some of this. IOW, you would have
two implementors of xenhost: one for nested remote/local events+grants and
another for this "shim domain" ?

> I was planning to do that (the xenhost_t * stuff) soon in order to add
> support for nested Xen using PV devices (you need two Xenstores for that
> as the nested dom0 is acting as Xen backend server, while using PV
> frontends for accessing the "real" world outside).
>
> The xenhost_t should be used for:
>
> - accessing Xenstore
> - issuing and receiving events
> - doing hypercalls
> - grant table operations
>

In the text above, I sort of suggested a slice of this on 1.b) with a
translate_gref() and put_gref() API -- to get the page from a gref. This was
because of the flags|host_addr hurdle we depicted above wrt to using using grant
maps/unmaps. You think some of the xenhost layer would be ammenable to support
this case?

> So exactly the kind of stuff you want to do, too.
>
Cool idea!

Joao

2019-04-08 10:44:30

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 08/04/2019 12:36, Joao Martins wrote:
> On 4/8/19 7:44 AM, Juergen Gross wrote:
>> On 12/03/2019 18:14, Joao Martins wrote:
>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>>
>>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>>> table and interdomain events. Next, we add support for late
>>>>>>> initialization of xenbus, followed by implementing
>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>>> notifications (event channels).
>>>>>>
>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>> define a completely different hypercall.
>>>>>>
>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>
>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>> changes don't affect that case.
>>>>>
>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>> more or less like:
>>>>>
>>>>> <netback|blkback|scsiback>
>>>>> gnttab_map_refs(map_ops, pages)
>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>> shim_hypercall()
>>>>> shim_hcall_gntmap()
>>>>>
>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>> issue where PV backends always need a struct page to reference the device
>>>>> inflight data as Ankur pointed out.
>>>>
>>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>>> especially the in/out change for the parameter. If you can at least
>>>> avoid that, it would alleviate my concerns quite a bit.
>>>
>>> In my view, we have two options overall:
>>>
>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>> successfully, and inside this callback the PV driver may update it's tracking
>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>
>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>> series completely. But it would require another guest being used
>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>> outside the scope of this work.
>>>
>>> I think there's value on both; 1) is probably more familiar for KVM users
>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>>> Xen disagregation capabilities in KVM.
>>>
>>> Thoughts? Xen maintainers what's your take on this?
>>
>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>> abstraction layer for this kind of stuff. It should be passed to the
>> backends and those would pass it on to low-level Xen drivers (xenbus,
>> event channels, grant table, ...).
>>
> So if IIRC backends would use the xenhost layer to access grants or frames
> referenced by grants, and that would grok into some of this. IOW, you would have
> two implementors of xenhost: one for nested remote/local events+grants and
> another for this "shim domain" ?

As I'd need that for nested Xen I guess that would make it 3 variants.
Probably the xen-shim variant would need more hooks, but that should be
no problem.

>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>> support for nested Xen using PV devices (you need two Xenstores for that
>> as the nested dom0 is acting as Xen backend server, while using PV
>> frontends for accessing the "real" world outside).
>>
>> The xenhost_t should be used for:
>>
>> - accessing Xenstore
>> - issuing and receiving events
>> - doing hypercalls
>> - grant table operations
>>
>
> In the text above, I sort of suggested a slice of this on 1.b) with a
> translate_gref() and put_gref() API -- to get the page from a gref. This was
> because of the flags|host_addr hurdle we depicted above wrt to using using grant
> maps/unmaps. You think some of the xenhost layer would be ammenable to support
> this case?

I think so, yes.

>
>> So exactly the kind of stuff you want to do, too.
>>
> Cool idea!

In the end you might make my life easier for nested Xen. :-)

Do you want to have a try with that idea or should I do that? I might be
able to start working on that in about a month.


Juergen

2019-04-08 18:39:54

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 4/8/19 11:42 AM, Juergen Gross wrote:
> On 08/04/2019 12:36, Joao Martins wrote:
>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>>>
>>>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>> table and interdomain events. Next, we add support for late
>>>>>>>> initialization of xenbus, followed by implementing
>>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>>>> notifications (event channels).
>>>>>>>
>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>> define a completely different hypercall.
>>>>>>>
>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>
>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>> changes don't affect that case.
>>>>>>
>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>> more or less like:
>>>>>>
>>>>>> <netback|blkback|scsiback>
>>>>>> gnttab_map_refs(map_ops, pages)
>>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>> shim_hypercall()
>>>>>> shim_hcall_gntmap()
>>>>>>
>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>> inflight data as Ankur pointed out.
>>>>>
>>>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>>>> especially the in/out change for the parameter. If you can at least
>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>
>>>> In my view, we have two options overall:
>>>>
>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>
>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>> series completely. But it would require another guest being used
>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>> outside the scope of this work.
>>>>
>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>>>> Xen disagregation capabilities in KVM.
>>>>
>>>> Thoughts? Xen maintainers what's your take on this?
>>>
>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>> abstraction layer for this kind of stuff. It should be passed to the
>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>> event channels, grant table, ...).
>>>
>> So if IIRC backends would use the xenhost layer to access grants or frames
>> referenced by grants, and that would grok into some of this. IOW, you would have
>> two implementors of xenhost: one for nested remote/local events+grants and
>> another for this "shim domain" ?
>
> As I'd need that for nested Xen I guess that would make it 3 variants.
> Probably the xen-shim variant would need more hooks, but that should be
> no problem.
>
I probably messed up in the short description but "nested remote/local
events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
So maybe only 2 variants are needed?

>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>> support for nested Xen using PV devices (you need two Xenstores for that
>>> as the nested dom0 is acting as Xen backend server, while using PV
>>> frontends for accessing the "real" world outside).
>>>
>>> The xenhost_t should be used for:
>>>
>>> - accessing Xenstore
>>> - issuing and receiving events
>>> - doing hypercalls
>>> - grant table operations
>>>
>>
>> In the text above, I sort of suggested a slice of this on 1.b) with a
>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>> this case?
>
> I think so, yes.
>
>>
>>> So exactly the kind of stuff you want to do, too.
>>>
>> Cool idea!
>
> In the end you might make my life easier for nested Xen. :-)
>
Hehe :)

> Do you want to have a try with that idea or should I do that? I might be
> able to start working on that in about a month.
>
Ankur (CC'ed) will give a shot at it, and should start a new thread on this
xenhost abstraction layer.

Joao

2019-04-09 00:45:41

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On Mon, 8 Apr 2019, Joao Martins wrote:
> On 4/8/19 11:42 AM, Juergen Gross wrote:
> > On 08/04/2019 12:36, Joao Martins wrote:
> >> On 4/8/19 7:44 AM, Juergen Gross wrote:
> >>> On 12/03/2019 18:14, Joao Martins wrote:
> >>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> >>>>> On 21/02/19 12:45, Joao Martins wrote:
> >>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> >>>>>>> On 20/02/19 21:15, Joao Martins wrote:
> >>>>>>>> 2. PV Driver support (patches 17 - 39)
> >>>>>>>>
> >>>>>>>> We start by redirecting hypercalls from the backend to routines
> >>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
> >>>>>>>> table and interdomain events. Next, we add support for late
> >>>>>>>> initialization of xenbus, followed by implementing
> >>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
> >>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
> >>>>>>>> which will setup a limited Xen environment. This uses the added
> >>>>>>>> functionality of Xen specific shared memory (grant tables) and
> >>>>>>>> notifications (event channels).
> >>>>>>>
> >>>>>>> I am a bit worried by the last patches, they seem really brittle and
> >>>>>>> prone to breakage. I don't know Xen well enough to understand if the
> >>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> >>>>>>> define a completely different hypercall.
> >>>>>>>
> >>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
> >>>>>>
> >>>>>> The xen_shim_domain() is only meant to handle the case where the backend
> >>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
> >>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
> >>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
> >>>>>> changes don't affect that case.
> >>>>>>
> >>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
> >>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
> >>>>>> more or less like:
> >>>>>>
> >>>>>> <netback|blkback|scsiback>
> >>>>>> gnttab_map_refs(map_ops, pages)
> >>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
> >>>>>> shim_hypercall()
> >>>>>> shim_hcall_gntmap()
> >>>>>>
> >>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
> >>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
> >>>>>> the shim determines its user doesn't want to map the page. Also, there's another
> >>>>>> issue where PV backends always need a struct page to reference the device
> >>>>>> inflight data as Ankur pointed out.
> >>>>>
> >>>>> Ultimately it's up to the Xen people. It does make their API uglier,
> >>>>> especially the in/out change for the parameter. If you can at least
> >>>>> avoid that, it would alleviate my concerns quite a bit.
> >>>>
> >>>> In my view, we have two options overall:
> >>>>
> >>>> 1) Make it explicit, the changes the PV drivers we have to make in
> >>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
> >>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
> >>>> successfully, and inside this callback the PV driver may update it's tracking
> >>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
> >>>> shim_domain specific bits would be a little more abstracted from Xen PV
> >>>> backends. See netback example below the scissors mark. Or b) have sort of a
> >>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
> >>>> more explicit that there's no grant ops involved. The latter is more invasive.
> >>>>
> >>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
> >>>> hosting PV backends inside the guest. This would remove the Xen changes in this
> >>>> series completely. But it would require another guest being used
> >>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
> >>>> it would be equivalent to what does Xen with grants/events). The only changes in
> >>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
> >>>> outside the scope of this work.
> >>>>
> >>>> I think there's value on both; 1) is probably more familiar for KVM users
> >>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
> >>>> Xen disagregation capabilities in KVM.
> >>>>
> >>>> Thoughts? Xen maintainers what's your take on this?
> >>>
> >>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
> >>> abstraction layer for this kind of stuff. It should be passed to the
> >>> backends and those would pass it on to low-level Xen drivers (xenbus,
> >>> event channels, grant table, ...).
> >>>
> >> So if IIRC backends would use the xenhost layer to access grants or frames
> >> referenced by grants, and that would grok into some of this. IOW, you would have
> >> two implementors of xenhost: one for nested remote/local events+grants and
> >> another for this "shim domain" ?
> >
> > As I'd need that for nested Xen I guess that would make it 3 variants.
> > Probably the xen-shim variant would need more hooks, but that should be
> > no problem.
> >
> I probably messed up in the short description but "nested remote/local
> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
> So maybe only 2 variants are needed?
>
> >>> I was planning to do that (the xenhost_t * stuff) soon in order to add
> >>> support for nested Xen using PV devices (you need two Xenstores for that
> >>> as the nested dom0 is acting as Xen backend server, while using PV
> >>> frontends for accessing the "real" world outside).
> >>>
> >>> The xenhost_t should be used for:
> >>>
> >>> - accessing Xenstore
> >>> - issuing and receiving events
> >>> - doing hypercalls
> >>> - grant table operations
> >>>
> >>
> >> In the text above, I sort of suggested a slice of this on 1.b) with a
> >> translate_gref() and put_gref() API -- to get the page from a gref. This was
> >> because of the flags|host_addr hurdle we depicted above wrt to using using grant
> >> maps/unmaps. You think some of the xenhost layer would be ammenable to support
> >> this case?
> >
> > I think so, yes.
> >
> >>
> >>> So exactly the kind of stuff you want to do, too.
> >>>
> >> Cool idea!
> >
> > In the end you might make my life easier for nested Xen. :-)
> >
> Hehe :)
>
> > Do you want to have a try with that idea or should I do that? I might be
> > able to start working on that in about a month.
> >
> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
> xenhost abstraction layer.

If you are up for it, it would be great to write some documentation too.
We are starting to have decent docs for some PV protocols, describing a
specific PV interface, but we are lacking docs about the basic building
blocks to bring up PV drivers in general. They would be extremely
useful. Given that you have just done the work, you are in a great
position to write those docs. Even bad English would be fine, I am sure
somebody else could volunteer to clean-up the language. Anything would
help :-)

2019-04-09 05:40:38

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 08/04/2019 19:31, Joao Martins wrote:
> On 4/8/19 11:42 AM, Juergen Gross wrote:
>> On 08/04/2019 12:36, Joao Martins wrote:
>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>>>>
>>>>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>> table and interdomain events. Next, we add support for late
>>>>>>>>> initialization of xenbus, followed by implementing
>>>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>>>>> notifications (event channels).
>>>>>>>>
>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>> define a completely different hypercall.
>>>>>>>>
>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>
>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>> changes don't affect that case.
>>>>>>>
>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>> more or less like:
>>>>>>>
>>>>>>> <netback|blkback|scsiback>
>>>>>>> gnttab_map_refs(map_ops, pages)
>>>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>> shim_hypercall()
>>>>>>> shim_hcall_gntmap()
>>>>>>>
>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>> inflight data as Ankur pointed out.
>>>>>>
>>>>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>>>>> especially the in/out change for the parameter. If you can at least
>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>
>>>>> In my view, we have two options overall:
>>>>>
>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>
>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>> series completely. But it would require another guest being used
>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>> outside the scope of this work.
>>>>>
>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>>>>> Xen disagregation capabilities in KVM.
>>>>>
>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>
>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>> event channels, grant table, ...).
>>>>
>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>> two implementors of xenhost: one for nested remote/local events+grants and
>>> another for this "shim domain" ?
>>
>> As I'd need that for nested Xen I guess that would make it 3 variants.
>> Probably the xen-shim variant would need more hooks, but that should be
>> no problem.
>>
> I probably messed up in the short description but "nested remote/local
> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
> So maybe only 2 variants are needed?

I need one xenhost variant for the "normal" case as today: talking to
the single hypervisor (or in nested case: to the L1 hypervisor).

Then I need a variant for the nested case talking to L0 hypervisor.

And you need a variant talking to xen-shim.

The first two variants can be active in the same system in case of
nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
while its frontends are talking with L0 hypervisor.

>
>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>> frontends for accessing the "real" world outside).
>>>>
>>>> The xenhost_t should be used for:
>>>>
>>>> - accessing Xenstore
>>>> - issuing and receiving events
>>>> - doing hypercalls
>>>> - grant table operations
>>>>
>>>
>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>> this case?
>>
>> I think so, yes.
>>
>>>
>>>> So exactly the kind of stuff you want to do, too.
>>>>
>>> Cool idea!
>>
>> In the end you might make my life easier for nested Xen. :-)
>>
> Hehe :)
>
>> Do you want to have a try with that idea or should I do that? I might be
>> able to start working on that in about a month.
>>
> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
> xenhost abstraction layer.

Great, looking forward to it!


Juergen

2019-04-10 06:10:31

by Ankur Arora

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2019-04-08 5:35 p.m., Stefano Stabellini wrote:
> On Mon, 8 Apr 2019, Joao Martins wrote:
>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>>>>>
>>>>>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>> table and interdomain events. Next, we add support for late
>>>>>>>>>> initialization of xenbus, followed by implementing
>>>>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>> notifications (event channels).
>>>>>>>>>
>>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>>> define a completely different hypercall.
>>>>>>>>>
>>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>>
>>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>>> changes don't affect that case.
>>>>>>>>
>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>>> more or less like:
>>>>>>>>
>>>>>>>> <netback|blkback|scsiback>
>>>>>>>> gnttab_map_refs(map_ops, pages)
>>>>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>> shim_hypercall()
>>>>>>>> shim_hcall_gntmap()
>>>>>>>>
>>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>
>>>>>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>>>>>> especially the in/out change for the parameter. If you can at least
>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>
>>>>>> In my view, we have two options overall:
>>>>>>
>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>>
>>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>>> series completely. But it would require another guest being used
>>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>>> outside the scope of this work.
>>>>>>
>>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>>>>>> Xen disagregation capabilities in KVM.
>>>>>>
>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>
>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>> event channels, grant table, ...).
>>>>>
>>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>>> two implementors of xenhost: one for nested remote/local events+grants and
>>>> another for this "shim domain" ?
>>>
>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>> Probably the xen-shim variant would need more hooks, but that should be
>>> no problem.
>>>
>> I probably messed up in the short description but "nested remote/local
>> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
>> So maybe only 2 variants are needed?
>>
>>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>>> frontends for accessing the "real" world outside).
>>>>>
>>>>> The xenhost_t should be used for:
>>>>>
>>>>> - accessing Xenstore
>>>>> - issuing and receiving events
>>>>> - doing hypercalls
>>>>> - grant table operations
>>>>>
>>>>
>>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>>> this case?
>>>
>>> I think so, yes.
>>>
>>>>
>>>>> So exactly the kind of stuff you want to do, too.
>>>>>
>>>> Cool idea!
>>>
>>> In the end you might make my life easier for nested Xen. :-)
>>>
>> Hehe :)
>>
>>> Do you want to have a try with that idea or should I do that? I might be
>>> able to start working on that in about a month.
>>>
>> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
>> xenhost abstraction layer.
>
> If you are up for it, it would be great to write some documentation too.
> We are starting to have decent docs for some PV protocols, describing a
> specific PV interface, but we are lacking docs about the basic building
> blocks to bring up PV drivers in general. They would be extremely
Agreed. These would be useful.

> useful. Given that you have just done the work, you are in a great
> position to write those docs. Even bad English would be fine, I am sure
> somebody else could volunteer to clean-up the language. Anything would
> help :-)
Can't make any promises on this yet but I will see if I can convert
notes I made into something useful for the community.


Ankur

>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> https://lists.xenproject.org/mailman/listinfo/xen-devel
>

2019-04-10 07:45:25

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 2019-04-08 10:04 p.m., Juergen Gross wrote:
> On 08/04/2019 19:31, Joao Martins wrote:
>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>> 2. PV Driver support (patches 17 - 39)
>>>>>>>>>>
>>>>>>>>>> We start by redirecting hypercalls from the backend to routines
>>>>>>>>>> which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>> table and interdomain events. Next, we add support for late
>>>>>>>>>> initialization of xenbus, followed by implementing
>>>>>>>>>> frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>>> interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>> which will setup a limited Xen environment. This uses the added
>>>>>>>>>> functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>> notifications (event channels).
>>>>>>>>>
>>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>>> prone to breakage. I don't know Xen well enough to understand if the
>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>>> define a completely different hypercall.
>>>>>>>>>
>>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>>
>>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>>> changes don't affect that case.
>>>>>>>>
>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>>> more or less like:
>>>>>>>>
>>>>>>>> <netback|blkback|scsiback>
>>>>>>>> gnttab_map_refs(map_ops, pages)
>>>>>>>> HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>> shim_hypercall()
>>>>>>>> shim_hcall_gntmap()
>>>>>>>>
>>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>
>>>>>>> Ultimately it's up to the Xen people. It does make their API uglier,
>>>>>>> especially the in/out change for the parameter. If you can at least
>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>
>>>>>> In my view, we have two options overall:
>>>>>>
>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>>
>>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>>> series completely. But it would require another guest being used
>>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>>> outside the scope of this work.
>>>>>>
>>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>>> perhaps (as it is similar to what vhost does?) while 2) equates to implementing
>>>>>> Xen disagregation capabilities in KVM.
>>>>>>
>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>
>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>> event channels, grant table, ...).
>>>>>
>>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>>> two implementors of xenhost: one for nested remote/local events+grants and
>>>> another for this "shim domain" ?
>>>
>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>> Probably the xen-shim variant would need more hooks, but that should be
>>> no problem.
>>>
>> I probably messed up in the short description but "nested remote/local
>> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
>> So maybe only 2 variants are needed?
>
> I need one xenhost variant for the "normal" case as today: talking to
> the single hypervisor (or in nested case: to the L1 hypervisor).
>
> Then I need a variant for the nested case talking to L0 hypervisor.
>
> And you need a variant talking to xen-shim.
>
> The first two variants can be active in the same system in case of
> nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
> while its frontends are talking with L0 hypervisor.
Thanks this is clarifying.

So, essentially backend drivers with a xenhost_t handle, communicate
with Xen low-level drivers etc using the same handle, however, if they
communicate with frontend drivers for accessing the "real" world,
they exclusively use standard mechanisms (Linux or hypercalls)?

In this scenario L2 dom0 xen-netback and L2 dom0 xen-netfront should
just be able to use Linux interfaces. But if L2 dom0 xenbus-backend
needs to talk to L2 dom0 xenbus-frontend then do you see them layered
or are they still exclusively talking via the standard mechanisms?

Ankur

>
>>
>>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>>> frontends for accessing the "real" world outside).
>>>>>
>>>>> The xenhost_t should be used for:
>>>>>
>>>>> - accessing Xenstore
>>>>> - issuing and receiving events
>>>>> - doing hypercalls
>>>>> - grant table operations
>>>>>
>>>>
>>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>>> this case?
>>>
>>> I think so, yes.
>>>
>>>>
>>>>> So exactly the kind of stuff you want to do, too.
>>>>>
>>>> Cool idea!
>>>
>>> In the end you might make my life easier for nested Xen. :-)
>>>
>> Hehe :)
>>
>>> Do you want to have a try with that idea or should I do that? I might be
>>> able to start working on that in about a month.
>>>
>> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
>> xenhost abstraction layer.
>
> Great, looking forward to it!
>
>
> Juergen
>

2019-04-10 07:51:22

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On 10/04/2019 08:55, Ankur Arora wrote:
> On 2019-04-08 10:04 p.m., Juergen Gross wrote:
>> On 08/04/2019 19:31, Joao Martins wrote:
>>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>>>   2. PV Driver support (patches 17 - 39)
>>>>>>>>>>>
>>>>>>>>>>>   We start by redirecting hypercalls from the backend to
>>>>>>>>>>> routines
>>>>>>>>>>>   which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>>>   table and interdomain events. Next, we add support for late
>>>>>>>>>>>   initialization of xenbus, followed by implementing
>>>>>>>>>>>   frontend/backend communication mechanisms (i.e. grant
>>>>>>>>>>> tables and
>>>>>>>>>>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>>>   which will setup a limited Xen environment. This uses the
>>>>>>>>>>> added
>>>>>>>>>>>   functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>>>   notifications (event channels).
>>>>>>>>>>
>>>>>>>>>> I am a bit worried by the last patches, they seem really
>>>>>>>>>> brittle and
>>>>>>>>>> prone to breakage.  I don't know Xen well enough to understand
>>>>>>>>>> if the
>>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not,
>>>>>>>>>> you have to
>>>>>>>>>> define a completely different hypercall.
>>>>>>>>>>
>>>>>>>>> I guess Ankur already answered this; so just to stack this on
>>>>>>>>> top of his comment.
>>>>>>>>>
>>>>>>>>> The xen_shim_domain() is only meant to handle the case where
>>>>>>>>> the backend
>>>>>>>>> has/can-have full access to guest memory [i.e. netback and
>>>>>>>>> blkback would work
>>>>>>>>> with similar assumptions as vhost?]. For the normal case, where
>>>>>>>>> a backend *in a
>>>>>>>>> guest* maps and unmaps other guest memory, this is not
>>>>>>>>> applicable and these
>>>>>>>>> changes don't affect that case.
>>>>>>>>>
>>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the
>>>>>>>>> hypercalls aren't
>>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The
>>>>>>>>> call chain would go
>>>>>>>>> more or less like:
>>>>>>>>>
>>>>>>>>> <netback|blkback|scsiback>
>>>>>>>>>   gnttab_map_refs(map_ops, pages)
>>>>>>>>>     HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>>>       shim_hypercall()
>>>>>>>>>         shim_hcall_gntmap()
>>>>>>>>>
>>>>>>>>> Our reasoning was that given we are already in KVM, why mapping
>>>>>>>>> a page if the
>>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of
>>>>>>>>> GNTMAP_host_map is how
>>>>>>>>> the shim determines its user doesn't want to map the page.
>>>>>>>>> Also, there's another
>>>>>>>>> issue where PV backends always need a struct page to reference
>>>>>>>>> the device
>>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>>
>>>>>>>> Ultimately it's up to the Xen people.  It does make their API
>>>>>>>> uglier,
>>>>>>>> especially the in/out change for the parameter.  If you can at
>>>>>>>> least
>>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>>
>>>>>>> In my view, we have two options overall:
>>>>>>>
>>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a
>>>>>>> callback
>>>>>>> argument to gnttab_map_refs() that is invoked for every page that
>>>>>>> gets looked up
>>>>>>> successfully, and inside this callback the PV driver may update
>>>>>>> it's tracking
>>>>>>> page. Here we no longer have this in/out parameter in
>>>>>>> gnttab_map_refs, and all
>>>>>>> shim_domain specific bits would be a little more abstracted from
>>>>>>> Xen PV
>>>>>>> backends. See netback example below the scissors mark. Or b) have
>>>>>>> sort of a
>>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which
>>>>>>> make it even
>>>>>>> more explicit that there's no grant ops involved. The latter is
>>>>>>> more invasive.
>>>>>>>
>>>>>>> 2) The second option is to support guest grant mapping/unmapping
>>>>>>> [*] to allow
>>>>>>> hosting PV backends inside the guest. This would remove the Xen
>>>>>>> changes in this
>>>>>>> series completely. But it would require another guest being used
>>>>>>> as netback/blkback/xenstored, and less performance than 1)
>>>>>>> (though, in theory,
>>>>>>> it would be equivalent to what does Xen with grants/events). The
>>>>>>> only changes in
>>>>>>> Linux Xen code is adding xenstored domain support, but that is
>>>>>>> useful on its own
>>>>>>> outside the scope of this work.
>>>>>>>
>>>>>>> I think there's value on both; 1) is probably more familiar for
>>>>>>> KVM users
>>>>>>> perhaps (as it is similar to what vhost does?) while  2) equates
>>>>>>> to implementing
>>>>>>> Xen disagregation capabilities in KVM.
>>>>>>>
>>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>>
>>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used
>>>>>> as an
>>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>>> event channels, grant table, ...).
>>>>>>
>>>>> So if IIRC backends would use the xenhost layer to access grants or
>>>>> frames
>>>>> referenced by grants, and that would grok into some of this. IOW,
>>>>> you would have
>>>>> two implementors of xenhost: one for nested remote/local
>>>>> events+grants and
>>>>> another for this "shim domain" ?
>>>>
>>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>>> Probably the xen-shim variant would need more hooks, but that should be
>>>> no problem.
>>>>
>>> I probably messed up in the short description but "nested remote/local
>>> events+grants" was referring to nested Xen (FWIW remote meant L0 and
>>> local L1).
>>> So maybe only 2 variants are needed?
>>
>> I need one xenhost variant for the "normal" case as today: talking to
>> the single hypervisor (or in nested case: to the L1 hypervisor).
>>
>> Then I need a variant for the nested case talking to L0 hypervisor.
>>
>> And you need a variant talking to xen-shim.
>>
>> The first two variants can be active in the same system in case of
>> nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
>> while its frontends are talking with L0 hypervisor.
> Thanks this is clarifying.
>
> So, essentially backend drivers with a xenhost_t handle, communicate
> with Xen low-level drivers etc using the same handle, however, if they
> communicate with frontend drivers for accessing the "real" world,
> they exclusively use standard mechanisms (Linux or hypercalls)?

This should be opaque to the backends. The xenhost_t handle should have
a pointer to a function vector for relevant grant-, event- and Xenstore-
related functions. Calls to such functions should be done via an inline
function with the xenhost_t handle being one parameter, that function
will then call the correct implementation.

> In this scenario L2 dom0 xen-netback and L2 dom0 xen-netfront should
> just be able to use Linux interfaces. But if L2 dom0 xenbus-backend
> needs to talk to L2 dom0 xenbus-frontend then do you see them layered
> or are they still exclusively talking via the standard mechanisms?

The distinction is made via the function vector in xenhost_t. So the
only change in backends needed is the introduction of xenhost_t.

Whether we want to introduce xenhost_t in frontends, too, is TBD.


Juergen

2019-04-10 20:46:28

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC 00/39] x86/KVM: Xen HVM guest support

On Tue, 9 Apr 2019, Ankur Arora wrote:
> On 2019-04-08 5:35 p.m., Stefano Stabellini wrote:
> > On Mon, 8 Apr 2019, Joao Martins wrote:
> > > On 4/8/19 11:42 AM, Juergen Gross wrote:
> > > > On 08/04/2019 12:36, Joao Martins wrote:
> > > > > On 4/8/19 7:44 AM, Juergen Gross wrote:
> > > > > > On 12/03/2019 18:14, Joao Martins wrote:
> > > > > > > On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> > > > > > > > On 21/02/19 12:45, Joao Martins wrote:
> > > > > > > > > On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> > > > > > > > > > On 20/02/19 21:15, Joao Martins wrote:
> > > > > > > > > > > 2. PV Driver support (patches 17 - 39)
> > > > > > > > > > >
> > > > > > > > > > > We start by redirecting hypercalls from the backend to
> > > > > > > > > > > routines
> > > > > > > > > > > which emulate the behaviour that PV backends expect i.e.
> > > > > > > > > > > grant
> > > > > > > > > > > table and interdomain events. Next, we add support for
> > > > > > > > > > > late
> > > > > > > > > > > initialization of xenbus, followed by implementing
> > > > > > > > > > > frontend/backend communication mechanisms (i.e. grant
> > > > > > > > > > > tables and
> > > > > > > > > > > interdomain event channels). Finally, introduce
> > > > > > > > > > > xen-shim.ko,
> > > > > > > > > > > which will setup a limited Xen environment. This uses
> > > > > > > > > > > the added
> > > > > > > > > > > functionality of Xen specific shared memory (grant
> > > > > > > > > > > tables) and
> > > > > > > > > > > notifications (event channels).
> > > > > > > > > >
> > > > > > > > > > I am a bit worried by the last patches, they seem really
> > > > > > > > > > brittle and
> > > > > > > > > > prone to breakage. I don't know Xen well enough to
> > > > > > > > > > understand if the
> > > > > > > > > > lack of support for GNTMAP_host_map is fixable, but if not,
> > > > > > > > > > you have to
> > > > > > > > > > define a completely different hypercall.
> > > > > > > > > >
> > > > > > > > > I guess Ankur already answered this; so just to stack this on
> > > > > > > > > top of his comment.
> > > > > > > > >
> > > > > > > > > The xen_shim_domain() is only meant to handle the case where
> > > > > > > > > the backend
> > > > > > > > > has/can-have full access to guest memory [i.e. netback and
> > > > > > > > > blkback would work
> > > > > > > > > with similar assumptions as vhost?]. For the normal case,
> > > > > > > > > where a backend *in a
> > > > > > > > > guest* maps and unmaps other guest memory, this is not
> > > > > > > > > applicable and these
> > > > > > > > > changes don't affect that case.
> > > > > > > > >
> > > > > > > > > IOW, the PV backend here sits on the hypervisor, and the
> > > > > > > > > hypercalls aren't
> > > > > > > > > actual hypercalls but rather invoking shim_hypercall(). The
> > > > > > > > > call chain would go
> > > > > > > > > more or less like:
> > > > > > > > >
> > > > > > > > > <netback|blkback|scsiback>
> > > > > > > > > gnttab_map_refs(map_ops, pages)
> > > > > > > > > HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
> > > > > > > > > shim_hypercall()
> > > > > > > > > shim_hcall_gntmap()
> > > > > > > > >
> > > > > > > > > Our reasoning was that given we are already in KVM, why
> > > > > > > > > mapping a page if the
> > > > > > > > > user (i.e. the kernel PV backend) is himself? The lack of
> > > > > > > > > GNTMAP_host_map is how
> > > > > > > > > the shim determines its user doesn't want to map the page.
> > > > > > > > > Also, there's another
> > > > > > > > > issue where PV backends always need a struct page to reference
> > > > > > > > > the device
> > > > > > > > > inflight data as Ankur pointed out.
> > > > > > > >
> > > > > > > > Ultimately it's up to the Xen people. It does make their API
> > > > > > > > uglier,
> > > > > > > > especially the in/out change for the parameter. If you can at
> > > > > > > > least
> > > > > > > > avoid that, it would alleviate my concerns quite a bit.
> > > > > > >
> > > > > > > In my view, we have two options overall:
> > > > > > >
> > > > > > > 1) Make it explicit, the changes the PV drivers we have to make in
> > > > > > > order to support xen_shim_domain(). This could mean e.g. a) add a
> > > > > > > callback
> > > > > > > argument to gnttab_map_refs() that is invoked for every page that
> > > > > > > gets looked up
> > > > > > > successfully, and inside this callback the PV driver may update
> > > > > > > it's tracking
> > > > > > > page. Here we no longer have this in/out parameter in
> > > > > > > gnttab_map_refs, and all
> > > > > > > shim_domain specific bits would be a little more abstracted from
> > > > > > > Xen PV
> > > > > > > backends. See netback example below the scissors mark. Or b) have
> > > > > > > sort of a
> > > > > > > translate_gref() and put_gref() API that Xen PV drivers use which
> > > > > > > make it even
> > > > > > > more explicit that there's no grant ops involved. The latter is
> > > > > > > more invasive.
> > > > > > >
> > > > > > > 2) The second option is to support guest grant mapping/unmapping
> > > > > > > [*] to allow
> > > > > > > hosting PV backends inside the guest. This would remove the Xen
> > > > > > > changes in this
> > > > > > > series completely. But it would require another guest being used
> > > > > > > as netback/blkback/xenstored, and less performance than 1)
> > > > > > > (though, in theory,
> > > > > > > it would be equivalent to what does Xen with grants/events). The
> > > > > > > only changes in
> > > > > > > Linux Xen code is adding xenstored domain support, but that is
> > > > > > > useful on its own
> > > > > > > outside the scope of this work.
> > > > > > >
> > > > > > > I think there's value on both; 1) is probably more familiar for
> > > > > > > KVM users
> > > > > > > perhaps (as it is similar to what vhost does?) while 2) equates
> > > > > > > to implementing
> > > > > > > Xen disagregation capabilities in KVM.
> > > > > > >
> > > > > > > Thoughts? Xen maintainers what's your take on this?
> > > > > >
> > > > > > What I'd like best would be a new handle (e.g. xenhost_t *) used as
> > > > > > an
> > > > > > abstraction layer for this kind of stuff. It should be passed to the
> > > > > > backends and those would pass it on to low-level Xen drivers
> > > > > > (xenbus,
> > > > > > event channels, grant table, ...).
> > > > > >
> > > > > So if IIRC backends would use the xenhost layer to access grants or
> > > > > frames
> > > > > referenced by grants, and that would grok into some of this. IOW, you
> > > > > would have
> > > > > two implementors of xenhost: one for nested remote/local events+grants
> > > > > and
> > > > > another for this "shim domain" ?
> > > >
> > > > As I'd need that for nested Xen I guess that would make it 3 variants.
> > > > Probably the xen-shim variant would need more hooks, but that should be
> > > > no problem.
> > > >
> > > I probably messed up in the short description but "nested remote/local
> > > events+grants" was referring to nested Xen (FWIW remote meant L0 and local
> > > L1).
> > > So maybe only 2 variants are needed?
> > >
> > > > > > I was planning to do that (the xenhost_t * stuff) soon in order to
> > > > > > add
> > > > > > support for nested Xen using PV devices (you need two Xenstores for
> > > > > > that
> > > > > > as the nested dom0 is acting as Xen backend server, while using PV
> > > > > > frontends for accessing the "real" world outside).
> > > > > >
> > > > > > The xenhost_t should be used for:
> > > > > >
> > > > > > - accessing Xenstore
> > > > > > - issuing and receiving events
> > > > > > - doing hypercalls
> > > > > > - grant table operations
> > > > > >
> > > > >
> > > > > In the text above, I sort of suggested a slice of this on 1.b) with a
> > > > > translate_gref() and put_gref() API -- to get the page from a gref.
> > > > > This was
> > > > > because of the flags|host_addr hurdle we depicted above wrt to using
> > > > > using grant
> > > > > maps/unmaps. You think some of the xenhost layer would be ammenable to
> > > > > support
> > > > > this case?
> > > >
> > > > I think so, yes.
> > > >
> > > > >
> > > > > > So exactly the kind of stuff you want to do, too.
> > > > > >
> > > > > Cool idea!
> > > >
> > > > In the end you might make my life easier for nested Xen. :-)
> > > >
> > > Hehe :)
> > >
> > > > Do you want to have a try with that idea or should I do that? I might be
> > > > able to start working on that in about a month.
> > > >
> > > Ankur (CC'ed) will give a shot at it, and should start a new thread on
> > > this
> > > xenhost abstraction layer.
> >
> > If you are up for it, it would be great to write some documentation too.
> > We are starting to have decent docs for some PV protocols, describing a
> > specific PV interface, but we are lacking docs about the basic building
> > blocks to bring up PV drivers in general. They would be extremely
> Agreed. These would be useful.
>
> > useful. Given that you have just done the work, you are in a great
> > position to write those docs. Even bad English would be fine, I am sure
> > somebody else could volunteer to clean-up the language. Anything would
> > help :-)
> Can't make any promises on this yet but I will see if I can convert
> notes I made into something useful for the community.

Thank you!

2020-11-30 09:45:59

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> userspace registers a @port to an @eventfd, that is bound to a
> @vcpu. This information is then used when the guest does an
> EVTCHNOP_send with a port registered with the kernel.

Why do I want this part?

> EVTCHNOP_send short-circuiting happens by marking the event as pending
> in the shared info and vcpu info pages and doing the upcall. For IPIs
> and interdomain event channels, we do the upcall on the assigned vcpu.

This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
that we can short-circuit IPI delivery without it having to bounce
through userspace.

But why would I then want then short-circuit the short-circuit,
providing an eventfd for it to signal... so that I can then just
receive the event in userspace in a *different* form to the original
hypercall exit I would have got?


Attachments:
smime.p7s (5.05 kB)

2020-11-30 10:44:04

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On Fri, 2019-02-22 at 13:51 +0100, Paolo Bonzini wrote:
> On 22/02/19 02:30, Sean Christopherson wrote:
> > if (kvm_advertise_kvm()) {
> > if (<handle kvm msr>)
> > return ...;
> > } else if (kvm_advertise_hyperv()) {
> > if (<handle hyperv msr>)
> > return ...;
> > } else if (kvm_advertise_xen()) {
> > if (<handle xen msrs>)
> > return ...;
> > }
> >
> > <fall through to main switch statement>
> >
> > Obviously assumes KVM only advertises itself as one hypervisor, and so
> > the ordering is arbitrary.
>
> No, KVM can advertise as both KVM and Hyper-V. CPUID 0x40000000 is used
> for Hyper-V, while 0x40000100 is used for KVM. The MSRs do not conflict.

The MSRs *do* conflict. Kind of...

Xen uses MSR 0x40000000 (not to be conflated with CPUID leaf
0x40000000) for the "write hypercall page" request.

That conflicts with Hyper-V's HV_X64_MSR_GUEST_OS_ID.

So when the Hyper-V extensions are enabled, Xen moves its own MSR to
0x40000200 to avoid the conflict.

The problem is that KVM services the Hyper-V MSRs unconditionally in
the switch statement in kvm_set_msr_common(). So if the Xen MSR is set
to 0x40000000 and Hyper-V is *not* enabled, the Hyper-V support still
stops the Xen MSR from working.

Joao's patch fixes that.

A nicer alternative might be to disable the Hyper-V MSRs when they
shouldn't be there. Something like...

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2788,15 +2788,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
* the need to ignore the workaround.
*/
break;
- case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15:
- case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
- case HV_X64_MSR_CRASH_CTL:
- case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
- case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
- case HV_X64_MSR_TSC_EMULATION_CONTROL:
- case HV_X64_MSR_TSC_EMULATION_STATUS:
- return kvm_hv_set_msr_common(vcpu, msr, data,
- msr_info->host_initiated);
case MSR_IA32_BBL_CR_CTL3:
/* Drop writes to this legacy MSR -- see rdmsr
* counterpart for further detail.
@@ -2829,6 +2820,18 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
vcpu->arch.msr_misc_features_enables = data;
break;
+ case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15:
+ case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
+ case HV_X64_MSR_CRASH_CTL:
+ case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
+ if (kvm_hyperv_enabled(vcpu->kvm)) {
+ return kvm_hv_set_msr_common(vcpu, msr, data,
+ msr_info->host_initiated);
+ }
+ /* fall through */
default:
if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);


... except that's a bit icky because that trick of falling through to
the default case only works for *one* case statement. And more to the
point, the closest thing I can find to a 'kvm_hyperv_enabled()' flag is
what we do for setting the HV_X64_MSR_HYPERCALL_ENABLE flag... which is
based on whether the hv_guest_os_id is set, which in turn is done by
writing one of these MSRs :)

I suppose we could disable them just by letting Xen take precedence, if
kvm->arch.xen_hvm_config.msr == HV_X64_MSR_GUEST_OS_ID. But that's
basically what Joao's patch already does. It doesn't disable the
*other* Hyper-V MSRs except for the one Xen 'conflicts' with, but I
don't think that matters.

The patch stands alone to correct the *existing* functionality of
KVM_XEN_HVM_CONFIG, regardless of the additional functionality being
proposed in the rest of the series that followed it.

Reviewed-by: David Woodhouse <[email protected]>
Cc: [email protected]



Attachments:
smime.p7s (5.05 kB)

2020-11-30 11:09:21

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On 30/11/20 11:39, David Woodhouse wrote:
> ... except that's a bit icky because that trick of falling through to
> the default case only works for *one* case statement. And more to the
> point, the closest thing I can find to a 'kvm_hyperv_enabled()' flag is
> what we do for setting the HV_X64_MSR_HYPERCALL_ENABLE flag... which is
> based on whether the hv_guest_os_id is set, which in turn is done by
> writing one of these MSRs

You can use CPUID too (search for Hv#1 in leaf 0x40000000)?

Paolo

> I suppose we could disable them just by letting Xen take precedence, if
> kvm->arch.xen_hvm_config.msr == HV_X64_MSR_GUEST_OS_ID. But that's
> basically what Joao's patch already does. It doesn't disable the
> *other* Hyper-V MSRs except for the one Xen 'conflicts' with, but I
> don't think that matters.

2020-11-30 11:32:29

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 01/39] KVM: x86: fix Xen hypercall page msr handling

On Mon, 2020-11-30 at 12:03 +0100, Paolo Bonzini wrote:
> You can use CPUID too (search for Hv#1 in leaf 0x40000000)?

That's leaf 0x40000001. Which is also the leaf used for Xen to indicate
the Xen version. So as long as we don't pretend to be Xen version
12759.30280 I suppose that's OK.

Or we could just check leaf 0x40000000 for 'Microsoft Hv'? Or both.

How about...

#define HYPERV_CPUID_INTERFACE_MAGIC 0x31237648 /* 'Hv#1' */

static inline bool guest_cpu_has_hyperv(struct kvm_vcpu *vcpu)
{
struct kvm_hv *hv = &vcpu->kvm->arch.hyperv;
const struct kvm_cpuid_entry2 *entry;

/*
* The call to kvm_find_cpuid_entry() is a bit much to put on

* the fast path of some of the Hyper-V MSRs, so bypass it if
* the guest OS ID has already been set.
*/
if (hv->hv_guest_os_id)
return true;

entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_INTERFACE, 0x0);
return entry && entry->eax == HYPERV_CPUID_INTERFACE_MAGIC;
}

I wonder if this is overengineering when 'let Xen take precedence'
works well enough?


Attachments:
smime.p7s (5.05 kB)

2020-11-30 12:21:37

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On 11/30/20 9:41 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> userspace registers a @port to an @eventfd, that is bound to a
>> @vcpu. This information is then used when the guest does an
>> EVTCHNOP_send with a port registered with the kernel.
>
> Why do I want this part?
>
It is unnecessary churn to support eventfd at this point.

The patch subject/name is also a tad misleading, as
it implements the event channel port offloading with the optional fd
being just a small detail in addition.

>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>> and interdomain event channels, we do the upcall on the assigned vcpu.
>
> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
> that we can short-circuit IPI delivery without it having to bounce
> through userspace.
>
> But why would I then want then short-circuit the short-circuit,
> providing an eventfd for it to signal... so that I can then just
> receive the event in userspace in a *different* form to the original
> hypercall exit I would have got?
>

One thing I didn't quite do at the time, is the whitelisting of unregistered
ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
ones which go to userspace should be explicitly requested as such
and otherwise return -ENOENT in the hypercall.

Perhaps eventfd could be a way to express this? Like if you register
without an eventfd it's offloaded, otherwise it's assigned to userspace,
or if neither it's then returned an error without bothering the VMM.

But still, eventfd is probably unnecessary complexity when another @type
(XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
and let it route its evtchn port handling to the its own I/O handling thread.

Joao

2020-11-30 12:58:48

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> On 11/30/20 9:41 AM, David Woodhouse wrote:
> > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > userspace registers a @port to an @eventfd, that is bound to a
> > > @vcpu. This information is then used when the guest does an
> > > EVTCHNOP_send with a port registered with the kernel.
> >
> > Why do I want this part?
> >
>
> It is unnecessary churn to support eventfd at this point.
>
> The patch subject/name is also a tad misleading, as
> it implements the event channel port offloading with the optional fd
> being just a small detail in addition.

Right, I'd got that the commit title overstated its importance, but was
wondering why the optional fd existed at all.

I looked through the later xen_shim parts, half expecting it to be used
there... but no, they add their own special case next to the place
where eventfd_signal() gets called, instead of hooking the shim up via
an eventfd.

> > > EVTCHNOP_send short-circuiting happens by marking the event as pending
> > > in the shared info and vcpu info pages and doing the upcall. For IPIs
> > > and interdomain event channels, we do the upcall on the assigned vcpu.
> >
> > This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
> > that we can short-circuit IPI delivery without it having to bounce
> > through userspace.
> >
> > But why would I then want then short-circuit the short-circuit,
> > providing an eventfd for it to signal... so that I can then just
> > receive the event in userspace in a *different* form to the original
> > hypercall exit I would have got?
> >
>
> One thing I didn't quite do at the time, is the whitelisting of unregistered
> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
> ones which go to userspace should be explicitly requested as such
> and otherwise return -ENOENT in the hypercall.

Hm, why would -ENOENT be a fast path which needs to be handled in the
kernel?

> Perhaps eventfd could be a way to express this? Like if you register
> without an eventfd it's offloaded, otherwise it's assigned to userspace,
> or if neither it's then returned an error without bothering the VMM.

I much prefer the simple model where the *only* event channels that the
kernel knows about are the ones it's expected to handle.

For any others, the bypass doesn't kick in, and userspace gets the
KVM_EXIT_HYPERCALL exit.

> But still, eventfd is probably unnecessary complexity when another @type
> (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
> and let it route its evtchn port handling to the its own I/O handling thread.

Hmm... so the benefit of the eventfd is that we can wake the I/O thread
directly instead of bouncing out to userspace on the vCPU thread only
for it to send a signal and return to the guest? Did you ever use that,
and it is worth the additional in-kernel code?

Is there any reason we'd want that for IPI or VIRQ event channels, or
can it be only for INTERDOM/UNBOUND event channels which come later?

I'm tempted to leave it out of the first patch, and *maybe* add it back
in a later patch, putting it in the union alongside .virq.type.


struct kvm_xen_eventfd {

#define XEN_EVTCHN_TYPE_VIRQ 0
#define XEN_EVTCHN_TYPE_IPI 1
__u32 type;
__u32 port;
__u32 vcpu;
- __s32 fd;

#define KVM_XEN_EVENTFD_DEASSIGN (1 << 0)
#define KVM_XEN_EVENTFD_UPDATE (1 << 1)
__u32 flags;
union {
struct {
__u8 type;
} virq;
+ struct {
+ __s32 eventfd;
+ } interdom; /* including unbound */
__u32 padding[2];
};
} evtchn;

Does that make sense to you?


Attachments:
smime.p7s (5.05 kB)

2020-11-30 15:14:34

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On 11/30/20 12:55 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>>>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>>>> and interdomain event channels, we do the upcall on the assigned vcpu.
>>>
>>> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
>>> that we can short-circuit IPI delivery without it having to bounce
>>> through userspace.
>>>
>>> But why would I then want then short-circuit the short-circuit,
>>> providing an eventfd for it to signal... so that I can then just
>>> receive the event in userspace in a *different* form to the original
>>> hypercall exit I would have got?
>>>
>>
>> One thing I didn't quite do at the time, is the whitelisting of unregistered
>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>> ones which go to userspace should be explicitly requested as such
>> and otherwise return -ENOENT in the hypercall.
>
> Hm, why would -ENOENT be a fast path which needs to be handled in the
> kernel?
>
It's not that it's a fast path.

Like sending an event channel to an unbound vector, now becomes an possible vector to
worry about in userspace VMM e.g. should that port lookup logic be fragile.

So it's more along the lines of Nack-ing the invalid port earlier to rather go
to go userspace to invalidate it, provided we do the lookup anyway in the kernel.

>> Perhaps eventfd could be a way to express this? Like if you register
>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>> or if neither it's then returned an error without bothering the VMM.
>
> I much prefer the simple model where the *only* event channels that the
> kernel knows about are the ones it's expected to handle.
>
> For any others, the bypass doesn't kick in, and userspace gets the
> KVM_EXIT_HYPERCALL exit.
>
/me nods

I should comment on your other patch but: if we're going to make it generic for
the userspace hypercall handling, might as well move hyper-v there too. In this series,
I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)

>> But still, eventfd is probably unnecessary complexity when another @type
>> (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
>> and let it route its evtchn port handling to the its own I/O handling thread.
>
> Hmm... so the benefit of the eventfd is that we can wake the I/O thread
> directly instead of bouncing out to userspace on the vCPU thread only
> for it to send a signal and return to the guest? Did you ever use that,
> and it is worth the additional in-kernel code?
>
This was my own preemptive optimization to the interface -- it's not worth
the added code for vIRQ and IPI at this point which is what *for sure* the
kernel will handle.

> Is there any reason we'd want that for IPI or VIRQ event channels, or
> can it be only for INTERDOM/UNBOUND event channels which come later?
>
/me nods.

No reason to keep that for IPI/vIRQ.

> I'm tempted to leave it out of the first patch, and *maybe* add it back
> in a later patch, putting it in the union alongside .virq.type.
>
>
> struct kvm_xen_eventfd {
>
> #define XEN_EVTCHN_TYPE_VIRQ 0
> #define XEN_EVTCHN_TYPE_IPI 1
> __u32 type;
> __u32 port;
> __u32 vcpu;
> - __s32 fd;
>
> #define KVM_XEN_EVENTFD_DEASSIGN (1 << 0)
> #define KVM_XEN_EVENTFD_UPDATE (1 << 1)
> __u32 flags;
> union {
> struct {
> __u8 type;
> } virq;
> + struct {
> + __s32 eventfd;
> + } interdom; /* including unbound */
> __u32 padding[2];
> };
> } evtchn;
>
> Does that make sense to you?
>
Yeap! :)

Joao

2020-11-30 16:52:14

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> On 11/30/20 12:55 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > On 11/30/20 9:41 AM, David Woodhouse wrote:
> > > > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > One thing I didn't quite do at the time, is the whitelisting of unregistered
> > > ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
> > > the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
> > > ones which go to userspace should be explicitly requested as such
> > > and otherwise return -ENOENT in the hypercall.
> >
> > Hm, why would -ENOENT be a fast path which needs to be handled in the
> > kernel?
> >
>
> It's not that it's a fast path.
>
> Like sending an event channel to an unbound vector, now becomes an possible vector to
> worry about in userspace VMM e.g. should that port lookup logic be fragile.
>
> So it's more along the lines of Nack-ing the invalid port earlier to rather go
> to go userspace to invalidate it, provided we do the lookup anyway in the kernel.

If the port lookup logic is fragile, I *want* it in the sandboxed
userspace VMM and not in the kernel :)

And unless we're going to do *all* of the EVTCHNOP_bind*, EVTCHN_close,
etc. handling in the kernel, doesn't userspace have to have all that
logic for managing the port space anyway?

I think it's better to let userspace own it outright, and use the
kernel bypass purely for the fast paths. The VMM can even implement
IPI/VIRQ support in userspace, then use the kernel bypass if/when it's
available.

> > > Perhaps eventfd could be a way to express this? Like if you register
> > > without an eventfd it's offloaded, otherwise it's assigned to userspace,
> > > or if neither it's then returned an error without bothering the VMM.
> >
> > I much prefer the simple model where the *only* event channels that the
> > kernel knows about are the ones it's expected to handle.
> >
> > For any others, the bypass doesn't kick in, and userspace gets the
> > KVM_EXIT_HYPERCALL exit.
> >
>
> /me nods
>
> I should comment on your other patch but: if we're going to make it generic for
> the userspace hypercall handling, might as well move hyper-v there too. In this series,
> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)

There's a limit to how much consolidation we can do because the ABI is
different; the args are in different registers.

I do suspect Hyper-V should have marshalled its arguments into the
existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
don't think it makes sense to change it now since it's a user-facing
ABI. I don't want to follow its lead by inventing *another* gratuitous
exit type for Xen though.



Attachments:
smime.p7s (5.05 kB)

2020-11-30 17:18:33

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On 11/30/20 4:48 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> One thing I didn't quite do at the time, is the whitelisting of unregistered
>>>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>>>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>>>> ones which go to userspace should be explicitly requested as such
>>>> and otherwise return -ENOENT in the hypercall.
>>>
>>> Hm, why would -ENOENT be a fast path which needs to be handled in the
>>> kernel?
>>>
>>
>> It's not that it's a fast path.
>>
>> Like sending an event channel to an unbound vector, now becomes an possible vector to
>> worry about in userspace VMM e.g. should that port lookup logic be fragile.
>>
>> So it's more along the lines of Nack-ing the invalid port earlier to rather go
>> to go userspace to invalidate it, provided we do the lookup anyway in the kernel.
>
> If the port lookup logic is fragile, I *want* it in the sandboxed
> userspace VMM and not in the kernel :)
>
Yes definitely -- I think we are on the same page on that.

But it's just that we do the lookup *anyways* to check if the kernel has a given
evtchn port registered. That's the lookup I am talking about here, with just an
extra bit to tell that's a userspace handled port.

> And unless we're going to do *all* of the EVTCHNOP_bind*, EVTCHN_close,
> etc. handling in the kernel, doesn't userspace have to have all that
> logic for managing the port space anyway?
>
Indeed.

> I think it's better to let userspace own it outright, and use the
> kernel bypass purely for the fast paths. The VMM can even implement
> IPI/VIRQ support in userspace, then use the kernel bypass if/when it's
> available.
>
True, and it's pretty much how it's implemented today.

But felt it was still worth having this discussion ... should this be
considered or discarded. I suppose we stick with the later for now.

>>>> Perhaps eventfd could be a way to express this? Like if you register
>>>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>>>> or if neither it's then returned an error without bothering the VMM.
>>>
>>> I much prefer the simple model where the *only* event channels that the
>>> kernel knows about are the ones it's expected to handle.
>>>
>>> For any others, the bypass doesn't kick in, and userspace gets the
>>> KVM_EXIT_HYPERCALL exit.
>>>
>>
>> /me nods
>>
>> I should comment on your other patch but: if we're going to make it generic for
>> the userspace hypercall handling, might as well move hyper-v there too. In this series,
>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
>
> There's a limit to how much consolidation we can do because the ABI is
> different; the args are in different registers.
>
Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
-- surely wouldn't want to break current users of KVM_EXIT_HYPERV.

> I do suspect Hyper-V should have marshalled its arguments into the
> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
> don't think it makes sense to change it now since it's a user-facing
> ABI. I don't want to follow its lead by inventing *another* gratuitous
> exit type for Xen though.
>
I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
exit type ;)

But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).

Joao

2020-11-30 18:07:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On Mon, 2020-11-30 at 17:15 +0000, Joao Martins wrote:
> On 11/30/20 4:48 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> > > On 11/30/20 12:55 PM, David Woodhouse wrote:
> > > > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > > > On 11/30/20 9:41 AM, David Woodhouse wrote:
> > > > > > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > > >
> > > > > One thing I didn't quite do at the time, is the whitelisting of unregistered
> > > > > ports to userspace.

...

> But felt it was still worth having this discussion ... should this be
> considered or discarded. I suppose we stick with the later for now.

Ack. Duly discarded :)

> > > > > Perhaps eventfd could be a way to express this? Like if you register
> > > > > without an eventfd it's offloaded, otherwise it's assigned to userspace,
> > > > > or if neither it's then returned an error without bothering the VMM.
> > > >
> > > > I much prefer the simple model where the *only* event channels that the
> > > > kernel knows about are the ones it's expected to handle.
> > > >
> > > > For any others, the bypass doesn't kick in, and userspace gets the
> > > > KVM_EXIT_HYPERCALL exit.
> > > >
> > >
> > > /me nods
> > >
> > > I should comment on your other patch but: if we're going to make it generic for
> > > the userspace hypercall handling, might as well move hyper-v there too. In this series,
> > > I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
> > > I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
> > > disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
> >
> > There's a limit to how much consolidation we can do because the ABI is
> > different; the args are in different registers.
> >
>
> Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
> -- surely wouldn't want to break current users of KVM_EXIT_HYPERV.

True, but that means we'd have to keep KVM_EXIT_HYPERV around anyway,
and can't actually *remove* it. The "consolidation" gives us more
complexity, not less.

> > I do suspect Hyper-V should have marshalled its arguments into the
> > existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
> > don't think it makes sense to change it now since it's a user-facing
> > ABI. I don't want to follow its lead by inventing *another* gratuitous
> > exit type for Xen though.
> >
>
> I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
> exit type ;)
>
> But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
> tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).

Sure, but if the VMM doesn't know what kind of guest it's hosting, we
have bigger problems... :)


Attachments:
smime.p7s (5.05 kB)

2020-11-30 18:48:03

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On 11/30/20 6:01 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 17:15 +0000, Joao Martins wrote:
>> On 11/30/20 4:48 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
>>>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>>>> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>>>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:

[...]

>>>> I should comment on your other patch but: if we're going to make it generic for
>>>> the userspace hypercall handling, might as well move hyper-v there too. In this series,
>>>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
>>>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
>>>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
>>>
>>> There's a limit to how much consolidation we can do because the ABI is
>>> different; the args are in different registers.
>>>
>>
>> Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
>> -- surely wouldn't want to break current users of KVM_EXIT_HYPERV.
>
> True, but that means we'd have to keep KVM_EXIT_HYPERV around anyway,
> and can't actually *remove* it. The "consolidation" gives us more
> complexity, not less.
>
Fair point.

>>> I do suspect Hyper-V should have marshalled its arguments into the
>>> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
>>> don't think it makes sense to change it now since it's a user-facing
>>> ABI. I don't want to follow its lead by inventing *another* gratuitous
>>> exit type for Xen though.
>>>
>>
>> I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
>> exit type ;)
>>
>> But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
>> tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).
>
> Sure, but if the VMM doesn't know what kind of guest it's hosting, we
> have bigger problems... :)
>
Right :)

I was referring to the kernel here.

Eventually we need to special case things for a given guest type case e.g.

int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
...
if (kvm_hv_hypercall_enabled(vcpu->kvm))
return kvm_hv_hypercall(...);

if (kvm_xen_hypercall_enabled(vcpu->kvm))
return kvm_xen_hypercall(...);
...
}

And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.

The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
But the added complexity would have come from having to deal with two userspace exit types
-- indeed probably not worth the trouble as you pointed out.

2020-11-30 19:08:55

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

On Mon, 2020-11-30 at 18:41 +0000, Joao Martins wrote:
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> ...
> if (kvm_hv_hypercall_enabled(vcpu->kvm))
> return kvm_hv_hypercall(...);
>
> if (kvm_xen_hypercall_enabled(vcpu->kvm))
> return kvm_xen_hypercall(...);
> ...
> }
>
> And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
> e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.

Right, although it's a little more abstract than that: "RDI/RSI for
arg#0, arg#1 respectively".

And those are RDI/RSI for 64-bit Xen, EBX/ECX for 32-bit Xen, and
RBX/RDI for Hyper-V. (And Hyper-V seems to use only the two, while Xen
theoretically has up to 6).

> The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
> But the added complexity would have come from having to deal with two userspace exit types
> -- indeed probably not worth the trouble as you pointed out.

Yeah, I think I'm just going to move the 'kvm_userspace_hypercall()'
from my patch to be 'kvm_xen_hypercall()' in a new xen.c but still
using KVM_EXIT_HYPERCALL. Then I can rebase your other patches on top
of that, with the evtchn bypass.


Attachments:
smime.p7s (5.05 kB)

2020-11-30 19:28:03

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd



On 11/30/20 7:04 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 18:41 +0000, Joao Martins wrote:
>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>> {
>> ...
>> if (kvm_hv_hypercall_enabled(vcpu->kvm))
>> return kvm_hv_hypercall(...);
>>
>> if (kvm_xen_hypercall_enabled(vcpu->kvm))
>> return kvm_xen_hypercall(...);
>> ...
>> }
>>
>> And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
>> e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.
>
> Right, although it's a little more abstract than that: "RDI/RSI for
> arg#0, arg#1 respectively".
>
> And those are RDI/RSI for 64-bit Xen, EBX/ECX for 32-bit Xen, and
> RBX/RDI for Hyper-V. (And Hyper-V seems to use only the two, while Xen
> theoretically has up to 6).
>
Indeed, almost reminds my other patch for xen hypercalls -- it was handling 32-bit and
64-bit that way:

https://lore.kernel.org/kvm/[email protected]/

>> The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
>> But the added complexity would have come from having to deal with two userspace exit types
>> -- indeed probably not worth the trouble as you pointed out.
>
> Yeah, I think I'm just going to move the 'kvm_userspace_hypercall()'
> from my patch to be 'kvm_xen_hypercall()' in a new xen.c but still
> using KVM_EXIT_HYPERCALL. Then I can rebase your other patches on top
> of that, with the evtchn bypass.
>
Yeap, makes sense.

Joao

2020-12-01 13:12:22

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
> +{
> + struct shared_info *shared_info;
> + struct page *page;
> +
> + page = gfn_to_page(kvm, gfn);
> + if (is_error_page(page))
> + return -EINVAL;
> +
> + kvm->arch.xen.shinfo_addr = gfn;
> +
> + shared_info = page_to_virt(page);
> + memset(shared_info, 0, sizeof(struct shared_info));
> + kvm->arch.xen.shinfo = shared_info;
> + return 0;
> +}
> +

Hm.

How come we get to pin the page and directly dereference it every time,
while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
instead?

If that was allowed, wouldn't it have been a much simpler fix for
CVE-2019-3016? What am I missing?

Should I rework these to use kvm_write_guest_cached()?



Attachments:
smime.p7s (5.05 kB)

2020-12-01 22:34:07

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> Add a new exit reason for emulator to handle Xen hypercalls.
> Albeit these are injected only if guest has initialized the Xen
> hypercall page

I've reworked this a little.

I didn't like the inconsistency of allowing userspace to provide the
hypercall pages even though the ABI is now defined by the kernel and it
*has* to be VMCALL/VMMCALL.

So I switched it to generate the hypercall page directly from the
kernel, just like we do for the Hyper-V hypercall page.

I introduced a new flag in the xen_hvm_config struct to enable this
behaviour, and advertised it in the KVM_CAP_XEN_HVM return value.

I also added the cpl and support for 6-argument hypercalls, and made it
check the guest RIP when completing the call as discussed (although I
still think that probably ought to be a generic thing).

I adjusted the test case from my version of the patch, and added
support for actually testing the hypercall page MSR.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenpv

I'll go through and rebase your patch series at least up to patch 16
and collect them in that tree, then probably post them for real once
I've got everything working locally.


=======================
From c037c329c8867b219afe2100e383c62e9db7b06d Mon Sep 17 00:00:00 2001
From: Joao Martins <[email protected]>
Date: Wed, 13 Jun 2018 09:55:44 -0400
Subject: [PATCH] KVM: x86/xen: intercept xen hypercalls if enabled

Add a new exit reason for emulator to handle Xen hypercalls.

Since this means KVM owns the ABI, dispense with the facility for the
VMM to provide its own copy of the hypercall pages; just fill them in
directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page.

This behaviour is enabled by a new INTERCEPT_HCALL flag in the
KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag
being returned from the KVM_CAP_XEN_HVM check.

Add a test case and shift xen_hvm_config() to the nascent xen.c while
we're at it.

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/trace.h | 36 +++++
arch/x86/kvm/x86.c | 46 +++---
arch/x86/kvm/xen.c | 140 ++++++++++++++++++
arch/x86/kvm/xen.h | 21 +++
include/uapi/linux/kvm.h | 19 +++
tools/testing/selftests/kvm/Makefile | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
.../selftests/kvm/x86_64/xen_vmcall_test.c | 123 +++++++++++++++
10 files changed, 365 insertions(+), 30 deletions(-)
create mode 100644 arch/x86/kvm/xen.c
create mode 100644 arch/x86/kvm/xen.h
create mode 100644 tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e5f33a0d0e2..9de3229e91e1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -520,6 +520,11 @@ struct kvm_vcpu_hv {
cpumask_t tlb_flush;
};

+/* Xen HVM per vcpu emulation context */
+struct kvm_vcpu_xen {
+ u64 hypercall_rip;
+};
+
struct kvm_vcpu_arch {
/*
* rip and regs accesses must go through
@@ -717,6 +722,7 @@ struct kvm_vcpu_arch {
unsigned long singlestep_rip;

struct kvm_vcpu_hv hyperv;
+ struct kvm_vcpu_xen xen;

cpumask_var_t wbinvd_dirty_mask;

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index b804444e16d4..8bee4afc1fec 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -13,7 +13,7 @@ kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

-kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
+kvm-y += x86.o emulate.o i8259.o irq.o lapic.o xen.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
mmu/spte.o mmu/tdp_iter.o mmu/tdp_mmu.o
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index aef960f90f26..d28ecb37b62c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -92,6 +92,42 @@ TRACE_EVENT(kvm_hv_hypercall,
__entry->outgpa)
);

+/*
+ * Tracepoint for Xen hypercall.
+ */
+TRACE_EVENT(kvm_xen_hypercall,
+ TP_PROTO(unsigned long nr, unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3, unsigned long a4,
+ unsigned long a5),
+ TP_ARGS(nr, a0, a1, a2, a3, a4, a5),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr)
+ __field(unsigned long, a0)
+ __field(unsigned long, a1)
+ __field(unsigned long, a2)
+ __field(unsigned long, a3)
+ __field(unsigned long, a4)
+ __field(unsigned long, a5)
+ ),
+
+ TP_fast_assign(
+ __entry->nr = nr;
+ __entry->a0 = a0;
+ __entry->a1 = a1;
+ __entry->a2 = a2;
+ __entry->a3 = a3;
+ __entry->a4 = a4;
+ __entry->a4 = a5;
+ ),
+
+ TP_printk("nr 0x%lx a0 0x%lx a1 0x%lx a2 0x%lx a3 0x%lx a4 0x%lx a5 %lx",
+ __entry->nr, __entry->a0, __entry->a1, __entry->a2,
+ __entry->a3, __entry->a4, __entry->a5)
+);
+
+
+
/*
* Tracepoint for PIO.
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0836023963ec..adf04e8cc64a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -29,6 +29,7 @@
#include "pmu.h"
#include "hyperv.h"
#include "lapic.h"
+#include "xen.h"

#include <linux/clocksource.h>
#include <linux/interrupt.h>
@@ -2842,32 +2843,6 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}

-static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
-{
- struct kvm *kvm = vcpu->kvm;
- int lm = is_long_mode(vcpu);
- u8 *blob_addr = lm ? (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_64
- : (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_32;
- u8 blob_size = lm ? kvm->arch.xen_hvm_config.blob_size_64
- : kvm->arch.xen_hvm_config.blob_size_32;
- u32 page_num = data & ~PAGE_MASK;
- u64 page_addr = data & PAGE_MASK;
- u8 *page;
-
- if (page_num >= blob_size)
- return 1;
-
- page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
- if (IS_ERR(page))
- return PTR_ERR(page);
-
- if (kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE)) {
- kfree(page);
- return 1;
- }
- return 0;
-}
-
static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
{
u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
@@ -3002,7 +2977,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
u64 data = msr_info->data;

if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
- return xen_hvm_config(vcpu, data);
+ return kvm_xen_hvm_config(vcpu, data);

switch (msr) {
case MSR_AMD64_NB_CFG:
@@ -3703,7 +3678,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
- case KVM_CAP_XEN_HVM:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_HYPERV:
case KVM_CAP_HYPERV_VAPIC:
@@ -3742,6 +3716,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
r = 1;
break;
+ case KVM_CAP_XEN_HVM:
+ r = 1 | KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL;
+ break;
case KVM_CAP_SYNC_REGS:
r = KVM_SYNC_X86_VALID_FIELDS;
break;
@@ -5603,7 +5580,15 @@ long kvm_arch_vm_ioctl(struct file *filp,
if (copy_from_user(&xhc, argp, sizeof(xhc)))
goto out;
r = -EINVAL;
- if (xhc.flags)
+ if (xhc.flags & ~KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL)
+ goto out;
+ /*
+ * With hypercall interception the kernel generates its own
+ * hypercall page so it must not be provided.
+ */
+ if ((xhc.flags & KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL) &&
+ (xhc.blob_addr_32 || xhc.blob_addr_64 ||
+ xhc.blob_size_32 || xhc.blob_size_64))
goto out;
memcpy(&kvm->arch.xen_hvm_config, &xhc, sizeof(xhc));
r = 0;
@@ -8066,6 +8051,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
unsigned long nr, a0, a1, a2, a3, ret;
int op_64_bit;

+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
if (kvm_hv_hypercall_enabled(vcpu->kvm))
return kvm_hv_hypercall(vcpu);

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
new file mode 100644
index 000000000000..6400a4bc8480
--- /dev/null
+++ b/arch/x86/kvm/xen.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
+ * Copyright © 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * KVM Xen emulation
+ */
+
+#include "x86.h"
+#include "xen.h"
+
+#include <linux/kvm_host.h>
+
+#include <trace/events/kvm.h>
+
+#include "trace.h"
+
+int kvm_xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
+ {
+ struct kvm *kvm = vcpu->kvm;
+ u32 page_num = data & ~PAGE_MASK;
+ u64 page_addr = data & PAGE_MASK;
+
+ /*
+ * If Xen hypercall intercept is enabled, fill the hypercall
+ * page with VMCALL/VMMCALL instructions since that's what
+ * we catch. Else the VMM has provided the hypercall pages
+ * with instructions of its own choosing, so use those.
+ */
+ if (kvm_xen_hypercall_enabled(kvm)) {
+ u8 instructions[32];
+ int i;
+
+ if (page_num)
+ return 1;
+
+ /* mov imm32, %eax */
+ instructions[0] = 0xb8;
+
+ /* vmcall / vmmcall */
+ kvm_x86_ops.patch_hypercall(vcpu, instructions + 5);
+
+ /* ret */
+ instructions[8] = 0xc3;
+
+ /* int3 to pad */
+ memset(instructions + 9, 0xcc, sizeof(instructions) - 9);
+
+ for (i = 0; i < PAGE_SIZE / sizeof(instructions); i++) {
+ *(u32 *)&instructions[1] = i;
+ if (kvm_vcpu_write_guest(vcpu,
+ page_addr + (i * sizeof(instructions)),
+ instructions, sizeof(instructions)))
+ return 1;
+ }
+ } else {
+ int lm = is_long_mode(vcpu);
+ u8 *blob_addr = lm ? (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_64
+ : (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_32;
+ u8 blob_size = lm ? kvm->arch.xen_hvm_config.blob_size_64
+ : kvm->arch.xen_hvm_config.blob_size_32;
+ u8 *page;
+
+ if (page_num >= blob_size)
+ return 1;
+
+ page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ if (kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE)) {
+ kfree(page);
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static int kvm_xen_hypercall_set_result(struct kvm_vcpu *vcpu, u64 result)
+{
+ kvm_rax_write(vcpu, result);
+ return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
+{
+ struct kvm_run *run = vcpu->run;
+
+ if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.xen.hypercall_rip)))
+ return 1;
+
+ return kvm_xen_hypercall_set_result(vcpu, run->xen.u.hcall.result);
+}
+
+int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
+{
+ bool longmode;
+ u64 input, params[6];
+
+ input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
+
+ longmode = is_64_bit_mode(vcpu);
+ if (!longmode) {
+ params[0] = (u32)kvm_rbx_read(vcpu);
+ params[1] = (u32)kvm_rcx_read(vcpu);
+ params[2] = (u32)kvm_rdx_read(vcpu);
+ params[3] = (u32)kvm_rsi_read(vcpu);
+ params[4] = (u32)kvm_rdi_read(vcpu);
+ params[5] = (u32)kvm_rbp_read(vcpu);
+ }
+#ifdef CONFIG_X86_64
+ else {
+ params[0] = (u64)kvm_rdi_read(vcpu);
+ params[1] = (u64)kvm_rsi_read(vcpu);
+ params[2] = (u64)kvm_rdx_read(vcpu);
+ params[3] = (u64)kvm_r10_read(vcpu);
+ params[4] = (u64)kvm_r8_read(vcpu);
+ params[5] = (u64)kvm_r9_read(vcpu);
+ }
+#endif
+ trace_kvm_xen_hypercall(input, params[0], params[1], params[2],
+ params[3], params[4], params[5]);
+
+ vcpu->run->exit_reason = KVM_EXIT_XEN;
+ vcpu->run->xen.type = KVM_EXIT_XEN_HCALL;
+ vcpu->run->xen.u.hcall.longmode = longmode;
+ vcpu->run->xen.u.hcall.cpl = kvm_x86_ops.get_cpl(vcpu);
+ vcpu->run->xen.u.hcall.input = input;
+ vcpu->run->xen.u.hcall.params[0] = params[0];
+ vcpu->run->xen.u.hcall.params[1] = params[1];
+ vcpu->run->xen.u.hcall.params[2] = params[2];
+ vcpu->run->xen.u.hcall.params[3] = params[3];
+ vcpu->run->xen.u.hcall.params[4] = params[4];
+ vcpu->run->xen.u.hcall.params[5] = params[5];
+ vcpu->arch.xen.hypercall_rip = kvm_get_linear_rip(vcpu);
+ vcpu->arch.complete_userspace_io =
+ kvm_xen_hypercall_complete_userspace;
+
+ return 0;
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
new file mode 100644
index 000000000000..81e12f716d2e
--- /dev/null
+++ b/arch/x86/kvm/xen.h
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
+ * Copyright © 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * KVM Xen emulation
+ */
+
+#ifndef __ARCH_X86_KVM_XEN_H__
+#define __ARCH_X86_KVM_XEN_H__
+
+int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
+int kvm_xen_hvm_config(struct kvm_vcpu *vcpu, u64 data);
+
+static inline bool kvm_xen_hypercall_enabled(struct kvm *kvm)
+{
+ return kvm->arch.xen_hvm_config.flags &
+ KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL;
+}
+
+#endif /* __ARCH_X86_KVM_XEN_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ca41220b40b8..00221fe56994 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -216,6 +216,20 @@ struct kvm_hyperv_exit {
} u;
};

+struct kvm_xen_exit {
+#define KVM_EXIT_XEN_HCALL 1
+ __u32 type;
+ union {
+ struct {
+ __u32 longmode;
+ __u32 cpl;
+ __u64 input;
+ __u64 result;
+ __u64 params[6];
+ } hcall;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -250,6 +264,7 @@ struct kvm_hyperv_exit {
#define KVM_EXIT_ARM_NISV 28
#define KVM_EXIT_X86_RDMSR 29
#define KVM_EXIT_X86_WRMSR 30
+#define KVM_EXIT_XEN 31

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -426,6 +441,8 @@ struct kvm_run {
__u32 index; /* kernel -> user */
__u64 data; /* kernel <-> user */
} msr;
+ /* KVM_EXIT_XEN */
+ struct kvm_xen_exit xen;
/* Fix the size of the union. */
char padding[256];
};
@@ -1126,6 +1143,8 @@ struct kvm_x86_mce {
#endif

#ifdef KVM_CAP_XEN_HVM
+#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
+
struct kvm_xen_hvm_config {
__u32 flags;
__u32 msr;
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 3d14ef77755e..d94abec627e6 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
TEST_GEN_PROGS_x86_64 += x86_64/user_msr_test
+TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
TEST_GEN_PROGS_x86_64 += demand_paging_test
TEST_GEN_PROGS_x86_64 += dirty_log_test
TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 126c6727a6b0..6e96ae47d28c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1654,6 +1654,7 @@ static struct exit_reason {
{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
{KVM_EXIT_OSI, "OSI"},
{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
+ {KVM_EXIT_XEN, "XEN"},
#ifdef KVM_EXIT_MEMORY_NOT_PRESENT
{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
#endif
diff --git a/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
new file mode 100644
index 000000000000..3f1dd93626e5
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * svm_vmcall_test
+ *
+ * Copyright © 2020 Amazon.com, Inc. or its affiliates.
+ *
+ * Userspace hypercall testing
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+#define VCPU_ID 5
+
+#define HCALL_REGION_GPA 0xc0000000ULL
+#define HCALL_REGION_SLOT 10
+
+static struct kvm_vm *vm;
+
+#define INPUTVALUE 17
+#define ARGVALUE(x) (0xdeadbeef5a5a0000UL + x)
+#define RETVALUE 0xcafef00dfbfbffffUL
+
+#define XEN_HYPERCALL_MSR 0x40000000
+
+static void guest_code(void)
+{
+ unsigned long rax = INPUTVALUE;
+ unsigned long rdi = ARGVALUE(1);
+ unsigned long rsi = ARGVALUE(2);
+ unsigned long rdx = ARGVALUE(3);
+ register unsigned long r10 __asm__("r10") = ARGVALUE(4);
+ register unsigned long r8 __asm__("r8") = ARGVALUE(5);
+ register unsigned long r9 __asm__("r9") = ARGVALUE(6);
+
+ /* First a direct invocation of 'vmcall' */
+ __asm__ __volatile__("vmcall" :
+ "=a"(rax) :
+ "a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
+ "r"(r10), "r"(r8), "r"(r9));
+ GUEST_ASSERT(rax == RETVALUE);
+
+ /* Now fill in the hypercall page */
+ __asm__ __volatile__("wrmsr" : : "c" (XEN_HYPERCALL_MSR),
+ "a" (HCALL_REGION_GPA & 0xffffffff),
+ "d" (HCALL_REGION_GPA >> 32));
+
+ /* And invoke the same hypercall that way */
+ __asm__ __volatile__("call *%1" : "=a"(rax) :
+ "r"(HCALL_REGION_GPA + INPUTVALUE * 32),
+ "a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
+ "r"(r10), "r"(r8), "r"(r9));
+ GUEST_ASSERT(rax == RETVALUE);
+
+ GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+ if (!(kvm_check_cap(KVM_CAP_XEN_HVM) &
+ KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL) ) {
+ print_skip("KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL not available");
+ exit(KSFT_SKIP);
+ }
+
+ vm = vm_create_default(VCPU_ID, 0, (void *) guest_code);
+ vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+
+ struct kvm_xen_hvm_config hvmc = {
+ .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
+ .msr = XEN_HYPERCALL_MSR,
+ };
+ vm_ioctl(vm, KVM_XEN_HVM_CONFIG, &hvmc);
+
+ /* Map a region for the hypercall page */
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ HCALL_REGION_GPA, HCALL_REGION_SLOT,
+ getpagesize(), 0);
+ virt_map(vm, HCALL_REGION_GPA, HCALL_REGION_GPA, 1, 0);
+
+ for (;;) {
+ volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+ struct ucall uc;
+
+ vcpu_run(vm, VCPU_ID);
+
+ if (run->exit_reason == KVM_EXIT_XEN) {
+ ASSERT_EQ(run->xen.type, KVM_EXIT_XEN_HCALL);
+ ASSERT_EQ(run->xen.u.hcall.cpl, 0);
+ ASSERT_EQ(run->xen.u.hcall.longmode, 1);
+ ASSERT_EQ(run->xen.u.hcall.input, INPUTVALUE);
+ ASSERT_EQ(run->xen.u.hcall.params[0], ARGVALUE(1));
+ ASSERT_EQ(run->xen.u.hcall.params[1], ARGVALUE(2));
+ ASSERT_EQ(run->xen.u.hcall.params[2], ARGVALUE(3));
+ ASSERT_EQ(run->xen.u.hcall.params[3], ARGVALUE(4));
+ ASSERT_EQ(run->xen.u.hcall.params[4], ARGVALUE(5));
+ ASSERT_EQ(run->xen.u.hcall.params[5], ARGVALUE(6));
+ run->xen.u.hcall.result = RETVALUE;
+ continue;
+ }
+
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+ run->exit_reason,
+ exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vm, VCPU_ID, &uc)) {
+ case UCALL_ABORT:
+ TEST_FAIL("%s", (const char *)uc.args[0]);
+ /* NOT REACHED */
+ case UCALL_SYNC:
+ break;
+ case UCALL_DONE:
+ goto done;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ }
+ }
+done:
+ kvm_vm_free(vm);
+ return 0;
+}
--
2.17.1


Attachments:
smime.p7s (5.05 kB)

2020-12-01 22:34:17

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Tue, 2020-12-01 at 09:48 +0000, David Woodhouse wrote:
> So I switched it to generate the hypercall page directly from the
> kernel, just like we do for the Hyper-V hypercall page.

Speaking of Hyper-V... I'll post this one now so you can start heckling
it.

I won't post the rest as I go; not much of the rest of the series when
I eventually post it will be very new and exciting. It'll just be
rebasing and tweaking your originals and perhaps adding some tests.

=======================
From db020c521c3a96b698a78d4e62cdd19da3831585 Mon Sep 17 00:00:00 2001
From: David Woodhouse <[email protected]>
Date: Tue, 1 Dec 2020 10:59:26 +0000
Subject: [PATCH] KVM: x86: Fix coexistence of Xen and Hyper-V hypercalls

Disambiguate Xen vs. Hyper-V calls by adding 'orl $0x80000000, %eax'
at the start of the Hyper-V hypercall page when Xen hypercalls are
also enabled.

That bit is reserved in the Hyper-V ABI, and those hypercall numbers
will never be used by Xen (because it does precisely the same trick).

Switch to using kvm_vcpu_write_guest() while we're at it, instead of
open-coding it.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/hyperv.c | 40 ++++++++++++++-----
arch/x86/kvm/xen.c | 6 +++
.../selftests/kvm/x86_64/xen_vmcall_test.c | 39 +++++++++++++++---
3 files changed, 68 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 5c7c4060b45c..347ff9ad70b3 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -23,6 +23,7 @@
#include "ioapic.h"
#include "cpuid.h"
#include "hyperv.h"
+#include "xen.h"

#include <linux/cpu.h>
#include <linux/kvm_host.h>
@@ -1139,9 +1140,9 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
hv->hv_hypercall &= ~HV_X64_MSR_HYPERCALL_ENABLE;
break;
case HV_X64_MSR_HYPERCALL: {
- u64 gfn;
- unsigned long addr;
- u8 instructions[4];
+ u8 instructions[9];
+ int i = 0;
+ u64 addr;

/* if guest os id is not set hypercall should remain disabled */
if (!hv->hv_guest_os_id)
@@ -1150,16 +1151,33 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
hv->hv_hypercall = data;
break;
}
- gfn = data >> HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT;
- addr = gfn_to_hva(kvm, gfn);
- if (kvm_is_error_hva(addr))
- return 1;
- kvm_x86_ops.patch_hypercall(vcpu, instructions);
- ((unsigned char *)instructions)[3] = 0xc3; /* ret */
- if (__copy_to_user((void __user *)addr, instructions, 4))
+
+ /*
+ * If Xen and Hyper-V hypercalls are both enabled, disambiguate
+ * the same way Xen itself does, by setting the bit 31 of EAX
+ * which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just
+ * going to be clobbered on 64-bit.
+ */
+ if (kvm_xen_hypercall_enabled(kvm)) {
+ /* orl $0x80000000, %eax */
+ instructions[i++] = 0x0d;
+ instructions[i++] = 0x00;
+ instructions[i++] = 0x00;
+ instructions[i++] = 0x00;
+ instructions[i++] = 0x80;
+ }
+
+ /* vmcall/vmmcall */
+ kvm_x86_ops.patch_hypercall(vcpu, instructions + i);
+ i += 3;
+
+ /* ret */
+ ((unsigned char *)instructions)[i++] = 0xc3;
+
+ addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
+ if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
return 1;
hv->hv_hypercall = data;
- mark_page_dirty(kvm, gfn);
break;
}
case HV_X64_MSR_REFERENCE_TSC:
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 6400a4bc8480..c3426213a970 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -8,6 +8,7 @@

#include "x86.h"
#include "xen.h"
+#include "hyperv.h"

#include <linux/kvm_host.h>

@@ -99,6 +100,11 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)

input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);

+ /* Hyper-V hypercalls get bit 31 set in EAX */
+ if ((input & 0x80000000) &&
+ kvm_hv_hypercall_enabled(vcpu->kvm))
+ return kvm_hv_hypercall(vcpu);
+
longmode = is_64_bit_mode(vcpu);
if (!longmode) {
params[0] = (u32)kvm_rbx_read(vcpu);
diff --git a/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
index 3f1dd93626e5..24f279e1a66b 100644
--- a/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
+++ b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
@@ -15,6 +15,7 @@

#define HCALL_REGION_GPA 0xc0000000ULL
#define HCALL_REGION_SLOT 10
+#define PAGE_SIZE 4096

static struct kvm_vm *vm;

@@ -22,7 +23,12 @@ static struct kvm_vm *vm;
#define ARGVALUE(x) (0xdeadbeef5a5a0000UL + x)
#define RETVALUE 0xcafef00dfbfbffffUL

-#define XEN_HYPERCALL_MSR 0x40000000
+#define XEN_HYPERCALL_MSR 0x40000200
+#define HV_GUEST_OS_ID_MSR 0x40000000
+#define HV_HYPERCALL_MSR 0x40000001
+
+#define HVCALL_SIGNAL_EVENT 0x005d
+#define HV_STATUS_INVALID_ALIGNMENT 4

static void guest_code(void)
{
@@ -30,6 +36,7 @@ static void guest_code(void)
unsigned long rdi = ARGVALUE(1);
unsigned long rsi = ARGVALUE(2);
unsigned long rdx = ARGVALUE(3);
+ unsigned long rcx;
register unsigned long r10 __asm__("r10") = ARGVALUE(4);
register unsigned long r8 __asm__("r8") = ARGVALUE(5);
register unsigned long r9 __asm__("r9") = ARGVALUE(6);
@@ -41,18 +48,38 @@ static void guest_code(void)
"r"(r10), "r"(r8), "r"(r9));
GUEST_ASSERT(rax == RETVALUE);

- /* Now fill in the hypercall page */
+ /* Fill in the Xen hypercall page */
__asm__ __volatile__("wrmsr" : : "c" (XEN_HYPERCALL_MSR),
"a" (HCALL_REGION_GPA & 0xffffffff),
"d" (HCALL_REGION_GPA >> 32));

- /* And invoke the same hypercall that way */
+ /* Set Hyper-V Guest OS ID */
+ __asm__ __volatile__("wrmsr" : : "c" (HV_GUEST_OS_ID_MSR),
+ "a" (0x5a), "d" (0));
+
+ /* Hyper-V hypercall page */
+ u64 msrval = HCALL_REGION_GPA + PAGE_SIZE + 1;
+ __asm__ __volatile__("wrmsr" : : "c" (HV_HYPERCALL_MSR),
+ "a" (msrval & 0xffffffff),
+ "d" (msrval >> 32));
+
+ /* Invoke a Xen hypercall */
__asm__ __volatile__("call *%1" : "=a"(rax) :
"r"(HCALL_REGION_GPA + INPUTVALUE * 32),
"a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
"r"(r10), "r"(r8), "r"(r9));
GUEST_ASSERT(rax == RETVALUE);

+ /* Invoke a Hyper-V hypercall */
+ rax = 0;
+ rcx = HVCALL_SIGNAL_EVENT; /* code */
+ rdx = 0x5a5a5a5a; /* ingpa (badly aligned) */
+ __asm__ __volatile__("call *%1" : "=a"(rax) :
+ "r"(HCALL_REGION_GPA + PAGE_SIZE),
+ "a"(rax), "c"(rcx), "d"(rdx),
+ "r"(r8));
+ GUEST_ASSERT(rax == HV_STATUS_INVALID_ALIGNMENT);
+
GUEST_DONE();
}

@@ -73,11 +100,11 @@ int main(int argc, char *argv[])
};
vm_ioctl(vm, KVM_XEN_HVM_CONFIG, &hvmc);

- /* Map a region for the hypercall page */
+ /* Map a region for the hypercall pages */
vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
HCALL_REGION_GPA,
HCALL_REGION_SLOT,
- getpagesize(), 0);
- virt_map(vm, HCALL_REGION_GPA, HCALL_REGION_GPA, 1, 0);
+ 2 * getpagesize(), 0);
+ virt_map(vm, HCALL_REGION_GPA, HCALL_REGION_GPA, 2, 0);

for (;;) {
volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
--
2.17.1


Attachments:
smime.p7s (5.05 kB)

2020-12-02 00:46:04

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On 2020-12-01 5:07 a.m., David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct shared_info *shared_info;
>> + struct page *page;
>> +
>> + page = gfn_to_page(kvm, gfn);
>> + if (is_error_page(page))
>> + return -EINVAL;
>> +
>> + kvm->arch.xen.shinfo_addr = gfn;
>> +
>> + shared_info = page_to_virt(page);
>> + memset(shared_info, 0, sizeof(struct shared_info));
>> + kvm->arch.xen.shinfo = shared_info;
>> + return 0;
>> +}
>> +
>
> Hm.
>
> How come we get to pin the page and directly dereference it every time,
> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
> instead?

So looking at my WIP trees from the time, this is something that
we went back and forth on as well with using just a pinned page or a
persistent kvm_vcpu_map().

I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
as shared_info is created early and is not expected to change during the
lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
kvm_vcpu_unmap() dance or do some kind of synchronization.

That said, I don't think this code explicitly disallows any updates
to shared_info.

>
> If that was allowed, wouldn't it have been a much simpler fix for
> CVE-2019-3016? What am I missing?

Agreed.

Perhaps, Paolo can chime in with why KVM never uses pinned page
and always prefers to do cached mappings instead?

>
> Should I rework these to use kvm_write_guest_cached()?

kvm_vcpu_map() would be better. The event channel logic does RMW operations
on shared_info->vcpu_info.


Ankur

>
>

2020-12-02 01:29:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Tue, 2020-12-01 at 16:40 -0800, Ankur Arora wrote:
> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
> > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + struct shared_info *shared_info;
> > > + struct page *page;
> > > +
> > > + page = gfn_to_page(kvm, gfn);
> > > + if (is_error_page(page))
> > > + return -EINVAL;
> > > +
> > > + kvm->arch.xen.shinfo_addr = gfn;
> > > +
> > > + shared_info = page_to_virt(page);
> > > + memset(shared_info, 0, sizeof(struct shared_info));
> > > + kvm->arch.xen.shinfo = shared_info;
> > > + return 0;
> > > +}
> > > +
> >
> > Hm.
> >
> > How come we get to pin the page and directly dereference it every time,
> > while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
> > instead?
>
> So looking at my WIP trees from the time, this is something that
> we went back and forth on as well with using just a pinned page or a
> persistent kvm_vcpu_map().

OK, thanks.

> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
> as shared_info is created early and is not expected to change during the
> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
> kvm_vcpu_unmap() dance or do some kind of synchronization.
>
> That said, I don't think this code explicitly disallows any updates
> to shared_info.

It needs to allow updates as well as disabling the shared_info pages.
We're going to need that to implement SHUTDOWN_soft_reset for kexec.

> >
> > If that was allowed, wouldn't it have been a much simpler fix for
> > CVE-2019-3016? What am I missing?
>
> Agreed.
>
> Perhaps, Paolo can chime in with why KVM never uses pinned page
> and always prefers to do cached mappings instead?
>
> >
> > Should I rework these to use kvm_write_guest_cached()?
>
> kvm_vcpu_map() would be better. The event channel logic does RMW operations
> on shared_info->vcpu_info.

I've ported the shared_info/vcpu_info parts and made a test case, and
was going back through to make it use kvm_write_guest_cached(). But I
should probably plug on through the evtchn bits before I do that.

I also don't see much locking on the cache, and can't convince myself
that accessing the shared_info page from multiple CPUs with
kvm_write_guest_cached() or kvm_map_gfn() is sane unless they each have
their own cache.

What I have so far is at

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenpv

I'll do the event channel support tomorrow and hook it up to my actual
VMM to give it some more serious testing.


Attachments:
smime.p7s (5.05 kB)

2020-12-02 05:23:26

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On 2020-12-01 1:48 a.m., David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> Add a new exit reason for emulator to handle Xen hypercalls.
>> Albeit these are injected only if guest has initialized the Xen
>> hypercall page
>
> I've reworked this a little.
>
> I didn't like the inconsistency of allowing userspace to provide the
> hypercall pages even though the ABI is now defined by the kernel and it
> *has* to be VMCALL/VMMCALL.
>
> So I switched it to generate the hypercall page directly from the
> kernel, just like we do for the Hyper-V hypercall page.
>
> I introduced a new flag in the xen_hvm_config struct to enable this
> behaviour, and advertised it in the KVM_CAP_XEN_HVM return value.
>
> I also added the cpl and support for 6-argument hypercalls, and made it
> check the guest RIP when completing the call as discussed (although I
> still think that probably ought to be a generic thing).
>
> I adjusted the test case from my version of the patch, and added
> support for actually testing the hypercall page MSR.
>
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenpv
>
> I'll go through and rebase your patch series at least up to patch 16
> and collect them in that tree, then probably post them for real once
> I've got everything working locally.
>
>
> =======================
> From c037c329c8867b219afe2100e383c62e9db7b06d Mon Sep 17 00:00:00 2001
> From: Joao Martins <[email protected]>
> Date: Wed, 13 Jun 2018 09:55:44 -0400
> Subject: [PATCH] KVM: x86/xen: intercept xen hypercalls if enabled
>
> Add a new exit reason for emulator to handle Xen hypercalls.
>
> Since this means KVM owns the ABI, dispense with the facility for the
> VMM to provide its own copy of the hypercall pages; just fill them in
> directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page.
>
> This behaviour is enabled by a new INTERCEPT_HCALL flag in the
> KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag
> being returned from the KVM_CAP_XEN_HVM check.
>
> Add a test case and shift xen_hvm_config() to the nascent xen.c while
> we're at it.
>
> Signed-off-by: Joao Martins <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 +
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/trace.h | 36 +++++
> arch/x86/kvm/x86.c | 46 +++---
> arch/x86/kvm/xen.c | 140 ++++++++++++++++++
> arch/x86/kvm/xen.h | 21 +++
> include/uapi/linux/kvm.h | 19 +++
> tools/testing/selftests/kvm/Makefile | 1 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
> .../selftests/kvm/x86_64/xen_vmcall_test.c | 123 +++++++++++++++
> 10 files changed, 365 insertions(+), 30 deletions(-)
> create mode 100644 arch/x86/kvm/xen.c
> create mode 100644 arch/x86/kvm/xen.h
> create mode 100644 tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
>

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> new file mode 100644
> index 000000000000..6400a4bc8480
> --- /dev/null
> +++ b/arch/x86/kvm/xen.c
> @@ -0,0 +1,140 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
> + * Copyright © 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * KVM Xen emulation
> + */
> +
> +#include "x86.h"
> +#include "xen.h"
> +
> +#include <linux/kvm_host.h>
> +
> +#include <trace/events/kvm.h>
> +
> +#include "trace.h"
> +
> +int kvm_xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
> + {
> + struct kvm *kvm = vcpu->kvm;
> + u32 page_num = data & ~PAGE_MASK;
> + u64 page_addr = data & PAGE_MASK;
> +
> + /*
> + * If Xen hypercall intercept is enabled, fill the hypercall
> + * page with VMCALL/VMMCALL instructions since that's what
> + * we catch. Else the VMM has provided the hypercall pages
> + * with instructions of its own choosing, so use those.
> + */
> + if (kvm_xen_hypercall_enabled(kvm)) {
> + u8 instructions[32];
> + int i;
> +
> + if (page_num)
> + return 1;
> +
> + /* mov imm32, %eax */
> + instructions[0] = 0xb8;
> +
> + /* vmcall / vmmcall */
> + kvm_x86_ops.patch_hypercall(vcpu, instructions + 5);
> +
> + /* ret */
> + instructions[8] = 0xc3;
> +
> + /* int3 to pad */
> + memset(instructions + 9, 0xcc, sizeof(instructions) - 9);
> +
> + for (i = 0; i < PAGE_SIZE / sizeof(instructions); i++) {
> + *(u32 *)&instructions[1] = i;
> + if (kvm_vcpu_write_guest(vcpu,
> + page_addr + (i * sizeof(instructions)),
> + instructions, sizeof(instructions)))
> + return 1;
> + }

HYPERVISOR_iret isn't supported on 64bit so should be ud2 instead.


Ankur

> + } else {
> + int lm = is_long_mode(vcpu);
> + u8 *blob_addr = lm ? (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_64
> + : (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_32;
> + u8 blob_size = lm ? kvm->arch.xen_hvm_config.blob_size_64
> + : kvm->arch.xen_hvm_config.blob_size_32;
> + u8 *page;
> +
> + if (page_num >= blob_size)
> + return 1;
> +
> + page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
> + if (IS_ERR(page))
> + return PTR_ERR(page);
> +
> + if (kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE)) {
> + kfree(page);
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> +static int kvm_xen_hypercall_set_result(struct kvm_vcpu *vcpu, u64 result)
> +{
> + kvm_rax_write(vcpu, result);
> + return kvm_skip_emulated_instruction(vcpu);
> +}
> +
> +static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_run *run = vcpu->run;
> +
> + if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.xen.hypercall_rip)))
> + return 1;
> +
> + return kvm_xen_hypercall_set_result(vcpu, run->xen.u.hcall.result);
> +}
> +
> +int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
> +{
> + bool longmode;
> + u64 input, params[6];
> +
> + input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
> +
> + longmode = is_64_bit_mode(vcpu);
> + if (!longmode) {
> + params[0] = (u32)kvm_rbx_read(vcpu);
> + params[1] = (u32)kvm_rcx_read(vcpu);
> + params[2] = (u32)kvm_rdx_read(vcpu);
> + params[3] = (u32)kvm_rsi_read(vcpu);
> + params[4] = (u32)kvm_rdi_read(vcpu);
> + params[5] = (u32)kvm_rbp_read(vcpu);
> + }
> +#ifdef CONFIG_X86_64
> + else {
> + params[0] = (u64)kvm_rdi_read(vcpu);
> + params[1] = (u64)kvm_rsi_read(vcpu);
> + params[2] = (u64)kvm_rdx_read(vcpu);
> + params[3] = (u64)kvm_r10_read(vcpu);
> + params[4] = (u64)kvm_r8_read(vcpu);
> + params[5] = (u64)kvm_r9_read(vcpu);
> + }
> +#endif
> + trace_kvm_xen_hypercall(input, params[0], params[1], params[2],
> + params[3], params[4], params[5]);
> +
> + vcpu->run->exit_reason = KVM_EXIT_XEN;
> + vcpu->run->xen.type = KVM_EXIT_XEN_HCALL;
> + vcpu->run->xen.u.hcall.longmode = longmode;
> + vcpu->run->xen.u.hcall.cpl = kvm_x86_ops.get_cpl(vcpu);
> + vcpu->run->xen.u.hcall.input = input;
> + vcpu->run->xen.u.hcall.params[0] = params[0];
> + vcpu->run->xen.u.hcall.params[1] = params[1];
> + vcpu->run->xen.u.hcall.params[2] = params[2];
> + vcpu->run->xen.u.hcall.params[3] = params[3];
> + vcpu->run->xen.u.hcall.params[4] = params[4];
> + vcpu->run->xen.u.hcall.params[5] = params[5];
> + vcpu->arch.xen.hypercall_rip = kvm_get_linear_rip(vcpu);
> + vcpu->arch.complete_userspace_io =
> + kvm_xen_hypercall_complete_userspace;
> +
> + return 0;
> +}
> diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
> new file mode 100644
> index 000000000000..81e12f716d2e
> --- /dev/null
> +++ b/arch/x86/kvm/xen.h
> @@ -0,0 +1,21 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
> + * Copyright © 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * KVM Xen emulation
> + */
> +
> +#ifndef __ARCH_X86_KVM_XEN_H__
> +#define __ARCH_X86_KVM_XEN_H__
> +
> +int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
> +int kvm_xen_hvm_config(struct kvm_vcpu *vcpu, u64 data);
> +
> +static inline bool kvm_xen_hypercall_enabled(struct kvm *kvm)
> +{
> + return kvm->arch.xen_hvm_config.flags &
> + KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL;
> +}
> +
> +#endif /* __ARCH_X86_KVM_XEN_H__ */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index ca41220b40b8..00221fe56994 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -216,6 +216,20 @@ struct kvm_hyperv_exit {
> } u;
> };
>
> +struct kvm_xen_exit {
> +#define KVM_EXIT_XEN_HCALL 1
> + __u32 type;
> + union {
> + struct {
> + __u32 longmode;
> + __u32 cpl;
> + __u64 input;
> + __u64 result;
> + __u64 params[6];
> + } hcall;
> + } u;
> +};
> +
> #define KVM_S390_GET_SKEYS_NONE 1
> #define KVM_S390_SKEYS_MAX 1048576
>
> @@ -250,6 +264,7 @@ struct kvm_hyperv_exit {
> #define KVM_EXIT_ARM_NISV 28
> #define KVM_EXIT_X86_RDMSR 29
> #define KVM_EXIT_X86_WRMSR 30
> +#define KVM_EXIT_XEN 31
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -426,6 +441,8 @@ struct kvm_run {
> __u32 index; /* kernel -> user */
> __u64 data; /* kernel <-> user */
> } msr;
> + /* KVM_EXIT_XEN */
> + struct kvm_xen_exit xen;
> /* Fix the size of the union. */
> char padding[256];
> };
> @@ -1126,6 +1143,8 @@ struct kvm_x86_mce {
> #endif
>
> #ifdef KVM_CAP_XEN_HVM
> +#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
> +
> struct kvm_xen_hvm_config {
> __u32 flags;
> __u32 msr;
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 3d14ef77755e..d94abec627e6 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -59,6 +59,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
> TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
> TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
> TEST_GEN_PROGS_x86_64 += x86_64/user_msr_test
> +TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
> TEST_GEN_PROGS_x86_64 += demand_paging_test
> TEST_GEN_PROGS_x86_64 += dirty_log_test
> TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 126c6727a6b0..6e96ae47d28c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1654,6 +1654,7 @@ static struct exit_reason {
> {KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
> {KVM_EXIT_OSI, "OSI"},
> {KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
> + {KVM_EXIT_XEN, "XEN"},
> #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
> {KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
> #endif
> diff --git a/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
> new file mode 100644
> index 000000000000..3f1dd93626e5
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/xen_vmcall_test.c
> @@ -0,0 +1,123 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * svm_vmcall_test
> + *
> + * Copyright © 2020 Amazon.com, Inc. or its affiliates.
> + *
> + * Userspace hypercall testing
> + */
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +#define VCPU_ID 5
> +
> +#define HCALL_REGION_GPA 0xc0000000ULL
> +#define HCALL_REGION_SLOT 10
> +
> +static struct kvm_vm *vm;
> +
> +#define INPUTVALUE 17
> +#define ARGVALUE(x) (0xdeadbeef5a5a0000UL + x)
> +#define RETVALUE 0xcafef00dfbfbffffUL
> +
> +#define XEN_HYPERCALL_MSR 0x40000000
> +
> +static void guest_code(void)
> +{
> + unsigned long rax = INPUTVALUE;
> + unsigned long rdi = ARGVALUE(1);
> + unsigned long rsi = ARGVALUE(2);
> + unsigned long rdx = ARGVALUE(3);
> + register unsigned long r10 __asm__("r10") = ARGVALUE(4);
> + register unsigned long r8 __asm__("r8") = ARGVALUE(5);
> + register unsigned long r9 __asm__("r9") = ARGVALUE(6);
> +
> + /* First a direct invocation of 'vmcall' */
> + __asm__ __volatile__("vmcall" :
> + "=a"(rax) :
> + "a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
> + "r"(r10), "r"(r8), "r"(r9));
> + GUEST_ASSERT(rax == RETVALUE);
> +
> + /* Now fill in the hypercall page */
> + __asm__ __volatile__("wrmsr" : : "c" (XEN_HYPERCALL_MSR),
> + "a" (HCALL_REGION_GPA & 0xffffffff),
> + "d" (HCALL_REGION_GPA >> 32));
> +
> + /* And invoke the same hypercall that way */
> + __asm__ __volatile__("call *%1" : "=a"(rax) :
> + "r"(HCALL_REGION_GPA + INPUTVALUE * 32),
> + "a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
> + "r"(r10), "r"(r8), "r"(r9));
> + GUEST_ASSERT(rax == RETVALUE);
> +
> + GUEST_DONE();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + if (!(kvm_check_cap(KVM_CAP_XEN_HVM) &
> + KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL) ) {
> + print_skip("KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL not available");
> + exit(KSFT_SKIP);
> + }
> +
> + vm = vm_create_default(VCPU_ID, 0, (void *) guest_code);
> + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
> +
> + struct kvm_xen_hvm_config hvmc = {
> + .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
> + .msr = XEN_HYPERCALL_MSR,
> + };
> + vm_ioctl(vm, KVM_XEN_HVM_CONFIG, &hvmc);
> +
> + /* Map a region for the hypercall page */
> + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
> + HCALL_REGION_GPA, HCALL_REGION_SLOT,
> + getpagesize(), 0);
> + virt_map(vm, HCALL_REGION_GPA, HCALL_REGION_GPA, 1, 0);
> +
> + for (;;) {
> + volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> + struct ucall uc;
> +
> + vcpu_run(vm, VCPU_ID);
> +
> + if (run->exit_reason == KVM_EXIT_XEN) {
> + ASSERT_EQ(run->xen.type, KVM_EXIT_XEN_HCALL);
> + ASSERT_EQ(run->xen.u.hcall.cpl, 0);
> + ASSERT_EQ(run->xen.u.hcall.longmode, 1);
> + ASSERT_EQ(run->xen.u.hcall.input, INPUTVALUE);
> + ASSERT_EQ(run->xen.u.hcall.params[0], ARGVALUE(1));
> + ASSERT_EQ(run->xen.u.hcall.params[1], ARGVALUE(2));
> + ASSERT_EQ(run->xen.u.hcall.params[2], ARGVALUE(3));
> + ASSERT_EQ(run->xen.u.hcall.params[3], ARGVALUE(4));
> + ASSERT_EQ(run->xen.u.hcall.params[4], ARGVALUE(5));
> + ASSERT_EQ(run->xen.u.hcall.params[5], ARGVALUE(6));
> + run->xen.u.hcall.result = RETVALUE;
> + continue;
> + }
> +
> + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> + "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> + run->exit_reason,
> + exit_reason_str(run->exit_reason));
> +
> + switch (get_ucall(vm, VCPU_ID, &uc)) {
> + case UCALL_ABORT:
> + TEST_FAIL("%s", (const char *)uc.args[0]);
> + /* NOT REACHED */
> + case UCALL_SYNC:
> + break;
> + case UCALL_DONE:
> + goto done;
> + default:
> + TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
> + }
> + }
> +done:
> + kvm_vm_free(vm);
> + return 0;
> +}
>

2020-12-02 05:24:38

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On 2020-12-01 5:26 p.m., David Woodhouse wrote
> On Tue, 2020-12-01 at 16:40 -0800, Ankur Arora wrote:
>> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>>>> +{
>>>> + struct shared_info *shared_info;
>>>> + struct page *page;
>>>> +
>>>> + page = gfn_to_page(kvm, gfn);
>>>> + if (is_error_page(page))
>>>> + return -EINVAL;
>>>> +
>>>> + kvm->arch.xen.shinfo_addr = gfn;
>>>> +
>>>> + shared_info = page_to_virt(page);
>>>> + memset(shared_info, 0, sizeof(struct shared_info));
>>>> + kvm->arch.xen.shinfo = shared_info;
>>>> + return 0;
>>>> +}
>>>> +
>>>
>>> Hm.
>>>
>>> How come we get to pin the page and directly dereference it every time,
>>> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
>>> instead?
>>
>> So looking at my WIP trees from the time, this is something that
>> we went back and forth on as well with using just a pinned page or a
>> persistent kvm_vcpu_map().
>
> OK, thanks.
>
>> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
>> as shared_info is created early and is not expected to change during the
>> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
>> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
>> kvm_vcpu_unmap() dance or do some kind of synchronization.
>>
>> That said, I don't think this code explicitly disallows any updates
>> to shared_info.
>
> It needs to allow updates as well as disabling the shared_info pages.
> We're going to need that to implement SHUTDOWN_soft_reset for kexec.
True.

>
>>>
>>> If that was allowed, wouldn't it have been a much simpler fix for
>>> CVE-2019-3016? What am I missing?
>>
>> Agreed.
>>
>> Perhaps, Paolo can chime in with why KVM never uses pinned page
>> and always prefers to do cached mappings instead?
>>
>>>
>>> Should I rework these to use kvm_write_guest_cached()?
>>
>> kvm_vcpu_map() would be better. The event channel logic does RMW operations
>> on shared_info->vcpu_info.
>
> I've ported the shared_info/vcpu_info parts and made a test case, and
> was going back through to make it use kvm_write_guest_cached(). But I
> should probably plug on through the evtchn bits before I do that.
>
> I also don't see much locking on the cache, and can't convince myself
> that accessing the shared_info page from multiple CPUs with
> kvm_write_guest_cached() or kvm_map_gfn() is sane unless they each have
> their own cache.

I think you could get a VCPU specific cache with kvm_vcpu_map().

>
> What I have so far is at
>
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenpv

Thanks. Will take a look there.

Ankur

>
> I'll do the event channel support tomorrow and hook it up to my actual
> VMM to give it some more serious testing.
>

2020-12-02 08:07:09

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Tue, 2020-12-01 at 21:19 -0800, Ankur Arora wrote:
> > + for (i = 0; i < PAGE_SIZE / sizeof(instructions); i++) {
> > + *(u32 *)&instructions[1] = i;
> > + if (kvm_vcpu_write_guest(vcpu,
> > + page_addr + (i * sizeof(instructions)),
> > + instructions, sizeof(instructions)))
> > + return 1;
> > + }
>
> HYPERVISOR_iret isn't supported on 64bit so should be ud2 instead.

Yeah, I got part way through typing that part but concluded it probably
wasn't a fast path that absolutely needed to be emulated in the kernel.

The VMM can inject the UD# when it receives the hypercall.

I appreciate it *is* a guest-visible difference, if we're being really
pedantic, but I don't think we were even going to be able to 100% hide
the fact that it's not actually Xen.


Attachments:
smime.p7s (5.05 kB)

2020-12-02 10:49:07

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

[late response - was on holiday yesterday]

On 12/2/20 12:40 AM, Ankur Arora wrote:
> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>>> +{
>>> + struct shared_info *shared_info;
>>> + struct page *page;
>>> +
>>> + page = gfn_to_page(kvm, gfn);
>>> + if (is_error_page(page))
>>> + return -EINVAL;
>>> +
>>> + kvm->arch.xen.shinfo_addr = gfn;
>>> +
>>> + shared_info = page_to_virt(page);
>>> + memset(shared_info, 0, sizeof(struct shared_info));
>>> + kvm->arch.xen.shinfo = shared_info;
>>> + return 0;
>>> +}
>>> +
>>
>> Hm.
>>
>> How come we get to pin the page and directly dereference it every time,
>> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
>> instead?
>
> So looking at my WIP trees from the time, this is something that
> we went back and forth on as well with using just a pinned page or a
> persistent kvm_vcpu_map().
>
> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
> as shared_info is created early and is not expected to change during the
> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
> kvm_vcpu_unmap() dance or do some kind of synchronization.
>
> That said, I don't think this code explicitly disallows any updates
> to shared_info.
>
>>
>> If that was allowed, wouldn't it have been a much simpler fix for
>> CVE-2019-3016? What am I missing?
>
> Agreed.
>
> Perhaps, Paolo can chime in with why KVM never uses pinned page
> and always prefers to do cached mappings instead?
>
Part of the CVE fix to not use cached versions.

It's not a longterm pin of the page unlike we try to do here (partly due to the nature
of the pages we are mapping) but we still we map the gpa, RMW the steal time struct, and
then unmap the page.

See record_steal_time() -- but more specifically commit b043138246 ("x86/KVM: Make sure
KVM_VCPU_FLUSH_TLB flag is not missed").

But I am not sure it's a good idea to follow the same as record_steal_time() given that
this is a fairly sensitive code path for event channels.

>>
>> Should I rework these to use kvm_write_guest_cached()?
>
> kvm_vcpu_map() would be better. The event channel logic does RMW operations
> on shared_info->vcpu_info.
>
Indeed, yes.

Ankur IIRC, we saw missed event channels notifications when we were using the
{write,read}_cached() version of the patch.

But I can't remember the reason it was due to, either the evtchn_pending or the mask
word -- which would make it not inject an upcall.

Joao

2020-12-02 10:53:24

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On 12/2/20 5:17 AM, Ankur Arora wrote:
> On 2020-12-01 5:26 p.m., David Woodhouse wrote
>> On Tue, 2020-12-01 at 16:40 -0800, Ankur Arora wrote:
>>> On 2020-12-01 5:07 a.m., David Woodhouse wrote:

[...]

>>>> If that was allowed, wouldn't it have been a much simpler fix for
>>>> CVE-2019-3016? What am I missing?
>>>
>>> Agreed.
>>>
>>> Perhaps, Paolo can chime in with why KVM never uses pinned page
>>> and always prefers to do cached mappings instead?
>>>
>>>>
>>>> Should I rework these to use kvm_write_guest_cached()?
>>>
>>> kvm_vcpu_map() would be better. The event channel logic does RMW operations
>>> on shared_info->vcpu_info.
>>
>> I've ported the shared_info/vcpu_info parts and made a test case, and
>> was going back through to make it use kvm_write_guest_cached(). But I
>> should probably plug on through the evtchn bits before I do that.
>>
>> I also don't see much locking on the cache, and can't convince myself
>> that accessing the shared_info page from multiple CPUs with
>> kvm_write_guest_cached() or kvm_map_gfn() is sane unless they each have
>> their own cache.
>
> I think you could get a VCPU specific cache with kvm_vcpu_map().
>

steal clock handling creates such a mapping cache (struct gfn_to_pfn_cache).

Joao

2020-12-02 11:20:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> @@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
> int r;
>
> switch (e->type) {
> + case KVM_IRQ_ROUTING_XEN_EVTCHN:
> + return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
> + line_status);
> case KVM_IRQ_ROUTING_HV_SINT:
> return kvm_hv_set_sint(e, kvm, irq_source_id, level,
> line_status);
> @@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
> e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
> e->hv_sint.sint = ue->u.hv_sint.sint;
> break;
> + case KVM_IRQ_ROUTING_XEN_EVTCHN:
> + e->set = kvm_xen_set_evtchn;
> + e->evtchn.vcpu = ue->u.evtchn.vcpu;
> + e->evtchn.vector = ue->u.evtchn.vector;
> + e->evtchn.via = ue->u.evtchn.via;
> +
> + return kvm_xen_setup_evtchn(kvm, e);
> default:
> return -EINVAL;
> }


Hmm. I'm not sure I've have done it that way.

These IRQ routing entries aren't for individual event channel ports;
they don't map to kvm_xen_evtchn_send().

They actually represent the upcall to the given vCPU when any event
channel is signalled, and it's really per-vCPU configuration.

When the kernel raises (IPI, VIRQ) events on a given CPU, it doesn't
actually use these routing entries; it just uses the values in
vcpu_xen->cb.{via,vector} which were cached from the last of these IRQ
routing entries that happens to have been processed?

The VMM is *expected* to set up precisely one of these for each vCPU,
right? Would it not be better to do that via KVM_XEN_HVM_SET_ATTR?

The usage model for userspace is presumably that the VMM should set the
appropriate bit in evtchn_pending, check evtchn_mask and then call into
the kernel to do the set_irq() to inject the callback vector to the
guest?

I might be more inclined to go for a model where the kernel handles the
evtchn_pending/evtchn_mask for us. What would go into the irq routing
table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().

Does that seem reasonable?

Either way, I do think we need a way for events raised in the kernel to
be signalled to userspace, if they are targeted at a vCPU which has
CALLBACK_VIA_INTX that the kernel can't do directly. So we probably
*do* need that eventfd I was objecting to earlier, except that it's not
a per-evtchn thing; it's per-vCPU.



Attachments:
smime.p7s (5.05 kB)

2020-12-02 11:22:38

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On 12/1/20 11:19 AM, David Woodhouse wrote:
> On Tue, 2020-12-01 at 09:48 +0000, David Woodhouse wrote:
>> So I switched it to generate the hypercall page directly from the
>> kernel, just like we do for the Hyper-V hypercall page.
>
> Speaking of Hyper-V... I'll post this one now so you can start heckling
> it.
>
Xen viridian mode is indeed one thing that needed fixing. And definitely a
separate patch as you do here. Both this one and the previous is looking good.

I suppose that because you switch to kvm_vcpu_write_guest() you no longer need
to validate that the hypercall page is correct neither marking as dirty. Probably
worth making that explicit in the commit message.

> I won't post the rest as I go; not much of the rest of the series when
> I eventually post it will be very new and exciting. It'll just be
> rebasing and tweaking your originals and perhaps adding some tests.
>

Awesome!!

2020-12-02 12:16:13

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On Wed, 2020-12-02 at 11:17 +0000, Joao Martins wrote:
> Xen viridian mode is indeed one thing that needed fixing. And definitely a
> separate patch as you do here. Both this one and the previous is looking good.
>
> I suppose that because you switch to kvm_vcpu_write_guest() you no longer need
> to validate that the hypercall page is correct neither marking as dirty. Probably
> worth making that explicit in the commit message.

I had intended that to be covered by "open-coding it".

Surely the *point* in using a helper function and not open-coding it is
that we don't *have* to care about the details of what it does ;)


Attachments:
smime.p7s (5.05 kB)

2020-12-02 12:24:38

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Wed, 2020-12-02 at 10:44 +0000, Joao Martins wrote:
> [late response - was on holiday yesterday]
>
> On 12/2/20 12:40 AM, Ankur Arora wrote:
> > On 2020-12-01 5:07 a.m., David Woodhouse wrote:
> > > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > > +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
> > > > +{
> > > > + struct shared_info *shared_info;
> > > > + struct page *page;
> > > > +
> > > > + page = gfn_to_page(kvm, gfn);
> > > > + if (is_error_page(page))
> > > > + return -EINVAL;
> > > > +
> > > > + kvm->arch.xen.shinfo_addr = gfn;
> > > > +
> > > > + shared_info = page_to_virt(page);
> > > > + memset(shared_info, 0, sizeof(struct shared_info));
> > > > + kvm->arch.xen.shinfo = shared_info;
> > > > + return 0;
> > > > +}
> > > > +
> > >
> > > Hm.
> > >
> > > How come we get to pin the page and directly dereference it every time,
> > > while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
> > > instead?
> >
> > So looking at my WIP trees from the time, this is something that
> > we went back and forth on as well with using just a pinned page or a
> > persistent kvm_vcpu_map().
> >
> > I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
> > as shared_info is created early and is not expected to change during the
> > lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
> > MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
> > kvm_vcpu_unmap() dance or do some kind of synchronization.
> >
> > That said, I don't think this code explicitly disallows any updates
> > to shared_info.
> >
> > >
> > > If that was allowed, wouldn't it have been a much simpler fix for
> > > CVE-2019-3016? What am I missing?
> >
> > Agreed.
> >
> > Perhaps, Paolo can chime in with why KVM never uses pinned page
> > and always prefers to do cached mappings instead?
> >
>
> Part of the CVE fix to not use cached versions.
>
> It's not a longterm pin of the page unlike we try to do here (partly due to the nature
> of the pages we are mapping) but we still we map the gpa, RMW the steal time struct, and
> then unmap the page.
>
> See record_steal_time() -- but more specifically commit b043138246 ("x86/KVM: Make sure
> KVM_VCPU_FLUSH_TLB flag is not missed").
>
> But I am not sure it's a good idea to follow the same as record_steal_time() given that
> this is a fairly sensitive code path for event channels.

Right. We definitely need to use atomic RMW operations (like the CVE
fix did) so the page needs to be *mapped*.

My question was about a permanent pinned mapping vs the map/unmap as we
need it that record_steal_time() does.

On IRC, Paolo told me that permanent pinning causes problems for memory
hotplug, and pointed me at the trick we do with an MMU notifier and
kvm_vcpu_reload_apic_access_page().

I'm going to stick with the pinning we have for the moment, and just
fix up the fact that it leaks the pinned pages if the guest sets the
shared_info address more than once.

At some point the apic page MMU notifier thing can be made generic, and
we can use that for this and for KVM steal time too.


Attachments:
smime.p7s (5.05 kB)

2020-12-02 13:14:59

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 12/2/20 11:17 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> @@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
>> int r;
>>
>> switch (e->type) {
>> + case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> + return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
>> + line_status);
>> case KVM_IRQ_ROUTING_HV_SINT:
>> return kvm_hv_set_sint(e, kvm, irq_source_id, level,
>> line_status);
>> @@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
>> e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
>> e->hv_sint.sint = ue->u.hv_sint.sint;
>> break;
>> + case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> + e->set = kvm_xen_set_evtchn;
>> + e->evtchn.vcpu = ue->u.evtchn.vcpu;
>> + e->evtchn.vector = ue->u.evtchn.vector;
>> + e->evtchn.via = ue->u.evtchn.via;
>> +
>> + return kvm_xen_setup_evtchn(kvm, e);
>> default:
>> return -EINVAL;
>> }
>
>
> Hmm. I'm not sure I've have done it that way.
>
> These IRQ routing entries aren't for individual event channel ports;
> they don't map to kvm_xen_evtchn_send().
>
> They actually represent the upcall to the given vCPU when any event
> channel is signalled, and it's really per-vCPU configuration.
>
Right.

> When the kernel raises (IPI, VIRQ) events on a given CPU, it doesn't
> actually use these routing entries; it just uses the values in
> vcpu_xen->cb.{via,vector} which were cached from the last of these IRQ
> routing entries that happens to have been processed?
>
Correct.

> The VMM is *expected* to set up precisely one of these for each vCPU,
> right?
>
From guest PoV, the hypercall:

HYPERVISOR_hvm_op(HVMOP_set_param, HVM_PARAM_CALLBACK_IRQ, ...)

(...) is global.

But this one (on more recent versions of Xen, particularly recent Windows guests):

HVMOP_set_evtchn_upcall_vector

Is per-vCPU, and it's a LAPIC vector.

But indeed the VMM ends up changing the @via @vector on a individual CPU.

> Would it not be better to do that via KVM_XEN_HVM_SET_ATTR?

It's definitely an interesting (better?) alternative, considering we use as a vCPU attribute.

I suppose you're suggesting like,

KVM_XEN_ATTR_TYPE_CALLBACK

And passing the vector, and callback type.

> The usage model for userspace is presumably that the VMM should set the
> appropriate bit in evtchn_pending, check evtchn_mask and then call into
> the kernel to do the set_irq() to inject the callback vector to the
> guest?
>
Correct, that's how it works for userspace handled event channels.

> I might be more inclined to go for a model where the kernel handles the
> evtchn_pending/evtchn_mask for us. What would go into the irq routing
> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>

But passing port to the routing and handling the sending of events wouldn't it lead to
unnecessary handling of event channels which aren't handled by the kernel, compared to
just injecting caring about the upcall?

I thought from previous feedback that it was something you wanted to avoid.

> Does that seem reasonable?
>
Otherwise, it seems reasonable to have it.

I'll let Ankur chip in too, as this was something he was more closely modelling after.

> Either way, I do think we need a way for events raised in the kernel to
> be signalled to userspace, if they are targeted at a vCPU which has
> CALLBACK_VIA_INTX that the kernel can't do directly. So we probably
> *do* need that eventfd I was objecting to earlier, except that it's not
> a per-evtchn thing; it's per-vCPU.
>

Ah!

I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
enough.

Joao

2020-12-02 16:50:53

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
> On 12/2/20 11:17 AM, David Woodhouse wrote:
> > I might be more inclined to go for a model where the kernel handles the
> > evtchn_pending/evtchn_mask for us. What would go into the irq routing
> > table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> >
>
> But passing port to the routing and handling the sending of events wouldn't it lead to
> unnecessary handling of event channels which aren't handled by the kernel, compared to
> just injecting caring about the upcall?

Well, I'm generally in favour of *not* doing things in the kernel that
don't need to be there.

But if the kernel is going to short-circuit the IPIs and VIRQs, then
it's already going to have to handle the evtchn_pending/evtchn_mask
bitmaps, and actually injecting interrupts.

Given that it has to have that functionality anyway, it seems saner to
let the kernel have full control over it and to just expose
'evtchn_send' to userspace.

The alternative is to have userspace trying to play along with the
atomic handling of those bitmasks too, and injecting events through
KVM_INTERRUPT/KVM_SIGNAL_MSI in parallel to the kernel doing so. That
seems like *more* complexity, not less.

> I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
> enough.

That actually works quite nicely even for userspace irqchip.

Forgetting Xen for the moment... my model for userspace I/OAPIC with
interrupt remapping is that during normal runtime, the irqfd is
assigned and things all work and we can even have IRQ posting for
eventfds which came from VFIO.

When the IOMMU invalidates an IRQ translation, all it does is
*deassign* the irqfd from the KVM IRQ. So the next time that eventfd
fires, it's caught in the userspace event loop instead. Userspace can
then retranslate through the IOMMU and reassign the irqfd for next
time.

So, back to Xen. As things stand with just the hypercall+shinfo patches
I've already rebased, we have enough to do fully functional Xen
hosting. The event channels are slow but it *can* be done entirely in
userspace — handling *all* the hypercalls, and delivering interrupts to
the guest in whatever mode is required.

Event channels are a very important optimisation though. For the VMM
API I think we should follow the Xen model, mixing the domain-wide and
per-vCPU configuration. It's the best way to faithfully model the
behaviour a true Xen guest would experience.

So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
• HVMIRQ_callback_vector, taking a vector#
• HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#

And *maybe* in a later patch it could also handle
• HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
• HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)

I don't know if the latter two really make sense. After all the
argument for handling IPI/VIRQ in kernel kind of falls over if we have
to bounce out to userspace anyway. So it *only* makes sense if those
eventfds actually end up wired *through* userspace to a KVM IRQFD as I
described for the IOMMU stuff.


In addition to that per-domain setup, we'd also have a per-vCPU
KVM_XEN_ATTR_TYPE_VCPU_CALLBACK_VECTOR which takes {vCPU, vector}.



Attachments:
smime.p7s (5.05 kB)

2020-12-02 18:23:43

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

On 2020-12-02 12:03 a.m., David Woodhouse wrote:
> On Tue, 2020-12-01 at 21:19 -0800, Ankur Arora wrote:
>>> + for (i = 0; i < PAGE_SIZE / sizeof(instructions); i++) {
>>> + *(u32 *)&instructions[1] = i;
>>> + if (kvm_vcpu_write_guest(vcpu,
>>> + page_addr + (i * sizeof(instructions)),
>>> + instructions, sizeof(instructions)))
>>> + return 1;
>>> + }
>>
>> HYPERVISOR_iret isn't supported on 64bit so should be ud2 instead.
>
> Yeah, I got part way through typing that part but concluded it probably
> wasn't a fast path that absolutely needed to be emulated in the kernel.
>
> The VMM can inject the UD# when it receives the hypercall.

That would work as well but if it's a straight ud2 on the hypercall
page, wouldn't the guest just execute it when/if it does a
HYPERVISOR_iret?

Ankur


>
> I appreciate it *is* a guest-visible difference, if we're being really
> pedantic, but I don't think we were even going to be able to 100% hide
> the fact that it's not actually Xen.
>

2020-12-02 18:37:07

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 12/2/20 4:47 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> I might be more inclined to go for a model where the kernel handles the
>>> evtchn_pending/evtchn_mask for us. What would go into the irq routing
>>> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>>
>> But passing port to the routing and handling the sending of events wouldn't it lead to
>> unnecessary handling of event channels which aren't handled by the kernel, compared to
>> just injecting caring about the upcall?
>
> Well, I'm generally in favour of *not* doing things in the kernel that
> don't need to be there.
>
> But if the kernel is going to short-circuit the IPIs and VIRQs, then
> it's already going to have to handle the evtchn_pending/evtchn_mask
> bitmaps, and actually injecting interrupts.
>
Right. I was trying to point that out in the discussion we had
in next patch. But true be told, more about touting the idea of kernel
knowing if a given event channel is registered for userspace handling,
rather than fully handling the event channel.

I suppose we are able to provide both options to the VMM anyway
i.e. 1) letting them handle it enterily in userspace by intercepting
EVTCHNOP_send, or through the irq route if we want kernel to offload it.

> Given that it has to have that functionality anyway, it seems saner to
> let the kernel have full control over it and to just expose
> 'evtchn_send' to userspace.
>
> The alternative is to have userspace trying to play along with the
> atomic handling of those bitmasks too

That part is not particularly hard -- having it done already.

>, and injecting events through
> KVM_INTERRUPT/KVM_SIGNAL_MSI in parallel to the kernel doing so. That
> seems like *more* complexity, not less.
>
/me nods

>> I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
>> enough.
>
> That actually works quite nicely even for userspace irqchip.
>
> Forgetting Xen for the moment... my model for userspace I/OAPIC with
> interrupt remapping is that during normal runtime, the irqfd is
> assigned and things all work and we can even have IRQ posting for
> eventfds which came from VFIO.
>
> When the IOMMU invalidates an IRQ translation, all it does is
> *deassign* the irqfd from the KVM IRQ. So the next time that eventfd
> fires, it's caught in the userspace event loop instead. Userspace can
> then retranslate through the IOMMU and reassign the irqfd for next
> time.
>
> So, back to Xen. As things stand with just the hypercall+shinfo patches
> I've already rebased, we have enough to do fully functional Xen
> hosting.

Yes -- the rest become optimizations in performance sensitive paths.

TBH (and this is slightly offtopic) the somewhat hairy part is xenbus/xenstore.
And the alternative to playing nice with xenstored, is the VMM learning
to parse the xenbus messages and fake the xenstored content/transactions stuff
individually per per-VM .

> The event channels are slow but it *can* be done entirely in

While consulting my old notes, about twice as slow if done in userspace.

> userspace — handling *all* the hypercalls, and delivering interrupts to
> the guest in whatever mode is required.
>
> Event channels are a very important optimisation though.

/me nods

> For the VMM
> API I think we should follow the Xen model, mixing the domain-wide and
> per-vCPU configuration. It's the best way to faithfully model the
> behaviour a true Xen guest would experience.
>
> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
> • HVMIRQ_callback_vector, taking a vector#
> • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>
> And *maybe* in a later patch it could also handle
> • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
> • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>
Most of the Xen versions we were caring had callback_vector and
vcpu callback vector (despite Linux not using the latter). But if you're
dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
gsi and pci-intx are must-haves.

> I don't know if the latter two really make sense. After all the
> argument for handling IPI/VIRQ in kernel kind of falls over if we have
> to bounce out to userspace anyway.

Might as well let userspace EVTCHNOP_send handle it, I wonder.

> So it *only* makes sense if those
> eventfds actually end up wired *through* userspace to a KVM IRQFD as I
> described for the IOMMU stuff.
>
We didn't implement the phy event channels, but for most old kernels we
tested (back to 2.6.XX) at the time, seemed to play along.

> In addition to that per-domain setup, we'd also have a per-vCPU
> KVM_XEN_ATTR_TYPE_VCPU_CALLBACK_VECTOR which takes {vCPU, vector}.
>
I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
Don't see the adavantage in having another xen attr type.

But kinda have mixed feelings in having kernel handling all event channels ABI,
as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
the added gain to VMMs that don't need to care about how the internals of event channels.
But performance-wise it wouldn't bring anything better. But maybe, the former is reason
enough to consider it.

Joao

2020-12-02 19:06:14

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
> On 12/2/20 4:47 PM, David Woodhouse wrote:
> > On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
> > > On 12/2/20 11:17 AM, David Woodhouse wrote:
> > > > I might be more inclined to go for a model where the kernel handles the
> > > > evtchn_pending/evtchn_mask for us. What would go into the irq routing
> > > > table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> > >
> > > But passing port to the routing and handling the sending of events wouldn't it lead to
> > > unnecessary handling of event channels which aren't handled by the kernel, compared to
> > > just injecting caring about the upcall?
> >
> > Well, I'm generally in favour of *not* doing things in the kernel that
> > don't need to be there.
> >
> > But if the kernel is going to short-circuit the IPIs and VIRQs, then
> > it's already going to have to handle the evtchn_pending/evtchn_mask
> > bitmaps, and actually injecting interrupts.
> >
>
> Right. I was trying to point that out in the discussion we had
> in next patch. But true be told, more about touting the idea of kernel
> knowing if a given event channel is registered for userspace handling,
> rather than fully handling the event channel.
>
> I suppose we are able to provide both options to the VMM anyway
> i.e. 1) letting them handle it enterily in userspace by intercepting
> EVTCHNOP_send, or through the irq route if we want kernel to offload it.

Right. The kernel takes what it knows about and anything else goes up
to userspace.

I do like the way you've handled the vcpu binding in userspace, and the
kernel just knows that a given port goes to a given target CPU.

>
> > For the VMM
> > API I think we should follow the Xen model, mixing the domain-wide and
> > per-vCPU configuration. It's the best way to faithfully model the
> > behaviour a true Xen guest would experience.
> >
> > So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
> > • HVMIRQ_callback_vector, taking a vector#
> > • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
> >
> > And *maybe* in a later patch it could also handle
> > • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
> > • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
> >
>
> Most of the Xen versions we were caring had callback_vector and
> vcpu callback vector (despite Linux not using the latter). But if you're
> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
> gsi and pci-intx are must-haves.

Note sure about GSI but PCI-INTX is definitely something I've seen in
active use by customers recently. I think SLES10 will use that.

> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
> Don't see the adavantage in having another xen attr type.

Yeah, fair enough.

> But kinda have mixed feelings in having kernel handling all event channels ABI,
> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
> the added gain to VMMs that don't need to care about how the internals of event channels.
> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
> enough to consider it.

Yeah, we'll see. Especially when it comes to implementing FIFO event
channels, I'd rather just do it in one place — and if the kernel does
it anyway then it's hardly difficult to hook into that.

But I've been about as coherent as I can be in email, and I think we're
generally aligned on the direction. I'll do some more experiments and
see what I can get working, and what it looks like.

I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.


Attachments:
smime.p7s (5.05 kB)

2020-12-02 20:18:28

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 12/2/20 7:02 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>> On 12/2/20 4:47 PM, David Woodhouse wrote:
>>> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>>>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> For the VMM
>>> API I think we should follow the Xen model, mixing the domain-wide and
>>> per-vCPU configuration. It's the best way to faithfully model the
>>> behaviour a true Xen guest would experience.
>>>
>>> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>>> • HVMIRQ_callback_vector, taking a vector#
>>> • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>>>
>>> And *maybe* in a later patch it could also handle
>>> • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>>> • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>>>
>>
>> Most of the Xen versions we were caring had callback_vector and
>> vcpu callback vector (despite Linux not using the latter). But if you're
>> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
>> gsi and pci-intx are must-haves.
>
> Note sure about GSI but PCI-INTX is definitely something I've seen in
> active use by customers recently. I think SLES10 will use that.
>

Some of the Windows drivers we used were relying on GSI.

I don't know about what kernel is SLES10 but Linux is aware
of XENFEAT_hvm_callback_vector since 2.6.35 i.e. about 10years.
Unless some other out-of-tree patch is opting it out I suppose.

>
>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>> the added gain to VMMs that don't need to care about how the internals of event channels.
>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>> enough to consider it.
>
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.
>
Fortunately that's xen 4.3 and up *I think* :) (the FIFO events)

And Linux is the one user I am aware IIRC.

> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction.

Yes, definitely.

> I'll do some more experiments and
> see what I can get working, and what it looks like.
>
> I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
>
I was chatting with Ankur, and we can't fully 100% remember why we dropped using
kvm_vcpu_map/kvm_map_gfn. We were using kvm_vcpu_map() but at the time the new guest
mapping series was in discussion, so we dropped those until it settled in.

One "side effect" on mapping shared_info with kvm_vcpu_map, is that we have to loop all
vcpus unless we move shared_info elsewhere IIRC. But switching vcpu_info, vcpu_time_info
(and steal clock) to kvm_vcpu_map is trivial.. at least based on our old wip branches here.

Joao

2020-12-02 20:37:32

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page



On 2020-12-02 4:20 a.m., David Woodhouse wrote:
> On Wed, 2020-12-02 at 10:44 +0000, Joao Martins wrote:
>> [late response - was on holiday yesterday]
>>
>> On 12/2/20 12:40 AM, Ankur Arora wrote:
>>> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
>>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>>>>> +{
>>>>> + struct shared_info *shared_info;
>>>>> + struct page *page;
>>>>> +
>>>>> + page = gfn_to_page(kvm, gfn);
>>>>> + if (is_error_page(page))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + kvm->arch.xen.shinfo_addr = gfn;
>>>>> +
>>>>> + shared_info = page_to_virt(page);
>>>>> + memset(shared_info, 0, sizeof(struct shared_info));
>>>>> + kvm->arch.xen.shinfo = shared_info;
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>
>>>> Hm.
>>>>
>>>> How come we get to pin the page and directly dereference it every time,
>>>> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
>>>> instead?
>>>
>>> So looking at my WIP trees from the time, this is something that
>>> we went back and forth on as well with using just a pinned page or a
>>> persistent kvm_vcpu_map().
>>>
>>> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
>>> as shared_info is created early and is not expected to change during the
>>> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
>>> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
>>> kvm_vcpu_unmap() dance or do some kind of synchronization.
>>>
>>> That said, I don't think this code explicitly disallows any updates
>>> to shared_info.
>>>
>>>>
>>>> If that was allowed, wouldn't it have been a much simpler fix for
>>>> CVE-2019-3016? What am I missing?
>>>
>>> Agreed.
>>>
>>> Perhaps, Paolo can chime in with why KVM never uses pinned page
>>> and always prefers to do cached mappings instead?
>>>
>>
>> Part of the CVE fix to not use cached versions.
>>
>> It's not a longterm pin of the page unlike we try to do here (partly due to the nature
>> of the pages we are mapping) but we still we map the gpa, RMW the steal time struct, and
>> then unmap the page.
>>
>> See record_steal_time() -- but more specifically commit b043138246 ("x86/KVM: Make sure
>> KVM_VCPU_FLUSH_TLB flag is not missed").
>>
>> But I am not sure it's a good idea to follow the same as record_steal_time() given that
>> this is a fairly sensitive code path for event channels.
>
> Right. We definitely need to use atomic RMW operations (like the CVE
> fix did) so the page needs to be *mapped*.
>
> My question was about a permanent pinned mapping vs the map/unmap as we
> need it that record_steal_time() does.
>
> On IRC, Paolo told me that permanent pinning causes problems for memory
> hotplug, and pointed me at the trick we do with an MMU notifier and
> kvm_vcpu_reload_apic_access_page().

Okay that answers my question. Thanks for clearing that up.

Not sure of a good place to document this but it would be good to
have this written down somewhere. Maybe kvm_map_gfn()?

>
> I'm going to stick with the pinning we have for the moment, and just
> fix up the fact that it leaks the pinned pages if the guest sets the
> shared_info address more than once.
>
> At some point the apic page MMU notifier thing can be made generic, and
> we can use that for this and for KVM steal time too.
>

Yeah, that's something that'll definitely be good to have.

Ankur

2020-12-02 20:40:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On 2020-12-02 2:44 a.m., Joao Martins wrote:
> [late response - was on holiday yesterday]
>
> On 12/2/20 12:40 AM, Ankur Arora wrote:
>> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>>>> +{
>>>> + struct shared_info *shared_info;
>>>> + struct page *page;
>>>> +
>>>> + page = gfn_to_page(kvm, gfn);
>>>> + if (is_error_page(page))
>>>> + return -EINVAL;
>>>> +
>>>> + kvm->arch.xen.shinfo_addr = gfn;
>>>> +
>>>> + shared_info = page_to_virt(page);
>>>> + memset(shared_info, 0, sizeof(struct shared_info));
>>>> + kvm->arch.xen.shinfo = shared_info;
>>>> + return 0;
>>>> +}
>>>> +
>>>
>>> Hm.
>>>
>>> How come we get to pin the page and directly dereference it every time,
>>> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
>>> instead?
>>
>> So looking at my WIP trees from the time, this is something that
>> we went back and forth on as well with using just a pinned page or a
>> persistent kvm_vcpu_map().
>>
>> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
>> as shared_info is created early and is not expected to change during the
>> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
>> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
>> kvm_vcpu_unmap() dance or do some kind of synchronization.
>>
>> That said, I don't think this code explicitly disallows any updates
>> to shared_info.
>>
>>>
>>> If that was allowed, wouldn't it have been a much simpler fix for
>>> CVE-2019-3016? What am I missing?
>>
>> Agreed.
>>
>> Perhaps, Paolo can chime in with why KVM never uses pinned page
>> and always prefers to do cached mappings instead?
>>
> Part of the CVE fix to not use cached versions.
>
> It's not a longterm pin of the page unlike we try to do here (partly due to the nature
> of the pages we are mapping) but we still we map the gpa, RMW the steal time struct, and
> then unmap the page.
>
> See record_steal_time() -- but more specifically commit b043138246 ("x86/KVM: Make sure
> KVM_VCPU_FLUSH_TLB flag is not missed").
>
> But I am not sure it's a good idea to follow the same as record_steal_time() given that
> this is a fairly sensitive code path for event channels.
>
>>>
>>> Should I rework these to use kvm_write_guest_cached()?
>>
>> kvm_vcpu_map() would be better. The event channel logic does RMW operations
>> on shared_info->vcpu_info.
>>
> Indeed, yes.
>
> Ankur IIRC, we saw missed event channels notifications when we were using the
> {write,read}_cached() version of the patch.
>
> But I can't remember the reason it was due to, either the evtchn_pending or the mask
> word -- which would make it not inject an upcall.

If memory serves, it was the mask. Though I don't think that we had
kvm_{write,read}_cached in use at that point -- given that they were
definitely not RMW safe.


Ankur

>
> Joao
>

2020-12-02 20:43:13

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-02 at 20:12 +0000, Joao Martins wrote:
> > I'll do some more experiments and
> > see what I can get working, and what it looks like.
> >
> > I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
> >
>
> I was chatting with Ankur, and we can't fully 100% remember why we dropped using
> kvm_vcpu_map/kvm_map_gfn. We were using kvm_vcpu_map() but at the time the new guest
> mapping series was in discussion, so we dropped those until it settled in.
>
> One "side effect" on mapping shared_info with kvm_vcpu_map, is that we have to loop all
> vcpus unless we move shared_info elsewhere IIRC. But switching vcpu_info, vcpu_time_info
> (and steal clock) to kvm_vcpu_map is trivial.. at least based on our old wip branches here.

I'm not mapping/unmapping every time. I've just changed the
page_to_virt() bit to use kvm_map_gfn() as a one-time setup for now,
because I need it to work for GFNs without a backing page.

I have that working for the shinfo in my tree so far; will do the other
pages next.

In the fullness of time I'd like to eliminate the duplication between
kvm_setup_pvclock_page() and kvm_xen_update_vcpu_time(), which are
doing precisely the same thing. I think that can wait until we fix up
the MMU notifier thing as discussed though, so they can all just do
direct dereferencing.

But I'm inclined not to get hung up on that for now. There are much
more fun things to be playing with.


Attachments:
smime.p7s (5.05 kB)

2020-12-03 01:12:35

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 2020-12-02 11:02 a.m., David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>> On 12/2/20 4:47 PM, David Woodhouse wrote:
>>> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>>>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>>>> I might be more inclined to go for a model where the kernel handles the
>>>>> evtchn_pending/evtchn_mask for us. What would go into the irq routing
>>>>> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>>>>
>>>> But passing port to the routing and handling the sending of events wouldn't it lead to
>>>> unnecessary handling of event channels which aren't handled by the kernel, compared to
>>>> just injecting caring about the upcall?
>>>
>>> Well, I'm generally in favour of *not* doing things in the kernel that
>>> don't need to be there.
>>>
>>> But if the kernel is going to short-circuit the IPIs and VIRQs, then
>>> it's already going to have to handle the evtchn_pending/evtchn_mask
>>> bitmaps, and actually injecting interrupts.
>>>
>>
>> Right. I was trying to point that out in the discussion we had
>> in next patch. But true be told, more about touting the idea of kernel
>> knowing if a given event channel is registered for userspace handling,
>> rather than fully handling the event channel.
>>
>> I suppose we are able to provide both options to the VMM anyway
>> i.e. 1) letting them handle it enterily in userspace by intercepting
>> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
>
> Right. The kernel takes what it knows about and anything else goes up
> to userspace.
>
> I do like the way you've handled the vcpu binding in userspace, and the
> kernel just knows that a given port goes to a given target CPU.
>
>>
>>> For the VMM
>>> API I think we should follow the Xen model, mixing the domain-wide and
>>> per-vCPU configuration. It's the best way to faithfully model the
>>> behaviour a true Xen guest would experience.
>>>
>>> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>>> • HVMIRQ_callback_vector, taking a vector#
>>> • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>>>
>>> And *maybe* in a later patch it could also handle
>>> • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>>> • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>>>
>>
>> Most of the Xen versions we were caring had callback_vector and
>> vcpu callback vector (despite Linux not using the latter). But if you're
>> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
>> gsi and pci-intx are must-haves.
>
> Note sure about GSI but PCI-INTX is definitely something I've seen in
> active use by customers recently. I think SLES10 will use that.
>
>> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
>> Don't see the adavantage in having another xen attr type.
>
> Yeah, fair enough.
>
>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>> the added gain to VMMs that don't need to care about how the internals of event channels.
>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>> enough to consider it.
>
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.

Sorry I'm late to this conversation. Not a whole lot to add to what Joao
said. I would only differ with him on how much to offload.

Given that we need the fast path in the kernel anyway, I think it's simpler
to do all the event-channel bitmap only in the kernel.
This would also simplify using the kernel Xen drivers if someone eventually
decides to use them.


Ankur

>
> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. I'll do some more experiments and
> see what I can get working, and what it looks like.
>
> I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
>

2020-12-03 10:20:27

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Wed, 2020-12-02 at 12:32 -0800, Ankur Arora wrote:
> > On IRC, Paolo told me that permanent pinning causes problems for memory
> > hotplug, and pointed me at the trick we do with an MMU notifier and
> > kvm_vcpu_reload_apic_access_page().
>
> Okay that answers my question. Thanks for clearing that up.
>
> Not sure of a good place to document this but it would be good to
> have this written down somewhere. Maybe kvm_map_gfn()?

Trying not to get too distracted by polishing this part, so I can
continue with making more things actually work. But I took a quick look
at the reload_apic_access_page() thing.

AFAICT it works because the access is only from *within* the vCPU, in
guest mode.

So all the notifier has to do is kick all CPUs, which happens when it
calls kvm_make_all_cpus_request(). Thus we are guaranteed that all CPUs
are *out* of guest mode by the time...

...er... maybe not by the time the notifier returns, because all
we've done is *send* the IPI and we don't know the other CPUs have
actually stopped running the guest yet?

Maybe there's some explanation of why the actual TLB shootdown
truly *will* occur before the page goes away, and some ordering
rules which mean our reschedule IPI will happen first? Something
like that ideally would have been in a comment in in MMU notifier.

Anyway, I'm not going to fret about that because it clearly doesn't
work for the pvclock/shinfo cases anyway; those are accessed from the
kernel *outside* guest mode. I think we can use a similar trick but
just protect the accesses with kvm->srcu.

The code in my tree now uses kvm_map_gfn() at setup time and leaves the
shinfo page mapped. A bit like the KVM pvclock code, except that I
don't pointlessly map and unmap all the time while leaving the page
pinned anyway.

So all we need to do is kvm_release_pfn() right after mapping it,
leaving the map->hva "unsafe".

Then we hook in to the MMU notifier to unmap it when it goes away (in
RCU style; clearing the pointer variable, synchronize_srcu(), and
*then* unmap, possibly in different range_start/range/range_end
callbacks).

The *users* of such mappings will need an RCU read lock, and will need
to be able to fault the GFN back in when they need it.

For now, though, I'm content that converting the Xen shinfo support to
kvm_map_gfn() is the right way to go, and the memory hotplug support is
an incremental addition on top of it.

And I'm even more comforted by the observation that the KVM pvclock
support *already* pins its pages, so I'm not even failing to meet the
existing bar :)


Attachments:
smime.p7s (5.05 kB)

2020-12-04 17:33:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Thu, Dec 03, 2020, David Woodhouse wrote:
> On Wed, 2020-12-02 at 12:32 -0800, Ankur Arora wrote:
> > > On IRC, Paolo told me that permanent pinning causes problems for memory
> > > hotplug, and pointed me at the trick we do with an MMU notifier and
> > > kvm_vcpu_reload_apic_access_page().
> >
> > Okay that answers my question. Thanks for clearing that up.
> >
> > Not sure of a good place to document this but it would be good to
> > have this written down somewhere. Maybe kvm_map_gfn()?
>
> Trying not to get too distracted by polishing this part, so I can
> continue with making more things actually work. But I took a quick look
> at the reload_apic_access_page() thing.
>
> AFAICT it works because the access is only from *within* the vCPU, in
> guest mode.
>
> So all the notifier has to do is kick all CPUs, which happens when it
> calls kvm_make_all_cpus_request(). Thus we are guaranteed that all CPUs
> are *out* of guest mode by the time...
>
> ...er... maybe not by the time the notifier returns, because all
> we've done is *send* the IPI and we don't know the other CPUs have
> actually stopped running the guest yet?
>
> Maybe there's some explanation of why the actual TLB shootdown
> truly *will* occur before the page goes away, and some ordering
> rules which mean our reschedule IPI will happen first? Something
> like that ideally would have been in a comment in in MMU notifier.

KVM_REQ_APIC_PAGE_RELOAD is tagged with KVM_REQUEST_WAIT, which means that
kvm_kick_many_cpus() and thus smp_call_function_many() will have @wait=true,
i.e. the sender will wait for the SMP function call to finish on the target CPUs.

2020-12-08 16:12:55

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-02 at 19:02 +0000, David Woodhouse wrote:
>
> > I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
> > Don't see the adavantage in having another xen attr type.
>
> Yeah, fair enough.
>
> > But kinda have mixed feelings in having kernel handling all event channels ABI,
> > as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
> > the added gain to VMMs that don't need to care about how the internals of event channels.
> > But performance-wise it wouldn't bring anything better. But maybe, the former is reason
> > enough to consider it.
>
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.
>
> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. I'll do some more experiments and
> see what I can get working, and what it looks like.


So... I did some more typing, and revived our completely userspace
based implementation of event channels. I wanted to declare that such
was *possible*, and that using the kernel for IPI and VIRQ was just a
very desirable optimisation.

It looks like Linux doesn't use the per-vCPU upcall vector that you
called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
KVM_INTERRUPT as if they were ExtINT....

... except I'm not. Because the kernel really does expect that to be an
ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
true if LVT0 is set up for EXTINT and unmasked.

I messed around with this hack and increasingly desperate variations on
the theme (since this one doesn't cause the IRQ window to be opened to
userspace in the first place), but couldn't get anything working:

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2380,6 +2380,9 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
if ((lvt0 & APIC_LVT_MASKED) == 0 &&
GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
r = 1;
+ /* Shoot me. */
+ if (vcpu->arch.pending_external_vector == 243)
+ r = 1;
return r;
}


Eventually I resorted to delivering the interrupt through the lapic
*anyway* (through KVM_SIGNAL_MSI with an MSI message constructed for
the appropriate vCPU/vector) and the following hack to auto-EOI:

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2416,7 +2419,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
*/

apic_clear_irr(vector, apic);
- if (test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
+ if (vector == 243 || test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
/*
* For auto-EOI interrupts, there might be another pending
* interrupt above PPR, so check whether to raise another


That works, and now my guest finishes the SMP bringup (and gets as far
as waiting on the XenStore implementation that I haven't put back yet).

So I think we need at least a tiny amount of support in-kernel for
delivering event channel interrupt vectors, even if we wanted to allow
for a completely userspace implementation.

Unless I'm missing something?

I will get on with implementing the in-kernel handling with IRQ routing
entries targeting a given { port, vcpu }. And I'm kind of vacillating
about whether the mode/vector should be separately configured, or
whether they might as well be in the IRQ routing table too, even if
it's kind of redundant because it's specified the same for *every* port
targeting the same vCPU. I *think* I prefer that redundancy over having
a separate configuration mechanism to set the vector for each vCPU. But
we'll see what happens when my fingers do the typing...


Attachments:
smime.p7s (5.05 kB)

2020-12-09 06:39:09

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 2020-12-08 8:08 a.m., David Woodhouse wrote:
> On Wed, 2020-12-02 at 19:02 +0000, David Woodhouse wrote:
>>
>>> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
>>> Don't see the adavantage in having another xen attr type.
>>
>> Yeah, fair enough.
>>
>>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>>> the added gain to VMMs that don't need to care about how the internals of event channels.
>>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>>> enough to consider it.
>>
>> Yeah, we'll see. Especially when it comes to implementing FIFO event
>> channels, I'd rather just do it in one place — and if the kernel does
>> it anyway then it's hardly difficult to hook into that.
>>
>> But I've been about as coherent as I can be in email, and I think we're
>> generally aligned on the direction. I'll do some more experiments and
>> see what I can get working, and what it looks like.
>
>
> So... I did some more typing, and revived our completely userspace
> based implementation of event channels. I wanted to declare that such
> was *possible*, and that using the kernel for IPI and VIRQ was just a
> very desirable optimisation.
>
> It looks like Linux doesn't use the per-vCPU upcall vector that you
> called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
> KVM_INTERRUPT as if they were ExtINT....
>
> ... except I'm not. Because the kernel really does expect that to be an
> ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
> true if LVT0 is set up for EXTINT and unmasked.
>
> I messed around with this hack and increasingly desperate variations on
> the theme (since this one doesn't cause the IRQ window to be opened to
> userspace in the first place), but couldn't get anything working:

Increasingly desperate variations, about sums up my process as well while
trying to get the upcall vector working.

>
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2380,6 +2380,9 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
> if ((lvt0 & APIC_LVT_MASKED) == 0 &&
> GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
> r = 1;
> + /* Shoot me. */
> + if (vcpu->arch.pending_external_vector == 243)
> + r = 1;
> return r;
> }
>
>
> Eventually I resorted to delivering the interrupt through the lapic
> *anyway* (through KVM_SIGNAL_MSI with an MSI message constructed for
> the appropriate vCPU/vector) and the following hack to auto-EOI:
>
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2416,7 +2419,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> */
>
> apic_clear_irr(vector, apic);
> - if (test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
> + if (vector == 243 || test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
> /*
> * For auto-EOI interrupts, there might be another pending
> * interrupt above PPR, so check whether to raise another
>
>
> That works, and now my guest finishes the SMP bringup (and gets as far
> as waiting on the XenStore implementation that I haven't put back yet).
>
> So I think we need at least a tiny amount of support in-kernel for
> delivering event channel interrupt vectors, even if we wanted to allow
> for a completely userspace implementation.
>
> Unless I'm missing something?

I did use the auto_eoi hack as well. So, yeah, I don't see any way of
getting around this.

Also, IIRC we had eventually gotten rid of the auto_eoi approach
because that wouldn't work with APICv. At that point we resorted to
direct queuing for vectored callbacks which was a hack that I never
grew fond of...

> I will get on with implementing the in-kernel handling with IRQ routing
> entries targeting a given { port, vcpu }. And I'm kind of vacillating
> about whether the mode/vector should be separately configured, or
> whether they might as well be in the IRQ routing table too, even if
> it's kind of redundant because it's specified the same for *every* port
> targeting the same vCPU. I *think* I prefer that redundancy over having
> a separate configuration mechanism to set the vector for each vCPU. But
> we'll see what happens when my fingers do the typing...
>

Good luck to your fingers!

Ankur

2020-12-09 10:30:56

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Tue, 2020-12-08 at 22:35 -0800, Ankur Arora wrote:
> > It looks like Linux doesn't use the per-vCPU upcall vector that you
> > called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
> > KVM_INTERRUPT as if they were ExtINT....
> >
> > ... except I'm not. Because the kernel really does expect that to be an
> > ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
> > true if LVT0 is set up for EXTINT and unmasked.
> >
> > I messed around with this hack and increasingly desperate variations on
> > the theme (since this one doesn't cause the IRQ window to be opened to
> > userspace in the first place), but couldn't get anything working:
>
> Increasingly desperate variations, about sums up my process as well while
> trying to get the upcall vector working.

:)

So this seems to work, and I think it's about as minimal as it can get.

All it does is implement a kvm_xen_has_interrupt() which checks the
vcpu_info->evtchn_upcall_pending flag, just like Xen does.

With this, my completely userspace implementation of event channels is
working. And of course this forms a basis for adding the minimal
acceleration of IPI/VIRQ that we need to the kernel, too.

My only slight concern is the bit in vcpu_enter_guest() where we have
to add the check for kvm_xen_has_interrupt(), because nothing is
setting KVM_REQ_EVENT. I suppose I could look at having something —
even an explicit ioctl from userspace — set that for us.... BUT...

I'm not sure that condition wasn't *already* broken some cases for
KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
sure.

But what if vcpu_enter_guest() processes that the first time (and
clears KVM_REQ_EVENT), and then exits for some *other* reason with
interrupts still disabled? Next time through vcpu_enter_guest(), even
though kvm_cpu_has_injectable_intr() is still true, we don't enable the
IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
inject_pending_event().

So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
if we need to use kvm_cpu_has_injectable_intr() to fix the existing
bug? Or am I missing something there and there isn't a bug after all?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d8716ef27728..4a63f212fdfc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -902,6 +902,7 @@ struct msr_bitmap_range {
/* Xen emulation context */
struct kvm_xen {
bool long_mode;
+ u8 upcall_vector;
struct kvm_host_map shinfo_map;
void *shinfo;
};
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 814698e5b152..24668b51b5c8 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -14,6 +14,7 @@
#include "irq.h"
#include "i8254.h"
#include "x86.h"
+#include "xen.h"

/*
* check if there are pending timer events
@@ -56,6 +57,9 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v)
if (!lapic_in_kernel(v))
return v->arch.interrupt.injected;

+ if (kvm_xen_has_interrupt(v))
+ return 1;
+
if (!kvm_apic_accept_pic_intr(v))
return 0;

@@ -110,6 +114,9 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
if (!lapic_in_kernel(v))
return v->arch.interrupt.nr;

+ if (kvm_xen_has_interrupt(v))
+ return v->kvm->arch.xen.upcall_vector;
+
if (irqchip_split(v->kvm)) {
int vector = v->arch.pending_external_vector;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ad9eea8f4f26..1711072b3616 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8891,7 +8891,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops.msr_filter_changed(vcpu);
}

- if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
+ if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
+ kvm_xen_has_interrupt(vcpu)) {
++vcpu->stat.req_event;
kvm_apic_accept_events(vcpu);
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4aa776c1ad57..116422d93d96 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -257,6 +257,22 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
srcu_read_unlock(&v->kvm->srcu, idx);
}

+int kvm_xen_has_interrupt(struct kvm_vcpu *v)
+{
+ int rc = 0;
+
+ if (v->kvm->arch.xen.upcall_vector) {
+ int idx = srcu_read_lock(&v->kvm->srcu);
+ struct vcpu_info *vcpu_info = xen_vcpu_info(v);
+
+ if (vcpu_info && READ_ONCE(vcpu_info->evtchn_upcall_pending))
+ rc = 1;
+
+ srcu_read_unlock(&v->kvm->srcu, idx);
+ }
+ return rc;
+}
+
static int vcpu_attr_loc(struct kvm_vcpu *vcpu, u16 type,
struct kvm_host_map **map, void ***hva, size_t *sz)
{
@@ -338,6 +354,14 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
break;
}

+ case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
+ if (data->u.vector < 0x10)
+ return -EINVAL;
+
+ kvm->arch.xen.upcall_vector = data->u.vector;
+ r = 0;
+ break;
+
default:
break;
}
@@ -386,6 +410,11 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
break;
}

+ case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
+ data->u.vector = kvm->arch.xen.upcall_vector;
+ r = 0;
+ break;
+
default:
break;
}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index ccd6002f55bc..1f599342f02c 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -25,6 +25,7 @@ static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu);
void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu);
+int kvm_xen_has_interrupt (struct kvm_vcpu *vcpu);
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1047364d1adf..113279fa9e1e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1587,6 +1587,7 @@ struct kvm_xen_hvm_attr {

union {
__u8 long_mode;
+ __u8 vector;
struct {
__u64 gfn;
} shared_info;
@@ -1604,6 +1605,7 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x2
#define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x3
#define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x4
+#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 0x5

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {


Attachments:
smime.p7s (5.05 kB)

2020-12-09 11:44:43

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
> Isn't this what the first half of this patch was doing initially (minus the
> irq routing) ? Looks really similar:
>
> https://lore.kernel.org/kvm/[email protected]/

Absolutely! This thread is in reply to your original posting of
precisely that patch, and I've had your tree open in gitk to crib from
for most of the last week.

There's a *reason* why my tree looks like yours, and why more than half
of the patches in it still show you as being the author :)

> Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
> replace the irq routing with a domain-wide upcall vector ;)

I like "simpler".

I also killed the ->cb.queued flag you had because I think it's
redundant with evtchn_upcall_pending anyway.

> Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
> (which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
> you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
> doesn't use it,
> and if it was the case we would probably get faster upcalls with APICv/AVIC.

It doesn't need to, because those can just be injected with
KVM_SIGNAL_MSI.

At most, we just need to make sure that kvm_xen_has_interrupt() returns
false if the per-vCPU LAPIC vector is configured. But I didn't do that
because I checked Xen and it doesn't do it either.

As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
return the domain-wide vector in preference to the one in the LAPIC, if
it actually gets invoked. And if the guest sets ->evtchn_upcall_pending
to reinject the IRQ (as Linux does on unmask) Xen won't use the per-
vCPU vector to inject that; it'll use the domain-wide vector.

> > I'm not sure that condition wasn't *already* broken some cases for
> > KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
> > vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
> > sure.
> >
> > But what if vcpu_enter_guest() processes that the first time (and
> > clears KVM_REQ_EVENT), and then exits for some *other* reason with
> > interrupts still disabled? Next time through vcpu_enter_guest(), even
> > though kvm_cpu_has_injectable_intr() is still true, we don't enable the
> > IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
> > inject_pending_event().
> >
> > So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
> > if we need to use kvm_cpu_has_injectable_intr() to fix the existing
> > bug? Or am I missing something there and there isn't a bug after all?
> >
>
> Given that the notion of an event channel pending is Xen specific handling, I am not sure
> we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
> reason that I ended up checking on vmenter that checks event channels pendings..

Sure, we definitely need the check I added in vcpu_enter_guest() for
Xen unless I'm going to come up with a way to set KVM_REQ_EVENT at the
appropriate time.

But I'm looking at the ExtINT handling and as far as I can tell it's
buggy. So I didn't want to use it as a model for setting KVM_REQ_EVENT
for Xen events.



Attachments:
smime.p7s (5.05 kB)

2020-12-09 13:31:48

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 12/9/20 11:39 AM, David Woodhouse wrote:
> On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
>> Isn't this what the first half of this patch was doing initially (minus the
>> irq routing) ? Looks really similar:
>>
>> https://lore.kernel.org/kvm/[email protected]/
>
> Absolutely! This thread is in reply to your original posting of
> precisely that patch, and I've had your tree open in gitk to crib from
> for most of the last week.
>
I forgot about this patch given all the discussion so far and I had to re-look given that
it resembled me from your snippet. But I ended up being a little pedantic -- sorry about that.

> There's a *reason* why my tree looks like yours, and why more than half
> of the patches in it still show you as being the author :)
>
Btw, in this patch it would be Ankur's.

More importantly, thanks a lot for picking it up and for all the awesome stuff you're
doing with it.

>> Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
>> replace the irq routing with a domain-wide upcall vector ;)
>
> I like "simpler".
>
> I also killed the ->cb.queued flag you had because I think it's
> redundant with evtchn_upcall_pending anyway.
>
Yeap, indeed.

>> Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
>> (which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
>> you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
>> doesn't use it,
>> and if it was the case we would probably get faster upcalls with APICv/AVIC.
>
> It doesn't need to, because those can just be injected with
> KVM_SIGNAL_MSI.
>
/me nods

> At most, we just need to make sure that kvm_xen_has_interrupt() returns
> false if the per-vCPU LAPIC vector is configured. But I didn't do that
> because I checked Xen and it doesn't do it either.
>
Oh! I have this strange recollection that it was, when we were looking at the Xen
implementation.

> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
> return the domain-wide vector in preference to the one in the LAPIC, if
> it actually gets invoked.

Only if the callback installed is HVMIRQ_callback_vector IIUC.

Otherwise the vector would be pending like any other LAPIC vector.

> And if the guest sets ->evtchn_upcall_pending
> to reinject the IRQ (as Linux does on unmask) Xen won't use the per-
> vCPU vector to inject that; it'll use the domain-wide vector.
> Right -- I don't think Linux even installs a per-CPU upcall LAPIC vector other than the
domain's callback irq.

>>> I'm not sure that condition wasn't *already* broken some cases for
>>> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
>>> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
>>> sure.
>>>
>>> But what if vcpu_enter_guest() processes that the first time (and
>>> clears KVM_REQ_EVENT), and then exits for some *other* reason with
>>> interrupts still disabled? Next time through vcpu_enter_guest(), even
>>> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
>>> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
>>> inject_pending_event().
>>>
>>> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
>>> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
>>> bug? Or am I missing something there and there isn't a bug after all?
>>
>> Given that the notion of an event channel pending is Xen specific handling, I am not sure
>> we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
>> reason that we ended up checking on vmenter that checks event channels pendings..
>
> Sure, we definitely need the check I added in vcpu_enter_guest() for
> Xen unless I'm going to come up with a way to set KVM_REQ_EVENT at the
> appropriate time.
>
> But I'm looking at the ExtINT handling and as far as I can tell it's
> buggy. So I didn't want to use it as a model for setting KVM_REQ_EVENT
> for Xen events.
>
/me nods

2020-12-09 14:28:48

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector



On 12/9/20 10:27 AM, David Woodhouse wrote:
> On Tue, 2020-12-08 at 22:35 -0800, Ankur Arora wrote:
>>> It looks like Linux doesn't use the per-vCPU upcall vector that you
>>> called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
>>> KVM_INTERRUPT as if they were ExtINT....
>>>
>>> ... except I'm not. Because the kernel really does expect that to be an
>>> ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
>>> true if LVT0 is set up for EXTINT and unmasked.
>>>
>>> I messed around with this hack and increasingly desperate variations on
>>> the theme (since this one doesn't cause the IRQ window to be opened to
>>> userspace in the first place), but couldn't get anything working:
>>
>> Increasingly desperate variations, about sums up my process as well while
>> trying to get the upcall vector working.
>
> :)
>
> So this seems to work, and I think it's about as minimal as it can get.
>
> All it does is implement a kvm_xen_has_interrupt() which checks the
> vcpu_info->evtchn_upcall_pending flag, just like Xen does.
>
> With this, my completely userspace implementation of event channels is
> working. And of course this forms a basis for adding the minimal
> acceleration of IPI/VIRQ that we need to the kernel, too.
>
> My only slight concern is the bit in vcpu_enter_guest() where we have
> to add the check for kvm_xen_has_interrupt(), because nothing is
> setting KVM_REQ_EVENT. I suppose I could look at having something —
> even an explicit ioctl from userspace — set that for us.... BUT...
>
Isn't this what the first half of this patch was doing initially (minus the
irq routing) ? Looks really similar:

https://lore.kernel.org/kvm/[email protected]/

Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
replace the irq routing with a domain-wide upcall vector ;)

Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
(which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
doesn't use it,
and if it was the case we would probably get faster upcalls with APICv/AVIC.

> I'm not sure that condition wasn't *already* broken some cases for
> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
> sure.
>
> But what if vcpu_enter_guest() processes that the first time (and
> clears KVM_REQ_EVENT), and then exits for some *other* reason with
> interrupts still disabled? Next time through vcpu_enter_guest(), even
> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
> inject_pending_event().
>
> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
> bug? Or am I missing something there and there isn't a bug after all?
>
Given that the notion of an event channel pending is Xen specific handling, I am not sure
we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
reason that I ended up checking on vmenter that checks event channels pendings..

That or the autoEOI hack Ankur and you were talking out.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d8716ef27728..4a63f212fdfc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -902,6 +902,7 @@ struct msr_bitmap_range {
> /* Xen emulation context */
> struct kvm_xen {
> bool long_mode;
> + u8 upcall_vector;
> struct kvm_host_map shinfo_map;
> void *shinfo;
> };
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 814698e5b152..24668b51b5c8 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -14,6 +14,7 @@
> #include "irq.h"
> #include "i8254.h"
> #include "x86.h"
> +#include "xen.h"
>
> /*
> * check if there are pending timer events
> @@ -56,6 +57,9 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v)
> if (!lapic_in_kernel(v))
> return v->arch.interrupt.injected;
>
> + if (kvm_xen_has_interrupt(v))
> + return 1;
> +
> if (!kvm_apic_accept_pic_intr(v))
> return 0;
>
> @@ -110,6 +114,9 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
> if (!lapic_in_kernel(v))
> return v->arch.interrupt.nr;
>
> + if (kvm_xen_has_interrupt(v))
> + return v->kvm->arch.xen.upcall_vector;
> +
> if (irqchip_split(v->kvm)) {
> int vector = v->arch.pending_external_vector;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ad9eea8f4f26..1711072b3616 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8891,7 +8891,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> kvm_x86_ops.msr_filter_changed(vcpu);
> }
>
> - if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> + if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> + kvm_xen_has_interrupt(vcpu)) {
> ++vcpu->stat.req_event;
> kvm_apic_accept_events(vcpu);
> if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 4aa776c1ad57..116422d93d96 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -257,6 +257,22 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
> srcu_read_unlock(&v->kvm->srcu, idx);
> }
>
> +int kvm_xen_has_interrupt(struct kvm_vcpu *v)
> +{
> + int rc = 0;
> +
> + if (v->kvm->arch.xen.upcall_vector) {
> + int idx = srcu_read_lock(&v->kvm->srcu);
> + struct vcpu_info *vcpu_info = xen_vcpu_info(v);
> +
> + if (vcpu_info && READ_ONCE(vcpu_info->evtchn_upcall_pending))
> + rc = 1;
> +
> + srcu_read_unlock(&v->kvm->srcu, idx);
> + }
> + return rc;
> +}
> +
> static int vcpu_attr_loc(struct kvm_vcpu *vcpu, u16 type,
> struct kvm_host_map **map, void ***hva, size_t *sz)
> {
> @@ -338,6 +354,14 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
> break;
> }
>
> + case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
> + if (data->u.vector < 0x10)
> + return -EINVAL;
> +
> + kvm->arch.xen.upcall_vector = data->u.vector;
> + r = 0;
> + break;
> +
> default:
> break;
> }
> @@ -386,6 +410,11 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
> break;
> }
>
> + case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
> + data->u.vector = kvm->arch.xen.upcall_vector;
> + r = 0;
> + break;
> +
> default:
> break;
> }
> diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
> index ccd6002f55bc..1f599342f02c 100644
> --- a/arch/x86/kvm/xen.h
> +++ b/arch/x86/kvm/xen.h
> @@ -25,6 +25,7 @@ static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
> void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
> void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu);
> void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu);
> +int kvm_xen_has_interrupt (struct kvm_vcpu *vcpu);
> int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
> int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
> int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1047364d1adf..113279fa9e1e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1587,6 +1587,7 @@ struct kvm_xen_hvm_attr {
>
> union {
> __u8 long_mode;
> + __u8 vector;
> struct {
> __u64 gfn;
> } shared_info;
> @@ -1604,6 +1605,7 @@ struct kvm_xen_hvm_attr {
> #define KVM_XEN_ATTR_TYPE_VCPU_INFO 0x2
> #define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO 0x3
> #define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE 0x4
> +#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 0x5
>
> /* Secure Encrypted Virtualization command */
> enum sev_cmd_id {
>

2020-12-09 15:45:21

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector



On 9 December 2020 13:26:55 GMT, Joao Martins <[email protected]> wrote:
>On 12/9/20 11:39 AM, David Woodhouse wrote:
>> On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
>>> Isn't this what the first half of this patch was doing initially
>(minus the
>>> irq routing) ? Looks really similar:
>>>
>>>
>https://lore.kernel.org/kvm/[email protected]/
>>
>> Absolutely! This thread is in reply to your original posting of
>> precisely that patch, and I've had your tree open in gitk to crib
>from
>> for most of the last week.
>>
>I forgot about this patch given all the discussion so far and I had to
>re-look given that
>it resembled me from your snippet. But I ended up being a little
>pedantic -- sorry about that.

Nah, pedantry is good :)

>> At most, we just need to make sure that kvm_xen_has_interrupt()
>returns
>> false if the per-vCPU LAPIC vector is configured. But I didn't do
>that
>> because I checked Xen and it doesn't do it either.
>>
>Oh! I have this strange recollection that it was, when we were looking
>at the Xen
>implementation.

Hm, maybe I missed it. Will stare at it harder, although looking at Xen code tends to make my brain hurt :)

>> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
>> return the domain-wide vector in preference to the one in the LAPIC,
>if
>> it actually gets invoked.
>
>Only if the callback installed is HVMIRQ_callback_vector IIUC.
>
>Otherwise the vector would be pending like any other LAPIC vector.

Ah, right.

For some reason I had it in my head that you could only set the per-vCPU lapic vector if the domain was set to HVMIRQ_callback_vector. If the domain is set to HVMIRQ_callback_none, that clearly makes more sense.

Still, my patch should do the same as Xen does in the case where a guest does set both, I think.

Faithful compatibility with odd Xen behaviour FTW :)

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2020-12-09 20:05:07

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 12/9/20 3:41 PM, David Woodhouse wrote:
> On 9 December 2020 13:26:55 GMT, Joao Martins <[email protected]> wrote:
>> On 12/9/20 11:39 AM, David Woodhouse wrote:
>>> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
>>> return the domain-wide vector in preference to the one in the LAPIC,
>> if
>>> it actually gets invoked.
>>
>> Only if the callback installed is HVMIRQ_callback_vector IIUC.
>>
>> Otherwise the vector would be pending like any other LAPIC vector.
>
> Ah, right.
>
> For some reason I had it in my head that you could only set the per-vCPU lapic vector if the domain was set to HVMIRQ_callback_vector. If the domain is set to HVMIRQ_callback_none, that clearly makes more sense.
>
> Still, my patch should do the same as Xen does in the case where a guest does set both, I think.
>
> Faithful compatibility with odd Xen behaviour FTW :)
>
Ah, yes. In that case, HVMIRQ_callback_vector does take precedence.

But it would be very weird for a guest to setup two callback vectors :)

2020-12-14 02:52:35

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

On Tue, 2020-12-01 at 16:40 -0800, Ankur Arora wrote:
> > How come we get to pin the page and directly dereference it every time,
> > while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
> > instead?
>
> So looking at my WIP trees from the time, this is something that
> we went back and forth on as well with using just a pinned page or a
> persistent kvm_vcpu_map().

I have a new plan.

Screw pinning it and mapping it into kernel space just because we need
to do atomic operations on it.

Futexes can do atomic operations on userspace; so can we. We just add
atomic_cmpxchg_user() and use that.

We can add kvm_cmpxchg_guest_offset_cached() and use that from places
like record_steal_time().

That works nicely for all of the Xen support needs too, I think — since
test-and-set and test-and-clear bit operations all want to be built on
cmpxchg.

The one thing I can think off offhand that it doesn't address easily is
the testing of vcpu_info->evtchn_upcall_pending in
kvm_xen_cpu_has_interrupt(), which really does want to be fast so I'm
not sure I want to be using copy_from_user() for that.


But I think I can open-code that to do the cache checking that
kvm_read_guest_offset_cached() does, and use __get_user() directly in
the fast path, falling back to actually calling
kvm_read_guest_offset_cached() when it needs to. That will suffice
until/unless we grow more use cases which would make it worth providing
that as a kvm_get_guest...() set of functions for generic use.


Attachments:
smime.p7s (5.05 kB)

2021-01-01 14:35:33

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
> > But if the kernel is going to short-circuit the IPIs and VIRQs, then
> > it's already going to have to handle the evtchn_pending/evtchn_mask
> > bitmaps, and actually injecting interrupts.
> >
>
> Right. I was trying to point that out in the discussion we had
> in next patch. But true be told, more about touting the idea of kernel
> knowing if a given event channel is registered for userspace handling,
> rather than fully handling the event channel.
>
> I suppose we are able to provide both options to the VMM anyway
> i.e. 1) letting them handle it enterily in userspace by intercepting
> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
>
> > Given that it has to have that functionality anyway, it seems saner to
> > let the kernel have full control over it and to just expose
> > 'evtchn_send' to userspace.
> >
> > The alternative is to have userspace trying to play along with the
> > atomic handling of those bitmasks too
>
> That part is not particularly hard -- having it done already.

Right, for 2-level it works out fairly well. I like your model of
installing { vcpu_id, port } into the KVM IRQ routing table and that's
enough for the kernel to raise the event channels that it *needs* to
know about, while userspace can do the others for itself. It's just
atomic test-and-set bitop stuff with no real need for coordination.

For FIFO event channels it gets more fun, because the updates aren't
truly atomic — they require a spinlock around the three operations that
the host has to perform when linking an event into a queue:

• Set the new port's LINKED bit
• Set the previous head's LINK to point to the new port
• Store the new port# as the head.

One option might be to declare that for FIFO, all updates for a given
queue *must* be handled either by the kernel, or by userspace, and
there's sharing of control.

Or maybe there's a way to keep the kernel side simple by avoiding the
tail handling completely. Surely we only really care about kernel
handling of the *fast* path, where a single event channel is triggered
and handled immediately? In the high-latency case where we're gathering
multiple events in a queue before the guest ever gets to process them,
we might as well let that be handled by userspace, right?

So in that case, perhaps the kernel part could forget all the horrid
nonsense with tracking the tail of the queue. It would handle the event
in-kernel *only* in the case where the event is the *first* in the
queue, and the head was previously zero?

But even that isn't a simple atomic operation though; we still have to
mark the event LINKED, then update the head pointer to reference it.
And what if we set the 'LINKED' bit but then find that userspace has
already linked another port so ->head is no longer zero?

Can we just clear the LINKED bit and then punt the whole event for
userspace to (re)handle? Or raise a special event to userspace so it
knows it needs to go ahead and link the port even though its LINKED bit
has already been set?

None of the available options really fill me with joy; I'm somewhat
inclined just to declare that the kernel won't support acceleration of
FIFO event channels at all.

None of which matters a *huge* amount right now if I was only going to
leave that as a future optimisation anyway.

What it does actually mean in the short term is that as I update your
KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
'priority' field to struct kvm_irq_routing_xen_evtchn to make it
extensible to FIFO event channels. We can always add that later.

Does that seem reasonable?


Attachments:
smime.p7s (5.05 kB)

2021-01-05 13:47:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On Tue, 2021-01-05 at 12:11 +0000, Joao Martins wrote:
> On 1/1/21 2:33 PM, David Woodhouse wrote:
> > What it does actually mean in the short term is that as I update your
> > KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> > 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> > extensible to FIFO event channels. We can always add that later.
> >
> > Does that seem reasonable?
> >
>
> Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
> evtchnop_init_control hypercall fails, and the way we are handling events,
> doesn't warrant FIFO event channel support as mandatory.

Right.

I think it's perfectly OK to declare that we aren't going to support
FIFO event channel acceleration in Linux/KVM, and anyone who really
wants to support them would have to do it entirely in userspace.

The only reason a VMM would really need to support FIFO event channels
is if we're insane enough to want to do live migration from actual
Xen, of guests which were previously using them.

While I make no comment about my mental state, I can happily observe
that I don't have guests which currently use FIFO support.


Attachments:
smime.p7s (5.05 kB)

2021-01-05 20:22:09

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

On 1/1/21 2:33 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>>> But if the kernel is going to short-circuit the IPIs and VIRQs, then
>>> it's already going to have to handle the evtchn_pending/evtchn_mask
>>> bitmaps, and actually injecting interrupts.
>>>
>>
>> Right. I was trying to point that out in the discussion we had
>> in next patch. But true be told, more about touting the idea of kernel
>> knowing if a given event channel is registered for userspace handling,
>> rather than fully handling the event channel.
>>
>> I suppose we are able to provide both options to the VMM anyway
>> i.e. 1) letting them handle it enterily in userspace by intercepting
>> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
>>
>>> Given that it has to have that functionality anyway, it seems saner to
>>> let the kernel have full control over it and to just expose
>>> 'evtchn_send' to userspace.
>>>
>>> The alternative is to have userspace trying to play along with the
>>> atomic handling of those bitmasks too
>>
>> That part is not particularly hard -- having it done already.
>
> Right, for 2-level it works out fairly well. I like your model of
> installing { vcpu_id, port } into the KVM IRQ routing table and that's
> enough for the kernel to raise the event channels that it *needs* to
> know about, while userspace can do the others for itself. It's just
> atomic test-and-set bitop stuff with no real need for coordination.
>
> For FIFO event channels it gets more fun, because the updates aren't
> truly atomic — they require a spinlock around the three operations that
> the host has to perform when linking an event into a queue:
>
> • Set the new port's LINKED bit
> • Set the previous head's LINK to point to the new port
> • Store the new port# as the head.
>
> One option might be to declare that for FIFO, all updates for a given
> queue *must* be handled either by the kernel, or by userspace, and
> there's sharing of control.
>
> Or maybe there's a way to keep the kernel side simple by avoiding the
> tail handling completely. Surely we only really care about kernel
> handling of the *fast* path, where a single event channel is triggered
> and handled immediately? In the high-latency case where we're gathering
> multiple events in a queue before the guest ever gets to process them,
> we might as well let that be handled by userspace, right?
>
> So in that case, perhaps the kernel part could forget all the horrid
> nonsense with tracking the tail of the queue. It would handle the event
> in-kernel *only* in the case where the event is the *first* in the
> queue, and the head was previously zero?
>
> But even that isn't a simple atomic operation though; we still have to
> mark the event LINKED, then update the head pointer to reference it.
> And what if we set the 'LINKED' bit but then find that userspace has
> already linked another port so ->head is no longer zero?
>
> Can we just clear the LINKED bit and then punt the whole event for
> userspace to (re)handle? Or raise a special event to userspace so it
> knows it needs to go ahead and link the port even though its LINKED bit
> has already been set?
>
> None of the available options really fill me with joy; I'm somewhat
> inclined just to declare that the kernel won't support acceleration of
> FIFO event channels at all.
>
> None of which matters a *huge* amount right now if I was only going to
> leave that as a future optimisation anyway.
>
ACK.

> What it does actually mean in the short term is that as I update your
> KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> extensible to FIFO event channels. We can always add that later.
>
> Does that seem reasonable?
>
Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
evtchnop_init_control hypercall fails, and the way we are handling events,
doesn't warrant FIFO event channel support as mandatory.

Despite the many fifo event features, IIRC the main driving motivation
was to go beyond the 1K/4K port limit for 32b/64b guests to be 128K max
ports per guest instead. But that was mostly a limit for Domain-0 as it hosts
all the backend handling in the traditional (i.e. without driver
domains) deployment. Therefore limiting how many vdevs one could host in
the system. And on cases for dense VM consolidation where you host 1K
guests with e.g. 3 block devices and 1 network interface one would quickly
ran out of interdomain event channel ports in Dom0. But that is not the
case here.

We are anyways ABI-limited to HVM_MAX_VCPUS (128) and if we assume
4 event channels for the legacy guest per VCPU (4 ipi evts, 1 timer,
1 debug) then it means we have still 3328 ports for interdomain event
channels for a 128 vCPU HVM guest ... when using the 2L event channels ABI.

Joao

2021-11-23 13:15:59

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

So... what we have in the kernel already is entirely sufficient to run
real Xen guests so they don't know they're running under KVM — with PV
netback and blkback, Xen timers, etc.

We can even run the PV Xen shim itself, to host PV guests.

There are some pieces we really want to move into the kernel though,
IPIs being top of the list — we really don't want them bouncing out to
the userspace VMM. In-kernel timers would be good, too.

I've finally got the basic 2l event channel delivery working in the
kernel, with a minor detour into fixing nesting UAF bugs. The first
simple use case for that is routing MSIs of assigned PCI devices to Xen
PIRQs. So I'm ready to stare at the patch in $SUBJECT again...

On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> On 11/30/20 12:55 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > But still, eventfd is probably unnecessary complexity when another @type
> > > (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
> > > and let it route its evtchn port handling to the its own I/O handling thread.
> >
> > Hmm... so the benefit of the eventfd is that we can wake the I/O thread
> > directly instead of bouncing out to userspace on the vCPU thread only
> > for it to send a signal and return to the guest? Did you ever use that,
> > and it is worth the additional in-kernel code?
> >
> This was my own preemptive optimization to the interface -- it's not worth
> the added code for vIRQ and IPI at this point which is what *for sure* the
> kernel will handle.
>
> > Is there any reason we'd want that for IPI or VIRQ event channels, or
> > can it be only for INTERDOM/UNBOUND event channels which come later?
> >
> /me nods.
>
> No reason to keep that for IPI/vIRQ.
>
> > I'm tempted to leave it out of the first patch, and *maybe* add it back
> > in a later patch, putting it in the union alongside .virq.type.
> >
> >
> > struct kvm_xen_eventfd {
> >
> > #define XEN_EVTCHN_TYPE_VIRQ 0
> > #define XEN_EVTCHN_TYPE_IPI 1
> > __u32 type;
> > __u32 port;
> > __u32 vcpu;
> > - __s32 fd;
> >
> > #define KVM_XEN_EVENTFD_DEASSIGN (1 << 0)
> > #define KVM_XEN_EVENTFD_UPDATE (1 << 1)
> > eentfd_ __u32 flags;
> > union {
> > struct {
> > __u8 type;
> > } virq;
> > + struct {
> > + __s32 eventfd;
> > + } interdom; /* including unbound */
> > __u32 padding[2];
> > };
> > } evtchn;
> >
> > Does that make sense to you?
> >
> Yeap! :)

It turns out there are actually two cases for interdom event channels.
As well as signalling "other domains" (i.e. userspace) they can also be
set up as loopback, so sending an event on one port actually raises an
event on another of that guest's same ports, a bit like an IPI.

So it isn't quite as simple as "if we handle interdom in the kernel it
has an eventfd".

It also means we need to provide both *source* and *target* ports.
Which brings me to the KVM_XEN_EVENTFD_DEASSIGN/KVM_XEN_EVENTFD_UPDATE
flags that you used. I'm not sure I like that API; given that we now
need separate source and target ports can we handle 'deassign' just by
setting the target to zero? (and eventfd to -1) ?

I think your original patch leaked refcounts on the eventfd in
kvm_xen_eventfd_update() because it would call eventfd_ctx_fdget() then
neither used nor released the result?

So I'm thinking of making it just a 'set'. And in fact I'm thinking of
modelling it on the irq routing table by having a single call to
set/update all of them at once?

Any objections?


Attachments:
smime.p7s (5.05 kB)

2022-02-09 10:19:02

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 15/39] KVM: x86/xen: handle PV spinlocks slowpath

On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> From: Boris Ostrovsky <[email protected]>
>
> Add support for SCHEDOP_poll hypercall.
>
> This implementation is optimized for polling for a single channel, which
> is what Linux does. Polling for multiple channels is not especially
> efficient (and has not been tested).
>
> PV spinlocks slow path uses this hypercall, and explicitly crash if it's
> not supported.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>
> ---

...

> +static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
> +{
> + struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
> +
> + if ((vcpu_xen->poll_evtchn == port ||
> + vcpu_xen->poll_evtchn == -1) &&
> + test_and_clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.xen.poll_mask))
> + wake_up(&vcpu_xen->sched_waitq);
> +}

...

> + if (sched_poll.nr_ports == 1)
> + vcpu_xen->poll_evtchn = port;
> + else
> + vcpu_xen->poll_evtchn = -1;
> +
> + if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports))
> + wait_event_interruptible_timeout(
> + vcpu_xen->sched_waitq,
> + wait_pending_event(vcpu, sched_poll.nr_ports, ports),
> + sched_poll.timeout ?: KTIME_MAX);

Hm, this doesn't wake on other interrupts, does it? I think it should.
Shouldn't it basically be like HLT, with an additional wakeup when the
listed ports are triggered even when they're masked?

At https://git.infradead.org/users/dwmw2/linux.git/commitdiff/ddfbdf1af
I've tried to make it use kvm_vcpu_halt(), and kvm_xen_check_poller()
sets KVM_REQ_UNBLOCK when an event is delivered to a monitored port.

I haven't quite got it to work yet, but does it seem like a sane
approach?

+ if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports)) {
+ vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+ kvm_vcpu_halt(vcpu);






Attachments:
smime.p7s (5.83 kB)

2022-02-10 12:25:26

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 12/39] KVM: x86/xen: store virq when assigning evtchn

On 2/8/22 16:17, Woodhouse, David wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> Enable virq offload to the hypervisor. The primary user for this is
>> the timer virq.
>>
>> Signed-off-by: Joao Martins <[email protected]>
>
> ...
>
>> @@ -636,8 +654,11 @@ static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
>> GFP_KERNEL);
>> mutex_unlock(port_lock);
>>
>> - if (ret >= 0)
>> + if (ret >= 0) {
>> + if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
>> + kvm_xen_set_virq(kvm, evtchnfd);
>> return 0;
>> + }
>>
>> if (ret == -ENOSPC)
>> ret = -EEXIST;
>>
>
> So, I've spent a while vacillating about how best we should do this.
>
> Since event channels are bidirectional, we essentially have *two*
> number spaces for them.
>
> We have the inbound events, in the KVM IRQ routing table (which 5.17
> already supports for delivering PIRQs, based on my mangling of your
> earlier patches).
>
/me nods

> And then we have the *outbound* events, which the guest can invoke with
> the EVTCHNOP_send hypercall. Those are either:
> • IPI, raising the same port# on the guest
> • Interdomain looped back to a different port# on the guest
> • Interdomain triggering an eventfd.
>
/me nods

I am forgetting why you one do this on Xen:

* Interdomain looped back to a different port# on the guest

> In the last case, that eventfd can be set up with IRQFD for direct
> event channel delivery to a different KVM/Xen guest.
>
> I've used your implemention, with an idr for the outbound port# space
> intercepting EVTCHNOP_send for known ports and only letting userspace
> see the hypercall if it's for a port# the kernel doesn't know. Looks a
> bit like
> https://git.infradead.org/users/dwmw2/linux.git/commitdiff/b4fbc49218a
>
>
> But I *don't* want to do the VIRQ part shown above, "spotting" the VIRQ
> in that outbound port# space and squirreling the information away into
> the kvm_vcpu for when we need to deliver a timer event.
>
> The VIRQ isn't part of the *outbound* port# space; it isn't a port to
> which a Xen guest can use EVTCHNOP_send to send an event.

But it is still an event channel which port is unique regardless of port
type/space hence (...)

> If anything,
> it would be part of the *inbound* port# space, in the KVM IRQ routing
> table. So perhaps we could have a similar snippet in
> kvm_xen_setup_evtchn() which spots a VIRQ and says "aha, now I know
> where to deliver timer events for this vCPU".
>
(...) The thinking at the time was mainly simplicity so our way of saying
'offload the evtchn to KVM' was through the machinery that offloads the outbound
part (using your terminology). I don't think even using XEN_EVENTFD as proposed
here that that one could send an VIRQ via EVTCHNOP_send (I could be wrong as
it has been a long time).

Regardless, I think you have a good point to split the semantics and (...)

> But... the IRQ routing table isn't really set up for that, and doesn't
> have explicit *deletion*. The kvm_xen_setup_evtchn() function might get
> called to translate into an updated table which is subsequently
> *abandoned*, and it would never know. I suppose we could stash the GSI#
> and then when we want to deliver it we look up that GSI# in the current
> table and see if it's *stale* but that's getting nasty.
>
> I suppose that's not insurmountable, but the other problem with
> inferring it from *either* the inbound or outbound port# tables is that
> the vCPU might not even *exist* at the time the table is set up (after
> live update or live migration, as the vCPU threads all go off and do
> their thing and *eventually* create their vCPUs, while the machine
> itself is being restored on the main VMM thread.)
>
> So I think I'm going to make the timer VIRQ (port#, priority) into an
> explicit KVM_XEN_VCPU_ATTR_TYPE.

(...) thus this makes sense. Do you particularly care about
VIRQ_DEBUG?

> Along with the *actual* timer expiry,
> which we need to extract/restore for LU/LM too, don't we?
> /me nods

I haven't thought that one well for Live Update / Live Migration, but
I wonder if wouldn't be better to be instead a general 'xen state'
attr type should you need more than just pending timers expiry. Albeit
considering that the VMM has everything it needs (?), perhaps for Xen PV
timer look to be the oddball missing, and we donºt need to go that extent.

2022-02-10 15:06:01

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 15/39] KVM: x86/xen: handle PV spinlocks slowpath

On 2/8/22 12:36, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> From: Boris Ostrovsky <[email protected]>
>>
>> Add support for SCHEDOP_poll hypercall.
>>
>> This implementation is optimized for polling for a single channel, which
>> is what Linux does. Polling for multiple channels is not especially
>> efficient (and has not been tested).
>>
>> PV spinlocks slow path uses this hypercall, and explicitly crash if it's
>> not supported.
>>
>> Signed-off-by: Boris Ostrovsky <[email protected]>
>> ---
>
> ...
>
>> +static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
>> +{
>> + struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
>> +
>> + if ((vcpu_xen->poll_evtchn == port ||
>> + vcpu_xen->poll_evtchn == -1) &&
>> + test_and_clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.xen.poll_mask))
>> + wake_up(&vcpu_xen->sched_waitq);
>> +}
>
> ...
>
>> + if (sched_poll.nr_ports == 1)
>> + vcpu_xen->poll_evtchn = port;
>> + else
>> + vcpu_xen->poll_evtchn = -1;
>> +
>> + if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports))
>> + wait_event_interruptible_timeout(
>> + vcpu_xen->sched_waitq,
>> + wait_pending_event(vcpu, sched_poll.nr_ports, ports),
>> + sched_poll.timeout ?: KTIME_MAX);
>
> Hm, this doesn't wake on other interrupts, does it?

Hmm, I don't think so? This was specifically polling on event channels,
not sleeping or blocking.

> I think it should.
> Shouldn't it basically be like HLT, with an additional wakeup when the
> listed ports are triggered even when they're masked?
>

I am actually not sure.

Quickly glancing at the xen source, this hypercall doesn't appear to really
block the vcpu, but rather just looking if the evtchn ports are pending and
if a timeout is is specified it sets up a timer. And ofc, wake any evtchn
pollers. But it doesn't appear to actually block the VCPU. It should be
IIRC, the functional equivalent of KVM_HC_VAPIC_POLL_IRQ but for event
channels.


> At https://git.infradead.org/users/dwmw2/linux.git/commitdiff/ddfbdf1af
> I've tried to make it use kvm_vcpu_halt(), and kvm_xen_check_poller()
> sets KVM_REQ_UNBLOCK when an event is delivered to a monitored port.
>
> I haven't quite got it to work yet, but does it seem like a sane
> approach?
>

Joao

2022-02-10 16:20:38

by David Woodhouse

[permalink] [raw]
Subject: Re: [EXTERNAL] [PATCH RFC 12/39] KVM: x86/xen: store virq when assigning evtchn

On Thu, 2022-02-10 at 12:17 +0000, Joao Martins wrote:
> On 2/8/22 16:17, Woodhouse, David wrote:
> > And then we have the *outbound* events, which the guest can invoke with
> > the EVTCHNOP_send hypercall. Those are either:
> > • IPI, raising the same port# on the guest
> > • Interdomain looped back to a different port# on the guest
> > • Interdomain triggering an eventfd.
> >
>
> /me nods
>
> I am forgetting why you one do this on Xen:
>
> * Interdomain looped back to a different port# on the guest

It's one of the few things we had to fix up when we started running PV
guests in the 'shim' under KVM. I don't know that it actually sends
loopback events via the true Xen (or KVM) host, but it does at least
register them so that the port# is 'reserved' and the host won't
allocate that port for anything else. It does it at least for the
console port.

For the inbound vs. outbound thing.... I did ponder a really simple API
design in which outbound ports are *only* ever associated with an
eventfd, and for IPIs the VMM would be expected to bind those as IRQFD
to an inbound event on the same port#.

You pointed out that it was quite inefficient, but... we already have
complex hacks to bypass the eventfd for posted interrupts when the
source and destination "match", and perhaps we could do something
similar to allow EVTCHNOP_send to deliver directly to a local port#
without having to go through all the eventfd code?

But the implementation of that would end up being awful, *and* the
userspace API isn't even that nice despite being "simple", because it
would force userspace to allocate a whole bunch of eventfds and use
space in the IRQ routing table for them. So it didn't seem worth it.
Let's just let userspace tell us explicitly the vcpu/port/prio instead
of having to jump through hoops to magically work it out from matching
eventfds.

> > In the last case, that eventfd can be set up with IRQFD for direct
> > event channel delivery to a different KVM/Xen guest.
> >
> > I've used your implemention, with an idr for the outbound port# space
> > intercepting EVTCHNOP_send for known ports and only letting userspace
> > see the hypercall if it's for a port# the kernel doesn't know. Looks a
> > bit like
> > https://git.infradead.org/users/dwmw2/linux.git/commitdiff/b4fbc49218a
> >
> >
> >
> > But I *don't* want to do the VIRQ part shown above, "spotting" the VIRQ
> > in that outbound port# space and squirreling the information away into
> > the kvm_vcpu for when we need to deliver a timer event.
> >
> > The VIRQ isn't part of the *outbound* port# space; it isn't a port to
> > which a Xen guest can use EVTCHNOP_send to send an event.
>
> But it is still an event channel which port is unique regardless of port
> type/space hence (...)
>
> > If anything,
> > it would be part of the *inbound* port# space, in the KVM IRQ routing
> > table. So perhaps we could have a similar snippet in
> > kvm_xen_setup_evtchn() which spots a VIRQ and says "aha, now I know
> > where to deliver timer events for this vCPU".
> >
>
> (...) The thinking at the time was mainly simplicity so our way of saying
> 'offload the evtchn to KVM' was through the machinery that offloads the outbound
> part (using your terminology). I don't think even using XEN_EVENTFD as proposed
> here that that one could send an VIRQ via EVTCHNOP_send (I could be wrong as
> it has been a long time).

I confess I didn't test it but it *looked* like you could, while true
Xen wouldn't permit that.

> Regardless, I think you have a good point to split the semantics and (...)

> >
> > So I think I'm going to make the timer VIRQ (port#, priority) into an
> > explicit KVM_XEN_VCPU_ATTR_TYPE.
>
> (...) thus this makes sense. Do you particularly care about
> VIRQ_DEBUG?


Not really. Especially not as something to accelerate in KVM.

Our environment doesn't have any way to deliver that to guests,
although we *do* have an API call to deliver "diagnostic interrupt"
which maps to an NMI, and we *have* occasionally hacked the VMM to
deliver VIRQ_DEBUG to Xen guests instead of that NMI. Mostly back when
I was busy being confused about ->vcpu_id vs. ->vcpu_idx vs. the
Xen/ACPI CPU# and where the hell my interrupts were going.

> > Along with the *actual* timer expiry,
> > which we need to extract/restore for LU/LM too, don't we?
> > /me nods
>
> I haven't thought that one well for Live Update / Live Migration, but
> I wonder if wouldn't be better to be instead a general 'xen state'
> attr type should you need more than just pending timers expiry. Albeit
> considering that the VMM has everything it needs (?), perhaps for Xen PV
> timer look to be the oddball missing, and we donºt need to go that extent.

Yeah, the VMM knows most of this stuff already, as it *told* the kernel
in the first place. Userspace is still responsible for all the setup
and admin, and the kernel just handles the bare minimum of the fast
path.

So on live update/migrate we only really need to read out the runstate
data from the kernel... and now the current timer expiry.

On the *resume* side it's still lots of syscalls, and perhaps in the
end we might decide we want to do a KVM_XEN_HVM_SET_ATTR_MULTI which
takes an array of them? But I think that's a premature optimisation for
now.


Attachments:
smime.p7s (5.83 kB)

2022-02-11 01:39:35

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 15/39] KVM: x86/xen: handle PV spinlocks slowpath

On Thu, 2022-02-10 at 12:17 +0000, Joao Martins wrote:
> On 2/8/22 12:36, David Woodhouse wrote:
> > > + if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports))
> > > + wait_event_interruptible_timeout(
> > > + vcpu_xen->sched_waitq,
> > > + wait_pending_event(vcpu, sched_poll.nr_ports, ports),
> > > + sched_poll.timeout ?: KTIME_MAX);
> >
> > Hm, this doesn't wake on other interrupts, does it?
>
> Hmm, I don't think so? This was specifically polling on event channels,
> not sleeping or blocking.
>
> > I think it should.
> > Shouldn't it basically be like HLT, with an additional wakeup when the
> > listed ports are triggered even when they're masked?
> >
>
> I am actually not sure.
>
> Quickly glancing at the xen source, this hypercall doesn't appear to really
> block the vcpu, but rather just looking if the evtchn ports are pending and
> if a timeout is is specified it sets up a timer. And ofc, wake any evtchn
> pollers. But it doesn't appear to actually block the VCPU. It should be
> IIRC, the functional equivalent of KVM_HC_VAPIC_POLL_IRQ but for event
> channels.
>

It does block.

https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/sched/core.c;hb=RELEASE-4.14.4#l1385

It sets the _VPF_blocked bit in the vCPU's pause_flags, then the
raise_softirq(SCHEDULE_SOFTIRQ) causes it to deschedule this vCPU when
it gets back to the return-to-guest path. It's not Linux, so you don't
see a schedule() call :)

It'll remain blocked until either an (unmasked) event channel or HVM
IRQ is asserted, or one of the masked event channels listed in the poll
list is raised.

Which is why I tried to tie it into the kvm_vcpu_halt() code path when
I updated the patch.


Attachments:
smime.p7s (5.83 kB)