2010-01-05 14:14:54

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out
from host memory vcpu execution is suspended till the page is not swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

To overcome this inefficiency this patch series implements "asynchronous
page fault" for paravirtualized KVM guests. If a page that vcpu is
trying to access is swapped out KVM sends an async PF to the vcpu
and continues vcpu execution. Requested page is swapped in by another
thread in parallel. When vcpu gets async PF it puts faulted task to
sleep until "wake up" interrupt is delivered. When the page is brought
to the host memory KVM sends "wake up" interrupt and the guest's task
resumes execution.

Changes:
v1->v2
Use MSR instead of hypercall.
Move most of the code into arch independent place.
halt inside a guest instead of doing "wait for page" hypercall if
preemption is disabled.
v2->v3
Use MSR from range 0x4b564dxx.
Add slot version tracking.
Support migration by restarting all guest processes after migration.
Drop patch that tract preemptability for non-preemptable kernels
due to performance concerns. Send async PF to non-preemptable
guests only when vcpu is executing userspace code.

Gleb Natapov (12):
Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
Add PV MSR to enable asynchronous page faults delivery.
Add async PF initialization to PV guest.
Add "handle page fault" PV helper.
Export __get_user_pages_fast.
Add get_user_pages() variant that fails if major fault is required.
Maintain memslot version number
Inject asynchronous page fault into a guest if page is swapped out.
Retry fault before vmentry
Handle async PF in non preemptable context
Let host know whether the guest can handle async PF in non-userspace
context.
Send async PF when guest is not in userspace too.

arch/x86/include/asm/kvm_host.h | 25 +++-
arch/x86/include/asm/kvm_para.h | 14 ++
arch/x86/include/asm/paravirt.h | 11 ++-
arch/x86/include/asm/paravirt_types.h | 4 +
arch/x86/kernel/kvm.c | 253 +++++++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 13 +--
arch/x86/kernel/paravirt.c | 8 +
arch/x86/kernel/paravirt_patch_32.c | 8 +
arch/x86/kernel/paravirt_patch_64.c | 7 +
arch/x86/kernel/smpboot.c | 3 +
arch/x86/kvm/Kconfig | 2 +
arch/x86/kvm/mmu.c | 63 ++++++++-
arch/x86/kvm/paging_tmpl.h | 49 ++++++-
arch/x86/kvm/x86.c | 121 +++++++++++++++-
arch/x86/mm/fault.c | 3 +
arch/x86/mm/gup.c | 2 +
fs/ncpfs/mmap.c | 2 +
include/linux/kvm.h | 1 +
include/linux/kvm_host.h | 29 ++++
include/linux/kvm_para.h | 2 +
include/linux/mm.h | 5 +
include/trace/events/kvm.h | 60 ++++++++
mm/filemap.c | 3 +
mm/memory.c | 31 ++++-
mm/shmem.c | 8 +-
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 245 +++++++++++++++++++++++++++++++-
27 files changed, 941 insertions(+), 34 deletions(-)


2010-01-05 14:13:27

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 11/12] Let host know whether the guest can handle async PF in non-userspace context.

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 3 +++
arch/x86/kvm/x86.c | 5 +++--
4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 43c1aca..cb0fd75 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -365,6 +365,7 @@ struct kvm_vcpu_arch {
unsigned long singlestep_rip;

u32 __user *apf_data;
+ bool apf_send_user_only;
u32 apf_memslot_ver;
u64 apf_msr_val;
u32 async_pf_id;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 98edaa9..8a18560 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
#define KVM_MAX_MMU_OP_BATCH 32

#define KVM_ASYNC_PF_ENABLED (1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1)

/* Operations for KVM_HC_MMU_OP */
#define KVM_MMU_OP_WRITE_PTE 1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 49549fd..4241706 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -480,6 +480,9 @@ void __cpuinit kvm_guest_cpu_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
u64 pa = __pa(&__get_cpu_var(apf_reason));

+#ifdef CONFIG_PREEMPT
+ pa |= KVM_ASYNC_PF_SEND_ALWAYS;
+#endif
if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
return;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e2e33ac..47f1661 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,8 +1009,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
int offset = offset_in_page(gpa);
unsigned long addr;

- /* Bits 1:5 are resrved, Should be zero */
- if (data & 0x3e)
+ /* Bits 2:5 are resrved, Should be zero */
+ if (data & 0x3c)
return 1;

vcpu->arch.apf_msr_val = data;
@@ -1032,6 +1032,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
return 1;
}
vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+ vcpu->arch.apf_send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
kvm_async_pf_wakeup_all(vcpu);
return 0;
}
--
1.6.5

2010-01-05 14:13:36

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 10/12] Handle async PF in non preemptable context

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/kernel/kvm.c | 36 ++++++++++++++++++++++++++++++++----
1 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 2245f35..49549fd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -33,6 +33,7 @@
#include <linux/sched.h>
#include <asm/timer.h>
#include <asm/cpu.h>
+#include <asm/tlbflush.h>

#define MMU_QUEUE_SIZE 1024

@@ -64,6 +65,8 @@ struct kvm_task_sleep_node {
wait_queue_head_t wq;
u32 token;
int cpu;
+ bool halted;
+ struct mm_struct *mm;
};

static struct kvm_task_sleep_head {
@@ -92,6 +95,11 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
struct kvm_task_sleep_node n, *e;
DEFINE_WAIT(wait);
+ int cpu, idle;
+
+ cpu = get_cpu();
+ idle = idle_cpu(cpu);
+ put_cpu();

spin_lock(&b->lock);
e = _find_apf_task(b, token);
@@ -105,17 +113,31 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)

n.token = token;
n.cpu = smp_processor_id();
+ n.mm = percpu_read(cpu_tlbstate.active_mm);
+ n.halted = idle || preempt_count() > 1;
+ atomic_inc(&n.mm->mm_count);
init_waitqueue_head(&n.wq);
hlist_add_head(&n.link, &b->list);
spin_unlock(&b->lock);

for (;;) {
- prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+ if (!n.halted)
+ prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
if (hlist_unhashed(&n.link))
break;
- schedule();
+
+ if (!n.halted) {
+ schedule();
+ } else {
+ /*
+ * We cannot reschedule. So halt.
+ */
+ native_safe_halt();
+ local_irq_disable();
+ }
}
- finish_wait(&n.wq, &wait);
+ if (!n.halted)
+ finish_wait(&n.wq, &wait);

return;
}
@@ -123,7 +145,12 @@ static void apf_task_wait(struct task_struct *tsk, u32 token)
static void apf_task_wake_one(struct kvm_task_sleep_node *n)
{
hlist_del_init(&n->link);
- if (waitqueue_active(&n->wq))
+ if (!n->mm)
+ return;
+ mmdrop(n->mm);
+ if (n->halted)
+ smp_send_reschedule(n->cpu);
+ else if (waitqueue_active(&n->wq))
wake_up(&n->wq);
}

@@ -153,6 +180,7 @@ again:
}
n->token = token;
n->cpu = smp_processor_id();
+ n->mm = NULL;
init_waitqueue_head(&n->wq);
hlist_add_head(&n->link, &b->list);
} else
--
1.6.5

2010-01-05 14:13:44

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 03/12] Add async PF initialization to PV guest.


Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 5 ++++
arch/x86/kernel/kvm.c | 49 +++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/smpboot.c | 3 ++
include/linux/kvm_para.h | 2 +
4 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f77eed3..56ca41b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -51,6 +51,11 @@ struct kvm_mmu_op_release_pt {
__u64 pt_phys;
};

+struct kvm_vcpu_pv_apf_data {
+ __u32 reason;
+ __u32 enabled;
+};
+
#ifdef __KERNEL__
#include <asm/processor.h>

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6db179..001222c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -27,7 +27,10 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/hardirq.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
#include <asm/timer.h>
+#include <asm/cpu.h>

#define MMU_QUEUE_SIZE 1024

@@ -37,6 +40,7 @@ struct kvm_para_state {
};

static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+static DEFINE_PER_CPU_ALIGNED(struct kvm_vcpu_pv_apf_data, apf_reason);

static struct kvm_para_state *kvm_para_state(void)
{
@@ -231,10 +235,35 @@ static void __init paravirt_ops_setup(void)
#endif
}

+static void kvm_pv_disable_apf(void *unused)
+{
+ if (!__get_cpu_var(apf_reason).enabled)
+ return;
+
+ wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
+ __get_cpu_var(apf_reason).enabled = 0;
+
+ printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
+ smp_processor_id());
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+ unsigned long code, void *unused)
+{
+ if (code == SYS_RESTART)
+ on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+ .notifier_call = kvm_pv_reboot_notify,
+};
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
WARN_ON(kvm_register_clock("primary cpu clock"));
+ kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
}
#endif
@@ -245,7 +274,27 @@ void __init kvm_guest_init(void)
return;

paravirt_ops_setup();
+ register_reboot_notifier(&kvm_pv_reboot_nb);
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#else
+ kvm_guest_cpu_init();
#endif
}
+
+void __cpuinit kvm_guest_cpu_init(void)
+{
+ if (!kvm_para_available())
+ return;
+
+ if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) {
+ u64 pa = __pa(&__get_cpu_var(apf_reason));
+
+ if (native_write_msr_safe(MSR_KVM_ASYNC_PF_EN,
+ pa | KVM_ASYNC_PF_ENABLED, pa >> 32))
+ return;
+ __get_cpu_var(apf_reason).enabled = 1;
+ printk(KERN_INFO"Setup pv shared memory for cpu %d\n",
+ smp_processor_id());
+ }
+}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 678d0b8..0a9eef4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -65,6 +65,7 @@
#include <asm/setup.h>
#include <asm/uv/uv.h>
#include <linux/mc146818rtc.h>
+#include <linux/kvm_para.h>

#include <asm/smpboot_hooks.h>

@@ -321,6 +322,8 @@ notrace static void __cpuinit start_secondary(void *unused)
ipi_call_unlock();
per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;

+ kvm_guest_cpu_init();
+
/* enable local interrupts */
local_irq_enable();

diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index d731092..4c8a2e6 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -26,8 +26,10 @@
#ifdef __KERNEL__
#ifdef CONFIG_KVM_GUEST
void __init kvm_guest_init(void);
+void __cpuinit kvm_guest_cpu_init(void);
#else
#define kvm_guest_init() do { } while (0)
+#define kvm_guest_cpu_init() do { } while (0)
#endif

static inline int kvm_para_has_feature(unsigned int feature)
--
1.6.5

2010-01-05 14:13:58

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 08/12] Inject asynchronous page fault into a guest if page is swapped out.

If guest access swapped out memory do not swap it in from vcpu thread
context. Setup slow work to do swapping and send async page fault to
a guest.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able to
reschedule.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 14 +++
arch/x86/kvm/Kconfig | 2 +
arch/x86/kvm/mmu.c | 36 ++++++-
arch/x86/kvm/paging_tmpl.h | 16 +++-
arch/x86/kvm/x86.c | 62 +++++++++-
include/linux/kvm_host.h | 28 +++++
include/trace/events/kvm.h | 60 ++++++++++
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 242 ++++++++++++++++++++++++++++++++++++++-
9 files changed, 456 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01d3ec4..641943e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -364,7 +364,9 @@ struct kvm_vcpu_arch {
unsigned long singlestep_rip;

u32 __user *apf_data;
+ u32 apf_memslot_ver;
u64 apf_msr_val;
+ u32 async_pf_id;
};

struct kvm_mem_alias {
@@ -537,6 +539,10 @@ struct kvm_x86_ops {
const struct trace_print_flags *exit_reasons_str;
};

+struct kvm_arch_async_pf {
+ u32 token;
+};
+
extern struct kvm_x86_ops *kvm_x86_ops;

int kvm_mmu_module_init(void);
@@ -815,4 +821,12 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
void kvm_define_shared_msr(unsigned index, u32 msr);
void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);

+struct kvm_async_pf;
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work);
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
#endif /* _ASM_X86_KVM_HOST_H */
+
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0687111..2dc919b 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -28,6 +28,8 @@ config KVM
select HAVE_KVM_IRQCHIP
select HAVE_KVM_EVENTFD
select KVM_APIC_ARCHITECTURE
+ select KVM_ASYNC_PF
+ select SLOW_WORK
select USER_RETURN_NOTIFIER
select KVM_MMIO
---help---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ed4f1a3..7214f28 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -19,6 +19,7 @@

#include "mmu.h"
#include "kvm_cache_regs.h"
+#include "x86.h"

#include <linux/kvm_host.h>
#include <linux/types.h>
@@ -30,6 +31,8 @@
#include <linux/hugetlb.h>
#include <linux/compiler.h>
#include <linux/srcu.h>
+#include <trace/events/kvm.h>
+#undef TRACE_INCLUDE_FILE

#include <asm/page.h>
#include <asm/cmpxchg.h>
@@ -2186,6 +2189,21 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
error_code & PFERR_WRITE_MASK, gfn);
}

+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+{
+ struct kvm_arch_async_pf arch;
+ arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+ return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+}
+
+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+ if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
+ return false;
+
+ return !!kvm_x86_ops->get_cpl(vcpu);
+}
+
static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
u32 error_code)
{
@@ -2208,7 +2226,23 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,

mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
- pfn = gfn_to_pfn(vcpu->kvm, gfn);
+
+ if (can_do_async_pf(vcpu)) {
+ r = gfn_to_pfn_async(vcpu->kvm, gfn, &pfn);
+ trace_kvm_try_async_get_page(r, pfn);
+ } else {
+do_sync:
+ r = 1;
+ pfn = gfn_to_pfn(vcpu->kvm, gfn);
+ }
+
+ if (!r) {
+ if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+ goto do_sync;
+ return 0;
+ }
+
+ /* mmio */
if (is_error_pfn(pfn)) {
kvm_release_pfn_clean(pfn);
return 1;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 58a0f1e..1b2c605 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -419,7 +419,21 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,

mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
- pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+
+ if (can_do_async_pf(vcpu)) {
+ r = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &pfn);
+ trace_kvm_try_async_get_page(r, pfn);
+ } else {
+do_sync:
+ r = 1;
+ pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
+ }
+
+ if (!r) {
+ if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+ goto do_sync;
+ return 0;
+ }

/* mmio */
if (is_error_pfn(pfn)) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f6821b9..cfef357 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1031,6 +1031,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
vcpu->arch.apf_data = NULL;
return 1;
}
+ vcpu->arch.apf_memslot_ver = vcpu->kvm->memslot_version;
+ kvm_async_pf_wakeup_all(vcpu);
return 0;
}

@@ -4046,6 +4048,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}

+ kvm_check_async_pf_completion(vcpu);
+
preempt_disable();

kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -5319,8 +5323,10 @@ static void kvm_free_vcpus(struct kvm *kvm)
/*
* Unpin any mmu pages first.
*/
- kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ kvm_clear_async_pf_completion_queue(vcpu);
kvm_unload_vcpu_mmu(vcpu);
+ }
kvm_for_each_vcpu(i, vcpu, kvm)
kvm_arch_vcpu_free(vcpu);

@@ -5426,10 +5432,11 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
{
return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
+ || !list_empty_careful(&vcpu->async_pf_done)
|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
- || vcpu->arch.nmi_pending ||
- (kvm_arch_interrupt_allowed(vcpu) &&
- kvm_cpu_has_interrupt(vcpu));
+ || vcpu->arch.nmi_pending
+ || (kvm_arch_interrupt_allowed(vcpu) &&
+ kvm_cpu_has_interrupt(vcpu));
}

void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
@@ -5476,6 +5483,53 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
}
EXPORT_SYMBOL_GPL(kvm_set_rflags);

+static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+ if (unlikely(vcpu->arch.apf_memslot_ver !=
+ vcpu->kvm->memslot_version)) {
+ u64 gpa = vcpu->arch.apf_msr_val & ~0x3f;
+ unsigned long addr;
+ int offset = offset_in_page(gpa);
+
+ addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+ vcpu->arch.apf_data = (u32 __user*)(addr + offset);
+ if (kvm_is_error_hva(addr)) {
+ vcpu->arch.apf_data = NULL;
+ return -EFAULT;
+ }
+ }
+
+ return put_user(val, vcpu->arch.apf_data);
+}
+
+void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work)
+{
+ if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+ kvm_inject_page_fault(vcpu, work->arch.token, 0);
+ trace_kvm_send_async_pf(work->arch.token, work->gva,
+ KVM_PV_REASON_PAGE_NOT_PRESENT);
+ }
+}
+
+void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work)
+{
+ if (is_error_page(work->page))
+ work->arch.token = ~0; /* broadcast wakeup */
+ if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+ kvm_inject_page_fault(vcpu, work->arch.token, 0);
+ trace_kvm_send_async_pf(work->arch.token, work->gva,
+ KVM_PV_REASON_PAGE_READY);
+ }
+}
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+ return !kvm_event_needs_reinjection(vcpu) &&
+ kvm_x86_ops->interrupt_allowed(vcpu);
+}
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3f5ebc2..d5695d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -16,6 +16,7 @@
#include <linux/mm.h>
#include <linux/preempt.h>
#include <linux/msi.h>
+#include <linux/slow-work.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -72,6 +73,26 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
struct kvm_io_device *dev);

+#ifdef CONFIG_KVM_ASYNC_PF
+struct kvm_async_pf {
+ struct slow_work work;
+ struct list_head link;
+ struct kvm_vcpu *vcpu;
+ struct mm_struct *mm;
+ gva_t gva;
+ unsigned long addr;
+ struct kvm_arch_async_pf arch;
+ struct page *page;
+ atomic_t used;
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+ struct kvm_arch_async_pf *arch);
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
+#endif
+
struct kvm_vcpu {
struct kvm *kvm;
#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -101,6 +122,12 @@ struct kvm_vcpu {
gpa_t mmio_phys_addr;
#endif

+#ifdef CONFIG_KVM_ASYNC_PF
+ struct list_head async_pf_done;
+ spinlock_t async_pf_lock;
+ struct kvm_async_pf *async_pf_work;
+#endif
+
struct kvm_vcpu_arch arch;
};

@@ -276,6 +303,7 @@ void kvm_release_page_dirty(struct page *page);
void kvm_set_page_dirty(struct page *page);
void kvm_set_page_accessed(struct page *page);

+int gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, pfn_t *pfn);
pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index dbe1084..ddfdd8e 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -145,6 +145,66 @@ TRACE_EVENT(kvm_mmio,
__entry->len, __entry->gpa, __entry->val)
);

+#ifdef CONFIG_KVM_ASYNC_PF
+TRACE_EVENT(
+ kvm_try_async_get_page,
+ TP_PROTO(bool r, u64 pfn),
+ TP_ARGS(r, pfn),
+
+ TP_STRUCT__entry(
+ __field(__u64, pfn)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = r ? pfn : (u64)-1;
+ ),
+
+ TP_printk("pfn %#llx", __entry->pfn)
+);
+
+TRACE_EVENT(
+ kvm_send_async_pf,
+ TP_PROTO(u64 token, u64 gva, u64 reason),
+ TP_ARGS(token, gva, reason),
+
+ TP_STRUCT__entry(
+ __field(__u64, token)
+ __field(__u64, gva)
+ __field(bool, np)
+ ),
+
+ TP_fast_assign(
+ __entry->token = token;
+ __entry->gva = gva;
+ __entry->np = (reason == KVM_PV_REASON_PAGE_NOT_PRESENT);
+ ),
+
+ TP_printk("token %#llx gva %#llx %s", __entry->token, __entry->gva,
+ __entry->np ? "not present" : "ready")
+);
+
+TRACE_EVENT(
+ kvm_async_pf_completed,
+ TP_PROTO(unsigned long address, struct page *page, u64 gva),
+ TP_ARGS(address, page, gva),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, address)
+ __field(struct page*, page)
+ __field(u64, gva)
+ ),
+
+ TP_fast_assign(
+ __entry->address = address;
+ __entry->page = page;
+ __entry->gva = gva;
+ ),
+
+ TP_printk("gva %#llx address %#lx pfn %lx", __entry->gva,
+ __entry->address, page_to_pfn(__entry->page))
+);
+#endif
+
#endif /* _TRACE_KVM_MAIN_H */

/* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7f1178f..f63ccb0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -15,3 +15,6 @@ config KVM_APIC_ARCHITECTURE

config KVM_MMIO
bool
+
+config KVM_ASYNC_PF
+ bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index df3325c..3552be0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -76,6 +76,10 @@ static atomic_t hardware_enable_failed;
struct kmem_cache *kvm_vcpu_cache;
EXPORT_SYMBOL_GPL(kvm_vcpu_cache);

+#ifdef CONFIG_KVM_ASYNC_PF
+static struct kmem_cache *async_pf_cache;
+#endif
+
static __read_mostly struct preempt_ops kvm_preempt_ops;

struct dentry *kvm_debugfs_dir;
@@ -178,6 +182,10 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
init_waitqueue_head(&vcpu->wq);
+#ifdef CONFIG_KVM_ASYNC_PF
+ INIT_LIST_HEAD(&vcpu->async_pf_done);
+ spin_lock_init(&vcpu->async_pf_lock);
+#endif

page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
@@ -916,6 +924,51 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_hva);

+int gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, pfn_t *pfn)
+{
+ struct page *page[1];
+ unsigned long addr;
+ int npages = 0;
+
+ *pfn = bad_pfn;
+
+ addr = gfn_to_hva(kvm, gfn);
+ if (kvm_is_error_hva(addr)) {
+ get_page(bad_page);
+ return 1;
+ }
+
+#ifdef CONFIG_X86
+ npages = __get_user_pages_fast(addr, 1, 1, page);
+#endif
+ if (unlikely(npages != 1)) {
+ down_read(&current->mm->mmap_sem);
+ npages = get_user_pages_noio(current, current->mm, addr, 1, 1,
+ 0, page, NULL);
+ up_read(&current->mm->mmap_sem);
+ }
+
+ if (unlikely(npages != 1)) {
+ struct vm_area_struct *vma;
+
+ down_read(&current->mm->mmap_sem);
+ vma = find_vma(current->mm, addr);
+
+ if (vma == NULL || addr < vma->vm_start ||
+ !(vma->vm_flags & VM_PFNMAP)) {
+ up_read(&current->mm->mmap_sem);
+ return 0; /* do async fault in */
+ }
+
+ *pfn = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ up_read(&current->mm->mmap_sem);
+ BUG_ON(!kvm_is_mmio_pfn(*pfn));
+ } else
+ *pfn = page_to_pfn(page[0]);
+
+ return 1;
+}
+
static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
{
struct page *page[1];
@@ -1187,6 +1240,169 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
}
}

+#ifdef CONFIG_KVM_ASYNC_PF
+static void async_pf_work_free(struct kvm_async_pf *apf)
+{
+ if (atomic_dec_and_test(&apf->used))
+ kmem_cache_free(async_pf_cache, apf);
+}
+
+static int async_pf_get_ref(struct slow_work *work)
+{
+ struct kvm_async_pf *apf =
+ container_of(work, struct kvm_async_pf, work);
+
+ atomic_inc(&apf->used);
+ return 0;
+}
+
+static void async_pf_put_ref(struct slow_work *work)
+{
+ struct kvm_async_pf *apf =
+ container_of(work, struct kvm_async_pf, work);
+
+ kvm_put_kvm(apf->vcpu->kvm);
+ async_pf_work_free(apf);
+}
+
+static void async_pf_execute(struct slow_work *work)
+{
+ struct page *page;
+ struct kvm_async_pf *apf =
+ container_of(work, struct kvm_async_pf, work);
+ wait_queue_head_t *q = &apf->vcpu->wq;
+
+ might_sleep();
+
+ down_read(&apf->mm->mmap_sem);
+ get_user_pages(current, apf->mm, apf->addr, 1, 1, 0, &page, NULL);
+ up_read(&apf->mm->mmap_sem);
+
+ spin_lock(&apf->vcpu->async_pf_lock);
+ list_add_tail(&apf->link, &apf->vcpu->async_pf_done);
+ apf->page = page;
+ spin_unlock(&apf->vcpu->async_pf_lock);
+
+ trace_kvm_async_pf_completed(apf->addr, apf->page, apf->gva);
+
+ if (waitqueue_active(q))
+ wake_up_interruptible(q);
+
+ mmdrop(apf->mm);
+}
+
+struct slow_work_ops async_pf_ops = {
+ .get_ref = async_pf_get_ref,
+ .put_ref = async_pf_put_ref,
+ .execute = async_pf_execute
+};
+
+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
+{
+ while (!list_empty(&vcpu->async_pf_done)) {
+ struct kvm_async_pf *work =
+ list_entry(vcpu->async_pf_done.next,
+ typeof(*work), link);
+ list_del(&work->link);
+ put_page(work->page);
+ kmem_cache_free(async_pf_cache, work);
+ }
+}
+
+void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
+{
+ struct kvm_async_pf *work = vcpu->async_pf_work;
+
+ if (work) {
+ vcpu->async_pf_work = NULL;
+ if (work->page == NULL) {
+ kvm_arch_inject_async_page_not_present(vcpu, work);
+ return;
+ } else {
+ spin_lock(&vcpu->async_pf_lock);
+ list_del(&work->link);
+ spin_unlock(&vcpu->async_pf_lock);
+ put_page(work->page);
+ async_pf_work_free(work);
+ }
+ }
+
+ if (list_empty_careful(&vcpu->async_pf_done) ||
+ !kvm_arch_can_inject_async_page_present(vcpu))
+ return;
+
+ spin_lock(&vcpu->async_pf_lock);
+ work = list_first_entry(&vcpu->async_pf_done, typeof(*work), link);
+ list_del(&work->link);
+ spin_unlock(&vcpu->async_pf_lock);
+
+ kvm_arch_inject_async_page_present(vcpu, work);
+
+ put_page(work->page);
+ async_pf_work_free(work);
+}
+
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+ struct kvm_arch_async_pf *arch)
+{
+ struct kvm_async_pf *work;
+
+ /* setup slow work */
+
+ /* do alloc atomic since if we are going to sleep anyway we
+ may as well sleep faulting in page */
+ work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+ if (!work)
+ return 0;
+
+ atomic_set(&work->used, 1);
+ work->page = NULL;
+ work->vcpu = vcpu;
+ work->gva = gva;
+ work->addr = gfn_to_hva(vcpu->kvm, gfn);
+ work->arch = *arch;
+ work->mm = current->mm;
+ atomic_inc(&work->mm->mm_count);
+ kvm_get_kvm(work->vcpu->kvm);
+
+ /* this can't really happen otherwise gfn_to_pfn_async
+ would succeed */
+ if (unlikely(kvm_is_error_hva(work->addr)))
+ goto retry_sync;
+
+ slow_work_init(&work->work, &async_pf_ops);
+ if (slow_work_enqueue(&work->work) != 0)
+ goto retry_sync;
+
+ vcpu->async_pf_work = work;
+ return 1;
+retry_sync:
+ kvm_put_kvm(work->vcpu->kvm);
+ mmdrop(work->mm);
+ kmem_cache_free(async_pf_cache, work);
+ return 0;
+}
+
+int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
+{
+ struct kvm_async_pf *work;
+
+ if (!list_empty(&vcpu->async_pf_done))
+ return 0;
+
+ work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
+ if (!work)
+ return -ENOMEM;
+
+ atomic_set(&work->used, 1);
+ work->page = bad_page;
+ get_page(bad_page);
+
+ list_add_tail(&work->link, &vcpu->async_pf_done);
+ return 0;
+}
+#endif
+
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
@@ -2224,6 +2440,19 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
goto out_free_5;
}

+#ifdef CONFIG_KVM_ASYNC_PF
+ async_pf_cache = KMEM_CACHE(kvm_async_pf, 0);
+
+ if (!async_pf_cache) {
+ r = -ENOMEM;
+ goto out_free_6;
+ }
+
+ r = slow_work_register_user(THIS_MODULE);
+ if (r)
+ goto out_free;
+#endif
+
kvm_chardev_ops.owner = module;
kvm_vm_fops.owner = module;
kvm_vcpu_fops.owner = module;
@@ -2231,7 +2460,7 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
r = misc_register(&kvm_dev);
if (r) {
printk(KERN_ERR "kvm: misc device register failed\n");
- goto out_free;
+ goto out_unreg;
}

kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -2241,7 +2470,13 @@ int kvm_init(void *opaque, unsigned int vcpu_size,

return 0;

+out_unreg:
+#ifdef CONFIG_KVM_ASYNC_PF
+ slow_work_unregister_user(THIS_MODULE);
out_free:
+ kmem_cache_destroy(async_pf_cache);
+out_free_6:
+#endif
kmem_cache_destroy(kvm_vcpu_cache);
out_free_5:
sysdev_unregister(&kvm_sysdev);
@@ -2270,6 +2505,11 @@ void kvm_exit(void)
kvm_exit_debug();
misc_deregister(&kvm_dev);
kmem_cache_destroy(kvm_vcpu_cache);
+#ifdef CONFIG_KVM_ASYNC_PF
+ if (async_pf_cache)
+ kmem_cache_destroy(async_pf_cache);
+ slow_work_unregister_user(THIS_MODULE);
+#endif
sysdev_unregister(&kvm_sysdev);
sysdev_class_unregister(&kvm_sysdev_class);
unregister_reboot_notifier(&kvm_reboot_notifier);
--
1.6.5

2010-01-05 14:13:55

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 04/12] Add "handle page fault" PV helper.

Allow paravirtualized guest to do special handling for some page faults.

The patch adds one 'if' to do_page_fault() function. The call is patched
out when running on physical HW. I ran kernbech on the kernel with and
without that additional 'if' and result were rawly the same:

With 'if': Without 'if':
Average Half load -j 8 Run (std deviation):
Elapsed Time 338.122 (0.869207) Elapsed Time 336.404 (1.69196)
User Time 2025.49 (4.11097) User Time 2028.17 (1.34199)
System Time 218.098 (0.826632) System Time 219.022 (0.978581)
Percent CPU 663 (2.12132) Percent CPU 667.4 (3.20936)
Context Switches 140903 (2655.99) Context Switches 138476 (3408.49)
Sleeps 836103 (1594.93) Sleeps 836048 (2597.78)

Average Optimal load -j 16 Run (std deviation):
Elapsed Time 255.32 (1.28328) Elapsed Time 255.144 (1.09674)
User Time 2342.93 (334.625) User Time 2343.7 (332.612)
System Time 248.722 (32.2921) System Time 250.06 (32.7329)
Percent CPU 906.9 (257.13) Percent CPU 909.6 (255.326)
Context Switches 294395 (161817) Context Switches 293186 (163124)
Sleeps 890162 (57019.2) Sleeps 890534 (57494.5)

Average Custom load -j 32 Run (std deviation):
Elapsed Time 236.318 (1.56739) Elapsed Time 236.018 (1.32215)
User Time 2528.19 (381.49) User Time 2530.75 (382.228)
System Time 264.739 (34.9425) System Time 266.082 (35.1991)
Percent CPU 1055.2 (299.438) Percent CPU 1058.6 (299.153)
Context Switches 445730 (256736) Context Switches 446093 (259253)
Sleeps 939835 (85894.1) Sleeps 939638 (85396.9)

This is how the 'if' affects generated assembly:

With 'if': Without 'if':
do_page_fault: do_page_fault:
push %rbp push %rbp
mov %rsp,%rbp mov %rsp,%rbp
push %r15 push %r15
push %r14 push %r14
push %r13 push %r13
push %r12 push %r12
push %rbx push %rbx
sub $0x28,%rsp sub $0x18,%rsp
callq ffffffff81002a80 callq ffffffff81002a80
mov %rdi,%r14 mov %gs:0xb540,%r15
mov %rsi,%r13 mov 0x270(%r15),%rax
callq *0xffffffff816ab308 mov %rdi,%r14
test %eax,%eax mov %rsi,%r13
jne ffffffff813f10cb
mov %gs:0xb540,%r15
mov 0x270(%r15),%rax

And this is how the code looks like at runtime after patching:

Running on kvm: Running on phys HW:
do_page_fault: do_page_fault:
push %rbp push %rbp
mov %rsp,%rbp mov %rsp,%rbp
push %r15 push %r15
push %r14 push %r14
push %r13 push %r13
push %r12 push %r12
push %rbx push %rbx
sub $0x28,%rsp sub $0x28,%rsp
callq 0xffffffff81002a80 callq 0xffffffff81002a80
mov %rdi,%r14 mov %rdi,%r14
mov %rsi,%r13 mov %rsi,%r13
callq 0xffffffff8102417e xor %rax,%rax
xchg %ax,%ax nopl 0x0(%rax)
test %eax,%eax test %eax,%eax
jne 0xffffffff813f10cb jne 0xffffffff813f10cb
mov %gs:0xb540,%r15 mov %gs:0xb540,%r15

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 3 +
arch/x86/include/asm/paravirt.h | 11 ++-
arch/x86/include/asm/paravirt_types.h | 4 +
arch/x86/kernel/kvm.c | 162 +++++++++++++++++++++++++++++++++
arch/x86/kernel/paravirt.c | 8 ++
arch/x86/kernel/paravirt_patch_32.c | 8 ++
arch/x86/kernel/paravirt_patch_64.c | 7 ++
arch/x86/mm/fault.c | 3 +
8 files changed, 204 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 56ca41b..98edaa9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -51,6 +51,9 @@ struct kvm_mmu_op_release_pt {
__u64 pt_phys;
};

+#define KVM_PV_REASON_PAGE_NOT_PRESENT 1
+#define KVM_PV_REASON_PAGE_READY 2
+
struct kvm_vcpu_pv_apf_data {
__u32 reason;
__u32 enabled;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dd59a85..0c3e95b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -6,6 +6,7 @@
#ifdef CONFIG_PARAVIRT
#include <asm/pgtable_types.h>
#include <asm/asm.h>
+#include <asm/ptrace.h>

#include <asm/paravirt_types.h>

@@ -699,15 +700,21 @@ static inline void pmd_clear(pmd_t *pmdp)
}
#endif /* CONFIG_X86_PAE */

+static inline void arch_end_context_switch(struct task_struct *next)
+{
+ PVOP_VCALL1(pv_cpu_ops.end_context_switch, next);
+}
+
#define __HAVE_ARCH_START_CONTEXT_SWITCH
static inline void arch_start_context_switch(struct task_struct *prev)
{
PVOP_VCALL1(pv_cpu_ops.start_context_switch, prev);
}

-static inline void arch_end_context_switch(struct task_struct *next)
+static inline int handle_page_fault(struct pt_regs *regs,
+ unsigned long error_code)
{
- PVOP_VCALL1(pv_cpu_ops.end_context_switch, next);
+ return PVOP_CALL2(int, pv_cpu_ops.handle_pf, regs, error_code);
}

#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index b1e70d5..1aec9b5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -186,6 +186,7 @@ struct pv_cpu_ops {

void (*start_context_switch)(struct task_struct *prev);
void (*end_context_switch)(struct task_struct *next);
+ int (*handle_pf)(struct pt_regs *regs, unsigned long error_code);
};

struct pv_irq_ops {
@@ -385,6 +386,7 @@ extern struct pv_lock_ops pv_lock_ops;
unsigned paravirt_patch_nop(void);
unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len);
unsigned paravirt_patch_ignore(unsigned len);
unsigned paravirt_patch_call(void *insnbuf,
const void *target, u16 tgt_clobbers,
@@ -676,8 +678,10 @@ void paravirt_leave_lazy_mmu(void);
void _paravirt_nop(void);
u32 _paravirt_ident_32(u32);
u64 _paravirt_ident_64(u64);
+unsigned long _paravirt_ret_0(void);

#define paravirt_nop ((void *)_paravirt_nop)
+#define paravirt_ret_0 ((void *)_paravirt_ret_0)

/* These all sit in the .parainstructions section to tell us what to patch. */
struct paravirt_patch_site {
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 001222c..2245f35 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,6 +29,8 @@
#include <linux/hardirq.h>
#include <linux/notifier.h>
#include <linux/reboot.h>
+#include <linux/hash.h>
+#include <linux/sched.h>
#include <asm/timer.h>
#include <asm/cpu.h>

@@ -54,6 +56,159 @@ static void kvm_io_delay(void)
{
}

+#define KVM_TASK_SLEEP_HASHBITS 8
+#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
+
+struct kvm_task_sleep_node {
+ struct hlist_node link;
+ wait_queue_head_t wq;
+ u32 token;
+ int cpu;
+};
+
+static struct kvm_task_sleep_head {
+ spinlock_t lock;
+ struct hlist_head list;
+} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
+
+static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
+ u64 token)
+{
+ struct hlist_node *p;
+
+ hlist_for_each(p, &b->list) {
+ struct kvm_task_sleep_node *n =
+ hlist_entry(p, typeof(*n), link);
+ if (n->token == token)
+ return n;
+ }
+
+ return NULL;
+}
+
+static void apf_task_wait(struct task_struct *tsk, u32 token)
+{
+ u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+ struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+ struct kvm_task_sleep_node n, *e;
+ DEFINE_WAIT(wait);
+
+ spin_lock(&b->lock);
+ e = _find_apf_task(b, token);
+ if (e) {
+ /* dummy entry exist -> wake up was delivered ahead of PF */
+ hlist_del(&e->link);
+ kfree(e);
+ spin_unlock(&b->lock);
+ return;
+ }
+
+ n.token = token;
+ n.cpu = smp_processor_id();
+ init_waitqueue_head(&n.wq);
+ hlist_add_head(&n.link, &b->list);
+ spin_unlock(&b->lock);
+
+ for (;;) {
+ prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+ if (hlist_unhashed(&n.link))
+ break;
+ schedule();
+ }
+ finish_wait(&n.wq, &wait);
+
+ return;
+}
+
+static void apf_task_wake_one(struct kvm_task_sleep_node *n)
+{
+ hlist_del_init(&n->link);
+ if (waitqueue_active(&n->wq))
+ wake_up(&n->wq);
+}
+
+static void apf_task_wake(u32 token)
+{
+ u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
+ struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
+ struct kvm_task_sleep_node *n;
+
+again:
+ spin_lock(&b->lock);
+ n = _find_apf_task(b, token);
+ if (!n) {
+ /*
+ * async PF was not yet handled.
+ * Add dummy entry for the token.
+ */
+ n = kmalloc(sizeof(*n), GFP_ATOMIC);
+ if (!n) {
+ /*
+ * Allocation failed! Busy wait while other vcpu
+ * handles async PF.
+ */
+ spin_unlock(&b->lock);
+ cpu_relax();
+ goto again;
+ }
+ n->token = token;
+ n->cpu = smp_processor_id();
+ init_waitqueue_head(&n->wq);
+ hlist_add_head(&n->link, &b->list);
+ } else
+ apf_task_wake_one(n);
+ spin_unlock(&b->lock);
+ return;
+}
+
+static void apf_task_wake_all(void)
+{
+ int i;
+
+ for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
+ struct hlist_node *p, *next;
+ struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+ spin_lock(&b->lock);
+ hlist_for_each_safe(p, next, &b->list) {
+ struct kvm_task_sleep_node *n =
+ hlist_entry(p, typeof(*n), link);
+ if (n->cpu == smp_processor_id())
+ apf_task_wake_one(n);
+ }
+ spin_unlock(&b->lock);
+ }
+}
+
+int kvm_handle_pf(struct pt_regs *regs, unsigned long error_code)
+{
+ u32 reason, token;
+
+ if (!__get_cpu_var(apf_reason).enabled)
+ return 0;
+
+ reason = __get_cpu_var(apf_reason).reason;
+ __get_cpu_var(apf_reason).reason = 0;
+
+ token = (u32)read_cr2();
+
+ switch (reason) {
+ default:
+ return 0;
+ case KVM_PV_REASON_PAGE_NOT_PRESENT:
+ /* page is swapped out by the host. */
+ apf_task_wait(current, token);
+ break;
+ case KVM_PV_REASON_PAGE_READY:
+ if (unlikely(token == ~0))
+ apf_task_wake_all();
+ else
+ apf_task_wake(token);
+ break;
+ }
+
+ return 1;
+}
+
static void kvm_mmu_op(void *buffer, unsigned len)
{
int r;
@@ -207,6 +362,9 @@ static void __init paravirt_ops_setup(void)
if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
pv_cpu_ops.io_delay = kvm_io_delay;

+ if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
+ pv_cpu_ops.handle_pf = kvm_handle_pf;
+
if (kvm_para_has_feature(KVM_FEATURE_MMU_OP)) {
pv_mmu_ops.set_pte = kvm_set_pte;
pv_mmu_ops.set_pte_at = kvm_set_pte_at;
@@ -270,11 +428,15 @@ static void __init kvm_smp_prepare_boot_cpu(void)

void __init kvm_guest_init(void)
{
+ int i;
+
if (!kvm_para_available())
return;

paravirt_ops_setup();
register_reboot_notifier(&kvm_pv_reboot_nb);
+ for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
+ spin_lock_init(&async_pf_sleepers[i].lock);
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
#else
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b1739d..7d8f37b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -54,6 +54,11 @@ u64 _paravirt_ident_64(u64 x)
return x;
}

+unsigned long _paravirt_ret_0(void)
+{
+ return 0;
+}
+
void __init default_banner(void)
{
printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -154,6 +159,8 @@ unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
ret = paravirt_patch_ident_32(insnbuf, len);
else if (opfunc == _paravirt_ident_64)
ret = paravirt_patch_ident_64(insnbuf, len);
+ else if (opfunc == _paravirt_ret_0)
+ ret = paravirt_patch_ret_0(insnbuf, len);

else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
@@ -380,6 +387,7 @@ struct pv_cpu_ops pv_cpu_ops = {

.start_context_switch = paravirt_nop,
.end_context_switch = paravirt_nop,
+ .handle_pf = paravirt_ret_0,
};

struct pv_apic_ops pv_apic_ops = {
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d9f32e6..de006b1 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,8 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
DEF_NATIVE(pv_cpu_ops, clts, "clts");
DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");

+DEF_NATIVE(, mov0, "xor %eax, %eax");
+
unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
{
/* arg in %eax, return in %eax */
@@ -24,6 +26,12 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
return 0;
}

+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len)
+{
+ return paravirt_patch_insns(insnbuf, len,
+ start__mov0, end__mov0);
+}
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index 3f08f34..d685e7d 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -21,6 +21,7 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");

DEF_NATIVE(, mov32, "mov %edi, %eax");
DEF_NATIVE(, mov64, "mov %rdi, %rax");
+DEF_NATIVE(, mov0, "xor %rax, %rax");

unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
{
@@ -34,6 +35,12 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
start__mov64, end__mov64);
}

+unsigned paravirt_patch_ret_0(void *insnbuf, unsigned len)
+{
+ return paravirt_patch_insns(insnbuf, len,
+ start__mov0, end__mov0);
+}
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..7abc3ee 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -953,6 +953,9 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
int write;
int fault;

+ if (handle_page_fault(regs, error_code))
+ return;
+
tsk = current;
mm = tsk->mm;

--
1.6.5

2010-01-05 14:14:52

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 06/12] Add get_user_pages() variant that fails if major fault is required.

This patch add get_user_pages() variant that only succeeds if getting
a reference to a page doesn't require major fault.

Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
---
fs/ncpfs/mmap.c | 2 ++
include/linux/mm.h | 5 +++++
mm/filemap.c | 3 +++
mm/memory.c | 31 ++++++++++++++++++++++++++++---
mm/shmem.c | 8 +++++++-
5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 15458de..338527e 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -39,6 +39,8 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
int bufsize;
int pos; /* XXX: loff_t ? */

+ if (vmf->flags & FAULT_FLAG_MINOR)
+ return VM_FAULT_MAJOR | VM_FAULT_ERROR;
/*
* ncpfs has nothing against high pages as long
* as recvmsg and memset works on it
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2265f28..32e9c6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */
#define FAULT_FLAG_NONLINEAR 0x02 /* Fault was via a nonlinear mapping */
#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_MINOR 0x08 /* Do only minor fault */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -836,6 +837,9 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, int nr_pages, int write, int force,
+ struct page **pages, struct vm_area_struct **vmas);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct page *get_dump_page(unsigned long addr);
@@ -1257,6 +1261,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
+#define FOLL_MINOR 0x20 /* do only minor page faults */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/filemap.c b/mm/filemap.c
index 96ac6b0..25768ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1509,6 +1509,9 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
goto no_cached_page;
}
} else {
+ if (vmf->flags & FAULT_FLAG_MINOR)
+ return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
/* No page in the page cache at all */
do_sync_mmap_readahead(vma, ra, file, offset);
count_vm_event(PGMAJFAULT);
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..89e0074 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1336,10 +1336,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
+ unsigned int fault_fl =
+ ((foll_flags & FOLL_WRITE) ?
+ FAULT_FLAG_WRITE : 0) |
+ ((foll_flags & FOLL_MINOR) ?
+ FAULT_FLAG_MINOR : 0);

- ret = handle_mm_fault(mm, vma, start,
- (foll_flags & FOLL_WRITE) ?
- FAULT_FLAG_WRITE : 0);
+ ret = handle_mm_fault(mm, vma, start, fault_fl);

if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -1347,6 +1350,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (ret &
(VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
return i ? i : -EFAULT;
+ else if (ret & VM_FAULT_MAJOR)
+ return i ? i : -EFAULT;
BUG();
}
if (ret & VM_FAULT_MAJOR)
@@ -1457,6 +1462,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
}
EXPORT_SYMBOL(get_user_pages);

+int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, int nr_pages, int write, int force,
+ struct page **pages, struct vm_area_struct **vmas)
+{
+ int flags = FOLL_TOUCH | FOLL_MINOR;
+
+ if (pages)
+ flags |= FOLL_GET;
+ if (write)
+ flags |= FOLL_WRITE;
+ if (force)
+ flags |= FOLL_FORCE;
+
+ return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_noio);
+
/**
* get_dump_page() - pin user page in memory while writing it to core dump
* @addr: user address
@@ -2536,6 +2558,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
+ if (flags & FAULT_FLAG_MINOR)
+ return VM_FAULT_MAJOR | VM_FAULT_ERROR;
+
grab_swap_token(mm); /* Contend for token _before_ read-in */
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, address);
diff --git a/mm/shmem.c b/mm/shmem.c
index eef4ebe..ab3a255 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1225,6 +1225,7 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
swp_entry_t swap;
gfp_t gfp;
int error;
+ int flags = type ? *type : 0;

if (idx >= SHMEM_MAX_INDEX)
return -EFBIG;
@@ -1273,6 +1274,11 @@ repeat:
swappage = lookup_swap_cache(swap);
if (!swappage) {
shmem_swp_unmap(entry);
+ if (flags & FAULT_FLAG_MINOR) {
+ spin_unlock(&info->lock);
+ *type = VM_FAULT_MAJOR | VM_FAULT_ERROR;
+ goto failed;
+ }
/* here we actually do the io */
if (type && !(*type & VM_FAULT_MAJOR)) {
__count_vm_event(PGMAJFAULT);
@@ -1481,7 +1487,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
int error;
- int ret;
+ int ret = (int)vmf->flags;

if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
return VM_FAULT_SIGBUS;
--
1.6.5

2010-01-05 14:13:39

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 07/12] Maintain memslot version number

Code that depends on particular memslot layout can track changes and
adjust to new layout.

Signed-off-by: Gleb Natapov <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 600baf0..3f5ebc2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -163,6 +163,7 @@ struct kvm {
spinlock_t requests_lock;
struct mutex slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ u32 memslot_version;
struct kvm_memslots *memslots;
struct srcu_struct srcu;
#ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a5077df..df3325c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -737,6 +737,7 @@ skip_lpage:
slots->memslots[mem->slot] = new;
old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
+ kvm->memslot_version++;
synchronize_srcu_expedited(&kvm->srcu);

kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);
--
1.6.5

2010-01-05 14:14:31

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 02/12] Add PV MSR to enable asynchronous page faults delivery.


Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/include/asm/kvm_para.h | 4 +++
arch/x86/kvm/x86.c | 49 +++++++++++++++++++++++++++++++++++++-
include/linux/kvm.h | 1 +
4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 741b897..01d3ec4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -362,6 +362,9 @@ struct kvm_vcpu_arch {
/* used for guest single stepping over the given code position */
u16 singlestep_cs;
unsigned long singlestep_rip;
+
+ u32 __user *apf_data;
+ u64 apf_msr_val;
};

struct kvm_mem_alias {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5f580f2..f77eed3 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -15,12 +15,16 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_ASYNC_PF 3

#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d00

#define KVM_MAX_MMU_OP_BATCH 32

+#define KVM_ASYNC_PF_ENABLED (1 << 0)
+
/* Operations for KVM_HC_MMU_OP */
#define KVM_MMU_OP_WRITE_PTE 1
#define KVM_MMU_OP_FLUSH_TLB 2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index adc8597..f6821b9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -620,9 +620,9 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/

-#define KVM_SAVE_MSRS_BEGIN 2
+#define KVM_SAVE_MSRS_BEGIN 3
static u32 msrs_to_save[] = {
- MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_ASYNC_PF_EN,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_K6_STAR,
#ifdef CONFIG_X86_64
@@ -1003,6 +1003,37 @@ out:
return r;
}

+static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
+{
+ u64 gpa = data & ~0x3f;
+ int offset = offset_in_page(gpa);
+ unsigned long addr;
+
+ /* Bits 1:5 are resrved, Should be zero */
+ if (data & 0x3e)
+ return 1;
+
+ vcpu->arch.apf_msr_val = data;
+
+ if (!(data & KVM_ASYNC_PF_ENABLED)) {
+ vcpu->arch.apf_data = NULL;
+ return 0;
+ }
+
+ addr = gfn_to_hva(vcpu->kvm, gpa >> PAGE_SHIFT);
+ if (kvm_is_error_hva(addr))
+ return 1;
+
+ vcpu->arch.apf_data = (u32 __user*)(addr + offset);
+
+ /* check if address is mapped */
+ if (get_user(offset, vcpu->arch.apf_data)) {
+ vcpu->arch.apf_data = NULL;
+ return 1;
+ }
+ return 0;
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
switch (msr) {
@@ -1083,6 +1114,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
kvm_request_guest_time_update(vcpu);
break;
}
+ case MSR_KVM_ASYNC_PF_EN:
+ if (kvm_pv_enable_async_pf(vcpu, data))
+ return 1;
+ break;
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1275,6 +1310,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_KVM_SYSTEM_TIME:
data = vcpu->arch.time;
break;
+ case MSR_KVM_ASYNC_PF_EN:
+ data = vcpu->arch.apf_msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -1397,6 +1435,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_XEN_HVM:
case KVM_CAP_ADJUST_CLOCK:
case KVM_CAP_VCPU_EVENTS:
+ case KVM_CAP_ASYNC_PF:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -5117,6 +5156,9 @@ free_vcpu:

void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
{
+ vcpu->arch.apf_data = NULL;
+ vcpu->arch.apf_msr_val = 0;
+
vcpu_load(vcpu);
kvm_mmu_unload(vcpu);
vcpu_put(vcpu);
@@ -5134,6 +5176,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
vcpu->arch.dr6 = DR6_FIXED_1;
vcpu->arch.dr7 = DR7_FIXED_1;

+ vcpu->arch.apf_data = NULL;
+ vcpu->arch.apf_msr_val = 0;
+
return kvm_x86_ops->vcpu_reset(vcpu);
}

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index f2feef6..85a7161 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -497,6 +497,7 @@ struct kvm_ioeventfd {
#endif
#define KVM_CAP_S390_PSW 42
#define KVM_CAP_PPC_SEGSTATE 43
+#define KVM_CAP_ASYNC_PF 44

#ifdef KVM_CAP_IRQ_ROUTING

--
1.6.5

2010-01-05 14:16:01

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 05/12] Export __get_user_pages_fast.

KVM will use it to try and find a page without falling back to slow
gup. That is why get_user_pages_fast() is not enough.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/mm/gup.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 71da1bc..cea0dfe 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
#include <linux/mm.h>
#include <linux/vmstat.h>
#include <linux/highmem.h>
+#include <linux/module.h>

#include <asm/pgtable.h>

@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,

return nr;
}
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);

/**
* get_user_pages_fast() - pin user pages in memory
--
1.6.5

2010-01-05 14:16:18

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 01/12] Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 11 +++++++++++
arch/x86/kernel/kvmclock.c | 13 +------------
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c584076..5f580f2 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -51,6 +51,7 @@ struct kvm_mmu_op_release_pt {
#include <asm/processor.h>

extern void kvmclock_init(void);
+extern int kvm_register_clock(char *txt);


/* This instruction is vmcall. On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 63b0ec8..e6db179 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -231,10 +231,21 @@ static void __init paravirt_ops_setup(void)
#endif
}

+#ifdef CONFIG_SMP
+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+ WARN_ON(kvm_register_clock("primary cpu clock"));
+ native_smp_prepare_boot_cpu();
+}
+#endif
+
void __init kvm_guest_init(void)
{
if (!kvm_para_available())
return;

paravirt_ops_setup();
+#ifdef CONFIG_SMP
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
}
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..6ab9622 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -122,7 +122,7 @@ static struct clocksource kvm_clock = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

-static int kvm_register_clock(char *txt)
+int kvm_register_clock(char *txt)
{
int cpu = smp_processor_id();
int low, high;
@@ -146,14 +146,6 @@ static void __cpuinit kvm_setup_secondary_clock(void)
}
#endif

-#ifdef CONFIG_SMP
-static void __init kvm_smp_prepare_boot_cpu(void)
-{
- WARN_ON(kvm_register_clock("primary cpu clock"));
- native_smp_prepare_boot_cpu();
-}
-#endif
-
/*
* After the clock is registered, the host will keep writing to the
* registered memory location. If the guest happens to shutdown, this memory
@@ -192,9 +184,6 @@ void __init kvmclock_init(void)
x86_cpuinit.setup_percpu_clockev =
kvm_setup_secondary_clock;
#endif
-#ifdef CONFIG_SMP
- smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-#endif
machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
machine_ops.crash_shutdown = kvm_crash_shutdown;
--
1.6.5

2010-01-05 14:16:44

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 09/12] Retry fault before vmentry

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page.

Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 7 ++++++-
arch/x86/kvm/mmu.c | 27 ++++++++++++++++++++-------
arch/x86/kvm/paging_tmpl.h | 37 +++++++++++++++++++++++++++++++++----
arch/x86/kvm/x86.c | 9 +++++++++
virt/kvm/kvm_main.c | 2 ++
5 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 641943e..43c1aca 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,8 @@ struct kvm_pio_request {
*/
struct kvm_mmu {
void (*new_cr3)(struct kvm_vcpu *vcpu);
- int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+ int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool sync);
+ int (*page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva, u32 err);
void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva);
void (*prefetch_page)(struct kvm_vcpu *vcpu,
@@ -541,6 +542,8 @@ struct kvm_x86_ops {

struct kvm_arch_async_pf {
u32 token;
+ gpa_t cr3;
+ u32 error_code;
};

extern struct kvm_x86_ops *kvm_x86_ops;
@@ -827,6 +830,8 @@ void kvm_arch_inject_async_page_not_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work);
void kvm_arch_inject_async_page_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work);
bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
#endif /* _ASM_X86_KVM_HOST_H */

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7214f28..9fd29cb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2170,7 +2170,7 @@ static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr)
}

static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
- u32 error_code)
+ u32 error_code, bool sync)
{
gfn_t gfn;
int r;
@@ -2189,10 +2189,13 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
error_code & PFERR_WRITE_MASK, gfn);
}

-int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gva,
+ gfn_t gfn, u32 error_code)
{
struct kvm_arch_async_pf arch;
arch.token = (vcpu->arch.async_pf_id++ << 12) | vcpu->vcpu_id;
+ arch.cr3 = cr3;
+ arch.error_code = error_code;
return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
}

@@ -2204,8 +2207,8 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
return !!kvm_x86_ops->get_cpl(vcpu);
}

-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
- u32 error_code)
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+ bool sync)
{
pfn_t pfn;
int r;
@@ -2227,7 +2230,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (can_do_async_pf(vcpu)) {
+ if (!sync && can_do_async_pf(vcpu)) {
r = gfn_to_pfn_async(vcpu->kvm, gfn, &pfn);
trace_kvm_try_async_get_page(r, pfn);
} else {
@@ -2237,7 +2240,8 @@ do_sync:
}

if (!r) {
- if (!kvm_arch_setup_async_pf(vcpu, gpa, gfn))
+ if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, gpa, gfn,
+ error_code))
goto do_sync;
return 0;
}
@@ -2263,6 +2267,12 @@ out_unlock:
return 0;
}

+static int tdp_page_fault_sync(struct kvm_vcpu *vcpu, gpa_t cr3, gva_t gpa,
+ u32 error_code)
+{
+ return tdp_page_fault(vcpu, gpa, error_code, true);
+}
+
static void nonpaging_free(struct kvm_vcpu *vcpu)
{
mmu_free_roots(vcpu);
@@ -2387,6 +2397,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
ASSERT(is_pae(vcpu));
context->new_cr3 = paging_new_cr3;
context->page_fault = paging64_page_fault;
+ context->page_fault_other_cr3 = paging64_page_fault_other_cr3;
context->gva_to_gpa = paging64_gva_to_gpa;
context->prefetch_page = paging64_prefetch_page;
context->sync_page = paging64_sync_page;
@@ -2411,6 +2422,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
context->new_cr3 = paging_new_cr3;
context->page_fault = paging32_page_fault;
+ context->page_fault_other_cr3 = paging32_page_fault_other_cr3;
context->gva_to_gpa = paging32_gva_to_gpa;
context->free = paging_free;
context->prefetch_page = paging32_prefetch_page;
@@ -2434,6 +2446,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)

context->new_cr3 = nonpaging_new_cr3;
context->page_fault = tdp_page_fault;
+ context->page_fault_other_cr3 = tdp_page_fault_sync;
context->free = nonpaging_free;
context->prefetch_page = nonpaging_prefetch_page;
context->sync_page = nonpaging_sync_page;
@@ -2807,7 +2820,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code)
int r;
enum emulation_result er;

- r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code);
+ r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
if (r < 0)
goto out;

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1b2c605..c2c2f34 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -375,8 +375,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
* Returns: 1 if we need to emulate the instruction, 0 otherwise, or
* a negative value on error.
*/
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
- u32 error_code)
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+ bool sync)
{
int write_fault = error_code & PFERR_WRITE_MASK;
int user_fault = error_code & PFERR_USER_MASK;
@@ -420,7 +420,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (can_do_async_pf(vcpu)) {
+ if (!sync && can_do_async_pf(vcpu)) {
r = gfn_to_pfn_async(vcpu->kvm, walker.gfn, &pfn);
trace_kvm_try_async_get_page(r, pfn);
} else {
@@ -430,7 +430,8 @@ do_sync:
}

if (!r) {
- if (!kvm_arch_setup_async_pf(vcpu, addr, walker.gfn))
+ if (!kvm_arch_setup_async_pf(vcpu, vcpu->arch.cr3, addr,
+ walker.gfn, error_code))
goto do_sync;
return 0;
}
@@ -466,6 +467,34 @@ out_unlock:
return 0;
}

+static int FNAME(page_fault_other_cr3)(struct kvm_vcpu *vcpu, gpa_t cr3,
+ gva_t addr, u32 error_code)
+{
+ int r = 0;
+ gpa_t curr_cr3 = vcpu->arch.cr3;
+
+ if (curr_cr3 != cr3) {
+ /*
+ * We do page fault on behalf of a process that is sleeping
+ * because of async PF. PV guest takes reference to mm that cr3
+ * belongs too, so it has to be valid here.
+ */
+ kvm_set_cr3(vcpu, cr3);
+ if (kvm_mmu_reload(vcpu))
+ goto switch_cr3;
+ }
+
+ r = FNAME(page_fault)(vcpu, addr, error_code, true);
+
+switch_cr3:
+ if (curr_cr3 != vcpu->arch.cr3) {
+ kvm_set_cr3(vcpu, curr_cr3);
+ kvm_mmu_reload(vcpu);
+ }
+
+ return r;
+}
+
static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
{
struct kvm_shadow_walk_iterator iterator;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cfef357..e2e33ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5483,6 +5483,15 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
}
EXPORT_SYMBOL_GPL(kvm_set_rflags);

+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+ struct kvm_async_pf *work)
+{
+ if (!vcpu->arch.mmu.page_fault_other_cr3 || is_error_page(work->page))
+ return;
+ vcpu->arch.mmu.page_fault_other_cr3(vcpu, work->arch.cr3, work->gva,
+ work->arch.error_code);
+}
+
static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
{
if (unlikely(vcpu->arch.apf_memslot_ver !=
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3552be0..ca5cd7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1322,6 +1322,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
spin_lock(&vcpu->async_pf_lock);
list_del(&work->link);
spin_unlock(&vcpu->async_pf_lock);
+ kvm_arch_async_page_ready(vcpu, work);
put_page(work->page);
async_pf_work_free(work);
}
@@ -1336,6 +1337,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
list_del(&work->link);
spin_unlock(&vcpu->async_pf_lock);

+ kvm_arch_async_page_ready(vcpu, work);
kvm_arch_inject_async_page_present(vcpu, work);

put_page(work->page);
--
1.6.5

2010-01-05 14:17:13

by Gleb Natapov

[permalink] [raw]
Subject: [PATCH v3 12/12] Send async PF when guest is not in userspace too.


Signed-off-by: Gleb Natapov <[email protected]>
---
arch/x86/kvm/mmu.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9fd29cb..7945abf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2204,7 +2204,13 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
if (!vcpu->arch.apf_data || kvm_event_needs_reinjection(vcpu))
return false;

- return !!kvm_x86_ops->get_cpl(vcpu);
+ if (vcpu->arch.apf_send_user_only)
+ return !!kvm_x86_ops->get_cpl(vcpu);
+
+ if (!kvm_x86_ops->interrupt_allowed(vcpu))
+ return false;
+
+ return true;
}

static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
--
1.6.5

2010-01-05 15:05:56

by Jun Koi

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Tue, Jan 5, 2010 at 11:12 PM, Gleb Natapov <[email protected]> wrote:
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out
> from host memory vcpu execution is suspended till the page is not swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
>
> To overcome this inefficiency this patch series implements "asynchronous
> page fault" for paravirtualized KVM guests. If a page that vcpu is
> trying to access is swapped out KVM sends an async PF to the vcpu
> and continues vcpu execution. Requested page is swapped in by another
> thread in parallel. ?When vcpu gets async PF it puts faulted task to
> sleep until "wake up" interrupt is delivered. When the page is brought
> to the host memory KVM sends "wake up" interrupt and the guest's task
> resumes execution.
>

Is it true that to make this work, we will need a (PV) kernel driver
for each guest OS (Windows, Linux, ...)?

Thanks,
Jun

2010-01-05 16:00:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On 01/05/2010 10:05 AM, Jun Koi wrote:
> On Tue, Jan 5, 2010 at 11:12 PM, Gleb Natapov<[email protected]> wrote:
>> KVM virtualizes guest memory by means of shadow pages or HW assistance
>> like NPT/EPT. Not all memory used by a guest is mapped into the guest
>> address space or even present in a host memory at any given time.
>> When vcpu tries to access memory page that is not mapped into the guest
>> address space KVM is notified about it. KVM maps the page into the guest
>> address space and resumes vcpu execution. If the page is swapped out
>> from host memory vcpu execution is suspended till the page is not swapped
>> into the memory again. This is inefficient since vcpu can do other work
>> (run other task or serve interrupts) while page gets swapped in.
>>
>> To overcome this inefficiency this patch series implements "asynchronous
>> page fault" for paravirtualized KVM guests. If a page that vcpu is
>> trying to access is swapped out KVM sends an async PF to the vcpu
>> and continues vcpu execution. Requested page is swapped in by another
>> thread in parallel. When vcpu gets async PF it puts faulted task to
>> sleep until "wake up" interrupt is delivered. When the page is brought
>> to the host memory KVM sends "wake up" interrupt and the guest's task
>> resumes execution.
>>
>
> Is it true that to make this work, we will need a (PV) kernel driver
> for each guest OS (Windows, Linux, ...)?

This patch series contains the guest kernel code for Linux,
as well as the host side code.

--
All rights reversed.

2010-01-05 16:05:34

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On 01/05/2010 05:05 PM, Jun Koi wrote:
> Is it true that to make this work, we will need a (PV) kernel driver
> for each guest OS (Windows, Linux, ...)?
>
>

It's partially usable even without guest modifications; while servicing
a host page fault we can still deliver interrupts to the guest (which
might cause a context switch and thus further progress to be made).

--
error compiling committee.c: too many arguments to function

2010-01-06 10:19:32

by Jun Koi

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Wed, Jan 6, 2010 at 1:04 AM, Avi Kivity <[email protected]> wrote:
> On 01/05/2010 05:05 PM, Jun Koi wrote:
>>
>> Is it true that to make this work, we will need a (PV) kernel driver
>> for each guest OS (Windows, Linux, ...)?
>>
>>
>
> It's partially usable even without guest modifications; while servicing a
> host page fault we can still deliver interrupts to the guest (which might
> cause a context switch and thus further progress to be made).

Lets say, in the case the guest has no PV driver. When we find that a
guest page is swapped out, we can send a pagefault
to the guest to trick it to load that page in. And we dont need the
driver at all.

Is that a reasonable solution?

Thanks,
J

2010-01-06 10:22:10

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Wed, Jan 06, 2010 at 07:17:30PM +0900, Jun Koi wrote:
> On Wed, Jan 6, 2010 at 1:04 AM, Avi Kivity <[email protected]> wrote:
> > On 01/05/2010 05:05 PM, Jun Koi wrote:
> >>
> >> Is it true that to make this work, we will need a (PV) kernel driver
> >> for each guest OS (Windows, Linux, ...)?
> >>
> >>
> >
> > It's partially usable even without guest modifications; while servicing a
> > host page fault we can still deliver interrupts to the guest (which might
> > cause a context switch and thus further progress to be made).
>
> Lets say, in the case the guest has no PV driver. When we find that a
> guest page is swapped out, we can send a pagefault
> to the guest to trick it to load that page in. And we dont need the
> driver at all.
>
That's not the guest who should load the page. From guest's point of view the
page is in memory.

> Is that a reasonable solution?
>
> Thanks,
> J

--
Gleb.

2010-01-08 16:20:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Tue, Jan 05, 2010 at 04:12:42PM +0200, Gleb Natapov wrote:
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out
> from host memory vcpu execution is suspended till the page is not swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
>
> To overcome this inefficiency this patch series implements "asynchronous
> page fault" for paravirtualized KVM guests. If a page that vcpu is
> trying to access is swapped out KVM sends an async PF to the vcpu
> and continues vcpu execution. Requested page is swapped in by another
> thread in parallel. When vcpu gets async PF it puts faulted task to
> sleep until "wake up" interrupt is delivered. When the page is brought
> to the host memory KVM sends "wake up" interrupt and the guest's task
> resumes execution.

Some high level comments:

- cr2 used as token: better use the shared region? what if:

async pf queued
guest triple faults without a vmexit
inject async-pf-done with token in cr2

Also, in such scenario, can't you potentially corrupt guest memory after
the triple fault by writing to the previously registered shared region
address?

- The token can overflow relatively easy. Use u64?

- Does it really inject interrupts for non-pv guests while waiting
for swapin? Can't see that. Wish it was more geared towards fv.

- Please share some perf numbers.

- Limit the number of queued async pf's per guest ?

- Unify gfn_to_pfn / gfn_to_pfn_async code in the pf handlers (easier
to review).

2010-01-08 16:48:12

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Fri, Jan 08, 2010 at 02:18:28PM -0200, Marcelo Tosatti wrote:
> On Tue, Jan 05, 2010 at 04:12:42PM +0200, Gleb Natapov wrote:
> > KVM virtualizes guest memory by means of shadow pages or HW assistance
> > like NPT/EPT. Not all memory used by a guest is mapped into the guest
> > address space or even present in a host memory at any given time.
> > When vcpu tries to access memory page that is not mapped into the guest
> > address space KVM is notified about it. KVM maps the page into the guest
> > address space and resumes vcpu execution. If the page is swapped out
> > from host memory vcpu execution is suspended till the page is not swapped
> > into the memory again. This is inefficient since vcpu can do other work
> > (run other task or serve interrupts) while page gets swapped in.
> >
> > To overcome this inefficiency this patch series implements "asynchronous
> > page fault" for paravirtualized KVM guests. If a page that vcpu is
> > trying to access is swapped out KVM sends an async PF to the vcpu
> > and continues vcpu execution. Requested page is swapped in by another
> > thread in parallel. When vcpu gets async PF it puts faulted task to
> > sleep until "wake up" interrupt is delivered. When the page is brought
> > to the host memory KVM sends "wake up" interrupt and the guest's task
> > resumes execution.
>
> Some high level comments:
>
> - cr2 used as token: better use the shared region? what if:
>
> async pf queued
> guest triple faults without a vmexit
> inject async-pf-done with token in cr2
>
> Also, in such scenario, can't you potentially corrupt guest memory after
> the triple fault by writing to the previously registered shared region
> address?
>
After triple faults guest will reboot and this should clear all pending
async pf injections. I'll check that this is indeed happens.

> - The token can overflow relatively easy. Use u64?
It not only can it frequently does, but since there can't be 2^20
outstanding page faults per vcpu simultaneously this doesn't cause
any problem.

>
> - Does it really inject interrupts for non-pv guests while waiting
> for swapin? Can't see that. Wish it was more geared towards fv.
>
No it does not yet. I only started to play with this to see how it can
work.

> - Please share some perf numbers.
>
OK.

> - Limit the number of queued async pf's per guest ?
>
Make sense.

> - Unify gfn_to_pfn / gfn_to_pfn_async code in the pf handlers (easier
> to review).
OK.

--
Gleb.

2010-01-08 19:26:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On 01/08/2010 11:18 AM, Marcelo Tosatti wrote:

> - Limit the number of queued async pf's per guest ?

This is automatically limited to the number of processes
running in a guest :)

--
All rights reversed.

2010-01-08 19:31:18

by Bryan Donlan

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On Fri, Jan 8, 2010 at 2:24 PM, Rik van Riel <[email protected]> wrote:
> On 01/08/2010 11:18 AM, Marcelo Tosatti wrote:
>
>> - Limit the number of queued async pf's per guest ?
>
> This is automatically limited to the number of processes
> running in a guest :)

Only if the guest is nice and plays by the rules. What is to stop a
malicious guest from just immediately switching back to userspace and
kicking off another page-in every time it gets an async PF?

2010-01-08 19:56:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v3 00/12] KVM: Add host swap event notifications for PV guest

On 01/08/2010 02:30 PM, Bryan Donlan wrote:
> On Fri, Jan 8, 2010 at 2:24 PM, Rik van Riel<[email protected]> wrote:
>> On 01/08/2010 11:18 AM, Marcelo Tosatti wrote:
>>
>>> - Limit the number of queued async pf's per guest ?
>>
>> This is automatically limited to the number of processes
>> running in a guest :)
>
> Only if the guest is nice and plays by the rules. What is to stop a
> malicious guest from just immediately switching back to userspace and
> kicking off another page-in every time it gets an async PF?

Good point, a malicious guest could do that and ruin
things for everyone. I guess we do need a limit :)

--
All rights reversed.

2010-01-14 17:31:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> Allow paravirtualized guest to do special handling for some page faults.
>
> The patch adds one 'if' to do_page_fault() function. The call is patched
> out when running on physical HW. I ran kernbech on the kernel with and
> without that additional 'if' and result were rawly the same:

So why not program a different handler address for the #PF/#GP faults
and avoid the if all together?

2010-01-17 14:45:10

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> > Allow paravirtualized guest to do special handling for some page faults.
> >
> > The patch adds one 'if' to do_page_fault() function. The call is patched
> > out when running on physical HW. I ran kernbech on the kernel with and
> > without that additional 'if' and result were rawly the same:
>
> So why not program a different handler address for the #PF/#GP faults
> and avoid the if all together?
I would gladly use fault vector reserved by x86 architecture, but I am
not sure Intel will be happy about it.

--
Gleb.

2010-01-17 15:09:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Sun, 2010-01-17 at 16:44 +0200, Gleb Natapov wrote:
> On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
> > On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> > > Allow paravirtualized guest to do special handling for some page faults.
> > >
> > > The patch adds one 'if' to do_page_fault() function. The call is patched
> > > out when running on physical HW. I ran kernbech on the kernel with and
> > > without that additional 'if' and result were rawly the same:
> >
> > So why not program a different handler address for the #PF/#GP faults
> > and avoid the if all together?
> I would gladly use fault vector reserved by x86 architecture, but I am
> not sure Intel will be happy about it.

Whatever are we doing to end up in do_page_fault() as it stands? Surely
we can tell the CPU to go elsewhere to handle faults?

Isn't that as simple as calling set_intr_gate(14, my_page_fault)
somewhere on the cpuinit instead of the regular page_fault handler?

2010-01-17 15:12:46

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Sun, Jan 17, 2010 at 04:09:40PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-01-17 at 16:44 +0200, Gleb Natapov wrote:
> > On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
> > > On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> > > > Allow paravirtualized guest to do special handling for some page faults.
> > > >
> > > > The patch adds one 'if' to do_page_fault() function. The call is patched
> > > > out when running on physical HW. I ran kernbech on the kernel with and
> > > > without that additional 'if' and result were rawly the same:
> > >
> > > So why not program a different handler address for the #PF/#GP faults
> > > and avoid the if all together?
> > I would gladly use fault vector reserved by x86 architecture, but I am
> > not sure Intel will be happy about it.
>
> Whatever are we doing to end up in do_page_fault() as it stands? Surely
> we can tell the CPU to go elsewhere to handle faults?
>
> Isn't that as simple as calling set_intr_gate(14, my_page_fault)
> somewhere on the cpuinit instead of the regular page_fault handler?
>
Hmm, good idea. I'll look into that. Thanks.

--
Gleb.

2010-01-17 15:43:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 06/12] Add get_user_pages() variant that fails if major fault is required.

On Tue, Jan 05, 2010 at 04:12:48PM +0200, Gleb Natapov wrote:
> This patch add get_user_pages() variant that only succeeds if getting
> a reference to a page doesn't require major fault.

>
> +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, int nr_pages, int write, int force,
> + struct page **pages, struct vm_area_struct **vmas)
> +{
> + int flags = FOLL_TOUCH | FOLL_MINOR;
> +
> + if (pages)
> + flags |= FOLL_GET;
> + if (write)
> + flags |= FOLL_WRITE;
> + if (force)
> + flags |= FOLL_FORCE;
> +
> + return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);

Wouldn't it be better to just export __get_user_pages as a proper user
interface, maybe replacing get_user_pages by it entirely?

2010-01-17 15:46:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 05/12] Export __get_user_pages_fast.

On Tue, Jan 05, 2010 at 04:12:47PM +0200, Gleb Natapov wrote:
> KVM will use it to try and find a page without falling back to slow
> gup. That is why get_user_pages_fast() is not enough.

Btw, it seems like currently is declared unconditionally in linux/mm.h
but only implemented by x86, and you code using it needs ifdefs for
that. I think you should just introduce a stub that always returns
an error here.

2010-01-18 08:38:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/17/2010 06:44 AM, Gleb Natapov wrote:
> On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
>> On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
>>> Allow paravirtualized guest to do special handling for some page faults.
>>>
>>> The patch adds one 'if' to do_page_fault() function. The call is patched
>>> out when running on physical HW. I ran kernbech on the kernel with and
>>> without that additional 'if' and result were rawly the same:
>>
>> So why not program a different handler address for the #PF/#GP faults
>> and avoid the if all together?
> I would gladly use fault vector reserved by x86 architecture, but I am
> not sure Intel will be happy about it.
>

That's what it's there for... see Peter Z.'s response.

-hpa

2010-01-18 08:51:17

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Mon, Jan 18, 2010 at 12:34:16AM -0800, H. Peter Anvin wrote:
> On 01/17/2010 06:44 AM, Gleb Natapov wrote:
> >On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
> >>On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> >>>Allow paravirtualized guest to do special handling for some page faults.
> >>>
> >>>The patch adds one 'if' to do_page_fault() function. The call is patched
> >>>out when running on physical HW. I ran kernbech on the kernel with and
> >>>without that additional 'if' and result were rawly the same:
> >>
> >>So why not program a different handler address for the #PF/#GP faults
> >>and avoid the if all together?
> >I would gladly use fault vector reserved by x86 architecture, but I am
> >not sure Intel will be happy about it.
> >
>
> That's what it's there for... see Peter Z.'s response.
>
Do you mean I can use one of exception vectors reserved by Intel
(20-31)? What Peter Z says is that I can register my own handler for
#PF and avoid the 'if' in non PV case as far as I understand him.

--
Gleb.

2010-01-18 09:53:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

Peter Zijlstra <[email protected]> writes:
>
> Whatever are we doing to end up in do_page_fault() as it stands? Surely
> we can tell the CPU to go elsewhere to handle faults?
>
> Isn't that as simple as calling set_intr_gate(14, my_page_fault)
> somewhere on the cpuinit instead of the regular page_fault handler?

That typically requires ugly ifdefs in entry*S and could be described
as code obfuscation ("come from")

As long as he avoids a global reference (patch) the if should be practially
free anyways.

-Andi
--
[email protected] -- Speaking for myself only.

2010-01-19 01:59:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/18/2010 12:50 AM, Gleb Natapov wrote:
> On Mon, Jan 18, 2010 at 12:34:16AM -0800, H. Peter Anvin wrote:
>> On 01/17/2010 06:44 AM, Gleb Natapov wrote:
>>> On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
>>>> On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
>>>>> Allow paravirtualized guest to do special handling for some page faults.
>>>>>
>>>>> The patch adds one 'if' to do_page_fault() function. The call is patched
>>>>> out when running on physical HW. I ran kernbech on the kernel with and
>>>>> without that additional 'if' and result were rawly the same:
>>>>
>>>> So why not program a different handler address for the #PF/#GP faults
>>>> and avoid the if all together?
>>> I would gladly use fault vector reserved by x86 architecture, but I am
>>> not sure Intel will be happy about it.
>>>
>>
>> That's what it's there for... see Peter Z.'s response.
>>
> Do you mean I can use one of exception vectors reserved by Intel
> (20-31)? What Peter Z says is that I can register my own handler for
> #PF and avoid the 'if' in non PV case as far as I understand him.
>

What I mean is that vector 14 is page faults -- that's what it is all
about. Why on Earth do you need another vector?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-19 06:56:13

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Mon, Jan 18, 2010 at 05:53:53PM -0800, H. Peter Anvin wrote:
> On 01/18/2010 12:50 AM, Gleb Natapov wrote:
> > On Mon, Jan 18, 2010 at 12:34:16AM -0800, H. Peter Anvin wrote:
> >> On 01/17/2010 06:44 AM, Gleb Natapov wrote:
> >>> On Thu, Jan 14, 2010 at 06:31:07PM +0100, Peter Zijlstra wrote:
> >>>> On Tue, 2010-01-05 at 16:12 +0200, Gleb Natapov wrote:
> >>>>> Allow paravirtualized guest to do special handling for some page faults.
> >>>>>
> >>>>> The patch adds one 'if' to do_page_fault() function. The call is patched
> >>>>> out when running on physical HW. I ran kernbech on the kernel with and
> >>>>> without that additional 'if' and result were rawly the same:
> >>>>
> >>>> So why not program a different handler address for the #PF/#GP faults
> >>>> and avoid the if all together?
> >>> I would gladly use fault vector reserved by x86 architecture, but I am
> >>> not sure Intel will be happy about it.
> >>>
> >>
> >> That's what it's there for... see Peter Z.'s response.
> >>
> > Do you mean I can use one of exception vectors reserved by Intel
> > (20-31)? What Peter Z says is that I can register my own handler for
> > #PF and avoid the 'if' in non PV case as far as I understand him.
> >
>
> What I mean is that vector 14 is page faults -- that's what it is all
> about. Why on Earth do you need another vector?
>
Because this is not usual pagefault that tell the OS that page is not
mapped. This is a notification to a guest OS that the page it is trying
to access is swapped out by the host OS. There is nothing guest can do
about it except schedule another task. So the guest should handle both
type of exceptions: usual #PF when page is not mapped by the guest and
new type of notifications. Ideally we would use one of unused exception
vectors for new type of notifications.

--
Gleb.

2010-01-19 17:07:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/18/2010 10:55 PM, Gleb Natapov wrote:
>>
>> What I mean is that vector 14 is page faults -- that's what it is all
>> about. Why on Earth do you need another vector?
>>
> Because this is not usual pagefault that tell the OS that page is not
> mapped. This is a notification to a guest OS that the page it is trying
> to access is swapped out by the host OS. There is nothing guest can do
> about it except schedule another task. So the guest should handle both
> type of exceptions: usual #PF when page is not mapped by the guest and
> new type of notifications. Ideally we would use one of unused exception
> vectors for new type of notifications.
>

Ah, this kind of stuff. We have talked about this in the past, and the
right way to do that is to have the guest OS pick a vector our of the
standard 0x20-0xff range, and then notify the hypervisor via a hypercall
which vector to use.

In Linux this means marking it as a system vector. Note that there are
real hardware system vectors which will be mutually exclusive with this,
e.g. the UV one.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-19 17:46:06

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Tue, Jan 19, 2010 at 09:03:20AM -0800, H. Peter Anvin wrote:
> On 01/18/2010 10:55 PM, Gleb Natapov wrote:
> >>
> >> What I mean is that vector 14 is page faults -- that's what it is all
> >> about. Why on Earth do you need another vector?
> >>
> > Because this is not usual pagefault that tell the OS that page is not
> > mapped. This is a notification to a guest OS that the page it is trying
> > to access is swapped out by the host OS. There is nothing guest can do
> > about it except schedule another task. So the guest should handle both
> > type of exceptions: usual #PF when page is not mapped by the guest and
> > new type of notifications. Ideally we would use one of unused exception
> > vectors for new type of notifications.
> >
>
> Ah, this kind of stuff. We have talked about this in the past, and the
> right way to do that is to have the guest OS pick a vector our of the
> standard 0x20-0xff range, and then notify the hypervisor via a hypercall
> which vector to use.
>
> In Linux this means marking it as a system vector. Note that there are
> real hardware system vectors which will be mutually exclusive with this,
> e.g. the UV one.

Yes it can be done this way and I'll look into it once more. Using
exception vector is more convenient for three reasons: it allows to pass
additional data in error code, it doesn't require guest to issue EOI,
exception can be injected when interrupts are disabled by a guest. The
last one is not important for now since host doesn't inject notifications
when interrupts are disabled currently. Having Intel allocate one
exception vector for hypervisor use would be really nice though.

--
Gleb.

2010-01-19 20:14:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/19/2010 09:44 AM, Gleb Natapov wrote:
>
> Yes it can be done this way and I'll look into it once more. Using
> exception vector is more convenient for three reasons: it allows to pass
> additional data in error code, it doesn't require guest to issue EOI,
> exception can be injected when interrupts are disabled by a guest. The
> last one is not important for now since host doesn't inject notifications
> when interrupts are disabled currently. Having Intel allocate one
> exception vector for hypervisor use would be really nice though.
>

That's probably not going to happen, for the rather obvious reason: *you
already have 224 of them*.

You seem to be thinking here that vectors 0-31 have to be exceptions and
32-255 have to be interrupts. *There is no such distinction*; the only
thing special about 0-31 is that we (Intel) reserve the right to control
the assignments; for 32-255 the platform and OS control the assignment.

You can have the guest OS take an exception on a vector above 31 just
fine; you just need it to tell the hypervisor which vector it, the OS,
assigned for this purpose.

-hpa

2010-01-20 10:03:27

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Tue, Jan 19, 2010 at 12:10:17PM -0800, H. Peter Anvin wrote:
> On 01/19/2010 09:44 AM, Gleb Natapov wrote:
> >
> > Yes it can be done this way and I'll look into it once more. Using
> > exception vector is more convenient for three reasons: it allows to pass
> > additional data in error code, it doesn't require guest to issue EOI,
> > exception can be injected when interrupts are disabled by a guest. The
> > last one is not important for now since host doesn't inject notifications
> > when interrupts are disabled currently. Having Intel allocate one
> > exception vector for hypervisor use would be really nice though.
> >
>
> That's probably not going to happen, for the rather obvious reason: *you
> already have 224 of them*.
>
> You seem to be thinking here that vectors 0-31 have to be exceptions and
> 32-255 have to be interrupts. *There is no such distinction*; the only
> thing special about 0-31 is that we (Intel) reserve the right to control
> the assignments; for 32-255 the platform and OS control the assignment.
>
I would be glad to interpret the spec like you do, but table 6-1 SDM 3A
mark vectors 2,32-255 as interrupts while others are traps, fault and
aborts. Unfortunately VMX designers seems to be interpreting the spec
like I do. See below.

> You can have the guest OS take an exception on a vector above 31 just
> fine; you just need it to tell the hypervisor which vector it, the OS,
> assigned for this purpose.
>
VMX doesn't allow to inject hardware exception with vector greater then 31.
SDM 3B section 23.2.1.3.

I can inject the event as HW interrupt on vector greater then 32 but not
go through APIC so EOI will not be required. This sounds non-architectural
and I am not sure kernel has entry point code for this kind of event, it
has one for exception and one for interrupts that goes through __do_IRQ()
which assumes that interrupts should be ACKed.

--
Gleb.

2010-01-20 12:00:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 12:02 PM, Gleb Natapov wrote:
>
> I can inject the event as HW interrupt on vector greater then 32 but not
> go through APIC so EOI will not be required. This sounds non-architectural
> and I am not sure kernel has entry point code for this kind of event, it
> has one for exception and one for interrupts that goes through __do_IRQ()
> which assumes that interrupts should be ACKed.
>

Further, we start to interact with the TPR; Linux doesn't use the TPR or
cr8 but if it does one day we don't want it interfering with apf.

--
error compiling committee.c: too many arguments to function

2010-01-20 17:19:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 07:00 AM, Avi Kivity wrote:
> On 01/20/2010 12:02 PM, Gleb Natapov wrote:
>>
>> I can inject the event as HW interrupt on vector greater then 32 but not
>> go through APIC so EOI will not be required. This sounds
>> non-architectural
>> and I am not sure kernel has entry point code for this kind of event, it
>> has one for exception and one for interrupts that goes through __do_IRQ()
>> which assumes that interrupts should be ACKed.
>
> Further, we start to interact with the TPR; Linux doesn't use the TPR or
> cr8 but if it does one day we don't want it interfering with apf.

That's not an issue is it? The guest will tell the host what
vector to use for pseudo page faults.

--
All rights reversed.

2010-01-20 17:48:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 02:02 AM, Gleb Natapov wrote:
>
>> You can have the guest OS take an exception on a vector above 31 just
>> fine; you just need it to tell the hypervisor which vector it, the OS,
>> assigned for this purpose.
>>
> VMX doesn't allow to inject hardware exception with vector greater then 31.
> SDM 3B section 23.2.1.3.
>

OK, you're right. I had missed that... I presume it was done for
implementation reasons.

> I can inject the event as HW interrupt on vector greater then 32 but not
> go through APIC so EOI will not be required. This sounds non-architectural
> and I am not sure kernel has entry point code for this kind of event, it
> has one for exception and one for interrupts that goes through __do_IRQ()
> which assumes that interrupts should be ACKed.

You can also just emulate the state transition -- since you know you're
dealing with a flat protected-mode or long-mode OS (and just make that a
condition of enabling the feature) you don't have to deal with all the
strange combinations of directions that an unrestricted x86 event can
take. Since it's an exception, it is unconditional.

-hpa

2010-01-20 18:48:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 04:00 AM, Avi Kivity wrote:
> On 01/20/2010 12:02 PM, Gleb Natapov wrote:
>>
>> I can inject the event as HW interrupt on vector greater then 32 but not
>> go through APIC so EOI will not be required. This sounds
>> non-architectural
>> and I am not sure kernel has entry point code for this kind of event, it
>> has one for exception and one for interrupts that goes through __do_IRQ()
>> which assumes that interrupts should be ACKed.
>>
>
> Further, we start to interact with the TPR; Linux doesn't use the TPR or
> cr8 but if it does one day we don't want it interfering with apf.
>

I don't think the TPR would be involved unless you involve the APIC
(which you absolutely don't want to do.) What I'm trying to figure out
is if you could inject this vector as "external interrupt" and still
have it deliver if IF=0, or if it would cause any other funnies.

As that point, you do not want to go through the do_IRQ path but rather
through your own exception vector entry point (it would be an entry
point which doesn't get an error code, like #UD.)

-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-21 08:57:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 07:18 PM, Rik van Riel wrote:
> On 01/20/2010 07:00 AM, Avi Kivity wrote:
>> On 01/20/2010 12:02 PM, Gleb Natapov wrote:
>>>
>>> I can inject the event as HW interrupt on vector greater then 32 but
>>> not
>>> go through APIC so EOI will not be required. This sounds
>>> non-architectural
>>> and I am not sure kernel has entry point code for this kind of
>>> event, it
>>> has one for exception and one for interrupts that goes through
>>> __do_IRQ()
>>> which assumes that interrupts should be ACKed.
>>
>> Further, we start to interact with the TPR; Linux doesn't use the TPR or
>> cr8 but if it does one day we don't want it interfering with apf.
>
> That's not an issue is it? The guest will tell the host what
> vector to use for pseudo page faults.

And kill 15 other vectors?

--
error compiling committee.c: too many arguments to function

2010-01-21 08:59:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 08:45 PM, H. Peter Anvin wrote:
> On 01/20/2010 04:00 AM, Avi Kivity wrote:
>
>> On 01/20/2010 12:02 PM, Gleb Natapov wrote:
>>
>>> I can inject the event as HW interrupt on vector greater then 32 but not
>>> go through APIC so EOI will not be required. This sounds
>>> non-architectural
>>> and I am not sure kernel has entry point code for this kind of event, it
>>> has one for exception and one for interrupts that goes through __do_IRQ()
>>> which assumes that interrupts should be ACKed.
>>>
>>>
>> Further, we start to interact with the TPR; Linux doesn't use the TPR or
>> cr8 but if it does one day we don't want it interfering with apf.
>>
>>
> I don't think the TPR would be involved unless you involve the APIC
> (which you absolutely don't want to do.) What I'm trying to figure out
> is if you could inject this vector as "external interrupt" and still
> have it deliver if IF=0, or if it would cause any other funnies.
>

No, and it poses problems further down the line if the hardware
virtualizes more and more of the APIC as seems likely to happen.

External interrupts are asynchronous events, so they're likely not to be
guaranteed to be delivered on an instruction boundary like exceptions.
Things like interrupt shadow will affect them as well.

> As that point, you do not want to go through the do_IRQ path but rather
> through your own exception vector entry point (it would be an entry
> point which doesn't get an error code, like #UD.)
>

An error code would actually be useful.

--
error compiling committee.c: too many arguments to function

2010-01-21 09:03:12

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/20/2010 07:43 PM, H. Peter Anvin wrote:
> On 01/20/2010 02:02 AM, Gleb Natapov wrote:
>>
>>> You can have the guest OS take an exception on a vector above 31 just
>>> fine; you just need it to tell the hypervisor which vector it, the OS,
>>> assigned for this purpose.
>>>
>> VMX doesn't allow to inject hardware exception with vector greater
>> then 31.
>> SDM 3B section 23.2.1.3.
>>
>
> OK, you're right. I had missed that... I presume it was done for
> implementation reasons.

My expectation is that is was done for forward compatibility reasons.

>
>> I can inject the event as HW interrupt on vector greater then 32 but not
>> go through APIC so EOI will not be required. This sounds
>> non-architectural
>> and I am not sure kernel has entry point code for this kind of event, it
>> has one for exception and one for interrupts that goes through
>> __do_IRQ()
>> which assumes that interrupts should be ACKed.
>
> You can also just emulate the state transition -- since you know
> you're dealing with a flat protected-mode or long-mode OS (and just
> make that a condition of enabling the feature) you don't have to deal
> with all the strange combinations of directions that an unrestricted
> x86 event can take. Since it's an exception, it is unconditional.

Do you mean create the stack frame manually? I'd really like to avoid
that for many reasons, one of which is performance (need to do all the
virt-to-phys walks manually), the other is that we're certain to end up
with something horribly underspecified. I'd really like to keep as
close as possible to the hardware. For the alternative approach, see Xen.

--
error compiling committee.c: too many arguments to function

2010-01-21 09:05:11

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Thu, Jan 21, 2010 at 11:02:19AM +0200, Avi Kivity wrote:
> On 01/20/2010 07:43 PM, H. Peter Anvin wrote:
> >On 01/20/2010 02:02 AM, Gleb Natapov wrote:
> >>
> >>>You can have the guest OS take an exception on a vector above 31 just
> >>>fine; you just need it to tell the hypervisor which vector it, the OS,
> >>>assigned for this purpose.
> >>>
> >>VMX doesn't allow to inject hardware exception with vector
> >>greater then 31.
> >>SDM 3B section 23.2.1.3.
> >>
> >
> >OK, you're right. I had missed that... I presume it was done for
> >implementation reasons.
>
> My expectation is that is was done for forward compatibility reasons.
>
> >
> >>I can inject the event as HW interrupt on vector greater then 32 but not
> >>go through APIC so EOI will not be required. This sounds
> >>non-architectural
> >>and I am not sure kernel has entry point code for this kind of event, it
> >>has one for exception and one for interrupts that goes through
> >>__do_IRQ()
> >>which assumes that interrupts should be ACKed.
> >
> >You can also just emulate the state transition -- since you know
> >you're dealing with a flat protected-mode or long-mode OS (and
> >just make that a condition of enabling the feature) you don't have
> >to deal with all the strange combinations of directions that an
> >unrestricted x86 event can take. Since it's an exception, it is
> >unconditional.
>
> Do you mean create the stack frame manually? I'd really like to
> avoid that for many reasons, one of which is performance (need to do
> all the virt-to-phys walks manually), the other is that we're
> certain to end up with something horribly underspecified. I'd
> really like to keep as close as possible to the hardware. For the
> alternative approach, see Xen.
>
That and our event injection path can't play with guest memory right now
since it is done from atomic context.

--
Gleb.

2010-01-21 09:07:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/21/2010 11:04 AM, Gleb Natapov wrote:
>
>> Do you mean create the stack frame manually? I'd really like to
>> avoid that for many reasons, one of which is performance (need to do
>> all the virt-to-phys walks manually), the other is that we're
>> certain to end up with something horribly underspecified. I'd
>> really like to keep as close as possible to the hardware. For the
>> alternative approach, see Xen.
>>
>>
> That and our event injection path can't play with guest memory right now
> since it is done from atomic context.
>

That's true (I'd like to fix that though, for the real mode stuff).

--
error compiling committee.c: too many arguments to function

2010-01-21 15:52:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On 01/21/2010 01:02 AM, Avi Kivity wrote:
>>
>> You can also just emulate the state transition -- since you know
>> you're dealing with a flat protected-mode or long-mode OS (and just
>> make that a condition of enabling the feature) you don't have to deal
>> with all the strange combinations of directions that an unrestricted
>> x86 event can take. Since it's an exception, it is unconditional.
>
> Do you mean create the stack frame manually? I'd really like to avoid
> that for many reasons, one of which is performance (need to do all the
> virt-to-phys walks manually), the other is that we're certain to end up
> with something horribly underspecified. I'd really like to keep as
> close as possible to the hardware. For the alternative approach, see Xen.
>

I obviously didn't mean to do something which didn't look like a
hardware-delivered exception. That by itself provides a tight spec.
The performance issue is real, of course.

Obviously, the design of VT-x was before my time at Intel, so I'm not
familiar with why the tradeoffs that were done they way they were.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-22 07:25:38

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 04/12] Add "handle page fault" PV helper.

On Thu, Jan 21, 2010 at 07:47:22AM -0800, H. Peter Anvin wrote:
> On 01/21/2010 01:02 AM, Avi Kivity wrote:
> >>
> >> You can also just emulate the state transition -- since you know
> >> you're dealing with a flat protected-mode or long-mode OS (and just
> >> make that a condition of enabling the feature) you don't have to deal
> >> with all the strange combinations of directions that an unrestricted
> >> x86 event can take. Since it's an exception, it is unconditional.
> >
> > Do you mean create the stack frame manually? I'd really like to avoid
> > that for many reasons, one of which is performance (need to do all the
> > virt-to-phys walks manually), the other is that we're certain to end up
> > with something horribly underspecified. I'd really like to keep as
> > close as possible to the hardware. For the alternative approach, see Xen.
> >
>
> I obviously didn't mean to do something which didn't look like a
> hardware-delivered exception. That by itself provides a tight spec.
> The performance issue is real, of course.
>
> Obviously, the design of VT-x was before my time at Intel, so I'm not
> familiar with why the tradeoffs that were done they way they were.
>
Is it so out of question to reserver exception below 32 for PV use?

--
Gleb.