2012-06-04 05:06:22

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 0/7] KVM paravirt remote flush tlb


Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
the all the IPI to be processed and return back. With the new approach
of not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

The patchset depends on ticketlocks[1] and KVM Paravirt Spinlock
patches[2]

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers
• Increased SPIN_THRESHOLD 8k - to address the baseline numbers
regression in ebizzy(non-ple). Raghu is working on tuning the
threshold value along with the ple_window and ple_gap.

Here are the results from PLE hardware. Here is the setup details:
• 8 CPUs (HT disabled)
• 64-bit VM
• 8vcpus
• 1GB RAM

Numbers are % improvement/degradation wrt base kernel 3.4.0-rc4
(commit: af3a3ab2)

Note: SPINLOCK_THRESHOLD is set to 8192

gang - Base kernel + gang scheduling patches
pvspin - Base kernel + ticketlocks patches + paravirt spinlock patches
pvflush - Base kernel + paravirt tlb flush patches
pvall - pvspin + paravirt tlb flush patches
pvallnople - pvall and PLE is disabled(ple_gap = 0)

+-------------+-----------+-----------+-----------+-----------+-----------+
| | gang | pvspin | pvflush | pvall | pvallnople|
+-------------+-----------+-----------+-----------+-----------+-----------+
| ebizzy-1vm | 2 | 2 | 3 | -11 | 4 |
| ebizzy-2vm | 156 | 15 | -58 | 343 | 110 |
| ebizzy-4vm | 238 | 14 | -42 | 17 | 47 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| specjbb-1vm | 3 | 5 | 3 | 3 | 2 |
| specjbb-2vm | -10 | 3 | 2 | 2 | 3 |
| specjbb-4vm | 1 | 4 | 3 | 4 | 4 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| hbench-1vm | -14 | -58 | -1 | 2 | 7 |
| hbench-2vm | -35 | -5 | 7 | 11 | 12 |
| hbench-4vm | 19 | 8 | -1 | 14 | 35 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| dbench-1vm | -1 | -17 | -25 | -7 | -18 |
| dbench-2vm | 3 | -4 | 1 | 5 | 3 |
| dbench-4vm | 8 | 6 | 22 | 6 | -6 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| kbench-1vm | -100 | 8 | 4 | 5 | 7 |
| kbench-2vm | 7 | 9 | 0 | -2 | -2 |
| kbench-4vm | 12 | -1 | 0 | -6 | -15 |
+-------------+-----------+-----------+-----------+-----------+-----------+
| sysbnch-1vm | 4 | 1 | 3 | 4 | 5 |
| sysbnch-2vm | 73 | 15 | 29 | 34 | 49 |
| sysbnch-4vm | 22 | 2 | 9 | 17 | 31 |
+-------------+-----------+-----------+-----------+-----------+-----------+

Observations from the above table:
* pvall does well in most of the benchmarks.
* pvall does no do quite well for kernbench 2vm(-2%) and 4vm(-6%)

Other experiment that Vatsa suggested was to disable PLE. As the
paravirt patches provide similar functionality. So in those
experiments we did see notable improvements in hackbench and
sysbench. Kernbench degraded further, PLE does help kernbench. This
will be addressed by Raghu's directed yield approach.

Comments/suggestions welcome.

Regards
Nikunj

---

Nikunj A. Dadhania (6):
KVM Guest: Add VCPU running/pre-empted state for guest
KVM-HV: Add VCPU running/pre-empted state for guest
KVM: Add paravirt kvm_flush_tlb_others
KVM: export kvm_kick_vcpu for pv_flush
KVM: Introduce PV kick in flush tlb
Flush page-table pages before freeing them

Peter Zijlstra (1):
kvm,x86: RCU based table free


arch/Kconfig | 3 ++
arch/powerpc/include/asm/pgalloc.h | 1 +
arch/s390/mm/pgtable.c | 1 +
arch/sparc/include/asm/pgalloc_64.h | 1 +
arch/x86/Kconfig | 12 ++++++
arch/x86/include/asm/kvm_host.h | 7 ++++
arch/x86/include/asm/kvm_para.h | 15 ++++++++
arch/x86/include/asm/tlbflush.h | 9 +++++
arch/x86/kernel/kvm.c | 52 ++++++++++++++++++++++----
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/x86.c | 57 ++++++++++++++++++++++++++++-
arch/x86/mm/pgtable.c | 6 ++-
arch/x86/mm/tlb.c | 70 +++++++++++++++++++++++++++++++++++
include/asm-generic/tlb.h | 9 +++++
mm/memory.c | 31 +++++++++++++++-
15 files changed, 260 insertions(+), 15 deletions(-)


2012-06-04 05:07:06

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 2/7] KVM-HV: Add VCPU running/pre-empted state for guest

Hypervisor code to indicate guest running/pre-empteded status through
msr.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 7 ++++++
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/x86.c | 45 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dad475b..12fe3c7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,13 @@ struct kvm_vcpu_arch {
struct kvm_steal_time steal;
} st;

+ /* indicates vcpu is running or preempted */
+ struct {
+ u64 msr_val;
+ struct gfn_to_hva_cache data;
+ struct kvm_vcpu_state vs;
+ } v_state;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c93806..0588984 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_VCPU_STATE) |
(1 << KVM_FEATURE_PV_UNHALT);

if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5f57b..264f172 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -789,12 +789,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
* kvm-specific. Those are put in the beginning of the list.
*/

-#define KVM_SAVE_MSRS_BEGIN 9
+#define KVM_SAVE_MSRS_BEGIN 10
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+ MSR_KVM_VCPU_STATE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
#ifdef CONFIG_X86_64
@@ -1539,6 +1540,32 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
}

+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+ struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+ if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+ return;
+
+ vs->state = 1;
+ kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+ smp_wmb();
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+ struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+ if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+ return;
+
+ vs->state = 0;
+ kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+ smp_wmb();
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
bool pr = false;
@@ -1654,6 +1681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)

break;

+ case MSR_KVM_VCPU_STATE:
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.v_state.data,
+ data & KVM_VCPU_STATE_VALID_BITS))
+ return 1;
+
+ vcpu->arch.v_state.msr_val = data;
+ break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1974,6 +2009,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_KVM_STEAL_TIME:
data = vcpu->arch.st.msr_val;
break;
+ case MSR_KVM_VCPU_STATE:
+ data = vcpu->arch.v_state.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -5324,6 +5362,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_load_guest_fpu(vcpu);
kvm_load_guest_xcr0(vcpu);

+ kvm_set_vcpu_state(vcpu);
+
vcpu->mode = IN_GUEST_MODE;

/* We should set ->mode before check ->requests,
@@ -5340,6 +5380,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_enable();
preempt_enable();
kvm_x86_ops->cancel_injection(vcpu);
+ kvm_clear_vcpu_state(vcpu);
r = 1;
goto out;
}
@@ -5374,6 +5415,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);

+ kvm_clear_vcpu_state(vcpu);
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_enable();
@@ -6029,6 +6071,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.apf.msr_val = 0;
vcpu->arch.st.msr_val = 0;
+ vcpu->arch.v_state.msr_val = 0;

kvmclock_reset(vcpu);

2012-06-04 05:08:18

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>

--
Pseudo Algo:

Write()
======

guest_exit()
flush_on_enter[i]=0;
running[i] = 0;

guest_enter()
running[i] = 1;
smp_mb();
if(flush_on_enter[i]) {
tlb_flush()
flush_on_enter[i]=0;
}


Read()
======

GUEST KVM-HV

f->flushcpumask = cpumask - me;

again:
for_each_cpu(i, f->flushmask) {

if (!running[i]) {
case 1:

running[n]=1

(cpuN does not see
flush_on_enter set,
guest later finds it
running and sends ipi,
we are fine here, need
to clear the flag on
guest_exit)

flush_on_enter[i] = 1;
case2:

running[n]=1
(cpuN - will see flush
on enter and an IPI as
well - addressed in patch-4)

if (!running[i])
cpu_clear(f->flushmask); All is well, vm_enter
will do the fixup
}
case 3:
running[n] = 0;

(cpuN went to sleep,
we saw it as awake,
ipi sent, but wait
will break without
zero_mask and goto
again will take care)

}
send_ipi(f->flushmask)

wait_a_while_for_zero_mask();

if (!zero_mask)
goto again;
---
arch/x86/include/asm/kvm_para.h | 3 +-
arch/x86/include/asm/tlbflush.h | 9 ++++++
arch/x86/kernel/kvm.c | 1 +
arch/x86/kvm/x86.c | 14 ++++++++-
arch/x86/mm/tlb.c | 61 +++++++++++++++++++++++++++++++++++++++
5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f57b5cc..684a285 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -55,7 +55,8 @@ struct kvm_steal_time {

struct kvm_vcpu_state {
__u32 state;
- __u32 pad[15];
+ __u32 flush_on_enter;
+ __u32 pad[14];
};

#define KVM_VCPU_STATE_ALIGN_BITS 5
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c0e108e..29470bd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct cpumask *cpumask,
{
}

+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+ struct mm_struct *mm,
+ unsigned long va)
+{
+}
+
static inline void reset_lazy_tlbstate(void)
{
}
@@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm, unsigned long va);

+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+ struct mm_struct *mm, unsigned long va);
+
#define TLBSTATE_OK 1
#define TLBSTATE_LAZY 2

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bb686a6..66db54e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
}

has_vcpu_state = 1;
+ pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;

#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 264f172..4714a7b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1548,9 +1548,20 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
return;

+ /*
+ * Let the guest know that we are online, make sure we do not
+ * overwrite flush_on_enter, just write the vs->state.
+ */
vs->state = 1;
- kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+ kvm_write_guest_cached(vcpu->kvm, ghc, vs, 1*sizeof(__u32));
smp_wmb();
+ /*
+ * Guest might have seen us offline and would have set
+ * flush_on_enter.
+ */
+ kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+ if (vs->flush_on_enter)
+ kvm_x86_ops->tlb_flush(vcpu);
}

static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
@@ -1561,6 +1572,7 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
return;

+ vs->flush_on_enter = 0;
vs->state = 0;
kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
smp_wmb();
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d6c0418..f5dacdd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
#include <linux/interrupt.h>
#include <linux/module.h>
#include <linux/cpu.h>
+#include <linux/kvm_para.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -202,6 +203,66 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
raw_spin_unlock(&f->tlbstate_lock);
}

+#ifdef CONFIG_KVM_GUEST
+
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+ struct mm_struct *mm, unsigned long va)
+{
+ unsigned int sender;
+ union smp_flush_state *f;
+ int cpu, loop;
+ struct kvm_vcpu_state *v_state;
+
+ /* Caller has disabled preemption */
+ sender = this_cpu_read(tlb_vector_offset);
+ f = &flush_state[sender];
+
+ if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+ raw_spin_lock(&f->tlbstate_lock);
+
+ f->flush_mm = mm;
+ f->flush_va = va;
+ if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ /*
+ * We have to send the IPI only to online vCPUs
+ * affected. And queue flush_on_enter for pre-empted
+ * vCPUs
+ */
+again:
+ for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
+ v_state = &per_cpu(vcpu_state, cpu);
+
+ if (!v_state->state) {
+ v_state->flush_on_enter = 1;
+ smp_mb();
+ if (!v_state->state)
+ cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
+ }
+ }
+
+ if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+ goto out;
+
+ apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
+ INVALIDATE_TLB_VECTOR_START + sender);
+
+ loop = 1000;
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+ cpu_relax();
+
+ if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+ goto again;
+ }
+out:
+ f->flush_mm = NULL;
+ f->flush_va = 0;
+ if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+ raw_spin_unlock(&f->tlbstate_lock);
+}
+#endif /* CONFIG_KVM_GUEST */
+
void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm, unsigned long va)
{

2012-06-04 05:08:30

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush


Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 4 ++++
arch/x86/kernel/kvm.c | 18 +++++++++---------
2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 684a285..651a305 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -206,6 +206,7 @@ void kvm_async_pf_task_wait(u32 token);
void kvm_async_pf_task_wake(u32 token);
u32 kvm_read_and_reset_pf_reason(void);
extern void kvm_disable_steal_time(void);
+void kvm_kick_cpu(int cpu);

#ifdef CONFIG_PARAVIRT_SPINLOCKS
void __init kvm_spinlock_init(void);
@@ -229,6 +230,9 @@ static inline void kvm_disable_steal_time(void)
{
return;
}
+
+#define kvm_kick_cpu(T) do {} while (0)
+
#endif

#endif /* __KERNEL__ */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66db54e..5943285 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -487,6 +487,15 @@ static __init int activate_jump_labels(void)
}
arch_initcall(activate_jump_labels);

+/* Kick a cpu */
+void kvm_kick_cpu(int cpu)
+{
+ int apicid;
+
+ apicid = per_cpu(x86_cpu_to_apicid, cpu);
+ kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

enum kvm_contention_stat {
@@ -695,15 +704,6 @@ out:
}
PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);

-/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int cpu)
-{
- int apicid;
-
- apicid = per_cpu(x86_cpu_to_apicid, cpu);
- kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
-}
-
/* Kick vcpu waiting on @lock->head to reach value @ticket */
static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
{

2012-06-04 05:08:57

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb

In place of looping continuously introduce a halt if we do not succeed
after some time.

For vcpus that were running an IPI is sent. In case, it went to sleep
between this, we will be doing flush_on_enter(harmless). But as a
flush IPI was already sent, that will be processed in ipi handler,
this might result into something undesireable, i.e. It might clear the
flush_mask of a new request.

So after sending an IPI and waiting for a while, do a halt and wait
for a kick from the last vcpu.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/x86/mm/tlb.c | 25 +++++++++++++++++--------
1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f5dacdd..2c686bf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,6 +43,8 @@ union smp_flush_state {
struct {
struct mm_struct *flush_mm;
unsigned long flush_va;
+ int sender_cpu;
+ unsigned int need_kick;
raw_spinlock_t tlbstate_lock;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
@@ -167,6 +169,11 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+ if (f->need_kick && cpumask_empty(to_cpumask(f->flush_cpumask))) {
+ f->need_kick = 0;
+ smp_wmb();
+ kvm_kick_cpu(f->sender_cpu);
+ }
inc_irq_stat(irq_tlb_count);
}

@@ -222,15 +229,17 @@ void kvm_flush_tlb_others(const struct cpumask *cpumask,
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
raw_spin_lock(&f->tlbstate_lock);

+ cpu = smp_processor_id();
f->flush_mm = mm;
f->flush_va = va;
- if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ f->sender_cpu = cpu;
+ f->need_kick = 0;
+ if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(cpu))) {
/*
* We have to send the IPI only to online vCPUs
* affected. And queue flush_on_enter for pre-empted
* vCPUs
*/
-again:
for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
v_state = &per_cpu(vcpu_state, cpu);

@@ -242,9 +251,6 @@ again:
}
}

- if (cpumask_empty(to_cpumask(f->flush_cpumask)))
- goto out;
-
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);

@@ -252,10 +258,13 @@ again:
while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
cpu_relax();

- if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
- goto again;
+ if (!loop) {
+ f->need_kick = 1;
+ smp_mb();
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+ halt();
+ }
}
-out:
f->flush_mm = NULL;
f->flush_va = 0;
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)

2012-06-04 05:09:27

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 7/7] Flush page-table pages before freeing them

From: Nikunj A. Dadhania <[email protected]>

Certain architecture(viz. x86, arm, s390) have hardware page-table
walkers(#PF). So during the RCU page-table teardown process make sure
we do a tlb flush of page-table pages on all relevant CPUs to
synchronize against hardware walkers, and then free the pages.

Moreover, the (mm_users < 2) condition does not hold good for the above
architectures, as the hardware engine is one of the user.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/Kconfig | 3 +++
arch/x86/Kconfig | 12 ++++++++++++
mm/memory.c | 24 ++++++++++++++++++++++--
3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 684eb5a..abc3739 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -196,6 +196,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
config HAVE_RCU_TABLE_FREE
bool

+config ARCH_HW_WALKS_PAGE_TABLE
+ bool
+
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a9ec0da..b0a9f11 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -617,6 +617,18 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer N.

+config PARAVIRT_TLB_FLUSH
+ bool "Paravirtualization layer for TLB Flush"
+ depends on PARAVIRT && SMP && EXPERIMENTAL
+ select HAVE_RCU_TABLE_FREE
+ select ARCH_HW_WALKS_PAGE_TABLE
+ ---help---
+ Paravirtualized Flush TLB replace the native implementation
+ with something virtualization-friendly (for example, set a
+ flag for sleeping vcpu and do not wait for it).
+
+ If you are unsure how to answer this question, answer N.
+
config PARAVIRT_CLOCK
bool

diff --git a/mm/memory.c b/mm/memory.c
index c12685d..acfadb8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -335,11 +335,27 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
}

+#ifdef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
+/*
+ * Some architectures(x86, arm, s390) HW walks the page tables when
+ * the page-table tear down might be happening. So make sure that
+ * before freeing the page-table pages, flush their tlbs
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+ tlb_flush_mmu(tlb);
+}
+
+#else
+#define tlb_table_flush_mmu(tlb) do {} while (0)
+#endif
+
void tlb_table_flush(struct mmu_gather *tlb)
{
struct mmu_table_batch **batch = &tlb->batch;

if (*batch) {
+ tlb_table_flush_mmu(tlb);
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
@@ -351,18 +367,22 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)

tlb->need_flush = 1;

+#ifndef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
/*
- * When there's less then two users of this mm there cannot be a
- * concurrent page-table walk.
+ * When there's less then two users of this mm there cannot be
+ * a concurrent page-table walk for architectures that do not
+ * have hardware page-table walkers.
*/
if (atomic_read(&tlb->mm->mm_users) < 2) {
__tlb_remove_table(table);
return;
}
+#endif

if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
if (*batch == NULL) {
+ tlb_table_flush_mmu(tlb);
tlb_remove_table_one(table);
return;
}

2012-06-04 05:13:18

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 6/7] kvm,x86: RCU based table free

From: Peter Zijlstra <[email protected]>

get_user_pages_fast() depends on the IPI to hold off page table
teardown while they are locklessly walked with interrupts disabled.
If a vcpu were to be preempted while in this critical section, another
vcpu tearing down page tables would go ahead and destroy them. when
the preempted vcpu resumes it then touches the freed pages.

Using HAVE_RCU_TABLE_FREE:

By using call_rcu_sched() to free the page-tables you'd need to
receive and process at least one tick on the woken up cpu after the
freeing, but since the in-progress gup_fast() will have IRQs disabled
this will be delayed.

http://article.gmane.org/gmane.linux.kernel/1290539

Tested-by: Nikunj A. Dadhania <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/powerpc/include/asm/pgalloc.h | 1 +
arch/s390/mm/pgtable.c | 1 +
arch/sparc/include/asm/pgalloc_64.h | 1 +
arch/x86/mm/pgtable.c | 6 +++---
include/asm-generic/tlb.h | 9 +++++++++
mm/memory.c | 7 +++++++
6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)

pgtable_free(table, shift);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
{
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
else
free_pages((unsigned long) table, ALLOC_ORDER);
}
+#define __tlb_remove_table __tlb_remove_table

static void tlb_remove_table_smp_sync(void *arg)
{
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
is_page = true;
pgtable_free(table, is_page);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
{
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
{
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
- tlb_remove_page(tlb, pte);
+ tlb_remove_table(tlb, pte);
}

#if PAGETABLE_LEVELS > 2
void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pmd));
+ tlb_remove_table(tlb, virt_to_page(pmd));
}

#if PAGETABLE_LEVELS > 3
void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
{
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pud));
+ tlb_remove_table(tlb, virt_to_page(pud));
}
#endif /* PAGETABLE_LEVELS > 3 */
#endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..9ac30f7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
#ifdef CONFIG_HAVE_RCU_TABLE_FREE
/*
* Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
extern void tlb_table_flush(struct mmu_gather *tlb);
extern void tlb_remove_table(struct mmu_gather *tlb, void *table);

+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *page)
+{
+ tlb_remove_page(tlb, page);
+}
+
#endif

/*
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c12685d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
* See the comment near struct mmu_table_batch.
*/

+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+ free_page_and_swap_cache(table);
+}
+#endif
+
static void tlb_remove_table_smp_sync(void *arg)
{
/* Simply deliver the interrupt */

2012-06-04 05:14:41

by Nikunj A Dadhania

[permalink] [raw]
Subject: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Nikunj A. Dadhania <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 10 ++++++++++
arch/x86/kernel/kvm.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 77266d3..f57b5cc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
#define KVM_FEATURE_ASYNC_PF 4
#define KVM_FEATURE_STEAL_TIME 5
#define KVM_FEATURE_PV_UNHALT 6
+#define KVM_FEATURE_VCPU_STATE 7

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
#define MSR_KVM_STEAL_TIME 0x4b564d03
+#define MSR_KVM_VCPU_STATE 0x4b564d04

struct kvm_steal_time {
__u64 steal;
@@ -51,6 +53,14 @@ struct kvm_steal_time {
#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)

+struct kvm_vcpu_state {
+ __u32 state;
+ __u32 pad[15];
+};
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
#define KVM_MAX_MMU_OP_BATCH 32

#define KVM_ASYNC_PF_ENABLED (1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 98f0378..bb686a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
static int has_steal_clock = 0;

+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
/*
* No need for any "IO delay" on KVM
*/
@@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
cpu, __pa(st));
}

+static void kvm_register_vcpu_state(void)
+{
+ int cpu = smp_processor_id();
+ struct kvm_vcpu_state *v_state;
+
+ if (!has_vcpu_state)
+ return;
+
+ v_state = &per_cpu(vcpu_state, cpu);
+ memset(v_state, 0, sizeof(*v_state));
+
+ wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+ printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
+ cpu, __pa(v_state));
+}
+
void __cpuinit kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
@@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)

if (has_steal_clock)
kvm_register_steal_time();
+
+ if (has_vcpu_state)
+ kvm_register_vcpu_state();
}

static void kvm_pv_disable_apf(void *unused)
@@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
}

+void kvm_disable_vcpu_state(void)
+{
+ if (!has_vcpu_state)
+ return;
+
+ wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)

static void kvm_guest_cpu_offline(void *dummy)
{
+ kvm_disable_vcpu_state();
kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
@@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}

+ has_vcpu_state = 1;
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);

2012-06-05 10:48:12

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Mon, 4 Jun 2012, Nikunj A. Dadhania wrote:
> From: Peter Zijlstra <[email protected]>
>
> get_user_pages_fast() depends on the IPI to hold off page table
> teardown while they are locklessly walked with interrupts disabled.
> If a vcpu were to be preempted while in this critical section, another
> vcpu tearing down page tables would go ahead and destroy them. when
> the preempted vcpu resumes it then touches the freed pages.
>
> Using HAVE_RCU_TABLE_FREE:
>
> By using call_rcu_sched() to free the page-tables you'd need to
> receive and process at least one tick on the woken up cpu after the
> freeing, but since the in-progress gup_fast() will have IRQs disabled
> this will be delayed.
>
> http://article.gmane.org/gmane.linux.kernel/1290539
>
> Tested-by: Nikunj A. Dadhania <[email protected]>
> Signed-off-by: Nikunj A. Dadhania <[email protected]>
>
> arch/powerpc/include/asm/pgalloc.h | 1 +
> arch/s390/mm/pgtable.c | 1 +
> arch/sparc/include/asm/pgalloc_64.h | 1 +
> arch/x86/mm/pgtable.c | 6 +++---
> include/asm-generic/tlb.h | 9 +++++++++
> mm/memory.c | 7 +++++++
> 6 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
> index bf301ac..c33ae79 100644
> --- a/arch/powerpc/include/asm/pgalloc.h
> +++ b/arch/powerpc/include/asm/pgalloc.h
> @@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
>
> pgtable_free(table, shift);
> }
> +#define __tlb_remove_table __tlb_remove_table
> #else /* CONFIG_SMP */
> static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
> {
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 6e765bf..7029ed7 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
> else
> free_pages((unsigned long) table, ALLOC_ORDER);
> }
> +#define __tlb_remove_table __tlb_remove_table
>
> static void tlb_remove_table_smp_sync(void *arg)
> {
> diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
> index 40b2d7a..d10913a 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
> is_page = true;
> pgtable_free(table, is_page);
> }
> +#define __tlb_remove_table __tlb_remove_table
> #else /* CONFIG_SMP */
> static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
> {
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 8573b83..34fa168 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
> {
> pgtable_page_dtor(pte);
> paravirt_release_pte(page_to_pfn(pte));
> - tlb_remove_page(tlb, pte);
> + tlb_remove_table(tlb, pte);
> }
>
> #if PAGETABLE_LEVELS > 2
> void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
> {
> paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
> - tlb_remove_page(tlb, virt_to_page(pmd));
> + tlb_remove_table(tlb, virt_to_page(pmd));
> }
>
> #if PAGETABLE_LEVELS > 3
> void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
> {
> paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
> - tlb_remove_page(tlb, virt_to_page(pud));
> + tlb_remove_table(tlb, virt_to_page(pud));
> }
> #endif /* PAGETABLE_LEVELS > 3 */
> #endif /* PAGETABLE_LEVELS > 2 */
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..9ac30f7 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
> #include <asm/pgalloc.h>
> #include <asm/tlbflush.h>
>
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
> +
> #ifdef CONFIG_HAVE_RCU_TABLE_FREE
> /*
> * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
> extern void tlb_table_flush(struct mmu_gather *tlb);
> extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *page)
> +{
> + tlb_remove_page(tlb, page);
> +}
> +
> #endif
>
> /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 6105f47..c12685d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> * See the comment near struct mmu_table_batch.
> */
>
> +#ifndef __tlb_remove_table
> +static void __tlb_remove_table(void *table)
> +{
> + free_page_and_swap_cache(table);
> +}
> +#endif
> +
> static void tlb_remove_table_smp_sync(void *arg)
> {
> /* Simply deliver the interrupt */


I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
Maybe we can pull our efforts together :-)

Giving a look at this patch, it doesn't look like it is introducing
CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
How is the user supposed to set it?

I am appending the version of this patch I was working on: it introduces
a pvop in order not to force HAVE_RCU_TABLE_FREE on native x86.


>From cbb816810f17b57627285383c3d0f771dc89939f Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <[email protected]>
Date: Tue, 22 May 2012 11:39:44 +0000
Subject: [PATCH] [RFC] Introduce a new pv_mmu_op to free a directory page

At the moment get_user_pages_fast is unsafe on Xen, because it relies on
the fact that flush_tlb_others sends an IPI to flush the tlb but
xen_flush_tlb_others doesn't send any IPIs and always returns succesfully
and immediately.

The kernel offers HAVE_RCU_TABLE_FREE that enables an RCU lock to
protect this kind of software pagetable walks (see
include/asm-generic/tlb.h).
The goal of this patch is to enable HAVE_RCU_TABLE_FREE on Xen without
impacting the native x86 case.

The original patch to enable HAVE_RCU_TABLE_FREE on x86 is by Peter
Zijlstra, see http://marc.info/?l=linux-kernel&m=133595422305515&w=2; I
have only modified it so that it enables HAVE_RCU_TABLE_FREE on Xen
but not on native.
It does so by introducing a new pv_mmu_op to free a directory page:
pv_mmu_ops.tlb_remove_table.
The pvop is set to tlb_remove_page on native and to tlb_remove_table on
Xen.

The result is that if CONFIG_XEN is disabled, the behavior is the same
as today.
If CONFIG_XEN is enabled and we are running on native,
HAVE_RCU_TABLE_FREE is set but tlb_remove_table is never called: we
still call tlb_remove_page so there should be no performance penalty.
If CONFIG_XEN is enabled and we are running on Xen we make full usage of
the RCU directory page freeing as described in tlb.h.

Please advice on how to proceed: is it correct to enable
HAVE_RCU_TABLE_FREE but never call tlb_remove_table?
Is the pvops really the best thing to do here?
Maybe we can just implement get_user_pages_fast as a call to
get_user_pages on Xen?

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Stefano Stabellini <[email protected]>
---
arch/powerpc/include/asm/pgalloc.h | 1 +
arch/s390/mm/pgtable.c | 1 +
arch/sparc/include/asm/pgalloc_64.h | 1 +
arch/x86/include/asm/paravirt.h | 6 ++++++
arch/x86/include/asm/paravirt_types.h | 3 +++
arch/x86/include/asm/pgalloc.h | 1 +
arch/x86/kernel/paravirt.c | 2 ++
arch/x86/mm/pgtable.c | 6 +++---
arch/x86/xen/Kconfig | 1 +
arch/x86/xen/mmu.c | 7 +++++++
mm/memory.c | 7 +++++++
11 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)

pgtable_free(table, shift);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
{
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
else
free_pages((unsigned long) table, ALLOC_ORDER);
}
+#define __tlb_remove_table __tlb_remove_table

static void tlb_remove_table_smp_sync(void *arg)
{
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
is_page = true;
pgtable_free(table, is_page);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
{
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index aa0f913..42c6a9b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -402,6 +402,12 @@ static inline void flush_tlb_others(const struct cpumask *cpumask,
PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, cpumask, mm, va);
}

+static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb,
+ struct page *page)
+{
+ PVOP_VCALL2(pv_mmu_ops.tlb_remove_table, tlb, page);
+}
+
static inline int paravirt_pgd_alloc(struct mm_struct *mm)
{
return PVOP_CALL1(int, pv_mmu_ops.pgd_alloc, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..7e5ad6d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -51,6 +51,7 @@ struct mm_struct;
struct desc_struct;
struct task_struct;
struct cpumask;
+struct mmu_gather;

/*
* Wrapper type for pointers to code which uses the non-standard
@@ -251,6 +252,8 @@ struct pv_mmu_ops {
void (*flush_tlb_others)(const struct cpumask *cpus,
struct mm_struct *mm,
unsigned long va);
+ /* free a directory page */
+ void (*tlb_remove_table)(struct mmu_gather *tlb, struct page *page);

/* Hooks for allocating and freeing a pagetable top-level */
int (*pgd_alloc)(struct mm_struct *mm);
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b4389a4..0cc3251 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -20,6 +20,7 @@ static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) {
static inline void paravirt_release_pte(unsigned long pfn) {}
static inline void paravirt_release_pmd(unsigned long pfn) {}
static inline void paravirt_release_pud(unsigned long pfn) {}
+#define paravirt_tlb_remove_table(tlb, page) tlb_remove_page(tlb, page)
#endif

/*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab13760..370b3b4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -37,6 +37,7 @@
#include <asm/fixmap.h>
#include <asm/apic.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include <asm/timer.h>
#include <asm/special_insns.h>

@@ -422,6 +423,7 @@ struct pv_mmu_ops pv_mmu_ops = {
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
.flush_tlb_others = native_flush_tlb_others,
+ .tlb_remove_table = tlb_remove_page,

.pgd_alloc = __paravirt_pgd_alloc,
.pgd_free = paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..904d45c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
{
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
- tlb_remove_page(tlb, pte);
+ paravirt_tlb_remove_table(tlb, pte);
}

#if PAGETABLE_LEVELS > 2
void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pmd));
+ paravirt_tlb_remove_table(tlb, virt_to_page(pmd));
}

#if PAGETABLE_LEVELS > 3
void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
{
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pud));
+ paravirt_tlb_remove_table(tlb, virt_to_page(pud));
}
#endif /* PAGETABLE_LEVELS > 3 */
#endif /* PAGETABLE_LEVELS > 2 */
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index fdce49c..e2efb29 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -6,6 +6,7 @@ config XEN
bool "Xen guest support"
select PARAVIRT
select PARAVIRT_CLOCK
+ select HAVE_RCU_TABLE_FREE if SMP
depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS)
depends on X86_CMPXCHG && X86_TSC
help
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 69f5857..a58153b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -62,6 +62,7 @@
#include <asm/init.h>
#include <asm/pat.h>
#include <asm/smp.h>
+#include <asm/tlb.h>

#include <asm/xen/hypercall.h>
#include <asm/xen/hypervisor.h>
@@ -1281,6 +1282,11 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
xen_mc_issue(PARAVIRT_LAZY_MMU);
}

+static void xen_tlb_remove_table(struct mmu_gather *tlb, struct page *page)
+{
+ tlb_remove_table(tlb, page);
+}
+
static unsigned long xen_read_cr3(void)
{
return this_cpu_read(xen_cr3);
@@ -2006,6 +2012,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
.flush_tlb_kernel = xen_flush_tlb,
.flush_tlb_single = xen_flush_tlb_single,
.flush_tlb_others = xen_flush_tlb_others,
+ .tlb_remove_table = xen_tlb_remove_table,

.pte_update = paravirt_nop,
.pte_update_defer = paravirt_nop,
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c12685d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
* See the comment near struct mmu_table_batch.
*/

+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+ free_page_and_swap_cache(table);
+}
+#endif
+
static void tlb_remove_table_smp_sync(void *arg)
{
/* Simply deliver the interrupt */
--
1.7.2.5

2012-06-05 11:09:33

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <[email protected]> wrote:
>
> I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> Maybe we can pull our efforts together :-)
>
> Giving a look at this patch, it doesn't look like it is introducing
> CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> How is the user supposed to set it?
>
I am doing that in the next patch only for KVM-ParavirtTLB flush, as
there is a bug in this implementation that patch [7/7] fixes.

Refer following thread for details:
http://mid.gmane.org/1337254086.4281.26.camel@twins
http://mid.gmane.org/1337273959.4281.62.camel@twins

Regards
Nikunj

2012-06-05 11:58:41

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <[email protected]> wrote:
> >
> > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > Maybe we can pull our efforts together :-)
> >
> > Giving a look at this patch, it doesn't look like it is introducing
> > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > How is the user supposed to set it?
> >
> I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> there is a bug in this implementation that patch [7/7] fixes.
>
> Refer following thread for details:
> http://mid.gmane.org/1337254086.4281.26.camel@twins
> http://mid.gmane.org/1337273959.4281.62.camel@twins

Thanks, somehow I missed the 7/7 patch.

>From the Xen POV, your patch is fine because we'll just select
PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).

The main difference between the two approaches is that a kernel with
PARAVIRT_TLB_FLUSH and/or CONFIG_XEN enabled is going to have
HAVE_RCU_TABLE_FREE even when running on native.

Are you proposing this series for 3.5?
If not (because it depends on ticketlocks and KVM Paravirt Spinlock
patches), could you extract patch 6/7 and 7/7 and send them out
separately?
I am saying this because Xen needs the HAVE_RCU_TABLE_FREE fix even if
pv ticketlock are not accepted. This is an outstanding bug for us
unfortunately.

---

xen: select PARAVIRT_TLB_FLUSH if SMP

At the moment get_user_pages_fast is unsafe on Xen, because it relies on
the fact that flush_tlb_others sends an IPI to flush the tlb but
xen_flush_tlb_others doesn't send any IPIs and always returns
succesfully and immediately.

Select PARAVIRT_TLB_FLUSH, that enables an RCU lock to protect this kind
of software pagetable walks (also see HAVE_RCU_TABLE_FREE and
include/asm-generic/tlb.h).

Signed-off-by: Stefano Stabellini <[email protected]>

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index fdce49c..18c9876 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -6,6 +6,7 @@ config XEN
bool "Xen guest support"
select PARAVIRT
select PARAVIRT_CLOCK
+ select PARAVIRT_TLB_FLUSH if SMP
depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS)
depends on X86_CMPXCHG && X86_TSC
help

2012-06-05 13:08:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> PeterZ, is 7/7 alright to be picked?

Yeah, I guess it is.. I haven't had time to rework my tlb series yet
though. But these two patches together should make it work for x86.

2012-06-05 13:09:54

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012 12:58:32 +0100, Stefano Stabellini <[email protected]> wrote:
> On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> > On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <[email protected]> wrote:
> > >
> > > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > > Maybe we can pull our efforts together :-)
> > >
> > > Giving a look at this patch, it doesn't look like it is introducing
> > > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > > How is the user supposed to set it?
> > >
> > I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> > there is a bug in this implementation that patch [7/7] fixes.
> >
> > Refer following thread for details:
> > http://mid.gmane.org/1337254086.4281.26.camel@twins
> > http://mid.gmane.org/1337273959.4281.62.camel@twins
>
> Thanks, somehow I missed the 7/7 patch.
>
> From the Xen POV, your patch is fine because we'll just select
> PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).
>
Selecting ARCH_HW_WALKS_PAGE_TABLE in place of PARAVIRT_TLB_FLUSH should
suffice.

> The main difference between the two approaches is that a kernel with
> PARAVIRT_TLB_FLUSH and/or CONFIG_XEN enabled is going to have
> HAVE_RCU_TABLE_FREE even when running on native.
>
> Are you proposing this series for 3.5?
> If not (because it depends on ticketlocks and KVM Paravirt Spinlock
> patches),
>
3.6 I suppose as the merge window is already closed and we are having
some discussions on PLE results.

> could you extract patch 6/7 and 7/7 and send them out
> separately?
>
> I am saying this because Xen needs the HAVE_RCU_TABLE_FREE fix even if
> pv ticketlock are not accepted. This is an outstanding bug for us
> unfortunately.
>
PeterZ has a patch in his tlb-unify:

mm, x86: Add HAVE_RCU_TABLE_FREE support

Implements optional HAVE_RCU_TABLE_FREE support for x86.

This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Cc: Nikunj A Dadhania <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Avi Kivity <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>

http://git.kernel.org/?p=linux/kernel/git/peterz/mmu.git;a=commit;h=8a7e6fa5be9d2645c3394892c870113e6e5d9309

PeterZ, is 7/7 alright to be picked?

Regards
Nikunj

2012-06-05 13:21:59

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> On Tue, 5 Jun 2012 12:58:32 +0100, Stefano Stabellini <[email protected]> wrote:
> > On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> > > On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini <[email protected]> wrote:
> > > >
> > > > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > > > Maybe we can pull our efforts together :-)
> > > >
> > > > Giving a look at this patch, it doesn't look like it is introducing
> > > > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > > > How is the user supposed to set it?
> > > >
> > > I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> > > there is a bug in this implementation that patch [7/7] fixes.
> > >
> > > Refer following thread for details:
> > > http://mid.gmane.org/1337254086.4281.26.camel@twins
> > > http://mid.gmane.org/1337273959.4281.62.camel@twins
> >
> > Thanks, somehow I missed the 7/7 patch.
> >
> > From the Xen POV, your patch is fine because we'll just select
> > PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).
> >
> Selecting ARCH_HW_WALKS_PAGE_TABLE in place of PARAVIRT_TLB_FLUSH should
> suffice.

We would also need to select HAVE_RCU_TABLE_FREE, but it could be a good
idea, not go through PARAVIRT_TLB_FLUSH.

2012-06-05 13:26:44

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > PeterZ, is 7/7 alright to be picked?
>
> Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> though. But these two patches together should make it work for x86.
>

Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
3.6?

2012-06-05 13:31:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 2012-06-05 at 14:26 +0100, Stefano Stabellini wrote:
> On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > PeterZ, is 7/7 alright to be picked?
> >
> > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > though. But these two patches together should make it work for x86.
> >
>
> Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> 3.6?

I wouldn't rush this stuff, but if you're up for it and can convince
Linus that its all peachy you can try ;-)

2012-06-05 13:41:37

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 14:26 +0100, Stefano Stabellini wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > PeterZ, is 7/7 alright to be picked?
> > >
> > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > though. But these two patches together should make it work for x86.
> > >
> >
> > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > 3.6?
>
> I wouldn't rush this stuff, but if you're up for it and can convince
> Linus that its all peachy you can try ;-)
>

Fair enough, I think I'll let Konrad make this call :-)

2012-06-05 15:30:20

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 05 Jun 2012 15:08:08 +0200, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > PeterZ, is 7/7 alright to be picked?
>
> Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> though. But these two patches together should make it work for x86.
>
I haven't added your SOB yet to this, though I have your name in
mentioned "From". Should I add your SOB to this, I have added a
minor fix for !CONFIG_HAVE_RCU_TABLE_FREE case?

Regards
Nikunj

2012-06-12 23:02:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest

On Mon, Jun 04, 2012 at 10:36:05AM +0530, Nikunj A. Dadhania wrote:
> The patch adds guest code for msr between guest and hypervisor. The
> msr will export the vcpu running/pre-empted information to the guest
> from host. This will enable guest to intelligently send ipi to running
> vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> vcpus that are not running.
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Nikunj A. Dadhania <[email protected]>
> ---
> arch/x86/include/asm/kvm_para.h | 10 ++++++++++
> arch/x86/kernel/kvm.c | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 43 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 77266d3..f57b5cc 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -24,6 +24,7 @@
> #define KVM_FEATURE_ASYNC_PF 4
> #define KVM_FEATURE_STEAL_TIME 5
> #define KVM_FEATURE_PV_UNHALT 6
> +#define KVM_FEATURE_VCPU_STATE 7
>
> /* The last 8 bits are used to indicate how to interpret the flags field
> * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -39,6 +40,7 @@
> #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> #define MSR_KVM_STEAL_TIME 0x4b564d03
> +#define MSR_KVM_VCPU_STATE 0x4b564d04
>
> struct kvm_steal_time {
> __u64 steal;
> @@ -51,6 +53,14 @@ struct kvm_steal_time {
> #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
> #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
>
> +struct kvm_vcpu_state {
> + __u32 state;
> + __u32 pad[15];
> +};
> +
> +#define KVM_VCPU_STATE_ALIGN_BITS 5
> +#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
> +
> #define KVM_MAX_MMU_OP_BATCH 32
>
> #define KVM_ASYNC_PF_ENABLED (1 << 0)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 98f0378..bb686a6 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
> static int has_steal_clock = 0;
>
> +DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
> +static int has_vcpu_state;
> +
> /*
> * No need for any "IO delay" on KVM
> */
> @@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
> cpu, __pa(st));
> }
>
> +static void kvm_register_vcpu_state(void)
> +{
> + int cpu = smp_processor_id();
> + struct kvm_vcpu_state *v_state;
> +
> + if (!has_vcpu_state)
> + return;
> +
> + v_state = &per_cpu(vcpu_state, cpu);
> + memset(v_state, 0, sizeof(*v_state));
> +
> + wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
> + printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
> + cpu, __pa(v_state));
> +}
> +
> void __cpuinit kvm_guest_cpu_init(void)
> {
> if (!kvm_para_available())
> @@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
>
> if (has_steal_clock)
> kvm_register_steal_time();
> +
> + if (has_vcpu_state)
> + kvm_register_vcpu_state();
> }
>
> static void kvm_pv_disable_apf(void *unused)
> @@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
> wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
> }
>
> +void kvm_disable_vcpu_state(void)
> +{
> + if (!has_vcpu_state)
> + return;
> +
> + wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
> +}
> +
> #ifdef CONFIG_SMP
> static void __init kvm_smp_prepare_boot_cpu(void)
> {
> @@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
>
> static void kvm_guest_cpu_offline(void *dummy)
> {
> + kvm_disable_vcpu_state();
> kvm_disable_steal_time();
> kvm_pv_disable_apf(NULL);
> apf_task_wake_all();
> @@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
> pv_time_ops.steal_clock = kvm_steal_clock;
> }
>
> + has_vcpu_state = 1;
> +

Should be checking for a feature bit, see kvm_para_has_feature()
examples above in the function.

2012-06-12 23:02:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> paravirtualization.
>
> Use the vcpu state information inside the kvm_flush_tlb_others to
> avoid sending ipi to pre-empted vcpus.
>
> * Do not send ipi's to offline vcpus and set flush_on_enter flag
> * For online vcpus: Wait for them to clear the flag
>
> The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Nikunj A. Dadhania <[email protected]>

Why not reintroduce the hypercall to flush TLBs? No waiting, no
entry/exit trickery.

This is half-way-there paravirt with all the downsides. Even though the
guest running information might be useful in other cases.

> Pseudo Algo:
>
> Write()
> ======
>
> guest_exit()
> flush_on_enter[i]=0;
> running[i] = 0;
>
> guest_enter()
> running[i] = 1;
> smp_mb();
> if(flush_on_enter[i]) {
> tlb_flush()
> flush_on_enter[i]=0;
> }
>
>
> Read()
> ======
>
> GUEST KVM-HV
>
> f->flushcpumask = cpumask - me;
>
> again:
> for_each_cpu(i, f->flushmask) {
>
> if (!running[i]) {
> case 1:
>
> running[n]=1
>
> (cpuN does not see
> flush_on_enter set,
> guest later finds it
> running and sends ipi,
> we are fine here, need
> to clear the flag on
> guest_exit)
>
> flush_on_enter[i] = 1;
> case2:
>
> running[n]=1
> (cpuN - will see flush
> on enter and an IPI as
> well - addressed in patch-4)
>
> if (!running[i])
> cpu_clear(f->flushmask); All is well, vm_enter
> will do the fixup
> }
> case 3:
> running[n] = 0;
>
> (cpuN went to sleep,
> we saw it as awake,
> ipi sent, but wait
> will break without
> zero_mask and goto
> again will take care)
>
> }
> send_ipi(f->flushmask)
>
> wait_a_while_for_zero_mask();
>
> if (!zero_mask)
> goto again;
> ---
> arch/x86/include/asm/kvm_para.h | 3 +-
> arch/x86/include/asm/tlbflush.h | 9 ++++++
> arch/x86/kernel/kvm.c | 1 +
> arch/x86/kvm/x86.c | 14 ++++++++-
> arch/x86/mm/tlb.c | 61 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 86 insertions(+), 2 deletions(-)
>

2012-06-19 06:04:44

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest

On Tue, 12 Jun 2012 19:43:10 -0300, Marcelo Tosatti <[email protected]> wrote:
> On Mon, Jun 04, 2012 at 10:36:05AM +0530, Nikunj A. Dadhania wrote:
> > The patch adds guest code for msr between guest and hypervisor. The
> > msr will export the vcpu running/pre-empted information to the guest
> > from host. This will enable guest to intelligently send ipi to running
> > vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> > vcpus that are not running.
> >
> > Suggested-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Nikunj A. Dadhania <[email protected]>

[...]

> > @@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
> > pv_time_ops.steal_clock = kvm_steal_clock;
> > }
> >
> > + has_vcpu_state = 1;
> > +
>
> Should be checking for a feature bit, see kvm_para_has_feature()
> examples above in the function.
>
Sure, will take of this in my next version.

2012-06-19 06:12:41

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

Hi Marcelo,

Thanks for the review.

On Tue, 12 Jun 2012 20:02:18 -0300, Marcelo Tosatti <[email protected]> wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> >
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> >
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > * For online vcpus: Wait for them to clear the flag
> >
> > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> >
> > Suggested-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Nikunj A. Dadhania <[email protected]>
>
> Why not reintroduce the hypercall to flush TLBs?
Sure, I will get a version with this.

> No waiting, no entry/exit trickery.
>
Are you also suggesting to get rid of vcpu-running information?
We would atleast need this to raise a flushTLB hypercall on the sleeping
vcpu.

> This is half-way-there paravirt with all the downsides.
Some more details on what are the downsides would help us reach to a
better solution.

> Even though the guest running information might be useful in other
> cases.
>
Yes, that was one of the things on the mind.

> > Pseudo Algo:
> >
> > Write()
> > ======
> >
> > guest_exit()
> > flush_on_enter[i]=0;
> > running[i] = 0;
> >
> > guest_enter()
> > running[i] = 1;
> > smp_mb();
> > if(flush_on_enter[i]) {
> > tlb_flush()
> > flush_on_enter[i]=0;
> > }
> >
> >
> > Read()
> > ======
> >
> > GUEST KVM-HV
> >
> > f->flushcpumask = cpumask - me;
> >
> > again:
> > for_each_cpu(i, f->flushmask) {
> >
> > if (!running[i]) {
> > case 1:
> >
> > running[n]=1
> >
> > (cpuN does not see
> > flush_on_enter set,
> > guest later finds it
> > running and sends ipi,
> > we are fine here, need
> > to clear the flag on
> > guest_exit)
> >
> > flush_on_enter[i] = 1;
> > case2:
> >
> > running[n]=1
> > (cpuN - will see flush
> > on enter and an IPI as
> > well - addressed in patch-4)
> >
> > if (!running[i])
> > cpu_clear(f->flushmask); All is well, vm_enter
> > will do the fixup
> > }
> > case 3:
> > running[n] = 0;
> >
> > (cpuN went to sleep,
> > we saw it as awake,
> > ipi sent, but wait
> > will break without
> > zero_mask and goto
> > again will take care)
> >
> > }
> > send_ipi(f->flushmask)
> >
> > wait_a_while_for_zero_mask();
> >
> > if (!zero_mask)
> > goto again;
> > ---
> > arch/x86/include/asm/kvm_para.h | 3 +-
> > arch/x86/include/asm/tlbflush.h | 9 ++++++
> > arch/x86/kernel/kvm.c | 1 +
> > arch/x86/kvm/x86.c | 14 ++++++++-
> > arch/x86/mm/tlb.c | 61 +++++++++++++++++++++++++++++++++++++++
> > 5 files changed, 86 insertions(+), 2 deletions(-)
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-06-21 12:28:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

On Tue, Jun 19, 2012 at 11:41:54AM +0530, Nikunj A Dadhania wrote:
> Hi Marcelo,
>
> Thanks for the review.
>
> On Tue, 12 Jun 2012 20:02:18 -0300, Marcelo Tosatti <[email protected]> wrote:
> > On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > > * For online vcpus: Wait for them to clear the flag
> > >
> > > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > >
> > > Suggested-by: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Nikunj A. Dadhania <[email protected]>
> >
> > Why not reintroduce the hypercall to flush TLBs?
> Sure, I will get a version with this.
>
> > No waiting, no entry/exit trickery.
> >
> Are you also suggesting to get rid of vcpu-running information?

Yes.

> We would atleast need this to raise a flushTLB hypercall on the sleeping
> vcpu.

No, you always raise a flush TLB hypercall on flush_tlb_others, like Xen does.
Instead of an MMIO exit to APIC to IPI, its a hypercall exit.

> > This is half-way-there paravirt with all the downsides.
> Some more details on what are the downsides would help us reach to a
> better solution.

The downside i refer to is overhead to write, test and maintain separate code
for running as a guest. If you are adding awareness about hypervisor
anyway, a hypercall is the simplest way:

- It is simple to request a remote TLB flush via vcpu->requests in the
hypervisor. Information about which vcpus are in/out guest mode
already available.
- Maintenance of in/out guest mode information in guest memory adds more
overhead to entry/exit paths.
- No need to handle the interprocessor synchronization in the pseudo
algo below (already handled by vcpu->requests mechanism).
- The guest TLB IPI flow is (target vcpu):
- exit-due-to-external-interrupt.
- inject-ipi.
- enter guest mode.
- execute IPI handler, flush TLB.
- ACK IPI.
- source vcpu continues.
- The hypervisor TLB flush flow is:
- exit-due-to-external-interrupt.
- execute IPI handler (in host, from make_all_cpus_request)
- ACK IPI.
- source vcpu continues.

Unless there is an advantage in using APIC IPI exit vs hypercall
exit?

> > Even though the guest running information might be useful in other
> > cases.
> >
> Yes, that was one of the things on the mind.
>
> > > Pseudo Algo:
> > >
> > > Write()
> > > ======
> > >
> > > guest_exit()
> > > flush_on_enter[i]=0;
> > > running[i] = 0;
> > >
> > > guest_enter()
> > > running[i] = 1;
> > > smp_mb();
> > > if(flush_on_enter[i]) {
> > > tlb_flush()
> > > flush_on_enter[i]=0;
> > > }
> > >
> > >
> > > Read()
> > > ======
> > >
> > > GUEST KVM-HV
> > >
> > > f->flushcpumask = cpumask - me;
> > >
> > > again:
> > > for_each_cpu(i, f->flushmask) {
> > >
> > > if (!running[i]) {
> > > case 1:
> > >
> > > running[n]=1
> > >
> > > (cpuN does not see
> > > flush_on_enter set,
> > > guest later finds it
> > > running and sends ipi,
> > > we are fine here, need
> > > to clear the flag on
> > > guest_exit)
> > >
> > > flush_on_enter[i] = 1;
> > > case2:
> > >
> > > running[n]=1
> > > (cpuN - will see flush
> > > on enter and an IPI as
> > > well - addressed in patch-4)
> > >
> > > if (!running[i])
> > > cpu_clear(f->flushmask); All is well, vm_enter
> > > will do the fixup
> > > }
> > > case 3:
> > > running[n] = 0;
> > >
> > > (cpuN went to sleep,
> > > we saw it as awake,
> > > ipi sent, but wait
> > > will break without
> > > zero_mask and goto
> > > again will take care)
> > >
> > > }
> > > send_ipi(f->flushmask)
> > >
> > > wait_a_while_for_zero_mask();
> > >
> > > if (!zero_mask)
> > > goto again;
> > > ---
> > > arch/x86/include/asm/kvm_para.h | 3 +-
> > > arch/x86/include/asm/tlbflush.h | 9 ++++++
> > > arch/x86/kernel/kvm.c | 1 +
> > > arch/x86/kvm/x86.c | 14 ++++++++-
> > > arch/x86/mm/tlb.c | 61 +++++++++++++++++++++++++++++++++++++++
> > > 5 files changed, 86 insertions(+), 2 deletions(-)
> > >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-08-01 11:23:57

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > PeterZ, is 7/7 alright to be picked?
> >
> > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > though. But these two patches together should make it work for x86.
> >
>
> Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> 3.6?
>

Hello Nikunj,
what happened to this patch series?
In particular I am interested in the following two patches:

kvm,x86: RCU based table free
Flush page-table pages before freeing them

do you still intend to carry on with the development? Is there anything
missing that is preventing them from going upstream?

Cheers,

Stefano

2012-08-01 12:13:04

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

Hi Stefano,

On Wed, 1 Aug 2012 12:23:37 +0100, Stefano Stabellini <[email protected]> wrote:
> On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > PeterZ, is 7/7 alright to be picked?
> > >
> > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > though. But these two patches together should make it work for x86.
> > >
> >
> > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > 3.6?
> >
>
> Hello Nikunj,
> what happened to this patch series?
> In particular I am interested in the following two patches:
>
> kvm,x86: RCU based table free
> Flush page-table pages before freeing them
>
> do you still intend to carry on with the development? Is there anything
> missing that is preventing them from going upstream?
>
I have posted a v3 on the kvm-list:
http://www.spinics.net/lists/kvm/msg76955.html

I am carrying the above two patches(with one fix) in my series as well
for completeness.

I have picked up the patches from PeterZ's "Unify TLB gather
implementations -v3"
http://article.gmane.org/gmane.linux.kernel.mm/81278

Regards
Nikunj

2012-08-01 13:00:14

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] kvm,x86: RCU based table free

On Wed, 1 Aug 2012, Nikunj A Dadhania wrote:
> Hi Stefano,
>
> On Wed, 1 Aug 2012 12:23:37 +0100, Stefano Stabellini <[email protected]> wrote:
> > On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> > > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > > PeterZ, is 7/7 alright to be picked?
> > > >
> > > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > > though. But these two patches together should make it work for x86.
> > > >
> > >
> > > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > > 3.6?
> > >
> >
> > Hello Nikunj,
> > what happened to this patch series?
> > In particular I am interested in the following two patches:
> >
> > kvm,x86: RCU based table free
> > Flush page-table pages before freeing them
> >
> > do you still intend to carry on with the development? Is there anything
> > missing that is preventing them from going upstream?
> >
> I have posted a v3 on the kvm-list:
> http://www.spinics.net/lists/kvm/msg76955.html
>
> I am carrying the above two patches(with one fix) in my series as well
> for completeness.
>
> I have picked up the patches from PeterZ's "Unify TLB gather
> implementations -v3"
> http://article.gmane.org/gmane.linux.kernel.mm/81278

It is good to see that you are following up on this; since I didn't see
any updates I got worried :)
Cheers,

Stefano