2012-02-20 08:08:54

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

On Thu, 5 Jan 2012 10:10:59 +0100, Ingo Molnar <[email protected]> wrote:
>
> * Avi Kivity <[email protected]> wrote:
>
> > > So why wait for non-running vcpus at all? That is, why not
> > > paravirt the TLB flush such that the invalidate marks the
> > > non-running VCPU's state so that on resume it will first
> > > flush its TLBs. That way you don't have to wake it up and
> > > wait for it to invalidate its TLBs.
> >
> > That's what Xen does, but it's tricky. For example
> > get_user_pages_fast() depends on the IPI to hold off page
> > freeing, if we paravirt it we have to take that into
> > consideration.
> >
> > > Or am I like totally missing the point (I am after all
> > > reading the thread backwards and I haven't yet fully paged
> > > the kernel stuff back into my brain).
> >
> > You aren't, and I bet those kernel pages are unswappable
> > anyway.
> >
> > > I guess tagging remote VCPU state like that might be
> > > somewhat tricky.. but it seems worth considering, the whole
> > > wake and wait for flush thing seems daft.
> >
> > It's nasty, but then so is paravirt. It's hard to get right,
> > and it has a tendency to cause performance regressions as
> > hardware improves.
>
> Here it would massively improve performance - without regressing
> the scheduler code massively.
>
I tried doing an experiment with the flush_tlb_others_ipi. This depends
on Raghu's "kvm : Paravirt-spinlock support for KVM guests"
(https://lkml.org/lkml/2012/1/14/66), which has new hypercall for
kicking another vcpu out of halt.

Here are the results from non-PLE hardware. Running ebizzy
workload inside the VMs. The table shows the ebizzy score -
Records/sec.

8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

+--------+------------+------------+-------------+
| | baseline | gang | pv_flush |
+--------+------------+------------+-------------+
| 2VM | 3979.50 | 8818.00 | 11002.50 |
| 4VM | 1817.50 | 6236.50 | 6196.75 |
| 8VM | 922.12 | 4043.00 | 4001.38 |
+--------+------------+------------+-------------+

I will be posting the results for PLE hardware as well.

Here is the patch, this still needs to be hooked with the pv_mmu_ops. So,

Not-yet-Signed-off-by: Nikunj A Dadhania <[email protected]>

Index: linux-tip-f4ab688-pv/arch/x86/mm/tlb.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/mm/tlb.c 2012-02-14 18:26:21.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/mm/tlb.c 2012-02-20 15:23:10.242576314 +0800
@@ -43,6 +43,7 @@ union smp_flush_state {
struct mm_struct *flush_mm;
unsigned long flush_va;
raw_spinlock_t tlbstate_lock;
+ int sender_cpu;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
char pad[INTERNODE_CACHE_BYTES];
@@ -116,6 +117,9 @@ EXPORT_SYMBOL_GPL(leave_mm);
*
* Interrupts are disabled.
*/
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+extern void kvm_kick_cpu(int cpu);
+#endif

/*
* FIXME: use of asmlinkage is not consistent. On x86_64 it's noop
@@ -166,6 +170,10 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+ kvm_kick_cpu(f->sender_cpu);
+#endif
inc_irq_stat(irq_tlb_count);
}

@@ -184,7 +192,10 @@ static void flush_tlb_others_ipi(const s

f->flush_mm = mm;
f->flush_va = va;
+ f->sender_cpu = smp_processor_id();
if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ int loop = 1024;
+
/*
* We have to send the IPI only to
* CPUs affected.
@@ -192,8 +203,15 @@ static void flush_tlb_others_ipi(const s
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);

+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+ cpu_relax();
+ if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
+ halt();
+#else
while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
cpu_relax();
+#endif
}

f->flush_mm = NULL;
Index: linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c 2012-02-14 18:26:55.178450933 +0800
@@ -653,16 +653,17 @@ out:
PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);

/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int apicid)
+void kvm_kick_cpu(int cpu)
{
+ int apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
}
+EXPORT_SYMBOL_GPL(kvm_kick_cpu);

/* Kick vcpu waiting on @lock->head to reach value @ticket */
static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
{
int cpu;
- int apicid;

add_stats(RELEASED_SLOW, 1);

@@ -671,8 +672,7 @@ static void kvm_unlock_kick(struct arch_
if (ACCESS_ONCE(w->lock) == lock &&
ACCESS_ONCE(w->want) == ticket) {
add_stats(RELEASED_SLOW_KICKED, 1);
- apicid = per_cpu(x86_cpu_to_apicid, cpu);
- kvm_kick_cpu(apicid);
+ kvm_kick_cpu(cpu);
break;
}
}


2012-02-20 08:14:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS


* Nikunj A Dadhania <[email protected]> wrote:

> > Here it would massively improve performance - without
> > regressing the scheduler code massively.
>
> I tried doing an experiment with the flush_tlb_others_ipi.
> This depends on Raghu's "kvm : Paravirt-spinlock support for
> KVM guests" (https://lkml.org/lkml/2012/1/14/66), which has
> new hypercall for kicking another vcpu out of halt.
>
> Here are the results from non-PLE hardware. Running ebizzy
> workload inside the VMs. The table shows the ebizzy score -
> Records/sec.
>
> 8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)
>
> +--------+------------+------------+-------------+
> | | baseline | gang | pv_flush |
> +--------+------------+------------+-------------+
> | 2VM | 3979.50 | 8818.00 | 11002.50 |
> | 4VM | 1817.50 | 6236.50 | 6196.75 |
> | 8VM | 922.12 | 4043.00 | 4001.38 |
> +--------+------------+------------+-------------+

Very nice results!

Seems like the PV approach is massively faster on 2 VMs than
even the gang scheduling hack, because it attacks the problem
at its root, not just the symptom.

The patch is also an order of magnitude simpler. Gang
scheduling, R.I.P.

Thanks,

Ingo

2012-02-20 10:51:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> + while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> + cpu_relax();
> + if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> + halt();


That's just vile, you don't need to wait for it, all you need to make
sure is that when that vcpu wakes up it does the flush.

But yeah, the results are a good hint that you're on the right track.

2012-02-20 11:53:41

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > + while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > + cpu_relax();
> > + if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > + halt();
>
>
> That's just vile, you don't need to wait for it, all you need to make
> sure is that when that vcpu wakes up it does the flush.
>
Yes, but we are not sure the vcpu will be sleeping or running. In the
case where vcpus are running, it might be beneficial to wait a while.

For example: If its a remote flush to only one of the vcpu and its
already running, is it worthed to halt and come back?

Regards,
Nikunj

2012-02-20 12:02:32

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

* Nikunj A Dadhania <[email protected]> [2012-02-20 17:23:16]:

> On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > > + while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > > + cpu_relax();
> > > + if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > + halt();
> >
> >
> > That's just vile, you don't need to wait for it, all you need to make
> > sure is that when that vcpu wakes up it does the flush.
> >
> Yes, but we are not sure the vcpu will be sleeping or running. In the
> case where vcpus are running, it might be beneficial to wait a while.

I guess one possibility is for host scheduler to export run/preempt
information to guest OS, as was discussed in the context of this thread
as well:

http://lkml.org/lkml/2010/4/6/223

Essentially guest OS will know (using such exported information) which of its
vcpus are currently running and thus do a busy-wait for them.

- vatsa

2012-02-20 12:14:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

On Mon, 2012-02-20 at 17:32 +0530, Srivatsa Vaddagiri wrote:
> * Nikunj A Dadhania <[email protected]> [2012-02-20 17:23:16]:
>
> > On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <[email protected]> wrote:
> > > On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > > > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > > > + while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > > > + cpu_relax();
> > > > + if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > > + halt();
> > >
> > >
> > > That's just vile, you don't need to wait for it, all you need to make
> > > sure is that when that vcpu wakes up it does the flush.
> > >
> > Yes, but we are not sure the vcpu will be sleeping or running. In the
> > case where vcpus are running, it might be beneficial to wait a while.
>
> I guess one possibility is for host scheduler to export run/preempt
> information to guest OS, as was discussed in the context of this thread
> as well:
>
> http://lkml.org/lkml/2010/4/6/223

Doesn't need to be the host scheduler, KVM itself can do that just fine
on guest entry/exit.

> Essentially guest OS will know (using such exported information) which of its
> vcpus are currently running and thus do a busy-wait for them.

Right, do something like:

again:
for_each_cpu_in_mask(cpu, flush_cpumask) {
if !vcpu-running {
set_flush-on-enter(cpu)
if !vcpu-running
cpumask_clear(flush_cpumask, cpu); // vm-enter will do
}
wait-a-while-for-mask-to-clear
if (!cpumask_empty)
goto again;

with the appropriate memory barriers and atomic instructions, that way
you can skip waiting for vcpus that are not in guest mode and vm-enter
will fixup.