2019-10-25 18:08:58

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

When sending an IPI to a single CPU there is no need to deal with cpumasks.
With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
cycles) improvement with smp_call_function_single() loop benchmark. The
optimization, however, is tiny and straitforward. Also, send_ipi_one() is
important for PV spinlock kick.

I was also wondering if it would make sense to switch to using regular
APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
vector)).

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/hyperv/hv_apic.c | 22 +++++++++++++++++++---
arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index e01078e93dd3..847f9d0328fe 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -194,10 +194,26 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)

static bool __send_ipi_one(int cpu, int vector)
{
- struct cpumask mask = CPU_MASK_NONE;
+ int ret;

- cpumask_set_cpu(cpu, &mask);
- return __send_ipi_mask(&mask, vector);
+ trace_hyperv_send_ipi_one(cpu, vector);
+
+ if (unlikely(!hv_hypercall_pg))
+ return false;
+
+ if (unlikely((vector < HV_IPI_LOW_VECTOR) ||
+ (vector > HV_IPI_HIGH_VECTOR)))
+ return false;
+
+ if (cpu >= 64)
+ goto do_ex_hypercall;
+
+ ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
+ BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
+ return ((ret == 0) ? true : false);
+
+do_ex_hypercall:
+ return __send_ipi_mask_ex(cpumask_of(cpu), vector);
}

static void hv_send_ipi(int cpu, int vector)
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..4d705cb4d63b 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -71,6 +71,21 @@ TRACE_EVENT(hyperv_send_ipi_mask,
__entry->ncpus, __entry->vector)
);

+TRACE_EVENT(hyperv_send_ipi_one,
+ TP_PROTO(int cpu,
+ int vector),
+ TP_ARGS(cpu, vector),
+ TP_STRUCT__entry(
+ __field(int, cpu)
+ __field(int, vector)
+ ),
+ TP_fast_assign(__entry->cpu = cpu;
+ __entry->vector = vector;
+ ),
+ TP_printk("cpu %d vector %x",
+ __entry->cpu, __entry->vector)
+ );
+
#endif /* CONFIG_HYPERV */

#undef TRACE_INCLUDE_PATH
--
2.20.1


2019-10-25 18:10:57

by Roman Kagan

[permalink] [raw]
Subject: Re: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

On Thu, Oct 24, 2019 at 05:21:52PM +0200, Vitaly Kuznetsov wrote:
> When sending an IPI to a single CPU there is no need to deal with cpumasks.
> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
> cycles) improvement with smp_call_function_single() loop benchmark. The
> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
> important for PV spinlock kick.
>
> I was also wondering if it would make sense to switch to using regular
> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
> vector)).

Is it with APICv or emulated apic?

> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/hyperv/hv_apic.c | 22 +++++++++++++++++++---
> arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
> 2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index e01078e93dd3..847f9d0328fe 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -194,10 +194,26 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)
>
> static bool __send_ipi_one(int cpu, int vector)
> {
> - struct cpumask mask = CPU_MASK_NONE;
> + int ret;
>
> - cpumask_set_cpu(cpu, &mask);
> - return __send_ipi_mask(&mask, vector);
> + trace_hyperv_send_ipi_one(cpu, vector);
> +
> + if (unlikely(!hv_hypercall_pg))
> + return false;
> +
> + if (unlikely((vector < HV_IPI_LOW_VECTOR) ||
> + (vector > HV_IPI_HIGH_VECTOR)))
> + return false;

I guess 'ulikely' is unnecessary in these cases.

> +
> + if (cpu >= 64)
> + goto do_ex_hypercall;
> +
> + ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
> + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
> + return ((ret == 0) ? true : false);

D'oh. Isn't "return ret == 0;" or just "return ret;" good enough?

These tiny nitpicks are no reason to hold the patch though, so

Reviewed-by: Roman Kagan <[email protected]>

2019-10-25 18:15:07

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

On Thu, 2019-10-24 at 17:21 +0200, Vitaly Kuznetsov wrote:
> When sending an IPI to a single CPU there is no need to deal with cpumasks.
> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
> cycles) improvement with smp_call_function_single() loop benchmark. The
> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
> important for PV spinlock kick.
>
> I was also wondering if it would make sense to switch to using regular
> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
> vector)).

style trivia:

> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
[]
> @@ -194,10 +194,26 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)
>
> static bool __send_ipi_one(int cpu, int vector)
> {
> - struct cpumask mask = CPU_MASK_NONE;
> + int ret;
>
> - cpumask_set_cpu(cpu, &mask);
> - return __send_ipi_mask(&mask, vector);
> + trace_hyperv_send_ipi_one(cpu, vector);
> +
> + if (unlikely(!hv_hypercall_pg))
> + return false;
> +
> + if (unlikely((vector < HV_IPI_LOW_VECTOR) ||
> + (vector > HV_IPI_HIGH_VECTOR)))
> + return false;
> +
> + if (cpu >= 64)
> + goto do_ex_hypercall;

Pretty odd to have a separate single use goto
to a single return statement. Might be better
using a direct return.

> +
> + ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
> + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
> + return ((ret == 0) ? true : false);
> +
> +do_ex_hypercall:
> + return __send_ipi_mask_ex(cpumask_of(cpu), vector);
> }

And the use of a automatic declaration of ret probably
isn't useful either.

Perhaps:
---
arch/x86/hyperv/hv_apic.c | 16 +++++++++++++---
arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index e01078e9..16c65cd 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -194,10 +194,20 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)

static bool __send_ipi_one(int cpu, int vector)
{
- struct cpumask mask = CPU_MASK_NONE;
+ trace_hyperv_send_ipi_one(cpu, vector);

- cpumask_set_cpu(cpu, &mask);
- return __send_ipi_mask(&mask, vector);
+ if (unlikely(!hv_hypercall_pg))
+ return false;
+
+ if (unlikely((vector < HV_IPI_LOW_VECTOR) ||
+ (vector > HV_IPI_HIGH_VECTOR)))
+ return false;
+
+ if (cpu >= 64)
+ return __send_ipi_mask_ex(cpumask_of(cpu), vector);
+
+ return !hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
+ BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
}

static void hv_send_ipi(int cpu, int vector)
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f..4d705cb 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -71,6 +71,21 @@ TRACE_EVENT(hyperv_send_ipi_mask,
__entry->ncpus, __entry->vector)
);

+TRACE_EVENT(hyperv_send_ipi_one,
+ TP_PROTO(int cpu,
+ int vector),
+ TP_ARGS(cpu, vector),
+ TP_STRUCT__entry(
+ __field(int, cpu)
+ __field(int, vector)
+ ),
+ TP_fast_assign(__entry->cpu = cpu;
+ __entry->vector = vector;
+ ),
+ TP_printk("cpu %d vector %x",
+ __entry->cpu, __entry->vector)
+ );
+
#endif /* CONFIG_HYPERV */

#undef TRACE_INCLUDE_PATH


2019-10-25 19:12:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

On Thu, 24 Oct 2019, Roman Kagan wrote:
> > +
> > + if (cpu >= 64)
> > + goto do_ex_hypercall;
> > +
> > + ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
> > + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
> > + return ((ret == 0) ? true : false);
>
> D'oh. Isn't "return ret == 0;" or just "return ret;" good enough?

'return ret == 0' != 'return ret'

!ret perhaps :)

Thanks,

tglx

2019-10-25 19:35:06

by Roman Kagan

[permalink] [raw]
Subject: Re: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

On Fri, Oct 25, 2019 at 12:38:05AM +0200, Thomas Gleixner wrote:
> On Thu, 24 Oct 2019, Roman Kagan wrote:
> > > +
> > > + if (cpu >= 64)
> > > + goto do_ex_hypercall;
> > > +
> > > + ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
> > > + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
> > > + return ((ret == 0) ? true : false);
> >
> > D'oh. Isn't "return ret == 0;" or just "return ret;" good enough?
>
> 'return ret == 0' != 'return ret'
>
> !ret perhaps :)

Sure. Time to vacuum my keyboard ;)

Thanks,
Roman.

2019-10-25 19:41:51

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

Roman Kagan <[email protected]> writes:

> On Thu, Oct 24, 2019 at 05:21:52PM +0200, Vitaly Kuznetsov wrote:
>> When sending an IPI to a single CPU there is no need to deal with cpumasks.
>> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
>> cycles) improvement with smp_call_function_single() loop benchmark. The
>> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
>> important for PV spinlock kick.
>>
>> I was also wondering if it would make sense to switch to using regular
>> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
>> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
>> vector)).
>
> Is it with APICv or emulated apic?

That's actually a good question. Yesterday I was testing this on WS2019
host with Xeon e5-2420 v2 (Ivy Bridge EN) which I *think* should already
support APICv - but I'm not sure and ark.intel.com is not
helpful. Today, I decided to re-test on something more modern and I got
WS2016 host with E5-2667 v4 (Broadwell) and the results are:

'Ex' hypercall: 18000 cycles
orig_apic.send_IPI(): 46000 cycles

I'm, however, just assuming that Hyper-V uses APICv when it's available
and have no idea how to check from within the guest. I'm also not sure
if WS2019 is so much faster or if there are other differences on these
hosts which matter.

>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/hyperv/hv_apic.c | 22 +++++++++++++++++++---
>> arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
>> 2 files changed, 34 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
>> index e01078e93dd3..847f9d0328fe 100644
>> --- a/arch/x86/hyperv/hv_apic.c
>> +++ b/arch/x86/hyperv/hv_apic.c
>> @@ -194,10 +194,26 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)
>>
>> static bool __send_ipi_one(int cpu, int vector)
>> {
>> - struct cpumask mask = CPU_MASK_NONE;
>> + int ret;
>>
>> - cpumask_set_cpu(cpu, &mask);
>> - return __send_ipi_mask(&mask, vector);
>> + trace_hyperv_send_ipi_one(cpu, vector);
>> +
>> + if (unlikely(!hv_hypercall_pg))
>> + return false;
>> +
>> + if (unlikely((vector < HV_IPI_LOW_VECTOR) ||
>> + (vector > HV_IPI_HIGH_VECTOR)))
>> + return false;
>
> I guess 'ulikely' is unnecessary in these cases.
>

All I can say is that the resulting asm with my gcc is a bit different
:-)

>> +
>> + if (cpu >= 64)
>> + goto do_ex_hypercall;
>> +
>> + ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
>> + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
>> + return ((ret == 0) ? true : false);
>
> D'oh. Isn't "return ret == 0;" or just "return ret;" good enough?

That's how we do stuff in __send_ipi_mask() :-) I'll send v2
implementing Joe's suggestion to drop 'ret' and just do
return !hv_do_fast_hypercall16().

>
> These tiny nitpicks are no reason to hold the patch though, so
>
> Reviewed-by: Roman Kagan <[email protected]>

Thanks!

--
Vitaly

2019-10-25 20:44:34

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH] x86/hyper-v: micro-optimize send_ipi_one case

From: Vitaly Kuznetsov <[email protected]> Sent: Friday, October 25, 2019 3:44 AM
>
> Roman Kagan <[email protected]> writes:
>
> > On Thu, Oct 24, 2019 at 05:21:52PM +0200, Vitaly Kuznetsov wrote:
> >> When sending an IPI to a single CPU there is no need to deal with cpumasks.
> >> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
> >> cycles) improvement with smp_call_function_single() loop benchmark. The
> >> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
> >> important for PV spinlock kick.
> >>
> >> I was also wondering if it would make sense to switch to using regular
> >> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
> >> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
> >> vector)).
> >
> > Is it with APICv or emulated apic?
>
> That's actually a good question. Yesterday I was testing this on WS2019
> host with Xeon e5-2420 v2 (Ivy Bridge EN) which I *think* should already
> support APICv - but I'm not sure and ark.intel.com is not
> helpful. Today, I decided to re-test on something more modern and I got
> WS2016 host with E5-2667 v4 (Broadwell) and the results are:
>
> 'Ex' hypercall: 18000 cycles
> orig_apic.send_IPI(): 46000 cycles
>
> I'm, however, just assuming that Hyper-V uses APICv when it's available
> and have no idea how to check from within the guest. I'm also not sure
> if WS2019 is so much faster or if there are other differences on these
> hosts which matter.
>

On Hyper-V 2016 and 2019 (not sure about 2012 R2), and when the guest is
using xAPIC (not x2APIC), you can tell within the guest whether Intel APICv is
enabled based on the setting of HV_X64_APIC_ACCESS_RECOMMENDED. If
this flag is set, then APICv is not present, because Hyper-V only recommends
using the synthetic MSRs when APICv is not present. Conversely, if the flag is
not set, then APICv is present.

FWIW, when APICv is present in the hardware, you can disable its use in a
particular VM by using Powershell in the host to set compatibility mode on
the VM:

Set-VMProcessor <vmname> -CompatibilityForMigrationEnabled $true

Then when the VM is booted, HV_X64_APIC_ACCESS_RECOMMENDED
will show as set even if the hardware has APICv. This is useful for testing
the older code paths on hardware that has APICv.

Michael