2019-10-25 20:17:26

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v2] x86/hyper-v: micro-optimize send_ipi_one case

When sending an IPI to a single CPU there is no need to deal with cpumasks.
With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
cycles) improvement with smp_call_function_single() loop benchmark. The
optimization, however, is tiny and straitforward. Also, send_ipi_one() is
important for PV spinlock kick.

I was also wondering if it would make sense to switch to using regular
APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
vector)).

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
Changes since v1:
- Style changes [Roman, Joe]
---
arch/x86/hyperv/hv_apic.c | 13 ++++++++++---
arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index e01078e93dd3..fd17c6341737 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -194,10 +194,17 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector)

static bool __send_ipi_one(int cpu, int vector)
{
- struct cpumask mask = CPU_MASK_NONE;
+ trace_hyperv_send_ipi_one(cpu, vector);

- cpumask_set_cpu(cpu, &mask);
- return __send_ipi_mask(&mask, vector);
+ if (!hv_hypercall_pg || (vector < HV_IPI_LOW_VECTOR) ||
+ (vector > HV_IPI_HIGH_VECTOR))
+ return false;
+
+ if (cpu >= 64)
+ return __send_ipi_mask_ex(cpumask_of(cpu), vector);
+
+ return !hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
+ BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
}

static void hv_send_ipi(int cpu, int vector)
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..4d705cb4d63b 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -71,6 +71,21 @@ TRACE_EVENT(hyperv_send_ipi_mask,
__entry->ncpus, __entry->vector)
);

+TRACE_EVENT(hyperv_send_ipi_one,
+ TP_PROTO(int cpu,
+ int vector),
+ TP_ARGS(cpu, vector),
+ TP_STRUCT__entry(
+ __field(int, cpu)
+ __field(int, vector)
+ ),
+ TP_fast_assign(__entry->cpu = cpu;
+ __entry->vector = vector;
+ ),
+ TP_printk("cpu %d vector %x",
+ __entry->cpu, __entry->vector)
+ );
+
#endif /* CONFIG_HYPERV */

#undef TRACE_INCLUDE_PATH
--
2.20.1


2019-10-25 20:35:23

by Roman Kagan

[permalink] [raw]
Subject: Re: [PATCH v2] x86/hyper-v: micro-optimize send_ipi_one case

On Fri, Oct 25, 2019 at 03:15:46PM +0200, Vitaly Kuznetsov wrote:
> When sending an IPI to a single CPU there is no need to deal with cpumasks.
> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
> cycles) improvement with smp_call_function_single() loop benchmark. The
> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
> important for PV spinlock kick.
>
> I was also wondering if it would make sense to switch to using regular
> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
> vector)).
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> Changes since v1:
> - Style changes [Roman, Joe]
> ---
> arch/x86/hyperv/hv_apic.c | 13 ++++++++++---
> arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
> 2 files changed, 25 insertions(+), 3 deletions(-)

Reviewed-by: Roman Kagan <[email protected]>

2019-10-25 20:47:13

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v2] x86/hyper-v: micro-optimize send_ipi_one case

From: Vitaly Kuznetsov <[email protected]>
>
> When sending an IPI to a single CPU there is no need to deal with cpumasks.
> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
> cycles) improvement with smp_call_function_single() loop benchmark. The
> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
> important for PV spinlock kick.
>
> I was also wondering if it would make sense to switch to using regular
> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
> vector)).
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> Changes since v1:
> - Style changes [Roman, Joe]
> ---
> arch/x86/hyperv/hv_apic.c | 13 ++++++++++---
> arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
> 2 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index e01078e93dd3..fd17c6341737 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -194,10 +194,17 @@ static bool __send_ipi_mask(const struct cpumask *mask, int
> vector)
>
> static bool __send_ipi_one(int cpu, int vector)
> {
> - struct cpumask mask = CPU_MASK_NONE;
> + trace_hyperv_send_ipi_one(cpu, vector);
>
> - cpumask_set_cpu(cpu, &mask);
> - return __send_ipi_mask(&mask, vector);
> + if (!hv_hypercall_pg || (vector < HV_IPI_LOW_VECTOR) ||
> + (vector > HV_IPI_HIGH_VECTOR))
> + return false;
> +
> + if (cpu >= 64)
> + return __send_ipi_mask_ex(cpumask_of(cpu), vector);

The above test should be checking the VP number, not the CPU
number, since the VP number is used to form the bitmap argument
to the hypercall. In all current implementations of Hyper-V, the CPU number
and VP number are the same as far as I am aware, but that's not guaranteed in
the future.

Michael

> +
> + return !hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
> + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
> }
>

2019-10-25 20:49:21

by Vitaly Kuznetsov

[permalink] [raw]
Subject: RE: [PATCH v2] x86/hyper-v: micro-optimize send_ipi_one case

Michael Kelley <[email protected]> writes:

> From: Vitaly Kuznetsov <[email protected]>
>>
>> When sending an IPI to a single CPU there is no need to deal with cpumasks.
>> With 2 CPU guest on WS2019 I'm seeing a minor (like 3%, 8043 -> 7761 CPU
>> cycles) improvement with smp_call_function_single() loop benchmark. The
>> optimization, however, is tiny and straitforward. Also, send_ipi_one() is
>> important for PV spinlock kick.
>>
>> I was also wondering if it would make sense to switch to using regular
>> APIC IPI send for CPU > 64 case but no, it is twice as expesive (12650 CPU
>> cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu,
>> vector)).
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> Changes since v1:
>> - Style changes [Roman, Joe]
>> ---
>> arch/x86/hyperv/hv_apic.c | 13 ++++++++++---
>> arch/x86/include/asm/trace/hyperv.h | 15 +++++++++++++++
>> 2 files changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
>> index e01078e93dd3..fd17c6341737 100644
>> --- a/arch/x86/hyperv/hv_apic.c
>> +++ b/arch/x86/hyperv/hv_apic.c
>> @@ -194,10 +194,17 @@ static bool __send_ipi_mask(const struct cpumask *mask, int
>> vector)
>>
>> static bool __send_ipi_one(int cpu, int vector)
>> {
>> - struct cpumask mask = CPU_MASK_NONE;
>> + trace_hyperv_send_ipi_one(cpu, vector);
>>
>> - cpumask_set_cpu(cpu, &mask);
>> - return __send_ipi_mask(&mask, vector);
>> + if (!hv_hypercall_pg || (vector < HV_IPI_LOW_VECTOR) ||
>> + (vector > HV_IPI_HIGH_VECTOR))
>> + return false;
>> +
>> + if (cpu >= 64)
>> + return __send_ipi_mask_ex(cpumask_of(cpu), vector);
>
> The above test should be checking the VP number, not the CPU
> number,

Oops, of course, thanks for catching this! v3 is coming!

> since the VP number is used to form the bitmap argument
> to the hypercall. In all current implementations of Hyper-V, the CPU number
> and VP number are the same as far as I am aware, but that's not guaranteed in
> the future.
>
> Michael
>
>> +
>> + return !hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector,
>> + BIT_ULL(hv_cpu_number_to_vp_number(cpu)));
>> }
>>

--
Vitaly