2020-09-15 23:09:37

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-15 15:04, lushenming wrote:
> Thanks for your quick response.
>
> Okay, I agree that busy-waiting may add more overhead at the RD level.
> But I think that the delay time can be adjusted. In our latest
> hardware implementation, we optimize the search of the VPT, now even
> the VPT full of interrupts (56k) can be parsed within 2 microseconds.

It's not so much when the VPT is full that it is bad. It is when
the pending interrupts are not cached, and that you don't know *where*
to look for them in the VPT.

> It is true that the parse speeds of various hardware are different,
> but does directly waiting for 10 microseconds make the optimization of
> those fast hardware be completely masked? Maybe we can set the delay
> time smaller, like 1 microseconds?

That certainly would be more acceptable. But I still question the impact
of such a change compared to the cost of a vcpu entry. I suggest you
come up with measurements that actually show that polling this register
more often significantly reduces the entry latency. Only then can we
make an educated decision.

Thanks,

M.
--
Jazz is not dead. It just smells funny...


2020-09-16 07:06:22

by Shenming Lu

[permalink] [raw]
Subject: RE: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

Hi,

Our team just discussed this issue again and consulted our GIC hardware
design team. They think the RD can afford busy waiting. So we still think
maybe 0 is better, at least for our hardware.

In addition, if not 0, as I said before, in our measurement, it takes only
hundreds of nanoseconds, or 1~2 microseconds, to finish parsing the VPT
in most cases. So maybe 1 microseconds, or smaller, is more appropriate.
Anyway, 10 microseconds is too much.

But it has to be said that it does depend on the hardware implementation.

Besides, I'm not sure where are the start and end point of the total scheduling
latency of a vcpu you said, which includes many events. Is the parse time of
the VPT not clear enough?

-----Original Message-----
From: Marc Zyngier [mailto:[email protected]]
Sent: 2020-09-15 22:48
To: lushenming <[email protected]>
Cc: Thomas Gleixner <[email protected]>; Jason Cooper <[email protected]>; [email protected]; Wanghaibin (D) <[email protected]>; yuzenghui <[email protected]>
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-15 15:04, lushenming wrote:
> Thanks for your quick response.
>
> Okay, I agree that busy-waiting may add more overhead at the RD level.
> But I think that the delay time can be adjusted. In our latest
> hardware implementation, we optimize the search of the VPT, now even
> the VPT full of interrupts (56k) can be parsed within 2 microseconds.

It's not so much when the VPT is full that it is bad. It is when the pending interrupts are not cached, and that you don't know *where* to look for them in the VPT.

> It is true that the parse speeds of various hardware are different,
> but does directly waiting for 10 microseconds make the optimization of
> those fast hardware be completely masked? Maybe we can set the delay
> time smaller, like 1 microseconds?

That certainly would be more acceptable. But I still question the impact of such a change compared to the cost of a vcpu entry. I suggest you come up with measurements that actually show that polling this register more often significantly reduces the entry latency. Only then can we make an educated decision.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-09-16 08:42:51

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-16 08:04, lushenming wrote:
> Hi,
>
> Our team just discussed this issue again and consulted our GIC hardware
> design team. They think the RD can afford busy waiting. So we still
> think
> maybe 0 is better, at least for our hardware.
>
> In addition, if not 0, as I said before, in our measurement, it takes
> only
> hundreds of nanoseconds, or 1~2 microseconds, to finish parsing the VPT
> in most cases. So maybe 1 microseconds, or smaller, is more
> appropriate.
> Anyway, 10 microseconds is too much.
>
> But it has to be said that it does depend on the hardware
> implementation.

Exactly. And given that the only publicly available implementation is
a software model, I am reluctant to change "performance" related things
based on benchmarks that can't be verified and appears to me as a micro
optimization.

> Besides, I'm not sure where are the start and end point of the total
> scheduling
> latency of a vcpu you said, which includes many events. Is the parse
> time of
> the VPT not clear enough?

Measure the time it takes from kvm_vcpu_load() to the point where the
vcpu
enters the guest. How much, in proportion, do these 1/2/10ms represent?

Also, a better(?) course of action would maybe to consider whether we
should
split the its_vpe_schedule() call into two distinct operations: one that
programs the VPE to be resident, and another that poll the Dirty bit
*much
later* on the entry path, giving the GIC a chance to work in parallel
with
the CPU on the entry path.

If your HW is a quick as you say it is, it would pretty much guarantee
a clear read of GICR_VPENDBASER without waiting.

M.
--
Jazz is not dead. It just smells funny...

2020-09-18 12:36:04

by Shenming Lu

[permalink] [raw]
Subject: RE: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

Hi, Marc,

I measured the time from vcpu_load() (include it) to __guest_enter() on Kunpeng 920. On average, It takes 2.55 microseconds (not first run && the VPT is empty). So waiting for 10 microseconds in
vcpu scheduling really hurts performance.

And I agree that delaying the execution of its_wait_vpt_parse_complete() might be a viable solution.

-----Original Message-----
From: Marc Zyngier [mailto:[email protected]]
Sent: 2020-09-16 16:40
To: lushenming <[email protected]>
Cc: Thomas Gleixner <[email protected]>; Jason Cooper <[email protected]>; [email protected]; Wanghaibin (D) <[email protected]>; yuzenghui <[email protected]>
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-16 08:04, lushenming wrote:
> Hi,
>
> Our team just discussed this issue again and consulted our GIC
> hardware design team. They think the RD can afford busy waiting. So we
> still think maybe 0 is better, at least for our hardware.
>
> In addition, if not 0, as I said before, in our measurement, it takes
> only hundreds of nanoseconds, or 1~2 microseconds, to finish parsing
> the VPT in most cases. So maybe 1 microseconds, or smaller, is more
> appropriate.
> Anyway, 10 microseconds is too much.
>
> But it has to be said that it does depend on the hardware
> implementation.

Exactly. And given that the only publicly available implementation is a software model, I am reluctant to change "performance" related things based on benchmarks that can't be verified and appears to me as a micro optimization.

> Besides, I'm not sure where are the start and end point of the total
> scheduling latency of a vcpu you said, which includes many events. Is
> the parse time of the VPT not clear enough?

Measure the time it takes from kvm_vcpu_load() to the point where the vcpu enters the guest. How much, in proportion, do these 1/2/10ms represent?

Also, a better(?) course of action would maybe to consider whether we should split the its_vpe_schedule() call into two distinct operations: one that programs the VPE to be resident, and another that poll the Dirty bit *much
later* on the entry path, giving the GIC a chance to work in parallel with the CPU on the entry path.

If your HW is a quick as you say it is, it would pretty much guarantee a clear read of GICR_VPENDBASER without waiting.

M.
--
Jazz is not dead. It just smells funny...