From: Andrew Geissler <[email protected]>
Commit f90bc0f97f2b ("ipmi: kcs: Poll OBF briefly to reduce OBE
latency") introduced an optimization to poll when the host has
read the output data register (ODR). Testing has shown that the 100us
timeout was not always enough. When we miss that 100us window, it
results in 10x the time to get the next message from the BMC to the
host. When you're sending 100's of messages between the BMC and Host,
this results in a server boot taking 50% longer for IBM P10 machines.
Started with 1000 and worked it down until the issue started to reoccur.
200 was the sweet spot in my testing. 150 showed the issue
intermittently.
Signed-off-by: Andrew Geissler <[email protected]>
---
drivers/char/ipmi/kcs_bmc_aspeed.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c
index 72640da55380..af1eae6153f6 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -422,7 +422,7 @@ static void aspeed_kcs_irq_mask_update(struct kcs_bmc_device *kcs_bmc, u8 mask,
* missed the event.
*/
rc = read_poll_timeout_atomic(aspeed_kcs_inb, str,
- !(str & KCS_BMC_STR_OBF), 1, 100, false,
+ !(str & KCS_BMC_STR_OBF), 1, 200, false,
&priv->kcs_bmc, priv->kcs_bmc.ioreg.str);
/* Time for the slow path? */
if (rc == -ETIMEDOUT)
--
2.39.2
Dear Andrew,
Thank you for your patch. Some style suggestions.
Am 20.02.24 um 13:36 schrieb Andrew Geissler:
> From: Andrew Geissler <[email protected]>
(Oh no, Yahoo. (ignore))
You could be more specific in the git commit message by using *Double*:
> ipmi: kcs: Double OBF poll timeout to reduce latency
> ipmi: kcs: Double OBF poll timeout to 200 us to reduce latency
> Commit f90bc0f97f2b ("ipmi: kcs: Poll OBF briefly to reduce OBE
> latency") introduced an optimization to poll when the host has
> read the output data register (ODR). Testing has shown that the 100us
> timeout was not always enough. When we miss that 100us window, it
> results in 10x the time to get the next message from the BMC to the
> host. When you're sending 100's of messages between the BMC and Host,
I do not understand, how this poll timeout can result in such an
increase, and why a quite big timeout hurts, but I do not know the
implementation.
> this results in a server boot taking 50% longer for IBM P10 machines.
>
> Started with 1000 and worked it down until the issue started to reoccur.
> 200 was the sweet spot in my testing. 150 showed the issue
> intermittently.
I’d add a blank line here.
> Signed-off-by: Andrew Geissler <[email protected]>
> ---
> drivers/char/ipmi/kcs_bmc_aspeed.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c
> index 72640da55380..af1eae6153f6 100644
> --- a/drivers/char/ipmi/kcs_bmc_aspeed.c
> +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
> @@ -422,7 +422,7 @@ static void aspeed_kcs_irq_mask_update(struct kcs_bmc_device *kcs_bmc, u8 mask,
> * missed the event.
> */
> rc = read_poll_timeout_atomic(aspeed_kcs_inb, str,
> - !(str & KCS_BMC_STR_OBF), 1, 100, false,
> + !(str & KCS_BMC_STR_OBF), 1, 200, false,
> &priv->kcs_bmc, priv->kcs_bmc.ioreg.str);
> /* Time for the slow path? */
> if (rc == -ETIMEDOUT)
Kind regards,
Paul
On Tue, Feb 20, 2024 at 04:51:21PM +0100, Paul Menzel wrote:
> Dear Andrew,
>
>
> Thank you for your patch. Some style suggestions.
>
> Am 20.02.24 um 13:36 schrieb Andrew Geissler:
> > From: Andrew Geissler <[email protected]>
>
> (Oh no, Yahoo. (ignore))
>
> You could be more specific in the git commit message by using *Double*:
>
> > ipmi: kcs: Double OBF poll timeout to reduce latency
>
> > ipmi: kcs: Double OBF poll timeout to 200 us to reduce latency
>
> > Commit f90bc0f97f2b ("ipmi: kcs: Poll OBF briefly to reduce OBE
> > latency") introduced an optimization to poll when the host has
I assume that removing that patch doesn't fix the issue, it would only
make it worse, right?
> > read the output data register (ODR). Testing has shown that the 100us
> > timeout was not always enough. When we miss that 100us window, it
> > results in 10x the time to get the next message from the BMC to the
> > host. When you're sending 100's of messages between the BMC and Host,
>
> I do not understand, how this poll timeout can result in such an increase,
> and why a quite big timeout hurts, but I do not know the implementation.
It's because increasing that number causes it to poll longer for the
event, the host takes longer than 100us to generate the event, and if
the event is missed the time when it is checked again is very long.
Polling for 100us is already pretty extreme. 200us is really too long.
The real problem is that there is no interrupt for this. I'd also guess
there is no interrupt on the host side, because that would solve this
problem, too, as it would certainly get around to handling the interupt
in 100us. I'm assuming the host driver is not the Linux driver, as it
should also handle this in a timely manner, even when polling.
If people want hardware to perform well, they ought to design it and not
expect software to fix all the problems.
The right way to fix this is probably to do the same thing the host side
Linux driver does. It has a kernel thread that is kicked off to do
this. Unfortunately, that's more complicated to implement, but it
avoids polling in this location (which causes latency issues on the BMC
side) and lets you poll longer without causing issues.
I'll let the people who maintain that code comment.
-corey
>
> > this results in a server boot taking 50% longer for IBM P10 machines.
> >
> > Started with 1000 and worked it down until the issue started to reoccur.
> > 200 was the sweet spot in my testing. 150 showed the issue
> > intermittently.
>
> I’d add a blank line here.
>
> > Signed-off-by: Andrew Geissler <[email protected]>
> > ---
> > drivers/char/ipmi/kcs_bmc_aspeed.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c
> > index 72640da55380..af1eae6153f6 100644
> > --- a/drivers/char/ipmi/kcs_bmc_aspeed.c
> > +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
> > @@ -422,7 +422,7 @@ static void aspeed_kcs_irq_mask_update(struct kcs_bmc_device *kcs_bmc, u8 mask,
> > * missed the event.
> > */
> > rc = read_poll_timeout_atomic(aspeed_kcs_inb, str,
> > - !(str & KCS_BMC_STR_OBF), 1, 100, false,
> > + !(str & KCS_BMC_STR_OBF), 1, 200, false,
> > &priv->kcs_bmc, priv->kcs_bmc.ioreg.str);
> > /* Time for the slow path? */
> > if (rc == -ETIMEDOUT)
>
>
> Kind regards,
>
> Paul
On Tue, 2024-02-20 at 13:33 -0600, Corey Minyard wrote:
> On Tue, Feb 20, 2024 at 04:51:21PM +0100, Paul Menzel wrote:
> > Dear Andrew,
> >
> >
> > Thank you for your patch. Some style suggestions.
> >
> > Am 20.02.24 um 13:36 schrieb Andrew Geissler:
> > > From: Andrew Geissler <[email protected]>
> >
> > (Oh no, Yahoo. (ignore))
> >
> > You could be more specific in the git commit message by using *Double*:
> >
> > > ipmi: kcs: Double OBF poll timeout to reduce latency
> >
> > > ipmi: kcs: Double OBF poll timeout to 200 us to reduce latency
> >
> > > Commit f90bc0f97f2b ("ipmi: kcs: Poll OBF briefly to reduce OBE
> > > latency") introduced an optimization to poll when the host has
>
> I assume that removing that patch doesn't fix the issue, it would only
> make it worse, right?
Yep.
>
> > > read the output data register (ODR). Testing has shown that the 100us
> > > timeout was not always enough. When we miss that 100us window, it
> > > results in 10x the time to get the next message from the BMC to the
> > > host. When you're sending 100's of messages between the BMC and Host,
> >
> > I do not understand, how this poll timeout can result in such an increase,
> > and why a quite big timeout hurts, but I do not know the implementation.
>
> It's because increasing that number causes it to poll longer for the
> event, the host takes longer than 100us to generate the event, and if
> the event is missed the time when it is checked again is very long.
>
> Polling for 100us is already pretty extreme. 200us is really too long.
>
> The real problem is that there is no interrupt for this. I'd also guess
> there is no interrupt on the host side, because that would solve this
> problem, too, as it would certainly get around to handling the interupt
> in 100us. I'm assuming the host driver is not the Linux driver, as it
> should also handle this in a timely manner, even when polling.
I expect the issues Andrew G is observing are with the Power10 boot
firmware. The boot firmware only polls. The runtime firmware enables
interrupts.
>
> If people want hardware to perform well, they ought to design it and not
> expect software to fix all the problems.
+1
>
> The right way to fix this is probably to do the same thing the host side
> Linux driver does. It has a kernel thread that is kicked off to do
> this. Unfortunately, that's more complicated to implement, but it
> avoids polling in this location (which causes latency issues on the BMC
> side) and lets you poll longer without causing issues.
In Andrew G's case he's talking MCTP over KCS using a vendor-defined
transport binding (that also leverages LPC FWH cycles for bulk data
transfers)[1]. I think it could have taken more inspiration from the
IPMI KCS protocol: It might be worth an experiment to write the dummy
command value to IDR from the host side after each ODR read to signal
the host's clearing of OBF (no interrupt for the BMC) with an IBF
(which does interrupt the BMC). And doing the obverse for the BMC. Some
brief thought suggests that if the dummy value is read there's no need
to send a dummy value in reply (as it's an indicator to read the status
register). With that the need for the spin here (or on the host side)
is reduced at the cost of some constant protocol overhead.
[1]: https://github.com/openbmc/libmctp/blob/master/docs/bindings/vendor-ibm-astlpc.md
Andrew J
> On Feb 20, 2024, at 4:36 PM, Andrew Jeffery <[email protected]> wrote:
>
> On Tue, 2024-02-20 at 13:33 -0600, Corey Minyard wrote:
>> On Tue, Feb 20, 2024 at 04:51:21PM +0100, Paul Menzel wrote:
>>> Dear Andrew,
>>
>> It's because increasing that number causes it to poll longer for the
>> event, the host takes longer than 100us to generate the event, and if
>> the event is missed the time when it is checked again is very long.
>>
>> Polling for 100us is already pretty extreme. 200us is really too long.
>>
>> The real problem is that there is no interrupt for this. I'd also guess
>> there is no interrupt on the host side, because that would solve this
>> problem, too, as it would certainly get around to handling the interupt
>> in 100us. I'm assuming the host driver is not the Linux driver, as it
>> should also handle this in a timely manner, even when polling.
>
> I expect the issues Andrew G is observing are with the Power10 boot
> firmware. The boot firmware only polls. The runtime firmware enables
> interrupts.
Yep, this is with the low level host boot firmware.
Also, further testing over night showed that 200us wasn’t enough for
our larger Everest P10 machines, I needed to go to 300us. As we
were struggling to allow 200us, I assume 300us is going to be a no-go.
>>
>
>>
>> The right way to fix this is probably to do the same thing the host side
>> Linux driver does. It has a kernel thread that is kicked off to do
>> this. Unfortunately, that's more complicated to implement, but it
>> avoids polling in this location (which causes latency issues on the BMC
>> side) and lets you poll longer without causing issues.
>
> In Andrew G's case he's talking MCTP over KCS using a vendor-defined
> transport binding (that also leverages LPC FWH cycles for bulk data
> transfers)[1]. I think it could have taken more inspiration from the
> IPMI KCS protocol: It might be worth an experiment to write the dummy
> command value to IDR from the host side after each ODR read to signal
> the host's clearing of OBF (no interrupt for the BMC) with an IBF
> (which does interrupt the BMC). And doing the obverse for the BMC. Some
> brief thought suggests that if the dummy value is read there's no need
> to send a dummy value in reply (as it's an indicator to read the status
> register). With that the need for the spin here (or on the host side)
> is reduced at the cost of some constant protocol overhead.
>
Thanks for the quick reviews and ideas.
I’ll see if I can find someone on the team to help out with Andrew J’s
thoughts and if that doesn’t work, look into the kernel thread idea.
>
>
> Andrew J
On Wed, Feb 21, 2024 at 10:57:38AM -0600, Andrew Geissler wrote:
>
>
> > On Feb 20, 2024, at 4:36 PM, Andrew Jeffery <[email protected]> wrote:
> >
> > On Tue, 2024-02-20 at 13:33 -0600, Corey Minyard wrote:
> >> On Tue, Feb 20, 2024 at 04:51:21PM +0100, Paul Menzel wrote:
> >>> Dear Andrew,
> >>
> >> It's because increasing that number causes it to poll longer for the
> >> event, the host takes longer than 100us to generate the event, and if
> >> the event is missed the time when it is checked again is very long.
> >>
> >> Polling for 100us is already pretty extreme. 200us is really too long.
> >>
> >> The real problem is that there is no interrupt for this. I'd also guess
> >> there is no interrupt on the host side, because that would solve this
> >> problem, too, as it would certainly get around to handling the interupt
> >> in 100us. I'm assuming the host driver is not the Linux driver, as it
> >> should also handle this in a timely manner, even when polling.
> >
> > I expect the issues Andrew G is observing are with the Power10 boot
> > firmware. The boot firmware only polls. The runtime firmware enables
> > interrupts.
>
> Yep, this is with the low level host boot firmware.
> Also, further testing over night showed that 200us wasn’t enough for
> our larger Everest P10 machines, I needed to go to 300us. As we
> were struggling to allow 200us, I assume 300us is going to be a no-go.
It seems odd to me that firmware polling would be an issue. Usually,
with firmware, you have it just spinning waiting for something. At
least in the firmware I worked with.
I'm not familiar with this firmware, though, maybe it has timers and
such and parallel execution. Can this be fixed on the firmware side?
>
> >>
> >
> >>
> >> The right way to fix this is probably to do the same thing the host side
> >> Linux driver does. It has a kernel thread that is kicked off to do
> >> this. Unfortunately, that's more complicated to implement, but it
> >> avoids polling in this location (which causes latency issues on the BMC
> >> side) and lets you poll longer without causing issues.
> >
> > In Andrew G's case he's talking MCTP over KCS using a vendor-defined
> > transport binding (that also leverages LPC FWH cycles for bulk data
> > transfers)[1]. I think it could have taken more inspiration from the
> > IPMI KCS protocol: It might be worth an experiment to write the dummy
> > command value to IDR from the host side after each ODR read to signal
> > the host's clearing of OBF (no interrupt for the BMC) with an IBF
> > (which does interrupt the BMC). And doing the obverse for the BMC. Some
> > brief thought suggests that if the dummy value is read there's no need
> > to send a dummy value in reply (as it's an indicator to read the status
> > register). With that the need for the spin here (or on the host side)
> > is reduced at the cost of some constant protocol overhead.
> >
>
> Thanks for the quick reviews and ideas.
> I’ll see if I can find someone on the team to help out with Andrew J’s
> thoughts and if that doesn’t work, look into the kernel thread idea.
I don't really understand Andrew J's ideas very well, but hopefully they
help. The kernel thread idea is fairly complicated to implement, and
there has been an impetus in the kernel to not create new kernel
threads. But there just has to be a good reason, and this probably is
one. We worked on it a lot in the IPMI host driver to tune it and got
it to a point where it provided decent performance without causing power
management issues. When I first read the title I was worried it was
talking about this code; I'm lothe to touch it for fear of breaking
things.
-corey