2020-10-15 17:41:45

by Neeraj Upadhyay

[permalink] [raw]
Subject: Queries on ARM SDEI Linux kernel code

Hi James,

Have few queries on ARM SDEI Linux code. Queries are listed below; can
you please help provide your insights on these?

1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND)
is not available for clients to use; can you please share information on
why it is not provided?

While trying to dig information on this, I saw that [1] says:
Now the hotplug callbacks save nothing, and restore the OS-view of
registered/enabled. This makes bound-interrupts harder to work with.

Based on this comment, the changes from v4 [2], which I could understand
is, cpu down path does not save the current event enable status, and we
rely on the enable status `event->reenable', which is set, when
register/unregister, enable/disable calls are made; this enable status
is used during cpu up path, to decide whether to reenable the interrupt.

Does this make, bound-interrupts harder to work with? how? Can you
please explain? Or above save/restore is not the reason and you meant
something else?

Also, does shared bound interrupts also have the same problem, as
save/restore behavior was only for private events?

2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its
handling has the same problems, which are there for bound interrupts?

Also, if it is provided, clients need to register event 0 ? Vendor
events or other event nums are not supported, as per spec.

3. Can kernel panic() be triggered from sdei event handler? Is it a safe
operation? The spec says, synchronous exceptions should not be
triggered; I think panic won't do it; but anything which triggers a WARN
or other sync exception in that path can cause undefined behavior. Can
you share your thoughts on this?

"The handler code should not enable asynchronous exceptions by clearing
any of the PSTATE.DAIF bits, and should not cause synchronous exceptions
to the client Exception level."


[1] https://lwn.net/Articles/740817/
[2] https://www.spinics.net/lists/kvm-arm/msg27784.html

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


2020-10-16 19:57:04

by James Morse

[permalink] [raw]
Subject: Re: Queries on ARM SDEI Linux kernel code

Hi Neeraj,

On 15/10/2020 07:07, Neeraj Upadhyay wrote:
> 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available
> for clients to use; can you please share information on
> why it is not provided?

There is no compelling use-case for it, and its very complex to support as the driver can
no longer hide things like hibernate.

Last time I looked, it looked like the SDEI driver would need to ask the irqchip to
prevent modification while firmware re-configures the irq. I couldn't work out how this
would work if the irq is in-progress on another CPU.

The reasons to use bound-interrupts can equally be supported with an event provided by
firmware.


> While trying to dig information on this, I saw  that [1] says:
>   Now the hotplug callbacks save  nothing, and restore the OS-view of registered/enabled.
> This makes bound-interrupts harder to work with.

> Based on this comment, the changes from v4 [2], which I could understand is, cpu down path
> does not save the current event enable status, and we rely on the enable status
> `event->reenable', which is set, when register/unregister, enable/disable calls are made;
> this enable status is used during cpu up path, to decide whether to reenable the interrupt.

> Does this make, bound-interrupts harder to work with? how? Can you please explain? Or
> above save/restore is not the reason and you meant something else?

If you bind a level-triggered interrupt, how does firmware know how to clear the interrupt
from whatever is generating it?

What happens if the OS can't do this either, as it needs to allocate memory, or take a
lock, which it can't do in nmi context?


The people that wrote the SDEI spec's answer to this was that the handler can disable the
event from inside the handler... and firmware will do, something, to stop the interrupt
screaming.

So now an event can become disabled anytime its registered, which makes it more
complicated to save/restore.


> Also, does shared bound interrupts

Shared-interrupts as an NMI made me jump. But I think you mean a bound interrupt as a
shared event. i.e. and SPI not a PPI.


> also have the same problem, as save/restore behavior
> was only for private events?

See above, the problem is the event disabling itself.

Additionally those changes to unregister the private-event mean the code can't tell the
difference between cpuhp and hibernate... only hibernate additionally loses the state in
firmware.


> 2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the
> same problems, which are there for bound interrupts?

Its not supported as no-one showed up with a use-case.
While firmware is expected to back it with a PPI, its doesn't have the same problems as
bound-interrupts as its not an interrupt the OS ever knows about.


> Also, if it is provided, clients need to register event 0 ? Vendor events or other event
> nums are not supported, as per spec.

Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger
it. This should fit in with however the GIC's PMR based NMI does its PPI based
crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or
SDEI.


> 3. Can kernel panic() be triggered from sdei event handler?

Yes,


> Is it a safe operation?

panic() wipes out the machine... did you expect it to keep running?
What does safe mean here?

You should probably call nmi_panic() if there is the risk that the event occurred during
panic() on the same CPU, as it would otherwise just block.


> The spec says, synchronous exceptions should not be triggered; I think panic
> won't do it; but anything which triggers a WARN
> or other sync exception in that path can cause undefined behavior. Can you share your
> thoughts on this?

What do you mean by undefined behaviour?

SDEI was originally to report external abort to the OS in regions where the OS can't take
an exception because the exception-registers are live, just after and exception and just
before eret.

If you take another exception from the NMI handler, chances are you're going to go back
round the loop again, only this time firmware can't inject the SDEI event, so it has to
reboot.

If you know it might cause an exception, you shouldn't do it in NMI context.


> "The handler code should not enable asynchronous exceptions by clearing any of the
> PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception level."


What are you using this thing for?


Thanks,

James

2020-10-22 06:21:40

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: Queries on ARM SDEI Linux kernel code

Hi James,

Sorry for late reply. Thanks for your comments!

On 10/16/2020 9:57 PM, James Morse wrote:
> Hi Neeraj,
>
> On 15/10/2020 07:07, Neeraj Upadhyay wrote:
>> 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available
>> for clients to use; can you please share information on
>> why it is not provided?
>
> There is no compelling use-case for it, and its very complex to support as the driver can
> no longer hide things like hibernate.
>
> Last time I looked, it looked like the SDEI driver would need to ask the irqchip to
> prevent modification while firmware re-configures the irq. I couldn't work out how this
> would work if the irq is in-progress on another CPU.
>

Got it. I will think in this direction, on how to achieve this.

> The reasons to use bound-interrupts can equally be supported with an event provided by
> firmware.
>
>
Ok, I will explore in that direction.

>> While trying to dig information on this, I saw  that [1] says:
>>   Now the hotplug callbacks save  nothing, and restore the OS-view of registered/enabled.
>> This makes bound-interrupts harder to work with.
>
>> Based on this comment, the changes from v4 [2], which I could understand is, cpu down path
>> does not save the current event enable status, and we rely on the enable status
>> `event->reenable', which is set, when register/unregister, enable/disable calls are made;
>> this enable status is used during cpu up path, to decide whether to reenable the interrupt.
>
>> Does this make, bound-interrupts harder to work with? how? Can you please explain? Or
>> above save/restore is not the reason and you meant something else?
>
> If you bind a level-triggered interrupt, how does firmware know how to clear the interrupt
> from whatever is generating it?
>
> What happens if the OS can't do this either, as it needs to allocate memory, or take a
> lock, which it can't do in nmi context?
>
> Ok, makes sense.
> The people that wrote the SDEI spec's answer to this was that the handler can disable the
> event from inside the handler... and firmware will do, something, to stop the interrupt
> screaming.
>
> So now an event can become disabled anytime its registered, which makes it more
> complicated to save/restore.
>
>
>> Also, does shared bound interrupts
>
> Shared-interrupts as an NMI made me jump. But I think you mean a bound interrupt as a
> shared event. i.e. and SPI not a PPI.
>
>
Sorry I should have worded properly; yes I meant SPI as shared event.

>> also have the same problem, as save/restore behavior
>> was only for private events?
>
> See above, the problem is the event disabling itself.
>
This makes sense now.

> Additionally those changes to unregister the private-event mean the code can't tell the
> difference between cpuhp and hibernate... only hibernate additionally loses the state in
> firmware.
>
>
Got it!
>> 2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the
>> same problems, which are there for bound interrupts?
>
> Its not supported as no-one showed up with a use-case.
> While firmware is expected to back it with a PPI, its doesn't have the same problems as
> bound-interrupts as its not an interrupt the OS ever knows about.
>
>
>> Also, if it is provided, clients need to register event 0 ? Vendor events or other event
>> nums are not supported, as per spec.
>
> Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger
> it. This should fit in with however the GIC's PMR based NMI does its PPI based
> crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or
> SDEI.
>
>
Ok; I will explore how PMR based NMIs work; I thought it was SGI based.
But will recheck.

>> 3. Can kernel panic() be triggered from sdei event handler?
>
> Yes,
>
>
>> Is it a safe operation?
>
> panic() wipes out the machine... did you expect it to keep running?

I wanted to check the case where panic triggers kexec/kdump path into
capture kernel.

> What does safe mean here?
> I think I didn't put it correctly; I meant what possible scenarios can
happen in this case and you explained one below, thanks!

> You should probably call nmi_panic() if there is the risk that the event occurred during
> panic() on the same CPU, as it would otherwise just block.
>
>
>> The spec says, synchronous exceptions should not be triggered; I think panic
>> won't do it; but anything which triggers a WARN
>> or other sync exception in that path can cause undefined behavior. Can you share your
>> thoughts on this?
>
> What do you mean by undefined behaviour?
>
I was thinking, if SDEI event preempts EL1, at the point, where EL1 has
just entered an exception, and hasn't captured the registers like
spsr_el1, elr_el1 and other registers, what will be the behavior?

> SDEI was originally to report external abort to the OS in regions where the OS can't take
> an exception because the exception-registers are live, just after and exception and just
> before eret.
>
> If you take another exception from the NMI handler, chances are you're going to go back
> round the loop again, only this time firmware can't inject the SDEI event, so it has to
> reboot.
>
Got it.
> If you know it might cause an exception, you shouldn't do it in NMI context.
>
>
Ok, I understand now.

>> "The handler code should not enable asynchronous exceptions by clearing any of the
>> PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception level."
>
>
> What are you using this thing for?
>
>
Usecase is, a watchdog SPI interrupt, which we want to bound to a SDEI
event. Below is the flow:

wdog expiry -> SDEI event -> HLOS panic -> trigger kexec/kdump


Thanks
Neeraj

> Thanks,
>
> James
>

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

2020-10-30 14:57:14

by James Morse

[permalink] [raw]
Subject: Re: Queries on ARM SDEI Linux kernel code

Hi Neeraj,

On 21/10/2020 18:31, Neeraj Upadhyay wrote:
> On 10/16/2020 9:57 PM, James Morse wrote:
>> On 15/10/2020 07:07, Neeraj Upadhyay wrote:
>>> 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available
>>> for clients to use; can you please share information on
>>> why it is not provided?
>>
>> There is no compelling use-case for it, and its very complex to support as the driver can
>> no longer hide things like hibernate.
>>
>> Last time I looked, it looked like the SDEI driver would need to ask the irqchip to
>> prevent modification while firmware re-configures the irq. I couldn't work out how this
>> would work if the irq is in-progress on another CPU.

> Got it. I will think in this direction, on how to achieve this.

I'm really not keen on supporting it! Its basically unusable.


>> The reasons to use bound-interrupts can equally be supported with an event provided by
>> firmware.
>>
>>
> Ok, I will explore in that direction.

Great!


[...]

>> Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger
>> it. This should fit in with however the GIC's PMR based NMI does its PPI based
>> crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or
>> SDEI.

> Ok; I will explore how PMR based NMIs work; I thought it was SGI based. But will recheck.

This is where the recent work has been. (One of) Julien's cover-letter's describes it as
supporting PPI and SPI: https://lwn.net/Articles/755906/


>>> 3. Can kernel panic() be triggered from sdei event handler?
>>
>> Yes,
>>
>>
>>> Is it a safe operation?
>>
>> panic() wipes out the machine... did you expect it to keep running?
>
> I wanted to check the case where panic triggers kexec/kdump path into capture kernel.
>
>> What does safe mean here?

> I think I didn't put it correctly; I meant what possible scenarios can
> happen in this case and you explained one below, thanks!

Ah, kdump. You will certainly get into the kdump kernel, but I think the SDEI reset calls
will fail as there is still an event in progress, so the kernel will leave it masked to
prevent any new events being taken.
This shouldn't affect kdumps work of dumping memory, and calling reset.


>> You should probably call nmi_panic() if there is the risk that the event occurred during
>> panic() on the same CPU, as it would otherwise just block.
>>
>>
>>> The spec says, synchronous exceptions should not be triggered; I think panic
>>> won't do it; but anything which triggers a WARN
>>> or other sync exception in that path can cause undefined behavior. Can you share your
>>> thoughts on this?
>>
>> What do you mean by undefined behaviour?

> I was thinking, if SDEI event preempts EL1, at the point, where EL1 has just entered an
> exception, and hasn't captured the registers like spsr_el1, elr_el1 and other registers,
> what will be the behavior?

Exceptions to/from EL3 don't affect them, so those registers keep their value until the
next exception taken by EL1 (which overwrites them), or ERET from EL1 (which makes them
UNKNOWN). Kdump may change exception level on nVHE systems to do a reset.

If you need them for kdump, they should be saved in the crash handler... they aren't
needed in the general case as we learn what those values were by unwinding the stack,
which is also true for SDEI. (it gets the original PC from firmware to build a stack frame)

[...]

>>> "The handler code should not enable asynchronous exceptions by clearing any of the
>>> PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception
>>> level."
>>
>>
>> What are you using this thing for?

> Usecase is, a watchdog SPI interrupt, which we want to bound to a SDEI event. Below is the
> flow:
>
> wdog expiry -> SDEI event -> HLOS panic -> trigger kexec/kdump

Having a common interface to this would be a good thing, that way firmware can hide how
its implemented.


Thanks,

James