2014-10-14 22:37:53

by Daniel Drake

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

Hi,

Thanks a lot for working on this!

On Wed, Sep 17, 2014 at 10:10 AM, Daniel Thompson
<[email protected]> wrote:
> Changes *before* v1:
>
> * This patchset is a hugely cut-down successor to "[PATCH v11 00/19]
> arm: KGDB NMI/FIQ support". Thanks to Thomas Gleixner for suggesting
> the new structure. For historic details see:
> https://lkml.org/lkml/2014/9/2/227

What's the right way to extend your work in order to get a NMI-like
watchdog hard lockup detector similar to the one on x86?

I'm testing your patches on Exynos4412 and I guess in their current
state they don't go quite this deep, as the only callers of
trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
code - none of which seem as fail-safe as a trigger like a
pre-programmed watchdog NMI interrupt would be.

Do I need to find a way to get CONFIG_FIQ available on this platform
first? and/or CONFIG_HARDLOCKUP_DETECTOR?

Thanks
Daniel


2014-10-14 23:31:48

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On Tue, Oct 14, 2014 at 04:37:51PM -0600, Daniel Drake wrote:
> Hi,
>
> Thanks a lot for working on this!
>
> On Wed, Sep 17, 2014 at 10:10 AM, Daniel Thompson
> <[email protected]> wrote:
> > Changes *before* v1:
> >
> > * This patchset is a hugely cut-down successor to "[PATCH v11 00/19]
> > arm: KGDB NMI/FIQ support". Thanks to Thomas Gleixner for suggesting
> > the new structure. For historic details see:
> > https://lkml.org/lkml/2014/9/2/227
>
> What's the right way to extend your work in order to get a NMI-like
> watchdog hard lockup detector similar to the one on x86?
>
> I'm testing your patches on Exynos4412 and I guess in their current
> state they don't go quite this deep, as the only callers of
> trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
> code - none of which seem as fail-safe as a trigger like a
> pre-programmed watchdog NMI interrupt would be.
>
> Do I need to find a way to get CONFIG_FIQ available on this platform
> first? and/or CONFIG_HARDLOCKUP_DETECTOR?

The blocker on this work right now is the annoying Versatile Express
platform, which pretty much means that we currently can't push the
code into the GIC to support FIQs. As long as adding FIQ support to
the GIC results in the Versatile Express becoming non-bootable, the
idea of using FIQs is a total non-starter.

Or we decide that we dump the platform completely (which will upset
a number of developers.)

I have patches I'm using for trigger_all_cpu_backtrace() which I'm
maintaining privately in my tree until we can get the FIQ situation
sorted.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-10-16 09:23:46

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On 14/10/14 23:37, Daniel Drake wrote:
> Hi,
>
> Thanks a lot for working on this!
>
> On Wed, Sep 17, 2014 at 10:10 AM, Daniel Thompson
> <[email protected]> wrote:
>> Changes *before* v1:
>>
>> * This patchset is a hugely cut-down successor to "[PATCH v11 00/19]
>> arm: KGDB NMI/FIQ support". Thanks to Thomas Gleixner for suggesting
>> the new structure. For historic details see:
>> https://lkml.org/lkml/2014/9/2/227
>
> What's the right way to extend your work in order to get a NMI-like
> watchdog hard lockup detector similar to the one on x86?

There are a few things to get into place for this.

1. Figure out what number to put into the PMU to get an interrupt every
10s and provide the stub functions for the lock up detector.

2. Modify the current ARM PMU support to make is possible for this code
to run from a FIQ handler. This should be feasible by replicating
the design pattern used on x86. Nevertheless this is a fairly big
chunk of code review and testing.

3. Modify the Linux IRQ support to allow some kind of flag to
hint/demand that an interrupt be treated as NMI-ish in order to
switch (unshared) interrupts into FIQ mode and hook this up in
the GIC.

[Side note, this approach was suggested by Thomas Gleixner in
response to some rather hacky patches from me. My patches are
robust enough but are poorly designed and hard to maintain.
Thus if you want to do any quick prototyping you might skip this
step and dig out my old patches:

https://git.linaro.org/people/daniel.thompson/linux.git/shortlog/refs/heads/dev/kdb-fiq

Note also that, as a side effect of the above, tools like oprofile would
also get a very significant boost for kernel profiling because they
would no longer attribute time spent in interrupt handlers to interrupt
unmask functions.

At present I've done a little work towards all three of the above but
none are complete (most of the code has never been executed).


> I'm testing your patches on Exynos4412 and I guess in their current
> state they don't go quite this deep, as the only callers of
> trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
> code - none of which seem as fail-safe as a trigger like a
> pre-programmed watchdog NMI interrupt would be.
>
> Do I need to find a way to get CONFIG_FIQ available on this platform
> first? and/or CONFIG_HARDLOCKUP_DETECTOR?

You need CONFIG_FIQ working first. Be aware that this may be impossible
on Exynos unless you control the TrustZone. For this reason most of my
development is on Freescale i.MX6 (because i.MX6 boots in secure mode).


Daniel.

2014-10-16 09:33:30

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On 15/10/14 00:31, Russell King - ARM Linux wrote:
> On Tue, Oct 14, 2014 at 04:37:51PM -0600, Daniel Drake wrote:
>> Hi,
>>
>> Thanks a lot for working on this!
>>
>> On Wed, Sep 17, 2014 at 10:10 AM, Daniel Thompson
>> <[email protected]> wrote:
>>> Changes *before* v1:
>>>
>>> * This patchset is a hugely cut-down successor to "[PATCH v11 00/19]
>>> arm: KGDB NMI/FIQ support". Thanks to Thomas Gleixner for suggesting
>>> the new structure. For historic details see:
>>> https://lkml.org/lkml/2014/9/2/227
>>
>> What's the right way to extend your work in order to get a NMI-like
>> watchdog hard lockup detector similar to the one on x86?
>>
>> I'm testing your patches on Exynos4412 and I guess in their current
>> state they don't go quite this deep, as the only callers of
>> trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
>> code - none of which seem as fail-safe as a trigger like a
>> pre-programmed watchdog NMI interrupt would be.
>>
>> Do I need to find a way to get CONFIG_FIQ available on this platform
>> first? and/or CONFIG_HARDLOCKUP_DETECTOR?
>
> The blocker on this work right now is the annoying Versatile Express
> platform, which pretty much means that we currently can't push the
> code into the GIC to support FIQs. As long as adding FIQ support to
> the GIC results in the Versatile Express becoming non-bootable, the
> idea of using FIQs is a total non-starter.
>
> Or we decide that we dump the platform completely (which will upset
> a number of developers.)
>
> I have patches I'm using for trigger_all_cpu_backtrace() which I'm
> maintaining privately in my tree until we can get the FIQ situation
> sorted.

I do hope to gain (remote) access to a vexpress at some point just to
pick at this issue a little.

That said your previous description of the issue and of the GIC version
register leaves very little to explore!

In the end I'm expecting to have to use some kind of black list logic. I
assuming that vexpress by its nature as a bring up platform is more
likely to exhibit problems in this sort of area and therefore a black
list is more appropriate than a white list.


Daniel.

2014-10-16 12:23:37

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On Thu, Oct 16, 2014 at 10:23:52AM +0100, Daniel Thompson wrote:
> On 14/10/14 23:37, Daniel Drake wrote:
> > I'm testing your patches on Exynos4412 and I guess in their current
> > state they don't go quite this deep, as the only callers of
> > trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
> > code - none of which seem as fail-safe as a trigger like a
> > pre-programmed watchdog NMI interrupt would be.
> >
> > Do I need to find a way to get CONFIG_FIQ available on this platform
> > first? and/or CONFIG_HARDLOCKUP_DETECTOR?
>
> You need CONFIG_FIQ working first. Be aware that this may be impossible
> on Exynos unless you control the TrustZone. For this reason most of my
> development is on Freescale i.MX6 (because i.MX6 boots in secure mode).

CONFIG_FIQ enables the legacy FIQ code which is unsuitable for use on
SMP, so that should not be a requirement.

We still need to validate all the code we're proposing to run in FIQ
context does not violate any locking. IRQ-safe locks will do not
prevent FIQs occuring, and using IRQ-safe locks which are also taken
in the FIQ path /will/ cause deadlocks. So, we need to ensure that
the perf internals are safe for this.

Lastly, platforms running in non-secure mode most likely will not be
able to take /any/ advantage from the FIQ stuff because FIQs will
likely only be available to the secure firmware.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-10-16 13:15:02

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On 16/10/14 13:23, Russell King - ARM Linux wrote:
> On Thu, Oct 16, 2014 at 10:23:52AM +0100, Daniel Thompson wrote:
>> On 14/10/14 23:37, Daniel Drake wrote:
>>> I'm testing your patches on Exynos4412 and I guess in their current
>>> state they don't go quite this deep, as the only callers of
>>> trigger_all_cpu_backtrace() are sysrq, hung_task and spinlock debug
>>> code - none of which seem as fail-safe as a trigger like a
>>> pre-programmed watchdog NMI interrupt would be.
>>>
>>> Do I need to find a way to get CONFIG_FIQ available on this platform
>>> first? and/or CONFIG_HARDLOCKUP_DETECTOR?
>>
>> You need CONFIG_FIQ working first. Be aware that this may be impossible
>> on Exynos unless you control the TrustZone. For this reason most of my
>> development is on Freescale i.MX6 (because i.MX6 boots in secure mode).
>
> CONFIG_FIQ enables the legacy FIQ code which is unsuitable for use on
> SMP, so that should not be a requirement.

Sorry. That was rather stupid phrasing on my part.

What I mean is that before doing any other work related to FIQ one
should establish that the platform really is using FIQ to trigger
backtraces! On platforms where FIQ cannot be supported the code falls
back to using IRQs (making the IRQ handling easily spotted in the
backtrace).

2014-11-04 17:05:37

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On 16/10/14 10:33, Daniel Thompson wrote:
>> The blocker on this work right now is the annoying Versatile Express
>> platform, which pretty much means that we currently can't push the
>> code into the GIC to support FIQs. As long as adding FIQ support to
>> the GIC results in the Versatile Express becoming non-bootable, the
>> idea of using FIQs is a total non-starter.
>>
>> Or we decide that we dump the platform completely (which will upset
>> a number of developers.)
>>
>> I have patches I'm using for trigger_all_cpu_backtrace() which I'm
>> maintaining privately in my tree until we can get the FIQ situation
>> sorted.
>
> I do hope to gain (remote) access to a vexpress at some point just to
> pick at this issue a little.

This week with the help of one of my colleagues (thanks Tixy) I have
been able to run some tests and figure out what it going on on vexpress-a9.

The summary is that on some GICv1 implementations the bit to enable
group 1 interrupts cannot be accessed using secure memory accesses. More
specifically the presence/absence of the EnableGrp1 bit in the secure
version GICD_CTRL register is implementation defined.

My original patches overlooked this and as a result the existing code
will migrate all interrupts to group but then cannot enable delivery of
group 1 interrupts.

I'm planning to respin the code so it will automatically disable FIQ
support when the EnableGrp1 bit is not implemented. This means
vexpress-a9 will not benefit from FIQ support but will also not be
harmed by it.

2014-11-04 17:19:54

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH 3.17-rc4 v7 0/6] arm: Implement arch_trigger_all_cpu_backtrace

On Tue, Nov 04, 2014 at 05:05:25PM +0000, Daniel Thompson wrote:
> This week with the help of one of my colleagues (thanks Tixy) I have
> been able to run some tests and figure out what it going on on vexpress-a9.
>
> The summary is that on some GICv1 implementations the bit to enable
> group 1 interrupts cannot be accessed using secure memory accesses. More
> specifically the presence/absence of the EnableGrp1 bit in the secure
> version GICD_CTRL register is implementation defined.
>
> My original patches overlooked this and as a result the existing code
> will migrate all interrupts to group but then cannot enable delivery of
> group 1 interrupts.
>
> I'm planning to respin the code so it will automatically disable FIQ
> support when the EnableGrp1 bit is not implemented. This means
> vexpress-a9 will not benefit from FIQ support but will also not be
> harmed by it.

That's good news, and it should mean that the Versatile Express starts
booting on the nightly boot tests again. Thanks for looking into it.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.