On 17 September, 2021 - Corey Minyard wrote:
> On Fri, Sep 17, 2021 at 02:55:25PM +0200, Anton Lundin wrote:
> > On 17 September, 2021 - Corey Minyard wrote:
> >
> > > On Fri, Sep 17, 2021 at 12:14:19PM +0200, Anton Lundin wrote:
> > > > On 16 September, 2021 - Corey Minyard wrote:
> > > >
> > > > > On Thu, Sep 16, 2021 at 04:53:00PM +0200, Anton Lundin wrote:
> > > > > > Hi.
> > > > > >
> > > > > > I've just done a upgrade of the kernel we're using in a product from
> > > > > > 4.19 to 5.10 and I noted a issue.
> > > > > >
> > > > > > It started that with that we didn't get panic and oops dumps in our erst
> > > > > > backed pstore, and when debugging that I noted that the reboot on panic
> > > > > > timer didn't work either.
> > > > > >
> > > > > > I've bisected it down to 2033f6858970 ("ipmi: Free receive messages when
> > > > > > in an oops").
> > > > >
> > > > > Hmm. Unfortunately removing that will break other things. Can you try
> > > > > the following patch? It's a good idea, in general, to do as little as
> > > > > possible in the panic path, this should cover a multitude of issues.
> > > > >
> > > > > Thanks for the report.
> > > > >
> > > >
> > > > I'm sorry to report that the patch didn't solve the issue, and the
> > > > machine locked up in the panic path as before.
> > >
> > > I missed something. Can you try the following? If this doesn't work,
> > > I'm going to have to figure out how to reproduce this.
> > >
> >
> > Sorry, still no joy.
> >
> > My guess is that there is something locking up due to these Supermicro
> > machines have their ERST memory backed by the BMC, and the same BMC is
> > is the other end of all the ipmi communications.
> >
> > I've reproduced this on Server/X11SCZ-F and Server/H11SSL-i but I'm
> > guessing it can be reproduced on most, if not all, of their hardware
> > with the same setup.
> >
> > We're using the ERST backend for pstore, because we're still
> > bios-booting them and don't have efi services available to use as pstore
> > backend.
> >
> >
> > I've tested to just yank out the ipmi modules from the kernel and that
> > fixes the panic timer and we get crash dumps to pstore.
>
> Dang. I'm going to have to look deeper at what that could change to
> cause an issue like this. Are you using the IPMI watchdog? Do you have
> CONFIG_IPMI_PANIC_EVENT=y set in the config?
# CONFIG_IPMI_PANIC_EVENT is not set
We're using the IPMI watchdog.
//Anton
Well, that was dumb. Fix follows...
Thanks for working on this. On your approval, I'll send this to Linus.
-corey
ipmi:watchdog: Set panic count to proper value on a panic
You will get two decrements when the messages on a panic are sent, not
one, since 2033f6858970 ("ipmi: Free receive messages when in an oops")
was added, but the watchdog code had a bug where it didn't set the value
properly.
Reported-by: Anton Lundin <[email protected]>
Cc: <[email protected]> # v5.4+
Fixes: 2033f6858970 ("ipmi: Free receive messages when in an oops")
Signed-off-by: Corey Minyard <[email protected]>
---
drivers/char/ipmi/ipmi_watchdog.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/char/ipmi/ipmi_watchdog.c b/drivers/char/ipmi/ipmi_watchdog.c
index e4ff3b50de7f..9a64a069ffd1 100644
--- a/drivers/char/ipmi/ipmi_watchdog.c
+++ b/drivers/char/ipmi/ipmi_watchdog.c
@@ -531,7 +531,7 @@ static void panic_halt_ipmi_set_timeout(void)
/* Wait for the messages to be free. */
while (atomic_read(&panic_done_count) != 0)
ipmi_poll_interface(watchdog_user);
- atomic_inc(&panic_done_count);
+ atomic_set(&panic_done_count, 2);
rv = __ipmi_set_timeout(&panic_halt_smi_msg,
&panic_halt_recv_msg,
&send_heartbeat_now);
On 20 September, 2021 - Corey Minyard wrote:
> Well, that was dumb. Fix follows...
>
> Thanks for working on this. On your approval, I'll send this to Linus.
Winner winner chicken dinner!
This fixes the issue, and now panic timer works, and we get crashdumps
to pstore.
Great job, I approve!
Thanks for your help getting this fixed.
//Anton
On Mon, Sep 20, 2021 at 04:12:31PM +0200, Anton Lundin wrote:
> On 20 September, 2021 - Corey Minyard wrote:
>
> > Well, that was dumb. Fix follows...
> >
> > Thanks for working on this. On your approval, I'll send this to Linus.
>
> Winner winner chicken dinner!
>
> This fixes the issue, and now panic timer works, and we get crashdumps
> to pstore.
>
> Great job, I approve!
>
>
> Thanks for your help getting this fixed.
Thanks for reporting this. I'll get the patch in.
-corey
>
>
> //Anton
On Mon, Sep 20, 2021 at 09:41:46AM -0500, Corey Minyard wrote:
>On Mon, Sep 20, 2021 at 04:12:31PM +0200, Anton Lundin wrote:
>> On 20 September, 2021 - Corey Minyard wrote:
>>
>> > Well, that was dumb. Fix follows...
>> >
>> > Thanks for working on this. On your approval, I'll send this to Linus.
>>
>> Winner winner chicken dinner!
>>
>> This fixes the issue, and now panic timer works, and we get crashdumps
>> to pstore.
>>
>> Great job, I approve!
>>
>>
>> Thanks for your help getting this fixed.
>
>Thanks for reporting this. I'll get the patch in.
Hey Corey,
Just checking in to see if this patch was lost; I haven't seen it in
Linus's tree just yet.
--
Thanks,
Sasha
On Wed, Oct 27, 2021 at 01:59:09PM -0400, Sasha Levin wrote:
> On Mon, Sep 20, 2021 at 09:41:46AM -0500, Corey Minyard wrote:
> > On Mon, Sep 20, 2021 at 04:12:31PM +0200, Anton Lundin wrote:
> > > On 20 September, 2021 - Corey Minyard wrote:
> > >
> > > > Well, that was dumb. Fix follows...
> > > >
> > > > Thanks for working on this. On your approval, I'll send this to Linus.
> > >
> > > Winner winner chicken dinner!
> > >
> > > This fixes the issue, and now panic timer works, and we get crashdumps
> > > to pstore.
> > >
> > > Great job, I approve!
> > >
> > >
> > > Thanks for your help getting this fixed.
> >
> > Thanks for reporting this. I'll get the patch in.
>
> Hey Corey,
>
> Just checking in to see if this patch was lost; I haven't seen it in
> Linus's tree just yet.
I generally wait until the merge window for changes. It's too late in
the process for a patch now unless it's really critical.
rc7 is out now, the merge window should be opening soon.
-corey
On Wed, Oct 27, 2021 at 01:20:27PM -0500, Corey Minyard wrote:
>On Wed, Oct 27, 2021 at 01:59:09PM -0400, Sasha Levin wrote:
>> On Mon, Sep 20, 2021 at 09:41:46AM -0500, Corey Minyard wrote:
>> > On Mon, Sep 20, 2021 at 04:12:31PM +0200, Anton Lundin wrote:
>> > > On 20 September, 2021 - Corey Minyard wrote:
>> > >
>> > > > Well, that was dumb. Fix follows...
>> > > >
>> > > > Thanks for working on this. On your approval, I'll send this to Linus.
>> > >
>> > > Winner winner chicken dinner!
>> > >
>> > > This fixes the issue, and now panic timer works, and we get crashdumps
>> > > to pstore.
>> > >
>> > > Great job, I approve!
>> > >
>> > >
>> > > Thanks for your help getting this fixed.
>> >
>> > Thanks for reporting this. I'll get the patch in.
>>
>> Hey Corey,
>>
>> Just checking in to see if this patch was lost; I haven't seen it in
>> Linus's tree just yet.
>
>I generally wait until the merge window for changes. It's too late in
>the process for a patch now unless it's really critical.
>
>rc7 is out now, the merge window should be opening soon.
Ah, great. I thought it would go in via one of the -rc releases given
it's a fix.
Thank you!
--
Thanks,
Sasha