2012-02-03 17:18:28

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

On Fri, Jan 20, 2012 at 05:56:15PM +0000, Luck, Tony wrote:
> > Do you have any comments?
>
> I'm stuck in because I don't know how assign probabilities to
> the failure cases with kmsg_dump() before and after the smp_send_stop().
>
> There's a well documented tendency in humans to stick with the status
> quo in such situations. I'm definitely finding it hard to provide
> a positive recommendation (ACK).
>
> So I'll just talk out loud here for a bit in case someone sees
> something obviously flawed in my understanding.

Thanks for trying to think this through.

>
> Problem statement: We'd like to maximize our chances of saving the
> tail of the kernel log when the system goes down. With the current
> ordering there is a concern that other cpus will interfere with the
> one that is saving the log.
>
> Problems in current code flow:
> *) Other cpus might hold locks that we need. Our options are to fail,
> or to "bust" the locks (but busting the locks may lead to other
> problems in the code path - those locks were there for a reason).
> There are only a couple of ways that this could be an issue.
> 1) The lock is held because someone is doing some other pstore
> filesystem operation (reading and erasing records). This has a
> very low probability. Normal code flow will have some process harvest
> records from pstore in some /etc/rc.d/* script - this process should
> take much less than a second.

ok.

> 2) The lock is held because some other kmsg_dump() store is in progress.
> This one seems more worrying - think of an OOPS (or several) right
> before we panic

I think Seiji had another patch to address this.

>
> Problems in proposed code flow:
> *) smp_send_stop() fails:
> 1) doesn't actually stop other cpus (we are no worse off than before we
> made this change)
> 2) doesn't return - so we don't even try to dump to pstore back end. x86
> code has recently been hardened (though I can still imagine a pathological
> case where in a crash the cpu calling this is uncertain of its own
> identity, and somehow manages to stop itself - perhaps we are so screwed up
> in this case that we have no hope anyway)

Where in the code do you see that it might not return? We can also
conceive of a scenario such that pstore or apei code has a bug and oops
the box a second time and we are no better off. That code has a lot more
churn then the shutdown code, I believe.

> *) Even if it succeeds - we may still run into problems busting locks because
> even though the cpu that held them isn't executing, the data structures
> or device registers protected by the lock may be in an inconsistent state.
> *) If we had just let this other cpus keep running, they'd have finished their
> operation and freed up the problem lock anyway

So this is an interesting concern and it would be nice to have that extra
second to finish things off before breaking the spin lock. I was trying
to figure out if there was a way to do that.

On x86 I think we can (and maybe others). I had a thought, most
spinlocks are taken by disabling interrupts (ie spin_lock_irqsave). By
disabling interrupts you block IPIs. Originally the shutdown code would
use the REBOOT_IPI to stop other cpus but would fail if some piece of code
on another cpu was spinning forever with irqs disabled. I modified it to
use NMIs to be more robust.

What if we send the REBOOT_IPI first and let it block for up to a second.
Most code paths that are done with spin_locks will use
spin_lock_irqrestore. As soon as the interrupts are re-enabled the
REBOOT_IPI comes in and takes the processor. If after a second the cpu
still is blocking interrupts, just use the NMI as a big hammer to shut it
down.

This would allow the pstore stuff to block shutting things down for a
little bit to finish writing its data out before accepting the IPI (at
least for a second). Otherwise if it takes more than a second and the NMI
has to come in, we may have to investigate what is going on.

Would that help win you over?

I know that is x86 specific, but other arches might be able to adapt a
similar approach?

Cheers,
Don


2012-02-03 22:32:49

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

> What if we send the REBOOT_IPI first and let it block for up to a second.
> Most code paths that are done with spin_locks will use
> spin_lock_irqrestore. As soon as the interrupts are re-enabled the
> REBOOT_IPI comes in and takes the processor. If after a second the cpu
> still is blocking interrupts, just use the NMI as a big hammer to shut it
> down.

This looks good - it certainly deals with my "if we just let them run
a bit, they'd release the locks" quibble. One second sounds very
generous - but I'm not going to bikeshed that (so long as it is a total
of one second - not one second per cpu). So the pseudo-code is:

send_reboot_ipi_to_everyone_else()

wait_1_second()

for_each_cpu_that_didnt_respond_to_reboot_ipi {
hit_that_cpu_with_NMI()
}

Perhaps a notification printk() if we had to use the NMI hammer?

-Tony

2012-02-03 22:58:06

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

On Fri, Feb 03, 2012 at 10:32:31PM +0000, Luck, Tony wrote:
> > What if we send the REBOOT_IPI first and let it block for up to a second.
> > Most code paths that are done with spin_locks will use
> > spin_lock_irqrestore. As soon as the interrupts are re-enabled the
> > REBOOT_IPI comes in and takes the processor. If after a second the cpu
> > still is blocking interrupts, just use the NMI as a big hammer to shut it
> > down.
>
> This looks good - it certainly deals with my "if we just let them run
> a bit, they'd release the locks" quibble. One second sounds very
> generous - but I'm not going to bikeshed that (so long as it is a total
> of one second - not one second per cpu). So the pseudo-code is:

This is how the stop_cpus is implemented on x86 and the one second comes
from there

arch/x86/kernel/smp.c::native_irq_stop_other_cpus and
native_nmi_stop_other_cpus

>
> send_reboot_ipi_to_everyone_else()
>
> wait_1_second()
>
> for_each_cpu_that_didnt_respond_to_reboot_ipi {
> hit_that_cpu_with_NMI()
> }
>
> Perhaps a notification printk() if we had to use the NMI hammer?

Yes.

Again this is for x86, but I guess that is our common case with pstore.

Cheers,
Don

2012-02-08 20:20:22

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

On Fri, Feb 03, 2012 at 05:57:40PM -0500, Don Zickus wrote:
> On Fri, Feb 03, 2012 at 10:32:31PM +0000, Luck, Tony wrote:
> > > What if we send the REBOOT_IPI first and let it block for up to a second.
> > > Most code paths that are done with spin_locks will use
> > > spin_lock_irqrestore. As soon as the interrupts are re-enabled the
> > > REBOOT_IPI comes in and takes the processor. If after a second the cpu
> > > still is blocking interrupts, just use the NMI as a big hammer to shut it
> > > down.
> >
> > This looks good - it certainly deals with my "if we just let them run
> > a bit, they'd release the locks" quibble. One second sounds very
> > generous - but I'm not going to bikeshed that (so long as it is a total
> > of one second - not one second per cpu). So the pseudo-code is:
>

Hi Tony,

If I put together a patch to address this would you be willing to let
Seiji move the kmsg_dump to below smp_send_stop()?

Cheers,
Don

> This is how the stop_cpus is implemented on x86 and the one second comes
> from there
>
> arch/x86/kernel/smp.c::native_irq_stop_other_cpus and
> native_nmi_stop_other_cpus
>
> >
> > send_reboot_ipi_to_everyone_else()
> >
> > wait_1_second()
> >
> > for_each_cpu_that_didnt_respond_to_reboot_ipi {
> > hit_that_cpu_with_NMI()
> > }
> >
> > Perhaps a notification printk() if we had to use the NMI hammer?
>
> Yes.
>
> Again this is for x86, but I guess that is our common case with pstore.
>
> Cheers,
> Don

2012-02-08 21:28:45

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

> If I put together a patch to address this would you be willing to let
> Seiji move the kmsg_dump to below smp_send_stop()?

I'd "Ack" it ... not my decision whether to apply it.

-Tony

2012-02-08 22:49:05

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

On Wed, Feb 08, 2012 at 09:28:37PM +0000, Luck, Tony wrote:
> > If I put together a patch to address this would you be willing to let
> > Seiji move the kmsg_dump to below smp_send_stop()?
>
> I'd "Ack" it ... not my decision whether to apply it.

Sure, I understand. But I figured your objection is holding it up to. ;-)

Let me put together a patch for that change.

Cheers,
Don

2012-02-08 22:57:12

by Seiji Aguchi

[permalink] [raw]
Subject: RE: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

Tony,

You are the appropriate person to apply this patch.

Seiji

-----Original Message-----
From: Don Zickus [mailto:[email protected]]
Sent: Wednesday, February 08, 2012 5:49 PM
To: Luck, Tony
Cc: Seiji Aguchi; Chen Gong; [email protected]; Matthew Garrett; Vivek Goyal; Chen, Gong; [email protected]; Brown, Len; Huang, Ying; '[email protected]'; '[email protected]'; '[email protected]'; [email protected]; [email protected]; [email protected]; [email protected]; Satoru Moriya
Subject: Re: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop()

On Wed, Feb 08, 2012 at 09:28:37PM +0000, Luck, Tony wrote:
> > If I put together a patch to address this would you be willing to
> > let Seiji move the kmsg_dump to below smp_send_stop()?
>
> I'd "Ack" it ... not my decision whether to apply it.

Sure, I understand. But I figured your objection is holding it up to. ;-)

Let me put together a patch for that change.

Cheers,
Don