2011-02-03 00:55:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

Hi

> Hi,
>
> I noticed following commit which hooks into crash_kexec() for calling
> kmsg_dump().
>
> I think it is not a very good idea to pass control to modules after
> crash_kexec() has been called. Because modules can try to take locks
> or try to do some other operations which we really should not be doing
> now and fail kdump also. The whole design of kdump is built on the
> fact that in crashing kernel we do minimal thing and try to make
> transition to second kernel robust. Now with this hook, kmsg dumper
> breaks that assumption.

I guess you talked about some foolish can shoot their own foot. if so,
Yes. Any kernel module can make kernel panic or more disaster result.


> Anyway, if an image is loaded and we have setup to capture dump also
> why do we need to save kmsg with the help of an helper. I am assuming
> this is more of a debugging aid if we have no other way to capture the
> kernel log buffer. So if somebody has setup kdump to capture the
> vmcore, why to call another handler which tries to save part of the
> vmcore (kmsg) separately.

No.

kmsg_dump() is desingned for embedded. and some embedded devices uses
kexec for non dumping purpose. (Have you seen your embedded devices
show "Now storing dump image.." message?)

Anyway, you can feel free to avoid to use ksmg_dump().




2011-02-03 02:05:33

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 09:55:41AM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > Hi,
> >
> > I noticed following commit which hooks into crash_kexec() for calling
> > kmsg_dump().
> >
> > I think it is not a very good idea to pass control to modules after
> > crash_kexec() has been called. Because modules can try to take locks
> > or try to do some other operations which we really should not be doing
> > now and fail kdump also. The whole design of kdump is built on the
> > fact that in crashing kernel we do minimal thing and try to make
> > transition to second kernel robust. Now with this hook, kmsg dumper
> > breaks that assumption.
>
> I guess you talked about some foolish can shoot their own foot. if so,
> Yes. Any kernel module can make kernel panic or more disaster result.

Yes, the difference is that once a fool shoots his foot, kernel tries
to take a meaningful action to figure out what went wrong. Like displayig
an oops backtrace or like dumping a core (if kdump is configured) so
that one can figure out who was the fool and what did who do.

Now think give the control to two fools. First fool shoots his foot
and then kernel transfers the control to another fool which completely
screws up the situation and one can not save the core.

>
>
> > Anyway, if an image is loaded and we have setup to capture dump also
> > why do we need to save kmsg with the help of an helper. I am assuming
> > this is more of a debugging aid if we have no other way to capture the
> > kernel log buffer. So if somebody has setup kdump to capture the
> > vmcore, why to call another handler which tries to save part of the
> > vmcore (kmsg) separately.
>
> No.
>
> kmsg_dump() is desingned for embedded.

Great. And I like the idea of trying to save some useful information
to non volatile RAM or flash or something like that.

> kexec for non dumping purpose. (Have you seen your embedded devices
> show "Now storing dump image.." message?)

No I have not seen. Can you explain a bit more that apart from kernel
dump, what are the other purposes of kdump.

>
> Anyway, you can feel free to avoid to use ksmg_dump().

Yes, that is one more way but this information is not even exported to
user space to figure out if there are any registerd users of kmsg_dump.

Seconly there are two more important things.

- Why do you need a notification from inside crash_kexec(). IOW, what
is the usage of KMSG_DUMP_KEXEC.

- One can anyway call kmsg_dump() outside crash_kexec() before it so
that kmsg_dump notification will go out before kdump gets the control.
What I am arguing here is that it is not necessarily a very good idea
because external modules can try to do any amount of unsafe actions
once we export the hook.

Doing this is still fine if kdump is not configured as anyway syste would
have rebooted. But if kdump is configured, then we are just reducing
the reliability of the operation by passing the control in the hands
of unaudited code and trusting it when kernel data structures are
corrupt.

So to me, sending out kmsg_dump notifications are perfectly fine when
kdump is not configured. But if it is configured, then it probably is
not a good idea. Anyway, if you have configured the system to capture
the full dump, why do you also need kmsg_dump. And if you are happy
with kmsg_dump() then you don't need kdump. So these both seem to be
mutually exclusive anyway.

Thanks
Vivek

2011-02-03 04:52:08

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

> On Thu, Feb 03, 2011 at 09:55:41AM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > > Hi,
> > >
> > > I noticed following commit which hooks into crash_kexec() for calling
> > > kmsg_dump().
> > >
> > > I think it is not a very good idea to pass control to modules after
> > > crash_kexec() has been called. Because modules can try to take locks
> > > or try to do some other operations which we really should not be doing
> > > now and fail kdump also. The whole design of kdump is built on the
> > > fact that in crashing kernel we do minimal thing and try to make
> > > transition to second kernel robust. Now with this hook, kmsg dumper
> > > breaks that assumption.
> >
> > I guess you talked about some foolish can shoot their own foot. if so,
> > Yes. Any kernel module can make kernel panic or more disaster result.
>
> Yes, the difference is that once a fool shoots his foot, kernel tries
> to take a meaningful action to figure out what went wrong. Like displayig
> an oops backtrace or like dumping a core (if kdump is configured) so
> that one can figure out who was the fool and what did who do.
>
> Now think give the control to two fools. First fool shoots his foot
> and then kernel transfers the control to another fool which completely
> screws up the situation and one can not save the core.

If you really want to full control, you should disable CONFIG_MODULES,
kprobes, ftrace and perf. We have a lot of kernel capturing way already.
So, only one feature diabling don't solve anything. Alternatively,
I can imagine to improve security modules and audit loaded kernel
module (and other injection code) more carefully.

So, I'm curious why do you hate so much a part of them and not all of them.



> > > Anyway, if an image is loaded and we have setup to capture dump also
> > > why do we need to save kmsg with the help of an helper. I am assuming
> > > this is more of a debugging aid if we have no other way to capture the
> > > kernel log buffer. So if somebody has setup kdump to capture the
> > > vmcore, why to call another handler which tries to save part of the
> > > vmcore (kmsg) separately.
> >
> > No.
> >
> > kmsg_dump() is desingned for embedded.
>
> Great. And I like the idea of trying to save some useful information
> to non volatile RAM or flash or something like that.

Yeah, thanks.

>
> > kexec for non dumping purpose. (Have you seen your embedded devices
> > show "Now storing dump image.." message?)
>
> No I have not seen. Can you explain a bit more that apart from kernel
> dump, what are the other purposes of kdump.
>
> >
> > Anyway, you can feel free to avoid to use ksmg_dump().
>
> Yes, that is one more way but this information is not even exported to
> user space to figure out if there are any registerd users of kmsg_dump.
>
> Seconly there are two more important things.
>
> - Why do you need a notification from inside crash_kexec(). IOW, what
> is the usage of KMSG_DUMP_KEXEC.

AFAIK, kexec is used sneak rebooting way when the system face unexpected
scenario on some devices. (Some embedded is running very long time, then
it can't avoid memory bit corruption. all of reset is a last resort.
and a vendor gather logs at periodically checkback).

The main purpose of to introduce KMSG_DUMP_KEXEC is to be separate it
from KMSG_DUMP_PANIC. At kmsg_dump() initial patch, KMSG_DUMP_PANIC
is always called both kdump is configured or not. But it's no good idea
the same log is to be appeared when both kexec was successed and failured.
Moreover someone don't want any log at kexec phase. They only want logs
when real panic (ie kexec failure) route. Then, I've separated it to two.
Two separated argument can solve above both requreiment.


> - One can anyway call kmsg_dump() outside crash_kexec() before it so
> that kmsg_dump notification will go out before kdump gets the control.
> What I am arguing here is that it is not necessarily a very good idea
> because external modules can try to do any amount of unsafe actions
> once we export the hook.

I wrote why I don't think I added new risk. (shortly, It can be a lot of
another way)
Can you please tell me your view of my point? I'm afraid I haven't
understand your worry. So, I hope to understand it before making
alternative fixing patch.

> Doing this is still fine if kdump is not configured as anyway syste would
> have rebooted. But if kdump is configured, then we are just reducing
> the reliability of the operation by passing the control in the hands
> of unaudited code and trusting it when kernel data structures are
> corrupt.

At minimum, I'm fully agree we need reliable kdump. I only put a doubtness
this removing is a real solution or not.

> So to me, sending out kmsg_dump notifications are perfectly fine when
> kdump is not configured. But if it is configured, then it probably is
> not a good idea. Anyway, if you have configured the system to capture
> the full dump, why do you also need kmsg_dump. And if you are happy
> with kmsg_dump() then you don't need kdump. So these both seem to be
> mutually exclusive anyway.

Honestly, I haven't heared anyone are using both at the same time. But
I can guess some reason. 1) If the system is very big box, kdump is
really slooooooow operation. example Some stock exchange system have half
terabytes memory and it mean dump delivery need to hald days at worse. But
market should open just 9:00 at next day. So, summry information (eg log and
trace information) spoiling is important thing. 2) Two sequence crash (ie
crash kdump reboot-start next-crash-before-finish-reboot) can override former
dump image. Usually admin _guess_ the reason of two are same and report boss so.
But unfortunatelly customers at high end area often refuse a _guess_ report.

Or, it's for business competition reason. As far as I heared, IBM and HP
UNI*X system can save the logs both dump and special flash device.

PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
unfortunately I don't know its detail. I hope anyone tell us it.


2011-02-03 05:20:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

> AFAIK, kexec is used sneak rebooting way when the system face unexpected
> scenario on some devices. (Some embedded is running very long time, then
> it can't avoid memory bit corruption. all of reset is a last resort.
> and a vendor gather logs at periodically checkback).
>
> The main purpose of to introduce KMSG_DUMP_KEXEC is to be separate it
> from KMSG_DUMP_PANIC. At kmsg_dump() initial patch, KMSG_DUMP_PANIC
> is always called both kdump is configured or not. But it's no good idea
> the same log is to be appeared when both kexec was successed and failured.
> Moreover someone don't want any log at kexec phase. They only want logs
> when real panic (ie kexec failure) route. Then, I've separated it to two.
> Two separated argument can solve above both requreiment.

A bit additional explanation, An original patch have kmsg_dump(KMSG_DUMP_PANIC)
callsite at following point. I didn't think it makes either embedded or
kdump folks happiness. Thus I moved it after crash_kexec().


---------------------------------------------------------------------
@@ -74,6 +75,7 @@ NORET_TYPE void panic(const char * fmt, ...)
dump_stack();
#endif

+ kmsg_dump(KMSG_DUMP_PANIC);
/*
* If we have crashed and we have a crash kernel loaded let it handle
* everything else.
* Do we want to call this before we try to display a message?
*/
crash_kexec(NULL);
---------------------------------------------------------------------


2011-02-03 18:40:05

by Seiji Aguchi

[permalink] [raw]
Subject: RE: Query about kdump_msg hook into crash_kexec()

Hi,

>PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
> unfortunately I don't know its detail. I hope anyone tell us it.

I explain the usage of kmsg_dump(KMSG_DUMP_KEXEC) in enterprise area.

[Background]
In our support service experience, we always need to detect root cause
of OS panic.
So, customers in enterprise area never forgive us if kdump fails and
we can't detect the root cause of panic due to lack of materials for
investigation.

>- Why do you need a notification from inside crash_kexec(). IOW, what
> is the usage of KMSG_DUMP_KEXEC.


The usage of kdump(KMSG_DUMP_KEXEC) in enterprise area is getting
useful information for investigating kernel crash in case kdump
kernel doesn't boot.

Kdump kernel may not start booting because there is a sha256 checksum
verified over the kdump kernel before it starts booting.
This means kdump kernel may fail even if there is no bug in kdump and
we can't get any information for detecting root cause of kernel crash.

As I mentioned in [Background], We must avoid lack of materials for
investigation.
So, kdump(KMSG_DUMP_KEXEC) is very important feature in enterprise
area.

>- One can anyway call kmsg_dump() outside crash_kexec() before it so
> that kmsg_dump notification will go out before kdump gets the control.
> What I am arguing here is that it is not necessarily a very good idea
> because external modules can try to do any amount of unsafe actions
> once we export the hook.
>

kmsg_dump() is a feature for specific servers capable NVRAM or flash memory.
So, it should be provided as a option.

By providing notification, linux kernel is able to support
this feature in following different kinds of servers flexibly.

- equipped with some NVRAMs or flash memories
- equipped with a NVRAM or flash memory

Also, linux kernel is able to support servers which don't have NVRAM/flash memory

Seiji

>-----Original Message-----
>From: [email protected] [mailto:[email protected]] On Behalf Of Vivek Goyal
>Sent: Wednesday, February 02, 2011 9:05 PM
>To: KOSAKI Motohiro
>Cc: Eric W. Biederman; linux kernel mailing list; Jarod Wilson
>Subject: Re: Query about kdump_msg hook into crash_kexec()
>
>On Thu, Feb 03, 2011 at 09:55:41AM +0900, KOSAKI Motohiro wrote:
>> Hi
>>
>> > Hi,
>> >
>> > I noticed following commit which hooks into crash_kexec() for calling
>> > kmsg_dump().
>> >
>> > I think it is not a very good idea to pass control to modules after
>> > crash_kexec() has been called. Because modules can try to take locks
>> > or try to do some other operations which we really should not be doing
>> > now and fail kdump also. The whole design of kdump is built on the
>> > fact that in crashing kernel we do minimal thing and try to make
>> > transition to second kernel robust. Now with this hook, kmsg dumper
>> > breaks that assumption.
>>
>> I guess you talked about some foolish can shoot their own foot. if so,
>> Yes. Any kernel module can make kernel panic or more disaster result.
>
>Yes, the difference is that once a fool shoots his foot, kernel tries
>to take a meaningful action to figure out what went wrong. Like displayig
>an oops backtrace or like dumping a core (if kdump is configured) so
>that one can figure out who was the fool and what did who do.
>
>Now think give the control to two fools. First fool shoots his foot
>and then kernel transfers the control to another fool which completely
>screws up the situation and one can not save the core.
>
>>
>>
>> > Anyway, if an image is loaded and we have setup to capture dump also
>> > why do we need to save kmsg with the help of an helper. I am assuming
>> > this is more of a debugging aid if we have no other way to capture the
>> > kernel log buffer. So if somebody has setup kdump to capture the
>> > vmcore, why to call another handler which tries to save part of the
>> > vmcore (kmsg) separately.
>>
>> No.
>>
>> kmsg_dump() is desingned for embedded.
>
>Great. And I like the idea of trying to save some useful information
>to non volatile RAM or flash or something like that.
>
>> kexec for non dumping purpose. (Have you seen your embedded devices
>> show "Now storing dump image.." message?)
>
>No I have not seen. Can you explain a bit more that apart from kernel
>dump, what are the other purposes of kdump.
>
>>
>> Anyway, you can feel free to avoid to use ksmg_dump().
>
>Yes, that is one more way but this information is not even exported to
>user space to figure out if there are any registerd users of kmsg_dump.
>
>Seconly there are two more important things.
>
>- Why do you need a notification from inside crash_kexec(). IOW, what
> is the usage of KMSG_DUMP_KEXEC.
>
>- One can anyway call kmsg_dump() outside crash_kexec() before it so
> that kmsg_dump notification will go out before kdump gets the control.
> What I am arguing here is that it is not necessarily a very good idea
> because external modules can try to do any amount of unsafe actions
> once we export the hook.
>
> Doing this is still fine if kdump is not configured as anyway syste would
> have rebooted. But if kdump is configured, then we are just reducing
> the reliability of the operation by passing the control in the hands
> of unaudited code and trusting it when kernel data structures are
> corrupt.
>
> So to me, sending out kmsg_dump notifications are perfectly fine when
> kdump is not configured. But if it is configured, then it probably is
> not a good idea. Anyway, if you have configured the system to capture
> the full dump, why do you also need kmsg_dump. And if you are happy
> with kmsg_dump() then you don't need kdump. So these both seem to be
> mutually exclusive anyway.
>
>Thanks
>Vivek
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

2011-02-03 21:13:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

Seiji Aguchi <[email protected]> writes:

> Hi,
>
>>PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
>> unfortunately I don't know its detail. I hope anyone tell us it.
>
> I explain the usage of kmsg_dump(KMSG_DUMP_KEXEC) in enterprise area.
>
> [Background]
> In our support service experience, we always need to detect root cause
> of OS panic.
> So, customers in enterprise area never forgive us if kdump fails and
> we can't detect the root cause of panic due to lack of materials for
> investigation.
>
>>- Why do you need a notification from inside crash_kexec(). IOW, what
>> is the usage of KMSG_DUMP_KEXEC.
>
>
> The usage of kdump(KMSG_DUMP_KEXEC) in enterprise area is getting
> useful information for investigating kernel crash in case kdump
> kernel doesn't boot.
>
> Kdump kernel may not start booting because there is a sha256 checksum
> verified over the kdump kernel before it starts booting.
> This means kdump kernel may fail even if there is no bug in kdump and
> we can't get any information for detecting root cause of kernel crash

Sure it is theoretically possible that the sha256 checksum gets
corrupted (I have never seen it happen or heard reports of it
happening). It is a feature that if someone has corrupted your code the
code doesn't try and run anyway and corrupt anything else.

That you are arguing against have such a feature in the code you use to
write to persistent storage is scary.

> As I mentioned in [Background], We must avoid lack of materials for
> investigation.
> So, kdump(KMSG_DUMP_KEXEC) is very important feature in enterprise
> area.

That sounds wonderful, but it doesn't jive with the
code. kmsg_dump(KMSG_DUMP_KEXEC) when I read through it was simply not
written to be robust when most of the kernel is suspect. Making it in
appropriate for use on the crash_kexec path. I do not believe kmsg_dump
has seen any testing in kernel failure scenarios.

There is this huge assumption that kmsg_dump is more reliable than
crash_kexec, from my review of the code kmsg_dump is simply not safe in
the context of a broken kernel. The kmsg_dump code last I looked code
won't work if called with interrupts disabled.

Furthermore kmsg_dump(KMSG_DUMP_KEXEC) is only useful for debugging
crash_kexec. Which has it backwards as it is kmsg_dump that needs the
debugging.

You just argued that it is better to corrupt the target of your
kmsg_dump in the event of a kernel failure instead of to fail silently.

I don't want that unreliable code that wants to corrupt my jffs
partition anywhere near my machines.

Eric

2011-02-03 22:09:52

by Seiji Aguchi

[permalink] [raw]
Subject: RE: Query about kdump_msg hook into crash_kexec()

Hi Eric,

Thank you for your prompt reply.

I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.

(1) Needs in enterprise area
In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
for detecting root cause of kernel crash.

(2) Implementation of kmsg_dump
You suggest to review/test cording of kmsg_dump() more.

What do you think about (1)?
Is it acceptable for you?

Seiji

>-----Original Message-----
>From: Eric W. Biederman [mailto:[email protected]]
>Sent: Thursday, February 03, 2011 4:13 PM
>To: Seiji Aguchi
>Cc: Vivek Goyal; KOSAKI Motohiro; linux kernel mailing list; Jarod Wilson
>Subject: Re: Query about kdump_msg hook into crash_kexec()
>
>Seiji Aguchi <[email protected]> writes:
>
>> Hi,
>>
>>>PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
>>> unfortunately I don't know its detail. I hope anyone tell us it.
>>
>> I explain the usage of kmsg_dump(KMSG_DUMP_KEXEC) in enterprise area.
>>
>> [Background]
>> In our support service experience, we always need to detect root cause
>> of OS panic.
>> So, customers in enterprise area never forgive us if kdump fails and
>> we can't detect the root cause of panic due to lack of materials for
>> investigation.
>>
>>>- Why do you need a notification from inside crash_kexec(). IOW, what
>>> is the usage of KMSG_DUMP_KEXEC.
>>
>>
>> The usage of kdump(KMSG_DUMP_KEXEC) in enterprise area is getting
>> useful information for investigating kernel crash in case kdump
>> kernel doesn't boot.
>>
>> Kdump kernel may not start booting because there is a sha256 checksum
>> verified over the kdump kernel before it starts booting.
>> This means kdump kernel may fail even if there is no bug in kdump and
>> we can't get any information for detecting root cause of kernel crash
>
>Sure it is theoretically possible that the sha256 checksum gets
>corrupted (I have never seen it happen or heard reports of it
>happening). It is a feature that if someone has corrupted your code the
>code doesn't try and run anyway and corrupt anything else.
>
>That you are arguing against have such a feature in the code you use to
>write to persistent storage is scary.
>
>> As I mentioned in [Background], We must avoid lack of materials for
>> investigation.
>> So, kdump(KMSG_DUMP_KEXEC) is very important feature in enterprise
>> area.
>
>That sounds wonderful, but it doesn't jive with the
>code. kmsg_dump(KMSG_DUMP_KEXEC) when I read through it was simply not
>written to be robust when most of the kernel is suspect. Making it in
>appropriate for use on the crash_kexec path. I do not believe kmsg_dump
>has seen any testing in kernel failure scenarios.
>
>There is this huge assumption that kmsg_dump is more reliable than
>crash_kexec, from my review of the code kmsg_dump is simply not safe in
>the context of a broken kernel. The kmsg_dump code last I looked code
>won't work if called with interrupts disabled.
>
>Furthermore kmsg_dump(KMSG_DUMP_KEXEC) is only useful for debugging
>crash_kexec. Which has it backwards as it is kmsg_dump that needs the
>debugging.
>
>You just argued that it is better to corrupt the target of your
>kmsg_dump in the event of a kernel failure instead of to fail silently.
>
>I don't want that unreliable code that wants to corrupt my jffs
>partition anywhere near my machines.
>
>Eric

2011-02-04 02:27:23

by Cong Wang

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
>Hi Eric,
>
>Thank you for your prompt reply.
>
>I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
>
>(1) Needs in enterprise area
> In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
> for detecting root cause of kernel crash.
>
>(2) Implementation of kmsg_dump
> You suggest to review/test cording of kmsg_dump() more.
>
>What do you think about (1)?
>Is it acceptable for you?
>

For "in case of kdump fails", we can move
KMSG_DUMP_OOPS/KMSG_DUMP_PANIC before calling crash_kexec(),
so you can capture the log before kdump happens.

2011-02-04 02:51:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Fri, Feb 04, 2011 at 10:24:27AM +0800, Am?rico Wang wrote:
> On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
> >Hi Eric,
> >
> >Thank you for your prompt reply.
> >
> >I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
> >
> >(1) Needs in enterprise area
> > In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
> > for detecting root cause of kernel crash.
> >
> >(2) Implementation of kmsg_dump
> > You suggest to review/test cording of kmsg_dump() more.
> >
> >What do you think about (1)?
> >Is it acceptable for you?
> >
>
> For "in case of kdump fails", we can move
> KMSG_DUMP_OOPS/KMSG_DUMP_PANIC before calling crash_kexec(),
> so you can capture the log before kdump happens.

True. If the idea is to capture the logs before kdump starts saving
core, then kdump_msg() call, can be put before crash_kexec() call and
there is no need to hook into crash_kexec().

Secondly now the question of whether kdump_msg() call should be before
crash_kexec() or not because modules might want to do lots of unreliable
things, I am now split on that. Especially because of the fact that if
somebody wants it probably can use kprobe to hook into crash_kexec()
or panic() or something like that to execute its code before everything
else.

The other reason I am little split on this because in the past I have seen
kdump fail many a times either because of chipset issues or because of driver
issues etc. So though I don't like it and I want drivers to be fixed
to take care of booting in kdump environment, things not necessarily
worked as well as we expected them to and it is not hard to meet a kdump
failure.

Hence though I do not prefer it but I will not stand in the way of kdump_msg()
being called before crash_kexec() gets the control.

But we should atleast remove the hook from crash_kexec() and get rid
of additional config option (KMSG_DUMP_KEXEC).

Thanks
Vivek

2011-02-04 03:28:49

by Cong Wang

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 09:50:42PM -0500, Vivek Goyal wrote:
>On Fri, Feb 04, 2011 at 10:24:27AM +0800, Américo Wang wrote:
>> On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
>> >Hi Eric,
>> >
>> >Thank you for your prompt reply.
>> >
>> >I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
>> >
>> >(1) Needs in enterprise area
>> > In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
>> > for detecting root cause of kernel crash.
>> >
>> >(2) Implementation of kmsg_dump
>> > You suggest to review/test cording of kmsg_dump() more.
>> >
>> >What do you think about (1)?
>> >Is it acceptable for you?
>> >
>>
>> For "in case of kdump fails", we can move
>> KMSG_DUMP_OOPS/KMSG_DUMP_PANIC before calling crash_kexec(),
>> so you can capture the log before kdump happens.
>
>True. If the idea is to capture the logs before kdump starts saving
>core, then kdump_msg() call, can be put before crash_kexec() call and
>there is no need to hook into crash_kexec().
>

Totally agreed.

>Secondly now the question of whether kdump_msg() call should be before
>crash_kexec() or not because modules might want to do lots of unreliable
>things, I am now split on that. Especially because of the fact that if
>somebody wants it probably can use kprobe to hook into crash_kexec()
>or panic() or something like that to execute its code before everything
>else.

If kprobe is the reason, then probably we can remove lots of other
kernel API's like kmsg dumper.


>
>The other reason I am little split on this because in the past I have seen
>kdump fail many a times either because of chipset issues or because of driver
>issues etc. So though I don't like it and I want drivers to be fixed
>to take care of booting in kdump environment, things not necessarily
>worked as well as we expected them to and it is not hard to meet a kdump
>failure.
>
>Hence though I do not prefer it but I will not stand in the way of kdump_msg()
>being called before crash_kexec() gets the control.
>

The point is that we don't have to choose one of them (kmsg dumper or
kdump), we can choose to have them both. Like Seiji said, they always
want the kernel messages to be saved. Duplicates are much better than
nothing.

2011-02-04 06:40:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

> >Secondly now the question of whether kdump_msg() call should be before
> >crash_kexec() or not because modules might want to do lots of unreliable
> >things, I am now split on that. Especially because of the fact that if
> >somebody wants it probably can use kprobe to hook into crash_kexec()
> >or panic() or something like that to execute its code before everything
> >else.
>
> If kprobe is the reason, then probably we can remove lots of other
> kernel API's like kmsg dumper.

Really?

On enterprise, distro kernel has CONFIG_KPROBE=y. It has both pros and cons,
pros) kprobe can provide alternative hooking way therefore kmsg dump is not
must. cons) even if we remove kmsg dump, we can't get full kernel control.
kprobe can inject unrealiable code. It's two sides of the same coin. But,
many embedded don't have kprobe feature, therefore kprobe can't be
alternative of kmsg dumper.


2011-02-04 14:58:06

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 01:52:01PM +0900, KOSAKI Motohiro wrote:

[..]
> > Seconly there are two more important things.
> >
> > - Why do you need a notification from inside crash_kexec(). IOW, what
> > is the usage of KMSG_DUMP_KEXEC.
>
> AFAIK, kexec is used sneak rebooting way when the system face unexpected
> scenario on some devices. (Some embedded is running very long time, then
> it can't avoid memory bit corruption. all of reset is a last resort.
> and a vendor gather logs at periodically checkback).
>
> The main purpose of to introduce KMSG_DUMP_KEXEC is to be separate it
> from KMSG_DUMP_PANIC. At kmsg_dump() initial patch, KMSG_DUMP_PANIC
> is always called both kdump is configured or not. But it's no good idea
> the same log is to be appeared when both kexec was successed and failured.
> Moreover someone don't want any log at kexec phase. They only want logs
> when real panic (ie kexec failure) route. Then, I've separated it to two.
> Two separated argument can solve above both requreiment.

This point is not very clear to me. Let me add some details.

- crash_kexec() path is not taken during regular kexec boot. It is only
taken when first kernel has crashed and we want to boot in kdump
kernel

- If kdump fails, most likely you will either be hung or panic in second
kernel or something like that.

- Current code looks as follows.

if (mutex_trylock(&kexec_mutex)) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;

kmsg_dump(KMSG_DUMP_KEXEC);

crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
machine_kexec(kexec_crash_image);
}
mutex_unlock(&kexec_mutex);

If a kdump kernel is not loaded you will call kmsg_dump(PANIC) and if
a kdump kernel is loaded you will call kmsg_dump(KEXEC). This is
irrespective of the fact whether kdump succeeds or fails.

- ramoops driver is not differentiating between KMSG_DUMP_KEXEC or
KMSG_DUMP_PANIC. Similiarly mtdoops driver is not differentiating
between two reasons. IOW, to me there does not seem to be any difference
between events KMSG_DUMP_KEXEC and KMSG_DUMP_PANIC.

So we should simply get rid of extra event KMSG_DUMP_KEXEC?

>
>
> > - One can anyway call kmsg_dump() outside crash_kexec() before it so
> > that kmsg_dump notification will go out before kdump gets the control.
> > What I am arguing here is that it is not necessarily a very good idea
> > because external modules can try to do any amount of unsafe actions
> > once we export the hook.
>
> I wrote why I don't think I added new risk. (shortly, It can be a lot of
> another way)
> Can you please tell me your view of my point? I'm afraid I haven't
> understand your worry. So, I hope to understand it before making
> alternative fixing patch.

Once you provide a hook to modules they can try doing anything. As long
as modules keep it really simple, it probably is fine. But what if
they start doing some writes to regular file system/devices in an
attempt to save kernel logs or even some more data and that can lead
to hard lockups or further corruption and possibly resulting in kdump
not even getting a chance to run.

>
> > Doing this is still fine if kdump is not configured as anyway syste would
> > have rebooted. But if kdump is configured, then we are just reducing
> > the reliability of the operation by passing the control in the hands
> > of unaudited code and trusting it when kernel data structures are
> > corrupt.
>
> At minimum, I'm fully agree we need reliable kdump. I only put a doubtness
> this removing is a real solution or not.
>
> > So to me, sending out kmsg_dump notifications are perfectly fine when
> > kdump is not configured. But if it is configured, then it probably is
> > not a good idea. Anyway, if you have configured the system to capture
> > the full dump, why do you also need kmsg_dump. And if you are happy
> > with kmsg_dump() then you don't need kdump. So these both seem to be
> > mutually exclusive anyway.
>
> Honestly, I haven't heared anyone are using both at the same time. But
> I can guess some reason. 1) If the system is very big box, kdump is
> really slooooooow operation. example Some stock exchange system have half
> terabytes memory and it mean dump delivery need to hald days at worse. But
> market should open just 9:00 at next day. So, summry information (eg log and
> trace information) spoiling is important thing. 2) Two sequence crash (ie
> crash kdump reboot-start next-crash-before-finish-reboot) can override former
> dump image. Usually admin _guess_ the reason of two are same and report boss so.
> But unfortunatelly customers at high end area often refuse a _guess_ report.
>
> Or, it's for business competition reason. As far as I heared, IBM and HP
> UNI*X system can save the logs both dump and special flash device.
>
> PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
> unfortunately I don't know its detail. I hope anyone tell us it.

If we keep the kmsg dump driver simple I guess things will be just fine in
most of the cases. So as I said in other mail, I am fine with kmsg
notifications being sent before crash_kexec(). Atleast we can audit in
kernel driver code and bigger worry is vendor specific outside the tree
modules. But I think we can live with that and down the line possibly
provide a tunable to change this behavior if it becomes an issue.

So until and unlesss we have a very clear reason of differentiating
between KEXEC and PANIC event, lets get rid of it. So far I have not
been able to understand what's the difference between two.

Thanks
Vivek

2011-02-04 15:00:50

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 02:20:53PM +0900, KOSAKI Motohiro wrote:
> > AFAIK, kexec is used sneak rebooting way when the system face unexpected
> > scenario on some devices. (Some embedded is running very long time, then
> > it can't avoid memory bit corruption. all of reset is a last resort.
> > and a vendor gather logs at periodically checkback).
> >
> > The main purpose of to introduce KMSG_DUMP_KEXEC is to be separate it
> > from KMSG_DUMP_PANIC. At kmsg_dump() initial patch, KMSG_DUMP_PANIC
> > is always called both kdump is configured or not. But it's no good idea
> > the same log is to be appeared when both kexec was successed and failured.
> > Moreover someone don't want any log at kexec phase. They only want logs
> > when real panic (ie kexec failure) route. Then, I've separated it to two.
> > Two separated argument can solve above both requreiment.
>
> A bit additional explanation, An original patch have kmsg_dump(KMSG_DUMP_PANIC)
> callsite at following point. I didn't think it makes either embedded or
> kdump folks happiness. Thus I moved it after crash_kexec().
>
>
> ---------------------------------------------------------------------
> @@ -74,6 +75,7 @@ NORET_TYPE void panic(const char * fmt, ...)
> dump_stack();
> #endif
>
> + kmsg_dump(KMSG_DUMP_PANIC);
> /*
> * If we have crashed and we have a crash kernel loaded let it handle
> * everything else.
> * Do we want to call this before we try to display a message?
> */
> crash_kexec(NULL);
> ---------------------------------------------------------------------

And I think to compensate for that somebody introduced additional
kmsg_dump(KEXEC) call inside crash_kexec() and put it under CONFIG
option so that one can change the behavior based on config options.

I think this makes the logic somewhat twisted and an unnecessary call
inside crash_kexec(). So until and unless there is a strong reason we
can get rid of KEXEC event and move kmsg_dump call before crash_kexec()
for now and see how does it go, IMHO.

Vivek

2011-02-08 16:47:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
> Hi Eric,
>
> Thank you for your prompt reply.
>
> I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
>
> (1) Needs in enterprise area
> In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
> for detecting root cause of kernel crash.
>
> (2) Implementation of kmsg_dump
> You suggest to review/test cording of kmsg_dump() more.
>
> What do you think about (1)?
> Is it acceptable for you?

Ok, I am just trying to think loud about this problem and see if something
fruitful comes out which paves the way forward.

- So ideally we would like kdump_msg() to be called after crash_kexec() so
that any unaudited (third party modules), unreliable calls do not
compromise the realiability of kdump operation.

But hitachi folks seems to be wanting to save atleast kernel buffers
somwhere in the NVRAM etc because they think that kdump can be
unreliable and we might not capture any information after the crash. So
they kind of want two mechanisms in place. One is light weight which
tries to save kernel buffers in NVRAM and then one heavy weight one
which tries to save the entire/filtered kernel core.

Personally I am not too excited about the idea but I guess I can live
with it. We can try to audit atleast in kernel module and for external
modules we don't have much control and live with the fact that if
modules screw up, we don't capture the dump.

Those who don't want this behavior can do three things.

- Disable kdump_msg() at compile time.
- Do not load any module which registers for kdump_msg()
- Implement a /proc tunable which allows controlling this
behavior.

- Ok, having said why do we want it, comes the question of how to
do it so that it works reasonably well.

- There seems to be on common requirement of kmsg_dump() and kdump()
and that is stop other cpus reliably (use nmi if possible). Can
we try to share this code between kmsg_dump and crash_kexec(). So
something like as follows.

- panic happens
- Do all the activities related to printing panic string and
stack dump.
- Stop other cpus.
- This can be probably be done with the equivalent of
machine_crash_shutdown() function. In fact this function
can probably be broken down in two parts. First part
does shutdown_prepare() where all other cpus are shot
down and second part can do the actual disabling of
LAPIC/IOAPIC and saving cpu registers etc.

if (mutex_trylock(some_shutdown_mutex)) {
/* setp regs, fix vmcoreinfo etc */
crash_kexec_prepare();
machine_shutdown_prepare();
kdump_msg();
crash_kexec_execute()
/* Also call panic_notifier_list here ? */
}

crash_kexec_prepare () {
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
}

crash_kexec_execute() {
/* Shutdown lapic/ioapic, save this cpu register etc */
machine_shutdown();
machine_kexec()
}

So basically we break down machine_shutdown() function in two parts
and start sharing common part between kdump_msg(), crash_kexec and
possibly panic_notifiers.

If kdump is not configured, then after executing kdump_msg() and panic
notifiers, we should either be sitting in tight loop with interrupt
enabled for somebody to press Ctrl-boot or reboot system upon lapse
of panic_timeout().

Eric, does it make sense to you?

Thanks
Vivek



>
> Seiji
>
> >-----Original Message-----
> >From: Eric W. Biederman [mailto:[email protected]]
> >Sent: Thursday, February 03, 2011 4:13 PM
> >To: Seiji Aguchi
> >Cc: Vivek Goyal; KOSAKI Motohiro; linux kernel mailing list; Jarod Wilson
> >Subject: Re: Query about kdump_msg hook into crash_kexec()
> >
> >Seiji Aguchi <[email protected]> writes:
> >
> >> Hi,
> >>
> >>>PS: FWIW, Hitach folks have usage idea for their enterprise purpose, but
> >>> unfortunately I don't know its detail. I hope anyone tell us it.
> >>
> >> I explain the usage of kmsg_dump(KMSG_DUMP_KEXEC) in enterprise area.
> >>
> >> [Background]
> >> In our support service experience, we always need to detect root cause
> >> of OS panic.
> >> So, customers in enterprise area never forgive us if kdump fails and
> >> we can't detect the root cause of panic due to lack of materials for
> >> investigation.
> >>
> >>>- Why do you need a notification from inside crash_kexec(). IOW, what
> >>> is the usage of KMSG_DUMP_KEXEC.
> >>
> >>
> >> The usage of kdump(KMSG_DUMP_KEXEC) in enterprise area is getting
> >> useful information for investigating kernel crash in case kdump
> >> kernel doesn't boot.
> >>
> >> Kdump kernel may not start booting because there is a sha256 checksum
> >> verified over the kdump kernel before it starts booting.
> >> This means kdump kernel may fail even if there is no bug in kdump and
> >> we can't get any information for detecting root cause of kernel crash
> >
> >Sure it is theoretically possible that the sha256 checksum gets
> >corrupted (I have never seen it happen or heard reports of it
> >happening). It is a feature that if someone has corrupted your code the
> >code doesn't try and run anyway and corrupt anything else.
> >
> >That you are arguing against have such a feature in the code you use to
> >write to persistent storage is scary.
> >
> >> As I mentioned in [Background], We must avoid lack of materials for
> >> investigation.
> >> So, kdump(KMSG_DUMP_KEXEC) is very important feature in enterprise
> >> area.
> >
> >That sounds wonderful, but it doesn't jive with the
> >code. kmsg_dump(KMSG_DUMP_KEXEC) when I read through it was simply not
> >written to be robust when most of the kernel is suspect. Making it in
> >appropriate for use on the crash_kexec path. I do not believe kmsg_dump
> >has seen any testing in kernel failure scenarios.
> >
> >There is this huge assumption that kmsg_dump is more reliable than
> >crash_kexec, from my review of the code kmsg_dump is simply not safe in
> >the context of a broken kernel. The kmsg_dump code last I looked code
> >won't work if called with interrupts disabled.
> >
> >Furthermore kmsg_dump(KMSG_DUMP_KEXEC) is only useful for debugging
> >crash_kexec. Which has it backwards as it is kmsg_dump that needs the
> >debugging.
> >
> >You just argued that it is better to corrupt the target of your
> >kmsg_dump in the event of a kernel failure instead of to fail silently.
> >
> >I don't want that unreliable code that wants to corrupt my jffs
> >partition anywhere near my machines.
> >
> >Eric

2011-02-08 17:35:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

Vivek Goyal <[email protected]> writes:

> On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
>> Hi Eric,
>>
>> Thank you for your prompt reply.
>>
>> I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
>>
>> (1) Needs in enterprise area
>> In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
>> for detecting root cause of kernel crash.
>>
>> (2) Implementation of kmsg_dump
>> You suggest to review/test cording of kmsg_dump() more.
>>
>> What do you think about (1)?
>> Is it acceptable for you?
>
> Ok, I am just trying to think loud about this problem and see if something
> fruitful comes out which paves the way forward.
>
> - So ideally we would like kdump_msg() to be called after crash_kexec() so
> that any unaudited (third party modules), unreliable calls do not
> compromise the realiability of kdump operation.
>
> But hitachi folks seems to be wanting to save atleast kernel buffers
> somwhere in the NVRAM etc because they think that kdump can be
> unreliable and we might not capture any information after the crash. So
> they kind of want two mechanisms in place. One is light weight which
> tries to save kernel buffers in NVRAM and then one heavy weight one
> which tries to save the entire/filtered kernel core.
>
> Personally I am not too excited about the idea but I guess I can live
> with it. We can try to audit atleast in kernel module and for external
> modules we don't have much control and live with the fact that if
> modules screw up, we don't capture the dump.
>
> Those who don't want this behavior can do three things.
>
> - Disable kdump_msg() at compile time.
> - Do not load any module which registers for kdump_msg()
> - Implement a /proc tunable which allows controlling this
> behavior.
>
> - Ok, having said why do we want it, comes the question of how to
> do it so that it works reasonably well.
>
> - There seems to be on common requirement of kmsg_dump() and kdump()
> and that is stop other cpus reliably (use nmi if possible). Can
> we try to share this code between kmsg_dump and crash_kexec(). So
> something like as follows.
>
> - panic happens
> - Do all the activities related to printing panic string and
> stack dump.
> - Stop other cpus.
> - This can be probably be done with the equivalent of
> machine_crash_shutdown() function. In fact this function
> can probably be broken down in two parts. First part
> does shutdown_prepare() where all other cpus are shot
> down and second part can do the actual disabling of
> LAPIC/IOAPIC and saving cpu registers etc.
>
> if (mutex_trylock(some_shutdown_mutex)) {
> /* setp regs, fix vmcoreinfo etc */
> crash_kexec_prepare();
> machine_shutdown_prepare();
> kdump_msg();
> crash_kexec_execute()
> /* Also call panic_notifier_list here ? */
> }
>
> crash_kexec_prepare () {
> crash_setup_regs(&fixed_regs, regs);
> crash_save_vmcoreinfo();
> }
>
> crash_kexec_execute() {
> /* Shutdown lapic/ioapic, save this cpu register etc */
> machine_shutdown();
> machine_kexec()
> }
>
> So basically we break down machine_shutdown() function in two parts
> and start sharing common part between kdump_msg(), crash_kexec and
> possibly panic_notifiers.
>
> If kdump is not configured, then after executing kdump_msg() and panic
> notifiers, we should either be sitting in tight loop with interrupt
> enabled for somebody to press Ctrl-boot or reboot system upon lapse
> of panic_timeout().
>
> Eric, does it make sense to you?

kexec on panic doesn't strictly require that we stop other cpus.

What makes sense to me at this point is for someone on the kmsg_dump
side to make a strong case that the code actually works in a crash dump
scenario. We have lots of experience over the years that says a design
like kmsg_dump is attractive but turns out to be a unreliable piece of
junk that fails just when you need it. Because developers only test
the case when the kernel is happy and because people share code with
the regular path drivers, and that code assumes things are happy.

I forget exactly why but last I looked.
local_irq_disable()
kmsg_dump()
local_irq_enalbe()

Was a recipe for disaster, and you have be at least that good to even
have a chance of working in a crash dump scenario.

In part I am puzzled why the kmsg dump doesn't just use the printk
interface. Strangely enough printk works in the event of a crash and
has been shown to be reliable over the years.

Eric

2011-02-08 19:28:30

by Vivek Goyal

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

On Tue, Feb 08, 2011 at 09:35:15AM -0800, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
> >> Hi Eric,
> >>
> >> Thank you for your prompt reply.
> >>
> >> I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
> >>
> >> (1) Needs in enterprise area
> >> In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
> >> for detecting root cause of kernel crash.
> >>
> >> (2) Implementation of kmsg_dump
> >> You suggest to review/test cording of kmsg_dump() more.
> >>
> >> What do you think about (1)?
> >> Is it acceptable for you?
> >
> > Ok, I am just trying to think loud about this problem and see if something
> > fruitful comes out which paves the way forward.
> >
> > - So ideally we would like kdump_msg() to be called after crash_kexec() so
> > that any unaudited (third party modules), unreliable calls do not
> > compromise the realiability of kdump operation.
> >
> > But hitachi folks seems to be wanting to save atleast kernel buffers
> > somwhere in the NVRAM etc because they think that kdump can be
> > unreliable and we might not capture any information after the crash. So
> > they kind of want two mechanisms in place. One is light weight which
> > tries to save kernel buffers in NVRAM and then one heavy weight one
> > which tries to save the entire/filtered kernel core.
> >
> > Personally I am not too excited about the idea but I guess I can live
> > with it. We can try to audit atleast in kernel module and for external
> > modules we don't have much control and live with the fact that if
> > modules screw up, we don't capture the dump.
> >
> > Those who don't want this behavior can do three things.
> >
> > - Disable kdump_msg() at compile time.
> > - Do not load any module which registers for kdump_msg()
> > - Implement a /proc tunable which allows controlling this
> > behavior.
> >
> > - Ok, having said why do we want it, comes the question of how to
> > do it so that it works reasonably well.
> >
> > - There seems to be on common requirement of kmsg_dump() and kdump()
> > and that is stop other cpus reliably (use nmi if possible). Can
> > we try to share this code between kmsg_dump and crash_kexec(). So
> > something like as follows.
> >
> > - panic happens
> > - Do all the activities related to printing panic string and
> > stack dump.
> > - Stop other cpus.
> > - This can be probably be done with the equivalent of
> > machine_crash_shutdown() function. In fact this function
> > can probably be broken down in two parts. First part
> > does shutdown_prepare() where all other cpus are shot
> > down and second part can do the actual disabling of
> > LAPIC/IOAPIC and saving cpu registers etc.
> >
> > if (mutex_trylock(some_shutdown_mutex)) {
> > /* setp regs, fix vmcoreinfo etc */
> > crash_kexec_prepare();
> > machine_shutdown_prepare();
> > kdump_msg();
> > crash_kexec_execute()
> > /* Also call panic_notifier_list here ? */
> > }
> >
> > crash_kexec_prepare () {
> > crash_setup_regs(&fixed_regs, regs);
> > crash_save_vmcoreinfo();
> > }
> >
> > crash_kexec_execute() {
> > /* Shutdown lapic/ioapic, save this cpu register etc */
> > machine_shutdown();
> > machine_kexec()
> > }
> >
> > So basically we break down machine_shutdown() function in two parts
> > and start sharing common part between kdump_msg(), crash_kexec and
> > possibly panic_notifiers.
> >
> > If kdump is not configured, then after executing kdump_msg() and panic
> > notifiers, we should either be sitting in tight loop with interrupt
> > enabled for somebody to press Ctrl-boot or reboot system upon lapse
> > of panic_timeout().
> >
> > Eric, does it make sense to you?
>
> kexec on panic doesn't strictly require that we stop other cpus.

Yes but it is desirable.

- We don't want cpus to be scribbling on old memory and possibly on
new kernel's memory also in case of corrupted pointer and crash
the freshly booted kernel (New kerenl's memory is mapped in old
kernel)

- We don't want other cpus to complete panic() and jump to BIOS or
lead to some kind of triple fault and reset the system etc.

So that would suggest to be robust, stopping other cpus is required.

On a side note, kdump_msg() hook is present in emergency_reboot() too.
So if these paths are not properly designed, then system might not
even reboot automatically even if panic_timeout() has been specified.

>
> What makes sense to me at this point is for someone on the kmsg_dump
> side to make a strong case that the code actually works in a crash dump
> scenario. We have lots of experience over the years that says a design
> like kmsg_dump is attractive but turns out to be a unreliable piece of
> junk that fails just when you need it. Because developers only test
> the case when the kernel is happy and because people share code with
> the regular path drivers, and that code assumes things are happy.
>
> I forget exactly why but last I looked.
> local_irq_disable()
> kmsg_dump()
> local_irq_enalbe()

I agree that kdump_msg() code should be able to work with interrupts
disabled, atleast.

There seem to be two pieces to it. One is generic call which calls
all the dumpers and then respective dumpers.

Looking at generic dumping call, I can't think why does it need interrupts
to be enabled. There is one spin_lock() and then rcu_read_lock(). That's
it.

Regarding mtdoops, it is hard to tell. There is lot of code. But the good
thing is that they have introduced a separate write path for panic context.
That way atleast one can do special casing in panic path to avoid
taking locks and not be dependent on interrupts.

ramoops seems to be simple. It seems to be just memcpy() except
do_gettimeofday(). I noticed that in the past you raised concern about usage
of do_gettimeofday(), but I am not sure what is the concern here (silly
question i guess).

So to me, ramoops seems to be simple memcpy (atleast in principle) and
mtdoops has a dedicated path for handling panic context. So atleast
it is fixable for possible issues. Generic code seems harmless to me
at this point of time.

>
> Was a recipe for disaster, and you have be at least that good to even
> have a chance of working in a crash dump scenario.
>
> In part I am puzzled why the kmsg dump doesn't just use the printk
> interface. Strangely enough printk works in the event of a crash and
> has been shown to be reliable over the years.

Are you suggesting implementing these things as console driver and
register with printk as console? Sounds interesting. I think one of
the issues they probably will find that they don't want to log everything.
They want to do selective logging. Not sure how would they get this
info.

In some cases like emergency_restart(), there are no printk() and they
just consider it one of the events to dump the kernel buffer contents.

Thanks
Vivek

2011-02-08 19:58:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

Vivek Goyal <[email protected]> writes:

> On Tue, Feb 08, 2011 at 09:35:15AM -0800, Eric W. Biederman wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>> > On Thu, Feb 03, 2011 at 05:08:01PM -0500, Seiji Aguchi wrote:
>> >> Hi Eric,
>> >>
>> >> Thank you for your prompt reply.
>> >>
>> >> I would like to consider "Needs in enterprise area" and "Implementation of kmsg_dump()" separately.
>> >>
>> >> (1) Needs in enterprise area
>> >> In case of kdump failure, we would like to store kernel buffer to NVRAM/flush memory
>> >> for detecting root cause of kernel crash.
>> >>
>> >> (2) Implementation of kmsg_dump
>> >> You suggest to review/test cording of kmsg_dump() more.
>> >>
>> >> What do you think about (1)?
>> >> Is it acceptable for you?
>> >
>> > Ok, I am just trying to think loud about this problem and see if something
>> > fruitful comes out which paves the way forward.
>> >
>> > - So ideally we would like kdump_msg() to be called after crash_kexec() so
>> > that any unaudited (third party modules), unreliable calls do not
>> > compromise the realiability of kdump operation.
>> >
>> > But hitachi folks seems to be wanting to save atleast kernel buffers
>> > somwhere in the NVRAM etc because they think that kdump can be
>> > unreliable and we might not capture any information after the crash. So
>> > they kind of want two mechanisms in place. One is light weight which
>> > tries to save kernel buffers in NVRAM and then one heavy weight one
>> > which tries to save the entire/filtered kernel core.
>> >
>> > Personally I am not too excited about the idea but I guess I can live
>> > with it. We can try to audit atleast in kernel module and for external
>> > modules we don't have much control and live with the fact that if
>> > modules screw up, we don't capture the dump.
>> >
>> > Those who don't want this behavior can do three things.
>> >
>> > - Disable kdump_msg() at compile time.
>> > - Do not load any module which registers for kdump_msg()
>> > - Implement a /proc tunable which allows controlling this
>> > behavior.
>> >
>> > - Ok, having said why do we want it, comes the question of how to
>> > do it so that it works reasonably well.
>> >
>> > - There seems to be on common requirement of kmsg_dump() and kdump()
>> > and that is stop other cpus reliably (use nmi if possible). Can
>> > we try to share this code between kmsg_dump and crash_kexec(). So
>> > something like as follows.
>> >
>> > - panic happens
>> > - Do all the activities related to printing panic string and
>> > stack dump.
>> > - Stop other cpus.
>> > - This can be probably be done with the equivalent of
>> > machine_crash_shutdown() function. In fact this function
>> > can probably be broken down in two parts. First part
>> > does shutdown_prepare() where all other cpus are shot
>> > down and second part can do the actual disabling of
>> > LAPIC/IOAPIC and saving cpu registers etc.
>> >
>> > if (mutex_trylock(some_shutdown_mutex)) {
>> > /* setp regs, fix vmcoreinfo etc */
>> > crash_kexec_prepare();
>> > machine_shutdown_prepare();
>> > kdump_msg();
>> > crash_kexec_execute()
>> > /* Also call panic_notifier_list here ? */
>> > }
>> >
>> > crash_kexec_prepare () {
>> > crash_setup_regs(&fixed_regs, regs);
>> > crash_save_vmcoreinfo();
>> > }
>> >
>> > crash_kexec_execute() {
>> > /* Shutdown lapic/ioapic, save this cpu register etc */
>> > machine_shutdown();
>> > machine_kexec()
>> > }
>> >
>> > So basically we break down machine_shutdown() function in two parts
>> > and start sharing common part between kdump_msg(), crash_kexec and
>> > possibly panic_notifiers.
>> >
>> > If kdump is not configured, then after executing kdump_msg() and panic
>> > notifiers, we should either be sitting in tight loop with interrupt
>> > enabled for somebody to press Ctrl-boot or reboot system upon lapse
>> > of panic_timeout().
>> >
>> > Eric, does it make sense to you?
>>
>> kexec on panic doesn't strictly require that we stop other cpus.
>
> Yes but it is desirable.
>
> - We don't want cpus to be scribbling on old memory and possibly on
> new kernel's memory also in case of corrupted pointer and crash
> the freshly booted kernel (New kerenl's memory is mapped in old
> kernel)
>
> - We don't want other cpus to complete panic() and jump to BIOS or
> lead to some kind of triple fault and reset the system etc.
>
> So that would suggest to be robust, stopping other cpus is required.
>
> On a side note, kdump_msg() hook is present in emergency_reboot() too.
> So if these paths are not properly designed, then system might not
> even reboot automatically even if panic_timeout() has been specified.

Ouch! Ouch! Ouch!

emergency_restart is just supposed to be the restart bits, so
that it always works! We wouldn't need to use emergency restart
if we could trust the normal hardware shutdown code. Sigh.

I guess this is another issue I have with the design of kmsg_dump.
Instead of being a good citizen like everyone else and placing their
hooks explicitly where they can be seen. The kmsg_dump hooks were
buried in other routines changing the semantics of those routines.

What makes this scary is the modified routines are deliberately simple
so they are robust and can be audited and now on those code paths we
have the hooks that are so big everyone gives up before reading and
understanding all of the code involved.

>> What makes sense to me at this point is for someone on the kmsg_dump
>> side to make a strong case that the code actually works in a crash dump
>> scenario. We have lots of experience over the years that says a design
>> like kmsg_dump is attractive but turns out to be a unreliable piece of
>> junk that fails just when you need it. Because developers only test
>> the case when the kernel is happy and because people share code with
>> the regular path drivers, and that code assumes things are happy.
>>
>> I forget exactly why but last I looked.
>> local_irq_disable()
>> kmsg_dump()
>> local_irq_enalbe()
>
> I agree that kdump_msg() code should be able to work with interrupts
> disabled, atleast.
>
> There seem to be two pieces to it. One is generic call which calls
> all the dumpers and then respective dumpers.
>
> Looking at generic dumping call, I can't think why does it need interrupts
> to be enabled. There is one spin_lock() and then rcu_read_lock(). That's
> it.
>
> Regarding mtdoops, it is hard to tell. There is lot of code. But the good
> thing is that they have introduced a separate write path for panic context.
> That way atleast one can do special casing in panic path to avoid
> taking locks and not be dependent on interrupts.

I think that is part of my problem. There is a lot of code so it is
really hard to audit, and really hard to trust. I'm still leary with
the amount of code on the kexec on panic code path.

> ramoops seems to be simple. It seems to be just memcpy() except
> do_gettimeofday(). I noticed that in the past you raised concern about usage
> of do_gettimeofday(), but I am not sure what is the concern here (silly
> question i guess).
>
> So to me, ramoops seems to be simple memcpy (atleast in principle) and
> mtdoops has a dedicated path for handling panic context. So atleast
> it is fixable for possible issues. Generic code seems harmless to me
> at this point of time.
>
>>
>> Was a recipe for disaster, and you have be at least that good to even
>> have a chance of working in a crash dump scenario.
>>
>> In part I am puzzled why the kmsg dump doesn't just use the printk
>> interface. Strangely enough printk works in the event of a crash and
>> has been shown to be reliable over the years.
>
> Are you suggesting implementing these things as console driver and
> register with printk as console? Sounds interesting. I think one of
> the issues they probably will find that they don't want to log everything.
> They want to do selective logging. Not sure how would they get this
> info.
>
> In some cases like emergency_restart(), there are no printk() and they
> just consider it one of the events to dump the kernel buffer contents.

Sorry. I am tired and I'm really not ready to deal with yet another
mechanism like this, that is over designed and was flaky the first time
I looked at it. That flakiness combined with the authors not stepping
forward and rushing off to fix things or telling me they have fixed
things, means that I don't have any confidence in the mechanism.

Vivek if you want to step forward and be one of the people taking care
of kmsg_dump, then this conversation might have a point.

What I know for certain is that kmsg_dump and kexec on panic are
used in pretty exclusive ways, so getting the kmsg_dump code out
of crash_kexec does not hurt anyone today.

Right now I can't even get 2.6.38-rc4 to compile a reasonably sized
program, so I see a lot worse regressions to deal with than something
I can just avoid for now.

Eric

2011-03-08 01:31:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Query about kdump_msg hook into crash_kexec()

I'm sorry I've missed this mail long time.

> > ---------------------------------------------------------------------
> > @@ -74,6 +75,7 @@ NORET_TYPE void panic(const char * fmt, ...)
> > dump_stack();
> > #endif
> >
> > + kmsg_dump(KMSG_DUMP_PANIC);
> > /*
> > * If we have crashed and we have a crash kernel loaded let it handle
> > * everything else.
> > * Do we want to call this before we try to display a message?
> > */
> > crash_kexec(NULL);
> > ---------------------------------------------------------------------
>
> And I think to compensate for that somebody introduced additional
> kmsg_dump(KEXEC) call inside crash_kexec() and put it under CONFIG
> option so that one can change the behavior based on config options.
>
> I think this makes the logic somewhat twisted and an unnecessary call
> inside crash_kexec(). So until and unless there is a strong reason we
> can get rid of KEXEC event and move kmsg_dump call before crash_kexec()
> for now and see how does it go, IMHO.

I think I can agree your proposal. But could you please explain why do
you think kmsg _before_ kdump and kmsg _in_ kdump are so different?
I think it is only C level difference. CPU don't care C function and
anyway the kernel call kmsg_dump() because invoke second kernel even
if you proposal applied.

It is only curious. I'm not against your proposal.

Thanks.