2009-10-27 08:39:29

by Jin Dongming

[permalink] [raw]
Subject: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

When the kernel is going on panic, there are three groups of people
who want to do something for themselves at that time:
1. Only do the work for notifying
2. Only do the work for kdump
3. Want to do both of the above, notifying and kdump

The first and second have been done well on current kernel, but we
can not satisfy 3rd when kdump is used, because notify_die() is located
after crash_kexec(). So I think we can do some work to make it working
well. And I am also the person who stand in the third group.

In our PC-cluster, there are two nodes working together, one is running
and the other one is on standby mode. When the running one is going
on panic, we hope the works listed as following would be done:
1. Before the running kernel is going on panic, the node on standby
mode should be notified firstly.
2. After the notified work is done, the panic kernel startup on the
second kernel to get kdump.
But the current kernel could not do them all.

For resolving this problem,I made this patch.
In my patch, because I don't want to give impact to both of all, notifying
and kdump, I do following work:
1. I add crash_notifier instead of the existing notify_die().
So the original functionality of notify_die() will not be changed. I
made all of the work done just before kdump will be got.
2. I add a switch to turn on/off crash_notifier.
This switch could be turned on/off according to your request. For
example, when kdump is only required to be done, the crash_notifier is
not need any more, so the switch could be turned off.
Also there will be not impact to the second kernel. Because it is valid
only when the flag for crash_notifier is set via a parameter of kernel.

This patch is not tested on SH and Power PC.

Signed-off-by: Akiyama, Nobuyuki <[email protected]>
Signed-off-by: Jin Dongming <[email protected]>
---
include/linux/kexec.h | 2 ++
kernel/kexec.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index adc34f2..c50158c 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -9,6 +9,7 @@
#include <linux/ioport.h>
#include <linux/elfcore.h>
#include <linux/elf.h>
+#include <linux/notifier.h>
#include <asm/kexec.h>

/* Verify architecture specific macros are defined */
@@ -158,6 +159,7 @@ unsigned long paddr_vmcoreinfo_note(void);

extern struct kimage *kexec_image;
extern struct kimage *kexec_crash_image;
+extern struct raw_notifier_head crash_notifier_list;

#ifndef kexec_flush_icache_page
#define kexec_flush_icache_page(page)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index f336e21..147f22d 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -31,6 +31,8 @@
#include <linux/cpu.h>
#include <linux/console.h>
#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/notifier.h>

#include <asm/page.h>
#include <asm/uaccess.h>
@@ -38,6 +40,10 @@
#include <asm/system.h>
#include <asm/sections.h>

+/* crash notifier list */
+RAW_NOTIFIER_HEAD(crash_notifier_list);
+EXPORT_SYMBOL(crash_notifier_list);
+
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t* crash_notes;

@@ -1060,6 +1066,14 @@ asmlinkage long compat_sys_kexec_load(unsigned long entry,
}
#endif

+static int crash_notifier_valid __read_mostly;
+
+static void notify_crash(void)
+{
+ if (crash_notifier_valid)
+ raw_notifier_call_chain(&crash_notifier_list, 0, 0);
+}
+
void crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
@@ -1076,6 +1090,7 @@ void crash_kexec(struct pt_regs *regs)
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
+ notify_crash();
machine_kexec(kexec_crash_image);
}
mutex_unlock(&kexec_mutex);
@@ -1502,3 +1517,12 @@ int kernel_kexec(void)
mutex_unlock(&kexec_mutex);
return error;
}
+
+static int __init kexec_crash_notifier_valid(char *str)
+{
+ crash_notifier_valid = 1;
+
+ return 1;
+}
+
+__setup("crash_notifier", kexec_crash_notifier_valid);
--
1.6.2.2


2009-10-27 15:07:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

On Tue, Oct 27, 2009 at 05:39:40PM +0900, Jin Dongming wrote:
> When the kernel is going on panic, there are three groups of people
> who want to do something for themselves at that time:
> 1. Only do the work for notifying
> 2. Only do the work for kdump
> 3. Want to do both of the above, notifying and kdump
>
> The first and second have been done well on current kernel, but we
> can not satisfy 3rd when kdump is used, because notify_die() is located
> after crash_kexec(). So I think we can do some work to make it working
> well. And I am also the person who stand in the third group.
>
> In our PC-cluster, there are two nodes working together, one is running
> and the other one is on standby mode. When the running one is going
> on panic, we hope the works listed as following would be done:
> 1. Before the running kernel is going on panic, the node on standby
> mode should be notified firstly.
> 2. After the notified work is done, the panic kernel startup on the
> second kernel to get kdump.
> But the current kernel could not do them all.
>
> For resolving this problem,I made this patch.
> In my patch, because I don't want to give impact to both of all, notifying
> and kdump, I do following work:
> 1. I add crash_notifier instead of the existing notify_die().
> So the original functionality of notify_die() will not be changed. I
> made all of the work done just before kdump will be got.
> 2. I add a switch to turn on/off crash_notifier.
> This switch could be turned on/off according to your request. For
> example, when kdump is only required to be done, the crash_notifier is
> not need any more, so the switch could be turned off.
> Also there will be not impact to the second kernel. Because it is valid
> only when the flag for crash_notifier is set via a parameter of kernel.
>
> This patch is not tested on SH and Power PC.
>

I guess this might be 3rd or 4th attempt to get this kind of
infrastructure in kernel.

In the past exporting this kind of hook to modules has been rejected
becuase of concerns that modules might be doing too much in side a
crashed kernel and that can hang up the system completely and we can't
even capture the dump.

In the past two ways have been proposed to handle this situation.

- Handle it in second kernel. Especially in initrd. Put right
script/binary/tools and configuration in kdump initrd at the time of
configuration and once second kernel boots, initrd will first send the
kdump message out to other node(s). This can be helpful for fencing
scenario also.

- Other thing, I think Eric Biederman suggested was that we need to see
the code that will be executed after crash so that it can be audited and
keep it in kernel instead of blindly exporting a hook to module and let
module do whatever it want to do.

So my question is, how about introducing some infrastructure in kdump
initrd to send a notification out?

Thanks
Vivek

> Signed-off-by: Akiyama, Nobuyuki <[email protected]>
> Signed-off-by: Jin Dongming <[email protected]>
> ---
> include/linux/kexec.h | 2 ++
> kernel/kexec.c | 24 ++++++++++++++++++++++++
> 2 files changed, 26 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index adc34f2..c50158c 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -9,6 +9,7 @@
> #include <linux/ioport.h>
> #include <linux/elfcore.h>
> #include <linux/elf.h>
> +#include <linux/notifier.h>
> #include <asm/kexec.h>
>
> /* Verify architecture specific macros are defined */
> @@ -158,6 +159,7 @@ unsigned long paddr_vmcoreinfo_note(void);
>
> extern struct kimage *kexec_image;
> extern struct kimage *kexec_crash_image;
> +extern struct raw_notifier_head crash_notifier_list;
>
> #ifndef kexec_flush_icache_page
> #define kexec_flush_icache_page(page)
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index f336e21..147f22d 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -31,6 +31,8 @@
> #include <linux/cpu.h>
> #include <linux/console.h>
> #include <linux/vmalloc.h>
> +#include <linux/module.h>
> +#include <linux/notifier.h>
>
> #include <asm/page.h>
> #include <asm/uaccess.h>
> @@ -38,6 +40,10 @@
> #include <asm/system.h>
> #include <asm/sections.h>
>
> +/* crash notifier list */
> +RAW_NOTIFIER_HEAD(crash_notifier_list);
> +EXPORT_SYMBOL(crash_notifier_list);
> +
> /* Per cpu memory for storing cpu states in case of system crash. */
> note_buf_t* crash_notes;
>
> @@ -1060,6 +1066,14 @@ asmlinkage long compat_sys_kexec_load(unsigned long entry,
> }
> #endif
>
> +static int crash_notifier_valid __read_mostly;
> +
> +static void notify_crash(void)
> +{
> + if (crash_notifier_valid)
> + raw_notifier_call_chain(&crash_notifier_list, 0, 0);
> +}
> +
> void crash_kexec(struct pt_regs *regs)
> {
> /* Take the kexec_mutex here to prevent sys_kexec_load
> @@ -1076,6 +1090,7 @@ void crash_kexec(struct pt_regs *regs)
> crash_setup_regs(&fixed_regs, regs);
> crash_save_vmcoreinfo();
> machine_crash_shutdown(&fixed_regs);
> + notify_crash();
> machine_kexec(kexec_crash_image);
> }
> mutex_unlock(&kexec_mutex);
> @@ -1502,3 +1517,12 @@ int kernel_kexec(void)
> mutex_unlock(&kexec_mutex);
> return error;
> }
> +
> +static int __init kexec_crash_notifier_valid(char *str)
> +{
> + crash_notifier_valid = 1;
> +
> + return 1;
> +}
> +
> +__setup("crash_notifier", kexec_crash_notifier_valid);
> --
> 1.6.2.2
>

2009-10-27 16:16:52

by Lon Hohberger

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

On Tue, 2009-10-27 at 11:07 -0400, Vivek Goyal wrote:
> >
> > In our PC-cluster, there are two nodes working together, one is running
> > and the other one is on standby mode. When the running one is going
> > on panic, we hope the works listed as following would be done:
> > 1. Before the running kernel is going on panic, the node on standby
> > mode should be notified firstly.
> > 2. After the notified work is done, the panic kernel startup on the
> > second kernel to get kdump.
> > But the current kernel could not do them all.
> >

Ok, I'll admit at being naive as to how panicking kernels operate.

I do not understand how this could be safe from a cluster fencing
perspective. Effectively, you're allowing a "bad" kernel to continue to
do "something", when you should be allowing it to do "nothing".

This panicking kernel does "something" and the cluster presumably
initiates recovery /before/ the kdump kernel boots... i.e. with the old,
panicking kernel still present.

Shouldn't you at least wait until the kdump kernel boots before telling
a cluster that it is safe to begin recovery?

> > This patch is not tested on SH and Power PC.
> >
>
> I guess this might be 3rd or 4th attempt to get this kind of
> infrastructure in kernel.
>
> In the past exporting this kind of hook to modules has been rejected
> becuase of concerns that modules might be doing too much in side a
> crashed kernel and that can hang up the system completely and we can't
> even capture the dump.

Right.
- the hook can fail
- the hook could potentially be a poorly written one which tries to
access shared storage

Surely, booting the kdump kernel/env. might fail too - but it's no worse
than the notification hook failing. In both cases, you eventually time
out and fence off (or "STONITH") the failed node

I suspect doing things in a crashing kernel is more likely to fail than
doing things in a kdump-booted kernel...

> In the past two ways have been proposed to handle this situation.
>
> - Handle it in second kernel. Especially in initrd. Put right
> script/binary/tools and configuration in kdump initrd at the time of
> configuration and once second kernel boots, initrd will first send the
> kdump message out to other node(s). This can be helpful for fencing
> scenario also.

I think this is safer and more predictable: once the second kernel
boots, the panicked kernel is not in control any more. I suspect there
is a much higher degree of certainty around what the new kdump kernel
will do than what will happen in the panicked kernel with an added
'crashing' hook.

Waiting for kdump to boot is an unfortunate delay. However, the trade
off, I think, is more predictable, ordered failure recovery and
potentially less risk to data on shared storage (depending on what the
notify hook does).

-- Lon

2009-10-29 04:12:40

by Jin Dongming

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

Vivek, Lon Hohberger

Thanks for your comments.

I also agree with your opinion, too. But I still have problems listed
as following.
1. Nobody knows when the second kernel does not work well
2. Too much time cost to startup the second kernel

So I hope that following work could be done.
1. Add some code before kdump to monitor the actions of second kernel
Something we worried about is that nobody knows when the kdump will not
work well. If the second kernel does not work well, nobody knows when
the best time is to restart. So we need to add some code such as setting
watchdog. If we want to monitor second kernel, some work need to be done
before it starts to work.

2. Shorten the startup time of second kernel
I think that the purpose of second kernel is used for backup information
stored in memory to storage only. But the time cost is different
according to the system architecture. And also I think that there are
too many of device is not useful to get the memory information to
storage. I think the startup time of second kernel should be shortened.
I don't know much about second kernel, these are only my thought.

Thanks for your comments again.

Best regards,
Jin Dongming


Lon Hohberger wrote:
> On Tue, 2009-10-27 at 11:07 -0400, Vivek Goyal wrote:
>>> In our PC-cluster, there are two nodes working together, one is running
>>> and the other one is on standby mode. When the running one is going
>>> on panic, we hope the works listed as following would be done:
>>> 1. Before the running kernel is going on panic, the node on standby
>>> mode should be notified firstly.
>>> 2. After the notified work is done, the panic kernel startup on the
>>> second kernel to get kdump.
>>> But the current kernel could not do them all.
>>>
>
> Ok, I'll admit at being naive as to how panicking kernels operate.
>
> I do not understand how this could be safe from a cluster fencing
> perspective. Effectively, you're allowing a "bad" kernel to continue to
> do "something", when you should be allowing it to do "nothing".
>
> This panicking kernel does "something" and the cluster presumably
> initiates recovery /before/ the kdump kernel boots... i.e. with the old,
> panicking kernel still present.
>
> Shouldn't you at least wait until the kdump kernel boots before telling
> a cluster that it is safe to begin recovery?
>
>>> This patch is not tested on SH and Power PC.
>>>
>> I guess this might be 3rd or 4th attempt to get this kind of
>> infrastructure in kernel.
>>
>> In the past exporting this kind of hook to modules has been rejected
>> becuase of concerns that modules might be doing too much in side a

>> crashed kernel and that can hang up the system completely and we can't
>> even capture the dump.
>
> Right.
> - the hook can fail
> - the hook could potentially be a poorly written one which tries to
> access shared storage
>
> Surely, booting the kdump kernel/env. might fail too - but it's no worse
> than the notification hook failing. In both cases, you eventually time
> out and fence off (or "STONITH") the failed node
>
> I suspect doing things in a crashing kernel is more likely to fail than
> doing things in a kdump-booted kernel...
>
>> In the past two ways have been proposed to handle this situation.
>>
>> - Handle it in second kernel. Especially in initrd. Put right
>> script/binary/tools and configuration in kdump initrd at the time of
>> configuration and once second kernel boots, initrd will first send the
>> kdump message out to other node(s). This can be helpful for fencing
>> scenario also.
>
> I think this is safer and more predictable: once the second kernel
> boots, the panicked kernel is not in control any more. I suspect there
> is a much higher degree of certainty around what the new kdump kernel
> will do than what will happen in the panicked kernel with an added
> 'crashing' hook.
>
> Waiting for kdump to boot is an unfortunate delay. However, the trade
> off, I think, is more predictable, ordered failure recovery and
> potentially less risk to data on shared storage (depending on what the
> notify hook does).
>
> -- Lon
>
>
>

2009-10-29 05:32:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

Jin Dongming <[email protected]> writes:

> Vivek, Lon Hohberger
>
> Thanks for your comments.
>
> I also agree with your opinion, too. But I still have problems listed
> as following.
> 1. Nobody knows when the second kernel does not work well
> 2. Too much time cost to startup the second kernel
>
> So I hope that following work could be done.
> 1. Add some code before kdump to monitor the actions of second kernel
> Something we worried about is that nobody knows when the kdump will not
> work well. If the second kernel does not work well, nobody knows when
> the best time is to restart. So we need to add some code such as setting
> watchdog. If we want to monitor second kernel, some work need to be done
> before it starts to work.

Any extra work done in the crashing kernel decreases the likely hood
of the second kernel working. If you want a watchdog I recommend you always
run it and have the interval set large enough for the second kernel to boot
before it starts petting it. That way no code needs to run after a kernel crash.

> 2. Shorten the startup time of second kernel
> I think that the purpose of second kernel is used for backup information
> stored in memory to storage only. But the time cost is different
> according to the system architecture. And also I think that there are
> too many of device is not useful to get the memory information to
> storage. I think the startup time of second kernel should be shortened.
> I don't know much about second kernel, these are only my thought.

The second kernel can be anything you like. Including a standalone executable
if needed for very precise requirements. Any improvements in boot time for a
normal linux image should apply to the dump kernel.

Currently with a tight initrd and just dumping dmesg to disk I get away with
a 20M crashkernel reservation, and it typically runs and does it's work before
my watchdog fires.

Eric

2009-10-29 07:48:23

by Jin Dongming

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

Hi Eric

Thanks for your comment.
I will consider the method you suggested. Thank you very much.

Best regards,
Jin Dongming

Eric W. Biederman wrote:
> Jin Dongming <[email protected]> writes:
>
>> Vivek, Lon Hohberger
>>
>> Thanks for your comments.
>>
>> I also agree with your opinion, too. But I still have problems listed
>> as following.
>> 1. Nobody knows when the second kernel does not work well
>> 2. Too much time cost to startup the second kernel
>>
>> So I hope that following work could be done.
>> 1. Add some code before kdump to monitor the actions of second kernel
>> Something we worried about is that nobody knows when the kdump will not
>> work well. If the second kernel does not work well, nobody knows when
>> the best time is to restart. So we need to add some code such as setting
>> watchdog. If we want to monitor second kernel, some work need to be done
>> before it starts to work.
>
> Any extra work done in the crashing kernel decreases the likely hood
> of the second kernel working. If you want a watchdog I recommend you always
> run it and have the interval set large enough for the second kernel to boot
> before it starts petting it. That way no code needs to run after a kernel crash.
>
>> 2. Shorten the startup time of second kernel
>> I think that the purpose of second kernel is used for backup information
>> stored in memory to storage only. But the time cost is different
>> according to the system architecture. And also I think that there are
>> too many of device is not useful to get the memory information to
>> storage. I think the startup time of second kernel should be shortened.
>> I don't know much about second kernel, these are only my thought.
>
> The second kernel can be anything you like. Including a standalone executable
> if needed for very precise requirements. Any improvements in boot time for a
> normal linux image should apply to the dump kernel.
>
> Currently with a tight initrd and just dumping dmesg to disk I get away with
> a 20M crashkernel reservation, and it typically runs and does it's work before
> my watchdog fires.
>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2009-10-29 07:51:50

by Jin Dongming

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

Hi Vivek

Vivek Goyal wrote:
> - Other thing, I think Eric Biederman suggested was that we need to see
> the code that will be executed after crash so that it can be audited and
> keep it in kernel instead of blindly exporting a hook to module and let
> module do whatever it want to do.
>

I am sorry for that I could not catch the meaning of above words. In my
understanding:
If the code is already in current kernel and no matter whether it is built
or modulized, the code could be added before crash_kexec. Is it right?

If my understanding is not right, please let me ask a question before we
provide the code.
What kind of code is considered as nice code and could be added into
current kernel? Could you give me some requirements or examples
about nice code?

Best regards,
Jin Dongming

> So my question is, how about introducing some infrastructure in kdump
> initrd to send a notification out?
>
> Thanks
> Vivek

2009-10-29 18:43:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] [RFC][Patch x86-tip] add notifier before kdump

Jin Dongming <[email protected]> writes:

> Hi Vivek
>
> Vivek Goyal wrote:
>> - Other thing, I think Eric Biederman suggested was that we need to see
>> the code that will be executed after crash so that it can be audited and
>> keep it in kernel instead of blindly exporting a hook to module and let
>> module do whatever it want to do.
>>
>
> I am sorry for that I could not catch the meaning of above words. In my
> understanding:
> If the code is already in current kernel and no matter whether it is built
> or modulized, the code could be added before crash_kexec. Is it right?

No. crash_kexec skips the calling the notifiers intentionally to increase
it's robustness.

> If my understanding is not right, please let me ask a question before we
> provide the code.
> What kind of code is considered as nice code and could be added into
> current kernel? Could you give me some requirements or examples
> about nice code?

In general code that is necessary. If we can find a way to avoid running
code in a broken kernel we should.

The recent addition of code to disable processor vitalization features
on a crash is an example of code that is necessary. A small function
that we can always call that is not provided by a module.

Eric