From: Cliff Wickman <[email protected]>
X86 reboot-notify additions.
This patch adds scans of the "reboot_notifier_list" callback chain in
a three other places where the kernel is being stopped and/or restarted.
Adds calls to blocking_notifier_call_chain() in:
crash_kexec(), emergency_restart(), sys_kexec_load()
In the crash_kexec() and emergency_restart() cases it is indicated to the
called-back function that the system is not in a sane state, so that
it can avoid taking a lock or some such potentially blocking action.
These callbacks are important to a partition system. The stopped kernel needs
to inform other partitions of their need to disconnect (stop sharing memory).
Diffed against 2.6.26-rc6
Signed-off-by: Cliff Wickman <[email protected]>
---
include/linux/notifier.h | 4 ++++
kernel/kexec.c | 5 +++++
kernel/sys.c | 1 +
3 files changed, 10 insertions(+)
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -202,6 +202,10 @@ static inline int notifier_to_errno(int
#define SYS_RESTART SYS_DOWN
#define SYS_HALT 0x0002 /* Notify of system halt */
#define SYS_POWER_OFF 0x0003 /* Notify of system power off */
+#define SYS_INSANE 0x0004 /* Notify of system error/panic/oops */
+/* For the SYS_INSANE case, no locks should be taken by the called-back
+ * function. The kernel is ready for an immediate reboot.
+ */
#define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
Index: linux/kernel/kexec.c
===================================================================
--- linux.orig/kernel/kexec.c
+++ linux/kernel/kexec.c
@@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
if (result)
goto out;
}
+
+ blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+
/* Install the new kernel, and Uninstall the old */
image = xchg(dest_image, image);
@@ -1068,6 +1071,8 @@ void crash_kexec(struct pt_regs *regs)
if (!locked) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
+ blocking_notifier_call_chain(&reboot_notifier_list,
+ SYS_INSANE, NULL);
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -270,6 +270,7 @@ out_unlock:
*/
void emergency_restart(void)
{
+ blocking_notifier_call_chain(&reboot_notifier_list, SYS_INSANE, NULL);
machine_emergency_restart();
}
EXPORT_SYMBOL_GPL(emergency_restart);
Cliff Wickman <[email protected]> writes:
> From: Cliff Wickman <[email protected]>
>
> X86 reboot-notify additions.
Doesn't seem x86 specific to me.
> @@ -1068,6 +1071,8 @@ void crash_kexec(struct pt_regs *regs)
> if (!locked) {
> if (kexec_crash_image) {
> struct pt_regs fixed_regs;
> + blocking_notifier_call_chain(&reboot_notifier_list,
> + SYS_INSANE, NULL);
But you don't really want to block during a crash, do you?
> crash_setup_regs(&fixed_regs, regs);
> crash_save_vmcoreinfo();
> machine_crash_shutdown(&fixed_regs);
> Index: linux/kernel/sys.c
> ===================================================================
> --- linux.orig/kernel/sys.c
> +++ linux/kernel/sys.c
> @@ -270,6 +270,7 @@ out_unlock:
> */
> void emergency_restart(void)
> {
> + blocking_notifier_call_chain(&reboot_notifier_list, SYS_INSANE, NULL);
Here neither.
In fact taking any locks here is dangerous because if you crash holding
such a lock the system will deadlock on panic.
-Andi
* Cliff Wickman <[email protected]> wrote:
> From: Cliff Wickman <[email protected]>
>
> X86 reboot-notify additions.
>
> This patch adds scans of the "reboot_notifier_list" callback chain in
> a three other places where the kernel is being stopped and/or restarted.
>
> Adds calls to blocking_notifier_call_chain() in:
> crash_kexec(), emergency_restart(), sys_kexec_load()
>
> In the crash_kexec() and emergency_restart() cases it is indicated to the
> called-back function that the system is not in a sane state, so that
> it can avoid taking a lock or some such potentially blocking action.
>
> These callbacks are important to a partition system. The stopped kernel needs
> to inform other partitions of their need to disconnect (stop sharing memory).
>
> Diffed against 2.6.26-rc6
>
> Signed-off-by: Cliff Wickman <[email protected]>
> ---
> include/linux/notifier.h | 4 ++++
> kernel/kexec.c | 5 +++++
> kernel/sys.c | 1 +
> 3 files changed, 10 insertions(+)
>
> Index: linux/include/linux/notifier.h
> ===================================================================
> --- linux.orig/include/linux/notifier.h
> +++ linux/include/linux/notifier.h
> @@ -202,6 +202,10 @@ static inline int notifier_to_errno(int
> #define SYS_RESTART SYS_DOWN
> #define SYS_HALT 0x0002 /* Notify of system halt */
> #define SYS_POWER_OFF 0x0003 /* Notify of system power off */
> +#define SYS_INSANE 0x0004 /* Notify of system error/panic/oops */
> +/* For the SYS_INSANE case, no locks should be taken by the called-back
> + * function. The kernel is ready for an immediate reboot.
> + */
>
> #define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
>
> Index: linux/kernel/kexec.c
> ===================================================================
> --- linux.orig/kernel/kexec.c
> +++ linux/kernel/kexec.c
> @@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
> if (result)
> goto out;
> }
> +
> + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> +
> /* Install the new kernel, and Uninstall the old */
> image = xchg(dest_image, image);
>
> @@ -1068,6 +1071,8 @@ void crash_kexec(struct pt_regs *regs)
> if (!locked) {
> if (kexec_crash_image) {
> struct pt_regs fixed_regs;
> + blocking_notifier_call_chain(&reboot_notifier_list,
> + SYS_INSANE, NULL);
> crash_setup_regs(&fixed_regs, regs);
> crash_save_vmcoreinfo();
> machine_crash_shutdown(&fixed_regs);
> Index: linux/kernel/sys.c
> ===================================================================
> --- linux.orig/kernel/sys.c
> +++ linux/kernel/sys.c
> @@ -270,6 +270,7 @@ out_unlock:
> */
> void emergency_restart(void)
> {
> + blocking_notifier_call_chain(&reboot_notifier_list, SYS_INSANE, NULL);
> machine_emergency_restart();
> }
> EXPORT_SYMBOL_GPL(emergency_restart);
i dont think this is a good idea. reboot_notifier_list is a blocking
notifier, i.e. it comes with a notifier->rwsem read-write mutex that is
taken when blocking_notifier_call_chain() is executed.
i.e. this patch puts a sleeping mutex operation (a down_read()) into a
highly critical code path of the kernel. This will decrease the
reliability of the kernel.
what exactly are you trying to achieve?
Ingo
On Thu, Jun 19, 2008 at 01:02:14PM +0200, Ingo Molnar wrote:
>
> * Cliff Wickman <[email protected]> wrote:
>
> > From: Cliff Wickman <[email protected]>
> >
> > X86 reboot-notify additions.
As Andi Kleen pointed out, this is not X86-specific. (it started out
to be, but what I hoped to achieve for X86 turns out to be generic).
> > This patch adds scans of the "reboot_notifier_list" callback chain in
> > a three other places where the kernel is being stopped and/or restarted.
> >
> > Adds calls to blocking_notifier_call_chain() in:
> > crash_kexec(), emergency_restart(), sys_kexec_load()
> >
> > In the crash_kexec() and emergency_restart() cases it is indicated to the
> > called-back function that the system is not in a sane state, so that
> > it can avoid taking a lock or some such potentially blocking action.
> >
> > These callbacks are important to a partition system. The stopped kernel needs
> > to inform other partitions of their need to disconnect (stop sharing memory).
> >
> > Diffed against 2.6.26-rc6
> >
> > Signed-off-by: Cliff Wickman <[email protected]>
> > ---
> > include/linux/notifier.h | 4 ++++
> > kernel/kexec.c | 5 +++++
> > kernel/sys.c | 1 +
> > 3 files changed, 10 insertions(+)
> >
> > Index: linux/include/linux/notifier.h
> > ===================================================================
> > --- linux.orig/include/linux/notifier.h
> > +++ linux/include/linux/notifier.h
> > @@ -202,6 +202,10 @@ static inline int notifier_to_errno(int
> > #define SYS_RESTART SYS_DOWN
> > #define SYS_HALT 0x0002 /* Notify of system halt */
> > #define SYS_POWER_OFF 0x0003 /* Notify of system power off */
> > +#define SYS_INSANE 0x0004 /* Notify of system error/panic/oops */
> > +/* For the SYS_INSANE case, no locks should be taken by the called-back
> > + * function. The kernel is ready for an immediate reboot.
> > + */
> >
> > #define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
> >
> > Index: linux/kernel/kexec.c
> > ===================================================================
> > --- linux.orig/kernel/kexec.c
> > +++ linux/kernel/kexec.c
> > @@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
> > if (result)
> > goto out;
> > }
> > +
> > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> > +
> > /* Install the new kernel, and Uninstall the old */
> > image = xchg(dest_image, image);
> >
> > @@ -1068,6 +1071,8 @@ void crash_kexec(struct pt_regs *regs)
> > if (!locked) {
> > if (kexec_crash_image) {
> > struct pt_regs fixed_regs;
> > + blocking_notifier_call_chain(&reboot_notifier_list,
> > + SYS_INSANE, NULL);
> > crash_setup_regs(&fixed_regs, regs);
> > crash_save_vmcoreinfo();
> > machine_crash_shutdown(&fixed_regs);
> > Index: linux/kernel/sys.c
> > ===================================================================
> > --- linux.orig/kernel/sys.c
> > +++ linux/kernel/sys.c
> > @@ -270,6 +270,7 @@ out_unlock:
> > */
> > void emergency_restart(void)
> > {
> > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_INSANE, NULL);
> > machine_emergency_restart();
> > }
> > EXPORT_SYMBOL_GPL(emergency_restart);
>
> i dont think this is a good idea. reboot_notifier_list is a blocking
> notifier, i.e. it comes with a notifier->rwsem read-write mutex that is
> taken when blocking_notifier_call_chain() is executed.
>
> i.e. this patch puts a sleeping mutex operation (a down_read()) into a
> highly critical code path of the kernel. This will decrease the
> reliability of the kernel.
Andi pointed this out, too.
For these emergency cases (I'll change "SYS_INSANE" to "SYS_EMERGENCY")
I probably should be using raw_notifier_call_chain(), which requires
a slightly different form of list header but doesn't try to protect
against someone else adding to the notifier list.
>
> what exactly are you trying to achieve?
>
> Ingo
The impetus for these additions is to call back a driver in every case
that the kernel is going down. In a partitioned system we need such a
driver to inform all other partitions that they need to disconnect from
the rebooting/halting/panicing partition (kernel image). If they are not
informed, they may bring themselves crashing down as well.
(xpc is such a cross_partition driver)
I propose this revision of the patch instead:
Subject: [PATCH] reboot-notify additions
reboot-notify additions
This patch adds scans of the "reboot_notifier_list" callback chain in
the remaining places where the kernel is being stopped and/or restarted.
Adds 3 calls to {raw|blocking}_notifier_call_chain() in:
crash_kexec(), sys_kexec_load(), emergency_restart()
Diffed against 2.6.26-rc6
Signed-off-by: Cliff Wickman <[email protected]>
---
include/linux/notifier.h | 3 +++
kernel/kexec.c | 10 ++++++++++
kernel/sys.c | 7 +++++++
3 files changed, 20 insertions(+)
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -202,6 +202,9 @@ static inline int notifier_to_errno(int
#define SYS_RESTART SYS_DOWN
#define SYS_HALT 0x0002 /* Notify of system halt */
#define SYS_POWER_OFF 0x0003 /* Notify of system power off */
+#define SYS_EMERGENCY 0x0004 /* Notify of system error/panic/oops */
+/* For the SYS_EMERGENCY case, no locks should be taken by the called-back
+ * function. */
#define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
Index: linux/kernel/kexec.c
===================================================================
--- linux.orig/kernel/kexec.c
+++ linux/kernel/kexec.c
@@ -24,6 +24,7 @@
#include <linux/utsrelease.h>
#include <linux/utsname.h>
#include <linux/numa.h>
+#include <linux/notifier.h>
#include <asm/page.h>
#include <asm/uaccess.h>
@@ -1001,6 +1002,9 @@ asmlinkage long sys_kexec_load(unsigned
if (result)
goto out;
}
+
+ blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+
/* Install the new kernel, and Uninstall the old */
image = xchg(dest_image, image);
@@ -1063,11 +1067,17 @@ void crash_kexec(struct pt_regs *regs)
* If the crash kernel was not located in a fixed area
* of memory the xchg(&kexec_crash_image) would be
* sufficient. But since I reuse the memory...
+ *
+ * The reboot_notifier_list uses a header for a blocking-form scan.
+ * Use a local header suitable for a non-blocking scan.
*/
locked = xchg(&kexec_lock, 1);
if (!locked) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
+ struct raw_notifier_head rh;
+ rh.head = reboot_notifier_list.head;
+ raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -267,9 +267,16 @@ out_unlock:
* reboot the system. This is called when we know we are in
* trouble so this is our best effort to reboot. This is
* safe to call in interrupt context.
+ *
+ * The reboot_notifier_list uses a header for a blocking-form scan.
+ * Use a local header suitable for a non-blocking scan.
*/
void emergency_restart(void)
{
+ struct raw_notifier_head rh;
+
+ rh.head = reboot_notifier_list.head;
+ raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
machine_emergency_restart();
}
EXPORT_SYMBOL_GPL(emergency_restart);
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824
Cliff Wickman <[email protected]> writes:
> On Thu, Jun 19, 2008 at 01:02:14PM +0200, Ingo Molnar wrote:
>>
>> * Cliff Wickman <[email protected]> wrote:
>>
>> > From: Cliff Wickman <[email protected]>
>> >
>> > X86 reboot-notify additions.
>
> As Andi Kleen pointed out, this is not X86-specific. (it started out
> to be, but what I hoped to achieve for X86 turns out to be generic).
>
>> > This patch adds scans of the "reboot_notifier_list" callback chain in
>> > a three other places where the kernel is being stopped and/or restarted.
>> >
>> > Adds calls to blocking_notifier_call_chain() in:
>> > crash_kexec(), emergency_restart(), sys_kexec_load()
>> >
>> > In the crash_kexec() and emergency_restart() cases it is indicated to the
>> > called-back function that the system is not in a sane state, so that
>> > it can avoid taking a lock or some such potentially blocking action.
>> >
>> > These callbacks are important to a partition system. The stopped kernel
> needs
>> > to inform other partitions of their need to disconnect (stop sharing
> memory).
>> >
>> > Diffed against 2.6.26-rc6
>> >
>> > Signed-off-by: Cliff Wickman <[email protected]>
>> > ---
>> > include/linux/notifier.h | 4 ++++
>> > kernel/kexec.c | 5 +++++
>> > kernel/sys.c | 1 +
>> > 3 files changed, 10 insertions(+)
>> >
>> > Index: linux/include/linux/notifier.h
>> > ===================================================================
>> > --- linux.orig/include/linux/notifier.h
>> > +++ linux/include/linux/notifier.h
>> > @@ -202,6 +202,10 @@ static inline int notifier_to_errno(int
>> > #define SYS_RESTART SYS_DOWN
>> > #define SYS_HALT 0x0002 /* Notify of system halt */
>> > #define SYS_POWER_OFF 0x0003 /* Notify of system power off */
>> > +#define SYS_INSANE 0x0004 /* Notify of system error/panic/oops */
>> > +/* For the SYS_INSANE case, no locks should be taken by the called-back
>> > + * function. The kernel is ready for an immediate reboot.
>> > + */
>> >
>> > #define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
>> >
>> > Index: linux/kernel/kexec.c
>> > ===================================================================
>> > --- linux.orig/kernel/kexec.c
>> > +++ linux/kernel/kexec.c
>> > @@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
>> > if (result)
>> > goto out;
>> > }
>> > +
>> > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
>> > +
>> > /* Install the new kernel, and Uninstall the old */
>> > image = xchg(dest_image, image);
>> >
>> > @@ -1068,6 +1071,8 @@ void crash_kexec(struct pt_regs *regs)
>> > if (!locked) {
>> > if (kexec_crash_image) {
>> > struct pt_regs fixed_regs;
>> > + blocking_notifier_call_chain(&reboot_notifier_list,
>> > + SYS_INSANE, NULL);
>> > crash_setup_regs(&fixed_regs, regs);
>> > crash_save_vmcoreinfo();
>> > machine_crash_shutdown(&fixed_regs);
>> > Index: linux/kernel/sys.c
>> > ===================================================================
>> > --- linux.orig/kernel/sys.c
>> > +++ linux/kernel/sys.c
>> > @@ -270,6 +270,7 @@ out_unlock:
>> > */
>> > void emergency_restart(void)
>> > {
>> > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_INSANE, NULL);
>> > machine_emergency_restart();
>> > }
>> > EXPORT_SYMBOL_GPL(emergency_restart);
>>
>> i dont think this is a good idea. reboot_notifier_list is a blocking
>> notifier, i.e. it comes with a notifier->rwsem read-write mutex that is
>> taken when blocking_notifier_call_chain() is executed.
>>
>> i.e. this patch puts a sleeping mutex operation (a down_read()) into a
>> highly critical code path of the kernel. This will decrease the
>> reliability of the kernel.
>
> Andi pointed this out, too.
>
> For these emergency cases (I'll change "SYS_INSANE" to "SYS_EMERGENCY")
> I probably should be using raw_notifier_call_chain(), which requires
> a slightly different form of list header but doesn't try to protect
> against someone else adding to the notifier list.
>>
>> what exactly are you trying to achieve?
>>
>> Ingo
>
> The impetus for these additions is to call back a driver in every case
> that the kernel is going down. In a partitioned system we need such a
> driver to inform all other partitions that they need to disconnect from
> the rebooting/halting/panicing partition (kernel image). If they are not
> informed, they may bring themselves crashing down as well.
> (xpc is such a cross_partition driver)
Cool. Someone who wants this kind of functionality and has code in
the kernel. Perhaps we can have a reasonable discussion about this.
The last time this came up people wanted a hook so they could support
their out of tree blobs in an enterprise kernel.
emergency_restart only happens or only should happen with explicit admin
request Sysreq-r. Or when a watchdog detects the system is borked.
By design it is not expected to call drivers. The kexec on panic
case is similar.
sys_kexec_load is a ridiculous place to call any kind of reboot notifier
because it is a prep function that doesn't require any kind of connection
to rebooting.
As far as this being a generic problem I half agree. It seems to depend
on your platform if you need something like this.
With that said. I suggest we have a single platform specific function
that can be called. Possibly something like pm_power_off. It is
insanely important that we be able to audit all of the code on these
code paths, and that a random loadable driver not be able to get in
and mess things up.
Eric
* Cliff Wickman <[email protected]> wrote:
> > > +#define SYS_INSANE 0x0004 /* Notify of system error/panic/oops */
> > > +/* For the SYS_INSANE case, no locks should be taken by the called-back
> > > + * function. The kernel is ready for an immediate reboot.
> > > + */
please use standard comment blocks, something like this:
/*
* Notify of system error/panic/oops
*
* No locks should be taken by the called-back function.
* The kernel is ready for an immediate reboot.
*/
#define SYS_INSANE 0x0004
> > > +++ linux/kernel/kexec.c
> > > @@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
> > > if (result)
> > > goto out;
> > > }
> > > +
> > > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
this puts an extra blocking call into a kexec critical path.
> struct pt_regs fixed_regs;
> + struct raw_notifier_head rh;
> + rh.head = reboot_notifier_list.head;
> + raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
that's very nasty. Now callbacks which put themselves on a blocking list
will be called without locking.
> void emergency_restart(void)
> {
> + struct raw_notifier_head rh;
> +
> + rh.head = reboot_notifier_list.head;
> + raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
ditto.
this patch is still far from being acceptable.
Ingo
On Thu, Jun 19, 2008 at 02:50:49PM -0700, Eric W. Biederman wrote:
> Cliff Wickman <[email protected]> writes:
>
> >> > This patch adds scans of the "reboot_notifier_list" callback chain in
> >> > a three other places where the kernel is being stopped and/or restarted.
> >> >
> >> > Adds calls to blocking_notifier_call_chain() in:
> >> > crash_kexec(), emergency_restart(), sys_kexec_load()
> >> >
> >> > In the crash_kexec() and emergency_restart() cases it is indicated to the
> >> > called-back function that the system is not in a sane state, so that
> >> > it can avoid taking a lock or some such potentially blocking action.
> >> >
> >> > These callbacks are important to a partition system. The stopped kernel
> > needs
> >> > to inform other partitions of their need to disconnect (stop sharing
> > memory).
> >> >
> >> > Index: linux/kernel/kexec.c
> >> > ===================================================================
> >> > --- linux.orig/kernel/kexec.c
> >> > +++ linux/kernel/kexec.c
> >> > @@ -1001,6 +1001,9 @@ asmlinkage long sys_kexec_load(unsigned
> >> > if (result)
> >> > goto out;
> >> > }
> >> > +
> >> > + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> >> > +
> >> > /* Install the new kernel, and Uninstall the old */
> >> > image = xchg(dest_image, image);
> >> >
I withdraw the above chunk. As Eric pointed out, sys_kexec_load() is
not where the new kernel is started. That is done in kernel_kexec()
which already runs the reboot_notifier_list.
> >> i dont think this is a good idea. reboot_notifier_list is a blocking
> >> notifier, i.e. it comes with a notifier->rwsem read-write mutex that is
> >> taken when blocking_notifier_call_chain() is executed.
> >>
> >> i.e. this patch puts a sleeping mutex operation (a down_read()) into a
> >> highly critical code path of the kernel. This will decrease the
> >> reliability of the kernel.
> >
> > Andi pointed this out, too.
> >
> > For these emergency cases (I'll change "SYS_INSANE" to "SYS_EMERGENCY")
> > I probably should be using raw_notifier_call_chain(), which requires
> > a slightly different form of list header but doesn't try to protect
> > against someone else adding to the notifier list.
> >>
> >> what exactly are you trying to achieve?
> >>
> >> Ingo
> >
> > The impetus for these additions is to call back a driver in every case
> > that the kernel is going down. In a partitioned system we need such a
> > driver to inform all other partitions that they need to disconnect from
> > the rebooting/halting/panicing partition (kernel image). If they are not
> > informed, they may bring themselves crashing down as well.
> > (xpc is such a cross_partition driver)
>
> Cool. Someone who wants this kind of functionality and has code in
> the kernel. Perhaps we can have a reasonable discussion about this.
> The last time this came up people wanted a hook so they could support
> their out of tree blobs in an enterprise kernel.
>
> emergency_restart only happens or only should happen with explicit admin
> request Sysreq-r. Or when a watchdog detects the system is borked.
> By design it is not expected to call drivers. The kexec on panic
> case is similar.
I suppose one could trust that someone with superuser permission would
not stop one partition of a multi-partitioned system in a cavalier manner.
I'm inclined to think we should run the reboot_notifier_list even in those
situations.
But definitely on some watchdog timeout event. Some kind of mechanism
should be invoked to communicate the stoppage.
> sys_kexec_load is a ridiculous place to call any kind of reboot notifier
> because it is a prep function that doesn't require any kind of connection
> to rebooting.
agree
done
> As far as this being a generic problem I half agree. It seems to depend
> on your platform if you need something like this.
>
> With that said. I suggest we have a single platform specific function
> that can be called. Possibly something like pm_power_off. It is
> insanely important that we be able to audit all of the code on these
> code paths, and that a random loadable driver not be able to get in
> and mess things up.
The panic() function has the panic_notifier_list for those cases where
crash_kexec() does not find a crash kernel to exec.
That leaves holes for watchdog-type events and crash_kexec().
Can you elborate on the problem with running a non-blocking scan of
the reboot_notifier_list in those situations?
What do you have in mind as a platform specific function, that would
be an improvement over the reboot_notifier_list?
My current (v2) proposed patch for using the reboot_notifier_list as
this mechanism looks like this:
(and I'm not sure if using atomic_notifier_call_chain() might be a better
alternative to raw_notifier_call_chain())
Subject: [PATCHv2] reboot-notify additions
reboot-notify additions
This patch adds scans of the "reboot_notifier_list" callback chain in
the remaining places where the kernel is being stopped and/or restarted.
Adds 2 calls to raw_notifier_call_chain() in:
crash_kexec(), emergency_restart()
Diffed against 2.6.26-rc6
Signed-off-by: Cliff Wickman <[email protected]>
---
include/linux/notifier.h | 5 +++++
kernel/kexec.c | 6 ++++++
kernel/sys.c | 7 +++++++
3 files changed, 18 insertions(+)
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -202,6 +202,11 @@ static inline int notifier_to_errno(int
#define SYS_RESTART SYS_DOWN
#define SYS_HALT 0x0002 /* Notify of system halt */
#define SYS_POWER_OFF 0x0003 /* Notify of system power off */
+#define SYS_EMERGENCY 0x0004 /* Notify of system error/panic/oops */
+/*
+ * For the SYS_EMERGENCY case, no locks should be taken by the called-back
+ * function.
+ */
#define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
Index: linux/kernel/kexec.c
===================================================================
--- linux.orig/kernel/kexec.c
+++ linux/kernel/kexec.c
@@ -1063,11 +1063,17 @@ void crash_kexec(struct pt_regs *regs)
* If the crash kernel was not located in a fixed area
* of memory the xchg(&kexec_crash_image) would be
* sufficient. But since I reuse the memory...
+ *
+ * The reboot_notifier_list uses a header for a blocking-form scan.
+ * Use a local header suitable for a non-blocking scan.
*/
locked = xchg(&kexec_lock, 1);
if (!locked) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
+ struct raw_notifier_head rh;
+ rh.head = reboot_notifier_list.head;
+ raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -267,9 +267,16 @@ out_unlock:
* reboot the system. This is called when we know we are in
* trouble so this is our best effort to reboot. This is
* safe to call in interrupt context.
+ *
+ * The reboot_notifier_list uses a header for a blocking-form scan.
+ * Use a local header suitable for a non-blocking scan.
*/
void emergency_restart(void)
{
+ struct raw_notifier_head rh;
+
+ rh.head = reboot_notifier_list.head;
+ raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
machine_emergency_restart();
}
EXPORT_SYMBOL_GPL(emergency_restart);
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824
* Cliff Wickman <[email protected]> wrote:
> --- linux.orig/kernel/sys.c
> +++ linux/kernel/sys.c
> @@ -267,9 +267,16 @@ out_unlock:
> * reboot the system. This is called when we know we are in
> * trouble so this is our best effort to reboot. This is
> * safe to call in interrupt context.
> + *
> + * The reboot_notifier_list uses a header for a blocking-form scan.
> + * Use a local header suitable for a non-blocking scan.
> */
> void emergency_restart(void)
> {
> + struct raw_notifier_head rh;
> +
> + rh.head = reboot_notifier_list.head;
> + raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
> machine_emergency_restart();
> }
that's still not a good idea - a blocking notifier list is that: a list
that has stuff which might block. emergency_restart() might get called
by non-blocking codepaths as well and it expects the restart to occur.
Ingo
Cliff Wickman <[email protected]> writes:
>>
>> Cool. Someone who wants this kind of functionality and has code in
>> the kernel. Perhaps we can have a reasonable discussion about this.
>> The last time this came up people wanted a hook so they could support
>> their out of tree blobs in an enterprise kernel.
>>
>> emergency_restart only happens or only should happen with explicit admin
>> request Sysreq-r. Or when a watchdog detects the system is borked.
>> By design it is not expected to call drivers. The kexec on panic
>> case is similar.
>
> I suppose one could trust that someone with superuser permission would
> not stop one partition of a multi-partitioned system in a cavalier manner.
> I'm inclined to think we should run the reboot_notifier_list even in those
> situations.
NACK emergency_restart is for when calling a normal reboot doesn't
work i.e. calling the reboot_notifier_list is broken.
emergency_restart is by definition a hack.
Also now that I think about it now that we have the device tree
notifications the last few users of the reboot_notifier_list should
be updated and the reboot_notifier_list should just be removed.
> But definitely on some watchdog timeout event. Some kind of mechanism
> should be invoked to communicate the stoppage.
I'm not certain why this is important if you have a hardware partition
that looks like real hardware. In that case the firmware should
easily be able to detect this because we reboot the partition.
>> As far as this being a generic problem I half agree. It seems to depend
>> on your platform if you need something like this.
>>
>> With that said. I suggest we have a single platform specific function
>> that can be called. Possibly something like pm_power_off. It is
>> insanely important that we be able to audit all of the code on these
>> code paths, and that a random loadable driver not be able to get in
>> and mess things up.
>
> The panic() function has the panic_notifier_list for those cases where
> crash_kexec() does not find a crash kernel to exec.
>
> That leaves holes for watchdog-type events and crash_kexec().
> Can you elborate on the problem with running a non-blocking scan of
> the reboot_notifier_list in those situations?
>
> What do you have in mind as a platform specific function, that would
> be an improvement over the reboot_notifier_list?
In the general case it is WRONG to notify or run any function before
crash_kexec. The assumption is that the current kernel is BROKEN. So the
goal is to keep our exposure to kernel bugs to a minimum. There is currently
too _much_ code being run on the crash_kexec path.
In the crash_kexec case we can call functions on the other side of the
kexec notifier. So there is very much a hook there.
> My current (v2) proposed patch for using the reboot_notifier_list as
> this mechanism looks like this:
> (and I'm not sure if using atomic_notifier_call_chain() might be a better
> alternative to raw_notifier_call_chain())
I am willing to look at performing actions in the crash_kexec path
on a case by case basis. Which means essentially a hard coded function
call that compiles out on the hardware where it is meaningless.
As for emergency_restart. That is for those times when doing a proper
reboot doesn't work and you just want to rest the hardware.
So can we please start with what exactly you need to do on the xpc and
why?
Eric
On Fri, Jun 20, 2008 at 11:01:34AM -0700, Eric W. Biederman wrote:
> Cliff Wickman <[email protected]> writes:
>> void emergency_restart(void)
>> {
>> + struct raw_notifier_head rh;
>> +
>> + rh.head = reboot_notifier_list.head;
>> + raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
>> machine_emergency_restart();
>> }
> that's still not a good idea - a blocking notifier list is that: a list
> that has stuff which might block. emergency_restart() might get called
> by non-blocking codepaths as well and it expects the restart to occur.
>
> Ingo
Oh. I get it now. I was just looking at how the types of lists protect
against additions to the list.
Seems an atomic list would do what we want. By definition the functions
on such a list should not block.
> >>
> >> Cool. Someone who wants this kind of functionality and has code in
> >> the kernel. Perhaps we can have a reasonable discussion about this.
> >> The last time this came up people wanted a hook so they could support
> >> their out of tree blobs in an enterprise kernel.
> >>
> >> emergency_restart only happens or only should happen with explicit admin
> >> request Sysreq-r. Or when a watchdog detects the system is borked.
> >> By design it is not expected to call drivers. The kexec on panic
> >> case is similar.
> >
> > I suppose one could trust that someone with superuser permission would
> > not stop one partition of a multi-partitioned system in a cavalier manner.
> > I'm inclined to think we should run the reboot_notifier_list even in those
> > situations.
>
> NACK emergency_restart is for when calling a normal reboot doesn't
> work i.e. calling the reboot_notifier_list is broken.
>
> emergency_restart is by definition a hack.
Perhaps there should be a normal_restart() and an emergency_restart().
Currently, on x86, emergency_restart() is called in both sane and error
situations.
> Also now that I think about it now that we have the device tree
> notifications the last few users of the reboot_notifier_list should
> be updated and the reboot_notifier_list should just be removed.
>
> > But definitely on some watchdog timeout event. Some kind of mechanism
> > should be invoked to communicate the stoppage.
>
> I'm not certain why this is important if you have a hardware partition
> that looks like real hardware. In that case the firmware should
> easily be able to detect this because we reboot the partition.
I suppose the firmware could get involved. It's true that the live
partitions have no problem until the dead partition is rebooted and
memory barriers are raised to the memory the live partitions are accessing.
I suppose the rebooting partition could communicate to the firmware in
the live partitions. But to communicate to a driver in the OS, and then
back again to the firmware in the dead partition might be pretty messy.
> >> As far as this being a generic problem I half agree. It seems to depend
> >> on your platform if you need something like this.
> >>
> >> With that said. I suggest we have a single platform specific function
> >> that can be called. Possibly something like pm_power_off. It is
> >> insanely important that we be able to audit all of the code on these
> >> code paths, and that a random loadable driver not be able to get in
> >> and mess things up.
> >
> > The panic() function has the panic_notifier_list for those cases where
> > crash_kexec() does not find a crash kernel to exec.
> >
> > That leaves holes for watchdog-type events and crash_kexec().
> > Can you elborate on the problem with running a non-blocking scan of
> > the reboot_notifier_list in those situations?
> >
> > What do you have in mind as a platform specific function, that would
> > be an improvement over the reboot_notifier_list?
>
> In the general case it is WRONG to notify or run any function before
> crash_kexec. The assumption is that the current kernel is BROKEN. So the
> goal is to keep our exposure to kernel bugs to a minimum. There is currently
> too _much_ code being run on the crash_kexec path.
We do have an atomic panic_notifier_list. How about using that?
The functions on the list are supposed to be non-blocking.
(currently I only see 5 of them in use -- lguest_panic panic_event
wdog_panic_handler panic_happened softlock_panic)
> In the crash_kexec case we can call functions on the other side of the
> kexec notifier. So there is very much a hook there.
Sorry, I'm not understanding that. What is that hook?
> > My current (v2) proposed patch for using the reboot_notifier_list as
> > this mechanism looks like this:
> > (and I'm not sure if using atomic_notifier_call_chain() might be a better
> > alternative to raw_notifier_call_chain())
>
> I am willing to look at performing actions in the crash_kexec path
> on a case by case basis. Which means essentially a hard coded function
> call that compiles out on the hardware where it is meaningless.
>
> As for emergency_restart. That is for those times when doing a proper
> reboot doesn't work and you just want to rest the hardware.
Perhaps a split of emergency_restart() into normal_restart() and
emergency_restart() would reserve emergency_restart() for just those cases.
Then we could use the emergency procedures only in the emergency cases.
> So can we please start with what exactly you need to do on the xpc and
> why?
See xpc_system_reboot() [drivers/misc/sgi-xp/xpc_main.c]
It is called from the reboot_notifier_list when the system is being
rebooted.
The driver calls xpc_do_exit() go through its normal exit processing. This
involves some significant waiting.
And xpc_system_die().
It is called from the die_chain (on ia64). But on x86 there are these
couple of cases where no callback occurs from the reboot or panic lists.
The driver is to be called back when the kernel is restarted or halted due
to some sort of failure. It calls xpc_die_disengage() to notify other
partitions to disengage from all references to the dying partition's memory.
There is some waiting for the other partitions.
-Cliff
Below is my current thought for covering the crash_kexec and
emergency_restart cases:
Subject: [PATCH] panic-notify additions
This patch adds scans of the "panic_notifier_list" callback chain in
the places where the kernel is going down in an error situation, but in
which no notification was provided.
Adds 2 calls to atomic_notifier_call_chain() in:
crash_kexec(), emergency_restart()
Differentiates as to the source of the call with arguments SYS_PANIC,
SYS_KEXEC and SYS_EMERGENCY.
The panic_notifier_list is used instead of the reboot_notifier_list
because it is atomic. That is, by definition, the functions on this type
of list run in an atomic context so they must not block.
Diffed against 2.6.26-rc6
Signed-off-by: Cliff Wickman <[email protected]>
---
include/linux/notifier.h | 3 +++
kernel/kexec.c | 2 ++
kernel/panic.c | 2 +-
kernel/sys.c | 2 ++
4 files changed, 8 insertions(+), 1 deletion(-)
Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h
+++ linux/include/linux/notifier.h
@@ -202,6 +202,9 @@ static inline int notifier_to_errno(int
#define SYS_RESTART SYS_DOWN
#define SYS_HALT 0x0002 /* Notify of system halt */
#define SYS_POWER_OFF 0x0003 /* Notify of system power off */
+#define SYS_PANIC 0x0004 /* Notify of a panic */
+#define SYS_KEXEC 0x0005 /* Notify of kexec of a kernel */
+#define SYS_EMERGENCY 0x0006 /* Notify of emergency restart */
#define NETLINK_URELEASE 0x0001 /* Unicast netlink socket released */
Index: linux/kernel/kexec.c
===================================================================
--- linux.orig/kernel/kexec.c
+++ linux/kernel/kexec.c
@@ -1068,6 +1068,8 @@ void crash_kexec(struct pt_regs *regs)
if (!locked) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
+ atomic_notifier_call_chain(&panic_notifier_list,
+ SYS_KEXEC, "kexec");
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -270,6 +270,8 @@ out_unlock:
*/
void emergency_restart(void)
{
+ atomic_notifier_call_chain(&panic_notifier_list, SYS_EMERGENCY,
+ "emergency restart");
machine_emergency_restart();
}
EXPORT_SYMBOL_GPL(emergency_restart);
Index: linux/kernel/panic.c
===================================================================
--- linux.orig/kernel/panic.c
+++ linux/kernel/panic.c
@@ -98,7 +98,7 @@ NORET_TYPE void panic(const char * fmt,
smp_send_stop();
#endif
- atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
+ atomic_notifier_call_chain(&panic_notifier_list, SYS_PANIC, buf);
if (!panic_blink)
panic_blink = no_blink;
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824
Cliff Wickman <[email protected]> writes:
> On Fri, Jun 20, 2008 at 11:01:34AM -0700, Eric W. Biederman wrote:
>> Cliff Wickman <[email protected]> writes:
>
>>> void emergency_restart(void)
>>> {
>>> + struct raw_notifier_head rh;
>>> +
>>> + rh.head = reboot_notifier_list.head;
>>> + raw_notifier_call_chain(&rh, SYS_EMERGENCY, NULL);
>>> machine_emergency_restart();
>>> }
>
>> that's still not a good idea - a blocking notifier list is that: a list
>> that has stuff which might block. emergency_restart() might get called
>> by non-blocking codepaths as well and it expects the restart to occur.
>>
>> Ingo
>
> Oh. I get it now. I was just looking at how the types of lists protect
> against additions to the list.
>
> Seems an atomic list would do what we want. By definition the functions
> on such a list should not block.
>
>> >>
>> >> Cool. Someone who wants this kind of functionality and has code in
>> >> the kernel. Perhaps we can have a reasonable discussion about this.
>> >> The last time this came up people wanted a hook so they could support
>> >> their out of tree blobs in an enterprise kernel.
>> >>
>> >> emergency_restart only happens or only should happen with explicit admin
>> >> request Sysreq-r. Or when a watchdog detects the system is borked.
>> >> By design it is not expected to call drivers. The kexec on panic
>> >> case is similar.
>> >
>> > I suppose one could trust that someone with superuser permission would
>> > not stop one partition of a multi-partitioned system in a cavalier manner.
>> > I'm inclined to think we should run the reboot_notifier_list even in those
>> > situations.
>>
>> NACK emergency_restart is for when calling a normal reboot doesn't
>> work i.e. calling the reboot_notifier_list is broken.
>>
>> emergency_restart is by definition a hack.
>
> Perhaps there should be a normal_restart() and an emergency_restart().
> Currently, on x86, emergency_restart() is called in both sane and error
> situations.
Simply because it is a proper subset of the sane situation.
kernel_restart is the normal situation.
>> Also now that I think about it now that we have the device tree
>> notifications the last few users of the reboot_notifier_list should
>> be updated and the reboot_notifier_list should just be removed.
>>
>> > But definitely on some watchdog timeout event. Some kind of mechanism
>> > should be invoked to communicate the stoppage.
>>
>> I'm not certain why this is important if you have a hardware partition
>> that looks like real hardware. In that case the firmware should
>> easily be able to detect this because we reboot the partition.
>
> I suppose the firmware could get involved. It's true that the live
> partitions have no problem until the dead partition is rebooted and
> memory barriers are raised to the memory the live partitions are accessing.
> I suppose the rebooting partition could communicate to the firmware in
> the live partitions. But to communicate to a driver in the OS, and then
> back again to the firmware in the dead partition might be pretty messy.
Fun shared memory between kernels. Can the kernel that is accessing the
shared memory not just trap on the failure and handle it?
> We do have an atomic panic_notifier_list. How about using that?
> The functions on the list are supposed to be non-blocking.
>
> (currently I only see 5 of them in use -- lguest_panic panic_event
> wdog_panic_handler panic_happened softlock_panic)
>
>> In the crash_kexec case we can call functions on the other side of the
>> kexec notifier. So there is very much a hook there.
>
> Sorry, I'm not understanding that. What is that hook?
In the kexec on panic case the hook is any random standalone executable
(typically another kernel) that you want to use. It is arguably the
most powerful hook in the kernel.
> Perhaps a split of emergency_restart() into normal_restart() and
> emergency_restart() would reserve emergency_restart() for just those cases.
> Then we could use the emergency procedures only in the emergency
> cases.
That is where emergency_restart comes from. Now if you think the
watchdog drivers are calling the wrong function it should be easy
to update them.
>> So can we please start with what exactly you need to do on the xpc and
>> why?
>
> See xpc_system_reboot() [drivers/misc/sgi-xp/xpc_main.c]
> It is called from the reboot_notifier_list when the system is being
> rebooted.
> The driver calls xpc_do_exit() go through its normal exit processing. This
> involves some significant waiting.
Ok. Looking at that code. It is wholly inappropriate to be called in
a non-blocking emergency context as written.
>
> And xpc_system_die().
> It is called from the die_chain (on ia64). But on x86 there are these
> couple of cases where no callback occurs from the reboot or panic
> lists.
Yes. emergency_restart() came about because were explicitly skipping
doing a clean shutdown, in a nasty way. So it is the a deliberate
hack.
crash_kexec is similar. xpc_system_die is no where close to as
robust as the rest of the crash_kexec code path, and just calling
it looks like it might double or triple the amount of code on that
path, in general noticeably increasing the chances you won't get a core
dump when something goes wrong.
> The driver is to be called back when the kernel is restarted or halted due
> to some sort of failure. It calls xpc_die_disengage() to notify other
> partitions to disengage from all references to the dying partition's memory.
> There is some waiting for the other partitions.
And now I begin to see part of the issue. The code is constructed as
a driver, as a loadable module in fact. Yet it is grubbing down low
in the chipset, and is doing (to put it politely) highly non-standard
things. Thus the need for more support for the kernel then a normal
driver needs.
Implemented as platform code, as a subarchitecture, I don't have much
problem with weird support code doing things that normally shouldn't
be done. I think that is a maintainable situation.
Making it possible for any random driver to hook crash_kexec or
emergency_restart seems like a short recipe to remove reliability
from those code paths.
If you really want to stay a driver then I suggest making your driver
robust so when one member of your memory sharing group goes down it
doesn't bring the rest of the members down with it.
Eric