2013-04-08 05:47:25

by Dave Young

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On 04/06/2013 04:16 AM, Don Zickus wrote:
> A common problem with kdump is that during the boot up of the
> second kernel, the hardware watchdog times out and reboots the
> machine before a vmcore can be captured.
>
> Instead of tellling customers to disable their hardware watchdog
> timers, I hacked up a hook to put in the kdump path that provides
> one last kick before jumping into the second kernel.
>
> The assumption is the watchdog timeout is at least 10-30 seconds
> long, enough to get the second kernel to userspace to kick the watchdog
> again, if needed.

For kdump kernel some devices need to reset, this might increase the
boot time, it's not so reliable for the 10-30s for us to kicking the
watchdog.

Could we have another option to disable/stop the watchdog while panic
happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
it's set to 1, then just stop the watchdog or we can kick the watchdog
like what you do in this patch. Of course stopping watchdog should be
lockless as well..

>
> Of course kdump is usually executed on a panic path, so grabbing the
> watchdog mutexes to communicate with the hardware won't work. For now,
> I do everything locklessly.
>
> I can attempt a 'mutex_trylock' but not sure what to do in the failure
> case? spin up to a second?
>
> This patch is more of a proof of concept right now. Hopefully feedback
> will help solve this problem better.
>
> I have tested this with a machine using iTCO_wdt and the 'watchdog' app.
> The extra kicked happened as expected.
>
> Signed-off-by: Don Zickus <[email protected]>
> ---
> drivers/watchdog/watchdog_dev.c | 35 +++++++++++++++++++++++++++++++++++
> include/linux/watchdog.h | 7 +++++++
> kernel/kexec.c | 6 ++++++
> 3 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
> index 08b48bb..6393572 100644
> --- a/drivers/watchdog/watchdog_dev.c
> +++ b/drivers/watchdog/watchdog_dev.c
> @@ -84,6 +84,41 @@ out_ping:
> }
>
> /*
> + * watchdog_kick_for_kdump: kick the watchdog in the kdump path
> + *
> + * The kdump path needs time to reboot the kernel back into
> + * userspace. This window is big enough the hardware watchdog
> + * may come in and reboot the box. This hook gives the watchdog
> + * one final kick to get kdump over the hump.
> + *
> + * We don't know what devices are open, so just use the legacy
> + * interface for now. We have to do this locklessly as we can
> + * not wait.
> + */
> +
> +void watchdog_kick_for_kdump(void)
> +{
> + struct watchdog_device *wddev = old_wdd;
> +
> + /* FIXME - Perhaps use a mutex_trylock? */
> +
> + if (!wddev)
> + return;
> +
> + if (test_bit(WDOG_UNREGISTERED, &wddev->status))
> + return;
> +
> + if (!watchdog_active(wddev))
> + return;
> +
> + if (wddev->ops->ping)
> + wddev->ops->ping(wddev); /* ping the watchdog */
> + else
> + wddev->ops->start(wddev); /* restart watchdog */
> +}
> +EXPORT_SYMBOL_GPL(watchdog_kick_for_kdump);
> +
> +/*
> * watchdog_start: wrapper to start the watchdog.
> * @wddev: the watchdog device to start
> *
> diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
> index 2a3038e..5dff975 100644
> --- a/include/linux/watchdog.h
> +++ b/include/linux/watchdog.h
> @@ -142,4 +142,11 @@ extern int watchdog_init_timeout(struct watchdog_device *wdd,
> extern int watchdog_register_device(struct watchdog_device *);
> extern void watchdog_unregister_device(struct watchdog_device *);
>
> +#ifdef CONFIG_WATCHDOG_CORE
> +/* drivers/watchdog/watchdog_dev.c */
> +extern void watchdog_kick_for_kdump(void);
> +#else
> +static inline void watchdog_kick_for_kdump(void) { };
> +#endif
> +
> #endif /* ifndef _LINUX_WATCHDOG_H */
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index bddd3d7..ced7ccd 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -32,6 +32,7 @@
> #include <linux/vmalloc.h>
> #include <linux/swap.h>
> #include <linux/syscore_ops.h>
> +#include <linux/watchdog.h>
>
> #include <asm/page.h>
> #include <asm/uaccess.h>
> @@ -1094,6 +1095,11 @@ void crash_kexec(struct pt_regs *regs)
> if (kexec_crash_image) {
> struct pt_regs fixed_regs;
>
> + /*
> + * Give panic path a chance to do its post processing
> + */
> + watchdog_kick_for_kdump();
> +
> crash_setup_regs(&fixed_regs, regs);
> crash_save_vmcoreinfo();
> machine_crash_shutdown(&fixed_regs);
>


--
Thanks
Dave


2013-04-08 12:49:11

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> On 04/06/2013 04:16 AM, Don Zickus wrote:
> > A common problem with kdump is that during the boot up of the
> > second kernel, the hardware watchdog times out and reboots the
> > machine before a vmcore can be captured.
> >
> > Instead of tellling customers to disable their hardware watchdog
> > timers, I hacked up a hook to put in the kdump path that provides
> > one last kick before jumping into the second kernel.
> >
> > The assumption is the watchdog timeout is at least 10-30 seconds
> > long, enough to get the second kernel to userspace to kick the watchdog
> > again, if needed.
>
> For kdump kernel some devices need to reset, this might increase the
> boot time, it's not so reliable for the 10-30s for us to kicking the
> watchdog.
>
> Could we have another option to disable/stop the watchdog while panic
> happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> it's set to 1, then just stop the watchdog or we can kick the watchdog
> like what you do in this patch. Of course stopping watchdog should be
> lockless as well..

Hmm, I can look into that. But I am not sure all watchdogs have the
ability to stop once started. I was also worried about the case where
kdump hangs for some reason. Having the watchdog there to 'reboot' would
be a nice safety net.

Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
be easier?

I'll wait on feedback from the watchdog community to help point me in a
good direction.

Cheers,
Don

>
> >
> > Of course kdump is usually executed on a panic path, so grabbing the
> > watchdog mutexes to communicate with the hardware won't work. For now,
> > I do everything locklessly.
> >
> > I can attempt a 'mutex_trylock' but not sure what to do in the failure
> > case? spin up to a second?
> >
> > This patch is more of a proof of concept right now. Hopefully feedback
> > will help solve this problem better.
> >
> > I have tested this with a machine using iTCO_wdt and the 'watchdog' app.
> > The extra kicked happened as expected.
> >
> > Signed-off-by: Don Zickus <[email protected]>
> > ---
> > drivers/watchdog/watchdog_dev.c | 35 +++++++++++++++++++++++++++++++++++
> > include/linux/watchdog.h | 7 +++++++
> > kernel/kexec.c | 6 ++++++
> > 3 files changed, 48 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
> > index 08b48bb..6393572 100644
> > --- a/drivers/watchdog/watchdog_dev.c
> > +++ b/drivers/watchdog/watchdog_dev.c
> > @@ -84,6 +84,41 @@ out_ping:
> > }
> >
> > /*
> > + * watchdog_kick_for_kdump: kick the watchdog in the kdump path
> > + *
> > + * The kdump path needs time to reboot the kernel back into
> > + * userspace. This window is big enough the hardware watchdog
> > + * may come in and reboot the box. This hook gives the watchdog
> > + * one final kick to get kdump over the hump.
> > + *
> > + * We don't know what devices are open, so just use the legacy
> > + * interface for now. We have to do this locklessly as we can
> > + * not wait.
> > + */
> > +
> > +void watchdog_kick_for_kdump(void)
> > +{
> > + struct watchdog_device *wddev = old_wdd;
> > +
> > + /* FIXME - Perhaps use a mutex_trylock? */
> > +
> > + if (!wddev)
> > + return;
> > +
> > + if (test_bit(WDOG_UNREGISTERED, &wddev->status))
> > + return;
> > +
> > + if (!watchdog_active(wddev))
> > + return;
> > +
> > + if (wddev->ops->ping)
> > + wddev->ops->ping(wddev); /* ping the watchdog */
> > + else
> > + wddev->ops->start(wddev); /* restart watchdog */
> > +}
> > +EXPORT_SYMBOL_GPL(watchdog_kick_for_kdump);
> > +
> > +/*
> > * watchdog_start: wrapper to start the watchdog.
> > * @wddev: the watchdog device to start
> > *
> > diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
> > index 2a3038e..5dff975 100644
> > --- a/include/linux/watchdog.h
> > +++ b/include/linux/watchdog.h
> > @@ -142,4 +142,11 @@ extern int watchdog_init_timeout(struct watchdog_device *wdd,
> > extern int watchdog_register_device(struct watchdog_device *);
> > extern void watchdog_unregister_device(struct watchdog_device *);
> >
> > +#ifdef CONFIG_WATCHDOG_CORE
> > +/* drivers/watchdog/watchdog_dev.c */
> > +extern void watchdog_kick_for_kdump(void);
> > +#else
> > +static inline void watchdog_kick_for_kdump(void) { };
> > +#endif
> > +
> > #endif /* ifndef _LINUX_WATCHDOG_H */
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index bddd3d7..ced7ccd 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -32,6 +32,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/swap.h>
> > #include <linux/syscore_ops.h>
> > +#include <linux/watchdog.h>
> >
> > #include <asm/page.h>
> > #include <asm/uaccess.h>
> > @@ -1094,6 +1095,11 @@ void crash_kexec(struct pt_regs *regs)
> > if (kexec_crash_image) {
> > struct pt_regs fixed_regs;
> >
> > + /*
> > + * Give panic path a chance to do its post processing
> > + */
> > + watchdog_kick_for_kdump();
> > +
> > crash_setup_regs(&fixed_regs, regs);
> > crash_save_vmcoreinfo();
> > machine_crash_shutdown(&fixed_regs);
> >
>
>
> --
> Thanks
> Dave
>
>

2013-04-08 15:15:07

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Mon, Apr 08, 2013 at 08:48:58AM -0400, Don Zickus wrote:
> On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> > On 04/06/2013 04:16 AM, Don Zickus wrote:
> > > A common problem with kdump is that during the boot up of the
> > > second kernel, the hardware watchdog times out and reboots the
> > > machine before a vmcore can be captured.
> > >
> > > Instead of tellling customers to disable their hardware watchdog
> > > timers, I hacked up a hook to put in the kdump path that provides
> > > one last kick before jumping into the second kernel.
> > >
> > > The assumption is the watchdog timeout is at least 10-30 seconds
> > > long, enough to get the second kernel to userspace to kick the watchdog
> > > again, if needed.
> >
> > For kdump kernel some devices need to reset, this might increase the
> > boot time, it's not so reliable for the 10-30s for us to kicking the
> > watchdog.
> >
> > Could we have another option to disable/stop the watchdog while panic
> > happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> > it's set to 1, then just stop the watchdog or we can kick the watchdog
> > like what you do in this patch. Of course stopping watchdog should be
> > lockless as well..
>
> Hmm, I can look into that. But I am not sure all watchdogs have the
> ability to stop once started. I was also worried about the case where

Correct.

> kdump hangs for some reason. Having the watchdog there to 'reboot' would
> be a nice safety net.
>
Absolutely agree. After all, the reason for the kdump is most likely that
something went really wrong, meaning there is some likelyhood for the hang
to occur. Turning off the watchdog in this condition does not seem to be
a good idea.

> Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
> be easier?
>
Not all watchdogs support such large timeouts, unfortunately. Maybe it would
make sense to implement infrastructure support for a softdog on top of the
hardware watchdog. Several drivers implement that outside the infrastructure
already.

Guenter

> I'll wait on feedback from the watchdog community to help point me in a
> good direction.
>
> Cheers,
> Don
>
> >
> > >
> > > Of course kdump is usually executed on a panic path, so grabbing the
> > > watchdog mutexes to communicate with the hardware won't work. For now,
> > > I do everything locklessly.
> > >
> > > I can attempt a 'mutex_trylock' but not sure what to do in the failure
> > > case? spin up to a second?
> > >
> > > This patch is more of a proof of concept right now. Hopefully feedback
> > > will help solve this problem better.
> > >
> > > I have tested this with a machine using iTCO_wdt and the 'watchdog' app.
> > > The extra kicked happened as expected.
> > >
> > > Signed-off-by: Don Zickus <[email protected]>
> > > ---
> > > drivers/watchdog/watchdog_dev.c | 35 +++++++++++++++++++++++++++++++++++
> > > include/linux/watchdog.h | 7 +++++++
> > > kernel/kexec.c | 6 ++++++
> > > 3 files changed, 48 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
> > > index 08b48bb..6393572 100644
> > > --- a/drivers/watchdog/watchdog_dev.c
> > > +++ b/drivers/watchdog/watchdog_dev.c
> > > @@ -84,6 +84,41 @@ out_ping:
> > > }
> > >
> > > /*
> > > + * watchdog_kick_for_kdump: kick the watchdog in the kdump path
> > > + *
> > > + * The kdump path needs time to reboot the kernel back into
> > > + * userspace. This window is big enough the hardware watchdog
> > > + * may come in and reboot the box. This hook gives the watchdog
> > > + * one final kick to get kdump over the hump.
> > > + *
> > > + * We don't know what devices are open, so just use the legacy
> > > + * interface for now. We have to do this locklessly as we can
> > > + * not wait.
> > > + */
> > > +
> > > +void watchdog_kick_for_kdump(void)
> > > +{
> > > + struct watchdog_device *wddev = old_wdd;
> > > +
> > > + /* FIXME - Perhaps use a mutex_trylock? */
> > > +
> > > + if (!wddev)
> > > + return;
> > > +
> > > + if (test_bit(WDOG_UNREGISTERED, &wddev->status))
> > > + return;
> > > +
> > > + if (!watchdog_active(wddev))
> > > + return;
> > > +
> > > + if (wddev->ops->ping)
> > > + wddev->ops->ping(wddev); /* ping the watchdog */
> > > + else
> > > + wddev->ops->start(wddev); /* restart watchdog */
> > > +}
> > > +EXPORT_SYMBOL_GPL(watchdog_kick_for_kdump);
> > > +
> > > +/*
> > > * watchdog_start: wrapper to start the watchdog.
> > > * @wddev: the watchdog device to start
> > > *
> > > diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
> > > index 2a3038e..5dff975 100644
> > > --- a/include/linux/watchdog.h
> > > +++ b/include/linux/watchdog.h
> > > @@ -142,4 +142,11 @@ extern int watchdog_init_timeout(struct watchdog_device *wdd,
> > > extern int watchdog_register_device(struct watchdog_device *);
> > > extern void watchdog_unregister_device(struct watchdog_device *);
> > >
> > > +#ifdef CONFIG_WATCHDOG_CORE
> > > +/* drivers/watchdog/watchdog_dev.c */
> > > +extern void watchdog_kick_for_kdump(void);
> > > +#else
> > > +static inline void watchdog_kick_for_kdump(void) { };
> > > +#endif
> > > +
> > > #endif /* ifndef _LINUX_WATCHDOG_H */
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index bddd3d7..ced7ccd 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -32,6 +32,7 @@
> > > #include <linux/vmalloc.h>
> > > #include <linux/swap.h>
> > > #include <linux/syscore_ops.h>
> > > +#include <linux/watchdog.h>
> > >
> > > #include <asm/page.h>
> > > #include <asm/uaccess.h>
> > > @@ -1094,6 +1095,11 @@ void crash_kexec(struct pt_regs *regs)
> > > if (kexec_crash_image) {
> > > struct pt_regs fixed_regs;
> > >
> > > + /*
> > > + * Give panic path a chance to do its post processing
> > > + */
> > > + watchdog_kick_for_kdump();
> > > +
> > > crash_setup_regs(&fixed_regs, regs);
> > > crash_save_vmcoreinfo();
> > > machine_crash_shutdown(&fixed_regs);
> > >
> >
> >
> > --
> > Thanks
> > Dave
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-watchdog" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-04-09 14:44:42

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Mon, Apr 08, 2013 at 08:15:09AM -0700, Guenter Roeck wrote:
> On Mon, Apr 08, 2013 at 08:48:58AM -0400, Don Zickus wrote:
> > On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> > > On 04/06/2013 04:16 AM, Don Zickus wrote:
> > > > A common problem with kdump is that during the boot up of the
> > > > second kernel, the hardware watchdog times out and reboots the
> > > > machine before a vmcore can be captured.
> > > >
> > > > Instead of tellling customers to disable their hardware watchdog
> > > > timers, I hacked up a hook to put in the kdump path that provides
> > > > one last kick before jumping into the second kernel.
> > > >
> > > > The assumption is the watchdog timeout is at least 10-30 seconds
> > > > long, enough to get the second kernel to userspace to kick the watchdog
> > > > again, if needed.
> > >
> > > For kdump kernel some devices need to reset, this might increase the
> > > boot time, it's not so reliable for the 10-30s for us to kicking the
> > > watchdog.
> > >
> > > Could we have another option to disable/stop the watchdog while panic
> > > happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> > > it's set to 1, then just stop the watchdog or we can kick the watchdog
> > > like what you do in this patch. Of course stopping watchdog should be
> > > lockless as well..
> >
> > Hmm, I can look into that. But I am not sure all watchdogs have the
> > ability to stop once started. I was also worried about the case where
>
> Correct.
>
> > kdump hangs for some reason. Having the watchdog there to 'reboot' would
> > be a nice safety net.
> >
> Absolutely agree. After all, the reason for the kdump is most likely that
> something went really wrong, meaning there is some likelyhood for the hang
> to occur. Turning off the watchdog in this condition does not seem to be
> a good idea.
>
> > Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
> > be easier?
> >
> Not all watchdogs support such large timeouts, unfortunately. Maybe it would
> make sense to implement infrastructure support for a softdog on top of the
> hardware watchdog. Several drivers implement that outside the infrastructure
> already.

Hi Guenter,

I am not familar with a softdog. Can you give me an example of how it
works?

Cheers,
Don

2013-04-09 14:52:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Tue, Apr 09, 2013 at 10:44:31AM -0400, Don Zickus wrote:
> On Mon, Apr 08, 2013 at 08:15:09AM -0700, Guenter Roeck wrote:
> > On Mon, Apr 08, 2013 at 08:48:58AM -0400, Don Zickus wrote:
> > > On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> > > > On 04/06/2013 04:16 AM, Don Zickus wrote:
> > > > > A common problem with kdump is that during the boot up of the
> > > > > second kernel, the hardware watchdog times out and reboots the
> > > > > machine before a vmcore can be captured.
> > > > >
> > > > > Instead of tellling customers to disable their hardware watchdog
> > > > > timers, I hacked up a hook to put in the kdump path that provides
> > > > > one last kick before jumping into the second kernel.
> > > > >
> > > > > The assumption is the watchdog timeout is at least 10-30 seconds
> > > > > long, enough to get the second kernel to userspace to kick the watchdog
> > > > > again, if needed.
> > > >
> > > > For kdump kernel some devices need to reset, this might increase the
> > > > boot time, it's not so reliable for the 10-30s for us to kicking the
> > > > watchdog.
> > > >
> > > > Could we have another option to disable/stop the watchdog while panic
> > > > happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> > > > it's set to 1, then just stop the watchdog or we can kick the watchdog
> > > > like what you do in this patch. Of course stopping watchdog should be
> > > > lockless as well..
> > >
> > > Hmm, I can look into that. But I am not sure all watchdogs have the
> > > ability to stop once started. I was also worried about the case where
> >
> > Correct.
> >
> > > kdump hangs for some reason. Having the watchdog there to 'reboot' would
> > > be a nice safety net.
> > >
> > Absolutely agree. After all, the reason for the kdump is most likely that
> > something went really wrong, meaning there is some likelyhood for the hang
> > to occur. Turning off the watchdog in this condition does not seem to be
> > a good idea.
> >
> > > Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
> > > be easier?
> > >
> > Not all watchdogs support such large timeouts, unfortunately. Maybe it would
> > make sense to implement infrastructure support for a softdog on top of the
> > hardware watchdog. Several drivers implement that outside the infrastructure
> > already.
>
> Hi Guenter,
>
> I am not familar with a softdog. Can you give me an example of how it
> works?
>
Just look for the use of mod_timer in the watchdog directory.

Guenter

2013-04-09 15:14:35

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Tue, Apr 09, 2013 at 07:52:28AM -0700, Guenter Roeck wrote:
> On Tue, Apr 09, 2013 at 10:44:31AM -0400, Don Zickus wrote:
> > On Mon, Apr 08, 2013 at 08:15:09AM -0700, Guenter Roeck wrote:
> > > On Mon, Apr 08, 2013 at 08:48:58AM -0400, Don Zickus wrote:
> > > > On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> > > > > On 04/06/2013 04:16 AM, Don Zickus wrote:
> > > > > > A common problem with kdump is that during the boot up of the
> > > > > > second kernel, the hardware watchdog times out and reboots the
> > > > > > machine before a vmcore can be captured.
> > > > > >
> > > > > > Instead of tellling customers to disable their hardware watchdog
> > > > > > timers, I hacked up a hook to put in the kdump path that provides
> > > > > > one last kick before jumping into the second kernel.
> > > > > >
> > > > > > The assumption is the watchdog timeout is at least 10-30 seconds
> > > > > > long, enough to get the second kernel to userspace to kick the watchdog
> > > > > > again, if needed.
> > > > >
> > > > > For kdump kernel some devices need to reset, this might increase the
> > > > > boot time, it's not so reliable for the 10-30s for us to kicking the
> > > > > watchdog.
> > > > >
> > > > > Could we have another option to disable/stop the watchdog while panic
> > > > > happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> > > > > it's set to 1, then just stop the watchdog or we can kick the watchdog
> > > > > like what you do in this patch. Of course stopping watchdog should be
> > > > > lockless as well..
> > > >
> > > > Hmm, I can look into that. But I am not sure all watchdogs have the
> > > > ability to stop once started. I was also worried about the case where
> > >
> > > Correct.
> > >
> > > > kdump hangs for some reason. Having the watchdog there to 'reboot' would
> > > > be a nice safety net.
> > > >
> > > Absolutely agree. After all, the reason for the kdump is most likely that
> > > something went really wrong, meaning there is some likelyhood for the hang
> > > to occur. Turning off the watchdog in this condition does not seem to be
> > > a good idea.
> > >
> > > > Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
> > > > be easier?
> > > >
> > > Not all watchdogs support such large timeouts, unfortunately. Maybe it would
> > > make sense to implement infrastructure support for a softdog on top of the
> > > hardware watchdog. Several drivers implement that outside the infrastructure
> > > already.
> >
> > Hi Guenter,
> >
> > I am not familar with a softdog. Can you give me an example of how it
> > works?
> >
> Just look for the use of mod_timer in the watchdog directory.

So looking at the mod_timer logic in various drivers, it seems regardless
if the /dev/watchdog device is opened or not, if it is running, it will
automagically kick the watchdog.

This seems that we can avoid pulling in userspace pieces for this. Just
load the driver and the hardware starts getting kicked.

Is that true? And if so, do all drivers detect if the hardware is already
running during their init? Or is it based on the first device open?

Cheers,
Don

2013-04-09 16:07:54

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Tue, Apr 09, 2013 at 11:14:23AM -0400, Don Zickus wrote:
> On Tue, Apr 09, 2013 at 07:52:28AM -0700, Guenter Roeck wrote:
> > On Tue, Apr 09, 2013 at 10:44:31AM -0400, Don Zickus wrote:
> > > On Mon, Apr 08, 2013 at 08:15:09AM -0700, Guenter Roeck wrote:
> > > > On Mon, Apr 08, 2013 at 08:48:58AM -0400, Don Zickus wrote:
> > > > > On Mon, Apr 08, 2013 at 01:46:58PM +0800, Dave Young wrote:
> > > > > > On 04/06/2013 04:16 AM, Don Zickus wrote:
> > > > > > > A common problem with kdump is that during the boot up of the
> > > > > > > second kernel, the hardware watchdog times out and reboots the
> > > > > > > machine before a vmcore can be captured.
> > > > > > >
> > > > > > > Instead of tellling customers to disable their hardware watchdog
> > > > > > > timers, I hacked up a hook to put in the kdump path that provides
> > > > > > > one last kick before jumping into the second kernel.
> > > > > > >
> > > > > > > The assumption is the watchdog timeout is at least 10-30 seconds
> > > > > > > long, enough to get the second kernel to userspace to kick the watchdog
> > > > > > > again, if needed.
> > > > > >
> > > > > > For kdump kernel some devices need to reset, this might increase the
> > > > > > boot time, it's not so reliable for the 10-30s for us to kicking the
> > > > > > watchdog.
> > > > > >
> > > > > > Could we have another option to disable/stop the watchdog while panic
> > > > > > happens? Ie. add a kernel cmdline panic_stop_wd=<0|1> for 1st kernel, if
> > > > > > it's set to 1, then just stop the watchdog or we can kick the watchdog
> > > > > > like what you do in this patch. Of course stopping watchdog should be
> > > > > > lockless as well..
> > > > >
> > > > > Hmm, I can look into that. But I am not sure all watchdogs have the
> > > > > ability to stop once started. I was also worried about the case where
> > > >
> > > > Correct.
> > > >
> > > > > kdump hangs for some reason. Having the watchdog there to 'reboot' would
> > > > > be a nice safety net.
> > > > >
> > > > Absolutely agree. After all, the reason for the kdump is most likely that
> > > > something went really wrong, meaning there is some likelyhood for the hang
> > > > to occur. Turning off the watchdog in this condition does not seem to be
> > > > a good idea.
> > > >
> > > > > Perhaps adjusting the watchdog 'timeout' to something like 3 minutes would
> > > > > be easier?
> > > > >
> > > > Not all watchdogs support such large timeouts, unfortunately. Maybe it would
> > > > make sense to implement infrastructure support for a softdog on top of the
> > > > hardware watchdog. Several drivers implement that outside the infrastructure
> > > > already.
> > >
> > > Hi Guenter,
> > >
> > > I am not familar with a softdog. Can you give me an example of how it
> > > works?
> > >
> > Just look for the use of mod_timer in the watchdog directory.
>
> So looking at the mod_timer logic in various drivers, it seems regardless
> if the /dev/watchdog device is opened or not, if it is running, it will
> automagically kick the watchdog.
>
yes

> This seems that we can avoid pulling in userspace pieces for this. Just
> load the driver and the hardware starts getting kicked.
>
Only if it is already running. Also, you don't want to rely on it, because you
lose protection against user space issues.

A second use is if the hw watchdog needs to be pinged more often than user
space can provide. Some of the HW watchdogs need a ping in one-second intervals
or even faster.

> Is that true? And if so, do all drivers detect if the hardware is already
> running during their init? Or is it based on the first device open?
>
It is usually done in the probe function.

Guenter

2013-04-10 13:40:50

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > Just look for the use of mod_timer in the watchdog directory.
> >
> > So looking at the mod_timer logic in various drivers, it seems regardless
> > if the /dev/watchdog device is opened or not, if it is running, it will
> > automagically kick the watchdog.
> >
> yes
>
> > This seems that we can avoid pulling in userspace pieces for this. Just
> > load the driver and the hardware starts getting kicked.
> >
> Only if it is already running. Also, you don't want to rely on it, because you
> lose protection against user space issues.

IOW if something goes wrong with a runaway userspace app, the kernel
blindly continues to kick the watchdog, which masks the problem, right?

>
> A second use is if the hw watchdog needs to be pinged more often than user
> space can provide. Some of the HW watchdogs need a ping in one-second intervals
> or even faster.
>
> > Is that true? And if so, do all drivers detect if the hardware is already
> > running during their init? Or is it based on the first device open?
> >
> It is usually done in the probe function.

Ok. Thanks for the understanding of how the softdog stuff works.

However, we still have the problem that if the machine panics and we want
to jump into the kdump kernel, we need to 'kick' the watchdog one more
time. This provides us a sane sync point for determining how long we have
to load the watchdog driver in the second kernel before the hardware
reboots us. Otherwise the reboots are pretty random and nothing is
guaranteed.

Hence the need for some sort of patch resembling the one I posted.

Soooooooo, any thoughts about that patch and what changes I should make?
:-)

Cheers,
Don

2013-04-10 13:51:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > Just look for the use of mod_timer in the watchdog directory.
> > >
> > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > automagically kick the watchdog.
> > >
> > yes
> >
> > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > load the driver and the hardware starts getting kicked.
> > >
> > Only if it is already running. Also, you don't want to rely on it, because you
> > lose protection against user space issues.
>
> IOW if something goes wrong with a runaway userspace app, the kernel
> blindly continues to kick the watchdog, which masks the problem, right?
>
That would be wrong if any of the drivers does that. The kernel should stop
kicking after the software timeout expires.

For example, if the HW needs to be kicked every second, and the high level
timeout is set to one minute, the driver should keep kicking the hardware
watchdog for one minute and then stop doing it if /dev/watchdog was opened
and userspace is silent.

> >
> > A second use is if the hw watchdog needs to be pinged more often than user
> > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > or even faster.
> >
> > > Is that true? And if so, do all drivers detect if the hardware is already
> > > running during their init? Or is it based on the first device open?
> > >
> > It is usually done in the probe function.
>
> Ok. Thanks for the understanding of how the softdog stuff works.
>
> However, we still have the problem that if the machine panics and we want
> to jump into the kdump kernel, we need to 'kick' the watchdog one more
> time. This provides us a sane sync point for determining how long we have
> to load the watchdog driver in the second kernel before the hardware
> reboots us. Otherwise the reboots are pretty random and nothing is
> guaranteed.
>
> Hence the need for some sort of patch resembling the one I posted.
>
> Soooooooo, any thoughts about that patch and what changes I should make?
> :-)
>
The FIXME is a problem, and I think the name and scope would have to be
more generic (watchdog_kick ?). Also, it doesn't solve the problem
of having multiple open watchdogs (my system has three, for example),
and it doesn't check if the watchdog is running.

Guenter

2013-04-10 14:21:07

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 06:51:23AM -0700, Guenter Roeck wrote:
> On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > > Just look for the use of mod_timer in the watchdog directory.
> > > >
> > > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > > automagically kick the watchdog.
> > > >
> > > yes
> > >
> > > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > > load the driver and the hardware starts getting kicked.
> > > >
> > > Only if it is already running. Also, you don't want to rely on it, because you
> > > lose protection against user space issues.
> >
> > IOW if something goes wrong with a runaway userspace app, the kernel
> > blindly continues to kick the watchdog, which masks the problem, right?
> >
> That would be wrong if any of the drivers does that. The kernel should stop
> kicking after the software timeout expires.
>
> For example, if the HW needs to be kicked every second, and the high level
> timeout is set to one minute, the driver should keep kicking the hardware
> watchdog for one minute and then stop doing it if /dev/watchdog was opened
> and userspace is silent.

Ah ok.

>
> > >
> > > A second use is if the hw watchdog needs to be pinged more often than user
> > > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > > or even faster.
> > >
> > > > Is that true? And if so, do all drivers detect if the hardware is already
> > > > running during their init? Or is it based on the first device open?
> > > >
> > > It is usually done in the probe function.
> >
> > Ok. Thanks for the understanding of how the softdog stuff works.
> >
> > However, we still have the problem that if the machine panics and we want
> > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > time. This provides us a sane sync point for determining how long we have
> > to load the watchdog driver in the second kernel before the hardware
> > reboots us. Otherwise the reboots are pretty random and nothing is
> > guaranteed.
> >
> > Hence the need for some sort of patch resembling the one I posted.
> >
> > Soooooooo, any thoughts about that patch and what changes I should make?
> > :-)
> >
> The FIXME is a problem, and I think the name and scope would have to be
> more generic (watchdog_kick ?). Also, it doesn't solve the problem
> of having multiple open watchdogs (my system has three, for example),
> and it doesn't check if the watchdog is running.

Ok. I didn't know the watchdog subsystem well enough, so I just took
stabs in the dark about how things should work. I appreciate the
feedback.

I could make the name more generic. I wasn't sure if the watchdog
community would frown on that. The FIXME is a problem, I am not sure how
to handle the 'fail' scenario (can't get the mutex with trylock). And I
have no idea how to even find out if multiple watchdogs are open on the
system. Is there a list I could walk? And with regard to 'watchdog is
running', I thought 'watchdog_active' would do that. But again, I could
be misreading the code.

Thanks for the feedback.

Cheers,
Don

>
> Guenter

2013-04-10 15:10:39

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 10:20:55AM -0400, Don Zickus wrote:
> On Wed, Apr 10, 2013 at 06:51:23AM -0700, Guenter Roeck wrote:
> > On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > > On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > > > Just look for the use of mod_timer in the watchdog directory.
> > > > >
> > > > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > > > automagically kick the watchdog.
> > > > >
> > > > yes
> > > >
> > > > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > > > load the driver and the hardware starts getting kicked.
> > > > >
> > > > Only if it is already running. Also, you don't want to rely on it, because you
> > > > lose protection against user space issues.
> > >
> > > IOW if something goes wrong with a runaway userspace app, the kernel
> > > blindly continues to kick the watchdog, which masks the problem, right?
> > >
> > That would be wrong if any of the drivers does that. The kernel should stop
> > kicking after the software timeout expires.
> >
> > For example, if the HW needs to be kicked every second, and the high level
> > timeout is set to one minute, the driver should keep kicking the hardware
> > watchdog for one minute and then stop doing it if /dev/watchdog was opened
> > and userspace is silent.
>
> Ah ok.
>
> >
> > > >
> > > > A second use is if the hw watchdog needs to be pinged more often than user
> > > > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > > > or even faster.
> > > >
> > > > > Is that true? And if so, do all drivers detect if the hardware is already
> > > > > running during their init? Or is it based on the first device open?
> > > > >
> > > > It is usually done in the probe function.
> > >
> > > Ok. Thanks for the understanding of how the softdog stuff works.
> > >
> > > However, we still have the problem that if the machine panics and we want
> > > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > > time. This provides us a sane sync point for determining how long we have
> > > to load the watchdog driver in the second kernel before the hardware
> > > reboots us. Otherwise the reboots are pretty random and nothing is
> > > guaranteed.
> > >
> > > Hence the need for some sort of patch resembling the one I posted.
> > >
> > > Soooooooo, any thoughts about that patch and what changes I should make?
> > > :-)
> > >
> > The FIXME is a problem, and I think the name and scope would have to be
> > more generic (watchdog_kick ?). Also, it doesn't solve the problem
> > of having multiple open watchdogs (my system has three, for example),
> > and it doesn't check if the watchdog is running.
>
> Ok. I didn't know the watchdog subsystem well enough, so I just took
> stabs in the dark about how things should work. I appreciate the
> feedback.
>
> I could make the name more generic. I wasn't sure if the watchdog
> community would frown on that. The FIXME is a problem, I am not sure how

I don't know what others think, but I would prefer a more generic name.
That its current use is for kdump is besides the point; there could be
other valid uses.

> to handle the 'fail' scenario (can't get the mutex with trylock). And I

I would just return without doing anything. After all, if the mutex is held,
it is possible if not likely that the code crashed _in_ the watchdog code,
so it might be better to just let it crash and burn in that case.

> have no idea how to even find out if multiple watchdogs are open on the
> system. Is there a list I could walk? And with regard to 'watchdog is

/* the dev_t structure to store the dynamically allocated watchdog devices */
static dev_t watchdog_devt;

One way to look up the allocated watchdogs might be to loop through all kobj
instances for the major device using kobj_lookup. Don't know if there is a
better way.

> running', I thought 'watchdog_active' would do that. But again, I could
> be misreading the code.
>
You are right. Missed that part, sorry.

Guenter

2013-04-10 16:17:56

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> On Wed, Apr 10, 2013 at 10:20:55AM -0400, Don Zickus wrote:
> > On Wed, Apr 10, 2013 at 06:51:23AM -0700, Guenter Roeck wrote:
> > > On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > > > On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > > > > Just look for the use of mod_timer in the watchdog directory.
> > > > > >
> > > > > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > > > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > > > > automagically kick the watchdog.
> > > > > >
> > > > > yes
> > > > >
> > > > > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > > > > load the driver and the hardware starts getting kicked.
> > > > > >
> > > > > Only if it is already running. Also, you don't want to rely on it, because you
> > > > > lose protection against user space issues.
> > > >
> > > > IOW if something goes wrong with a runaway userspace app, the kernel
> > > > blindly continues to kick the watchdog, which masks the problem, right?
> > > >
> > > That would be wrong if any of the drivers does that. The kernel should stop
> > > kicking after the software timeout expires.
> > >
> > > For example, if the HW needs to be kicked every second, and the high level
> > > timeout is set to one minute, the driver should keep kicking the hardware
> > > watchdog for one minute and then stop doing it if /dev/watchdog was opened
> > > and userspace is silent.
> >
> > Ah ok.
> >
> > >
> > > > >
> > > > > A second use is if the hw watchdog needs to be pinged more often than user
> > > > > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > > > > or even faster.
> > > > >
> > > > > > Is that true? And if so, do all drivers detect if the hardware is already
> > > > > > running during their init? Or is it based on the first device open?
> > > > > >
> > > > > It is usually done in the probe function.
> > > >
> > > > Ok. Thanks for the understanding of how the softdog stuff works.
> > > >
> > > > However, we still have the problem that if the machine panics and we want
> > > > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > > > time. This provides us a sane sync point for determining how long we have
> > > > to load the watchdog driver in the second kernel before the hardware
> > > > reboots us. Otherwise the reboots are pretty random and nothing is
> > > > guaranteed.
> > > >
> > > > Hence the need for some sort of patch resembling the one I posted.
> > > >
> > > > Soooooooo, any thoughts about that patch and what changes I should make?
> > > > :-)
> > > >
> > > The FIXME is a problem, and I think the name and scope would have to be
> > > more generic (watchdog_kick ?). Also, it doesn't solve the problem
> > > of having multiple open watchdogs (my system has three, for example),
> > > and it doesn't check if the watchdog is running.
> >
> > Ok. I didn't know the watchdog subsystem well enough, so I just took
> > stabs in the dark about how things should work. I appreciate the
> > feedback.
> >
> > I could make the name more generic. I wasn't sure if the watchdog
> > community would frown on that. The FIXME is a problem, I am not sure how
>
> I don't know what others think, but I would prefer a more generic name.
> That its current use is for kdump is besides the point; there could be
> other valid uses.

Ok.

>
> > to handle the 'fail' scenario (can't get the mutex with trylock). And I
>
> I would just return without doing anything. After all, if the mutex is held,
> it is possible if not likely that the code crashed _in_ the watchdog code,
> so it might be better to just let it crash and burn in that case.

Hmm. Interesting point. So you are implying most watchdog mutex locks surround
quickly executed code that are not called often. And if we do catch the
mutex being held it is either because the watchdog code has failed (bad)
or just bad timing :-).

For now I can just do the trylock and return on failure, printing a big
fat warning for post-debug purposes. If it becomes a bigger problem, we
can refine the approach later I guess.

>
> > have no idea how to even find out if multiple watchdogs are open on the
> > system. Is there a list I could walk? And with regard to 'watchdog is
>
> /* the dev_t structure to store the dynamically allocated watchdog devices */
> static dev_t watchdog_devt;
>
> One way to look up the allocated watchdogs might be to loop through all kobj
> instances for the major device using kobj_lookup. Don't know if there is a
> better way.

I can poke at that to see if I can get something to work. Thanks.

>
> > running', I thought 'watchdog_active' would do that. But again, I could
> > be misreading the code.
> >
> You are right. Missed that part, sorry.

No problem, just wanted to make sure I was understanding things correctly.

Again thanks for the feedback. I'll spin a V2.

Cheers,
Don

2013-04-10 16:30:49

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 12:17:39PM -0400, Don Zickus wrote:
> On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> > On Wed, Apr 10, 2013 at 10:20:55AM -0400, Don Zickus wrote:
> > > On Wed, Apr 10, 2013 at 06:51:23AM -0700, Guenter Roeck wrote:
> > > > On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > > > > On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > > > > > Just look for the use of mod_timer in the watchdog directory.
> > > > > > >
> > > > > > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > > > > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > > > > > automagically kick the watchdog.
> > > > > > >
> > > > > > yes
> > > > > >
> > > > > > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > > > > > load the driver and the hardware starts getting kicked.
> > > > > > >
> > > > > > Only if it is already running. Also, you don't want to rely on it, because you
> > > > > > lose protection against user space issues.
> > > > >
> > > > > IOW if something goes wrong with a runaway userspace app, the kernel
> > > > > blindly continues to kick the watchdog, which masks the problem, right?
> > > > >
> > > > That would be wrong if any of the drivers does that. The kernel should stop
> > > > kicking after the software timeout expires.
> > > >
> > > > For example, if the HW needs to be kicked every second, and the high level
> > > > timeout is set to one minute, the driver should keep kicking the hardware
> > > > watchdog for one minute and then stop doing it if /dev/watchdog was opened
> > > > and userspace is silent.
> > >
> > > Ah ok.
> > >
> > > >
> > > > > >
> > > > > > A second use is if the hw watchdog needs to be pinged more often than user
> > > > > > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > > > > > or even faster.
> > > > > >
> > > > > > > Is that true? And if so, do all drivers detect if the hardware is already
> > > > > > > running during their init? Or is it based on the first device open?
> > > > > > >
> > > > > > It is usually done in the probe function.
> > > > >
> > > > > Ok. Thanks for the understanding of how the softdog stuff works.
> > > > >
> > > > > However, we still have the problem that if the machine panics and we want
> > > > > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > > > > time. This provides us a sane sync point for determining how long we have
> > > > > to load the watchdog driver in the second kernel before the hardware
> > > > > reboots us. Otherwise the reboots are pretty random and nothing is
> > > > > guaranteed.
> > > > >
> > > > > Hence the need for some sort of patch resembling the one I posted.
> > > > >
> > > > > Soooooooo, any thoughts about that patch and what changes I should make?
> > > > > :-)
> > > > >
> > > > The FIXME is a problem, and I think the name and scope would have to be
> > > > more generic (watchdog_kick ?). Also, it doesn't solve the problem
> > > > of having multiple open watchdogs (my system has three, for example),
> > > > and it doesn't check if the watchdog is running.
> > >
> > > Ok. I didn't know the watchdog subsystem well enough, so I just took
> > > stabs in the dark about how things should work. I appreciate the
> > > feedback.
> > >
> > > I could make the name more generic. I wasn't sure if the watchdog
> > > community would frown on that. The FIXME is a problem, I am not sure how
> >
> > I don't know what others think, but I would prefer a more generic name.
> > That its current use is for kdump is besides the point; there could be
> > other valid uses.
>
> Ok.
>
> >
> > > to handle the 'fail' scenario (can't get the mutex with trylock). And I
> >
> > I would just return without doing anything. After all, if the mutex is held,
> > it is possible if not likely that the code crashed _in_ the watchdog code,
> > so it might be better to just let it crash and burn in that case.
>
> Hmm. Interesting point. So you are implying most watchdog mutex locks surround
> quickly executed code that are not called often. And if we do catch the
> mutex being held it is either because the watchdog code has failed (bad)
> or just bad timing :-).
>
Hopefully all locks are for short periods of time ;).

Since the lock is per device, it would either mean that the code has crashed
there, or that some other entity is busy dealing with the device, presumably
trying to ping it. Either case, I think the best (safest) way to deal with
the situation would be to ignore it.

> For now I can just do the trylock and return on failure, printing a big
> fat warning for post-debug purposes. If it becomes a bigger problem, we
> can refine the approach later I guess.
>
Ok.

> >
> > > have no idea how to even find out if multiple watchdogs are open on the
> > > system. Is there a list I could walk? And with regard to 'watchdog is
> >
> > /* the dev_t structure to store the dynamically allocated watchdog devices */
> > static dev_t watchdog_devt;
> >
> > One way to look up the allocated watchdogs might be to loop through all kobj
> > instances for the major device using kobj_lookup. Don't know if there is a
> > better way.
>
> I can poke at that to see if I can get something to work. Thanks.
>
> >
> > > running', I thought 'watchdog_active' would do that. But again, I could
> > > be misreading the code.
> > >
> > You are right. Missed that part, sorry.
>
> No problem, just wanted to make sure I was understanding things correctly.
>
> Again thanks for the feedback. I'll spin a V2.
>
Make sure you add the kexec maintainer to Cc and get an Ack.

Guenter

2013-04-10 16:49:27

by David Teigland

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> However, we still have the problem that if the machine panics and we want
> to jump into the kdump kernel, we need to 'kick' the watchdog one more
> time. This provides us a sane sync point for determining how long we have
> to load the watchdog driver in the second kernel before the hardware
> reboots us. Otherwise the reboots are pretty random and nothing is
> guaranteed.

Some time ago I submitted this patch
http://www.spinics.net/lists/linux-watchdog/msg01477.html

to get rid of the one "extraneous" ping that was causing me trouble.
I'd still like to see merged, but haven't had time to follow up.

I have a use case where I need to guarantee that the watchdog
will *not* be pinged unless my userland daemon does the ping.
If my daemon is killed, the close() generates a ping that I
don't intend. This kdump ping looks like it would be another
instance that I'd need to suppress. Perhaps by renaming my flag
WDOG_NO_EXTRA_PING and checking it both in release and in
kick_for_kdump?

(My daemon associates watchdog pings with shared storage heartbeats.
Based on the heartbeats, hosts in a cluster can calculate when an
unresponsive host last pinged its watchdog, and can be fairly
certain that the "dead" host has been reset by its watchdog 60
seconds later. This is used as an alternative to i/o fencing
where we're protecting data on shared storage from corruption
after host failures. If there are uncontrolled watchdog pings,
then hosts don't know when a dead host might have last pinged
its watchdog, since it is no longer based on the last timestamp
it wrote to shared storage.)

Dave

2013-04-10 17:17:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 12:49:14PM -0400, David Teigland wrote:
> On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > However, we still have the problem that if the machine panics and we want
> > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > time. This provides us a sane sync point for determining how long we have
> > to load the watchdog driver in the second kernel before the hardware
> > reboots us. Otherwise the reboots are pretty random and nothing is
> > guaranteed.
>
> Some time ago I submitted this patch
> http://www.spinics.net/lists/linux-watchdog/msg01477.html
>
> to get rid of the one "extraneous" ping that was causing me trouble.
> I'd still like to see merged, but haven't had time to follow up.
>
The use case makes sense to me, so it gets my Ack. Did Wim ever comment on it ?

Thanks,
Guenter

> I have a use case where I need to guarantee that the watchdog
> will *not* be pinged unless my userland daemon does the ping.
> If my daemon is killed, the close() generates a ping that I
> don't intend. This kdump ping looks like it would be another
> instance that I'd need to suppress. Perhaps by renaming my flag
> WDOG_NO_EXTRA_PING and checking it both in release and in
> kick_for_kdump?
>
> (My daemon associates watchdog pings with shared storage heartbeats.
> Based on the heartbeats, hosts in a cluster can calculate when an
> unresponsive host last pinged its watchdog, and can be fairly
> certain that the "dead" host has been reset by its watchdog 60
> seconds later. This is used as an alternative to i/o fencing
> where we're protecting data on shared storage from corruption
> after host failures. If there are uncontrolled watchdog pings,
> then hosts don't know when a dead host might have last pinged
> its watchdog, since it is no longer based on the last timestamp
> it wrote to shared storage.)
>
> Dave
>

2013-04-12 21:16:40

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> > have no idea how to even find out if multiple watchdogs are open on the
> > system. Is there a list I could walk? And with regard to 'watchdog is
>
> /* the dev_t structure to store the dynamically allocated watchdog devices */
> static dev_t watchdog_devt;
>
> One way to look up the allocated watchdogs might be to loop through all kobj
> instances for the major device using kobj_lookup. Don't know if there is a
> better way.

Hmm, I got around to poking at this today and I am not sure kobj_lookup
will work. Besides being surrounded with another mutex, I don't have
access to the character device domain to pass to kobj_lookup.

Perhaps I am not reading the code right, but I can't find a good way
forward.

The only other hack I can think of, is to embed a list object in the
watchdog structure and list_add each new register'd watchdog. Then it
would be trivial to walk the watchdog list.

Thoughts?

Cheers,
Don

>
> > running', I thought 'watchdog_active' would do that. But again, I could
> > be misreading the code.
> >
> You are right. Missed that part, sorry.
>
> Guenter

2013-04-12 21:30:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Fri, Apr 12, 2013 at 05:16:27PM -0400, Don Zickus wrote:
> On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> > > have no idea how to even find out if multiple watchdogs are open on the
> > > system. Is there a list I could walk? And with regard to 'watchdog is
> >
> > /* the dev_t structure to store the dynamically allocated watchdog devices */
> > static dev_t watchdog_devt;
> >
> > One way to look up the allocated watchdogs might be to loop through all kobj
> > instances for the major device using kobj_lookup. Don't know if there is a
> > better way.
>
> Hmm, I got around to poking at this today and I am not sure kobj_lookup
> will work. Besides being surrounded with another mutex, I don't have
> access to the character device domain to pass to kobj_lookup.
>
> Perhaps I am not reading the code right, but I can't find a good way
> forward.
>
> The only other hack I can think of, is to embed a list object in the
> watchdog structure and list_add each new register'd watchdog. Then it
> would be trivial to walk the watchdog list.
>
After looking into it again, I agree. Maybe you can give it a try. At least
other options look even more complicated (eg creating a watchdog class ?).

Guenter

2013-04-15 20:55:19

by Don Zickus

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Fri, Apr 12, 2013 at 02:30:24PM -0700, Guenter Roeck wrote:
> On Fri, Apr 12, 2013 at 05:16:27PM -0400, Don Zickus wrote:
> > On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> > > > have no idea how to even find out if multiple watchdogs are open on the
> > > > system. Is there a list I could walk? And with regard to 'watchdog is
> > >
> > > /* the dev_t structure to store the dynamically allocated watchdog devices */
> > > static dev_t watchdog_devt;
> > >
> > > One way to look up the allocated watchdogs might be to loop through all kobj
> > > instances for the major device using kobj_lookup. Don't know if there is a
> > > better way.
> >
> > Hmm, I got around to poking at this today and I am not sure kobj_lookup
> > will work. Besides being surrounded with another mutex, I don't have
> > access to the character device domain to pass to kobj_lookup.
> >
> > Perhaps I am not reading the code right, but I can't find a good way
> > forward.
> >
> > The only other hack I can think of, is to embed a list object in the
> > watchdog structure and list_add each new register'd watchdog. Then it
> > would be trivial to walk the watchdog list.
> >
> After looking into it again, I agree. Maybe you can give it a try. At least
> other options look even more complicated (eg creating a watchdog class ?).

I looked at the watchdog class in watchdog_core.c. Even implemented
class_for_each_device. But got stuck trying to figure out how to go from
a struct device *dev to a struct watchdog_device. Then I realized in the
bowels of class_for_each_device were spinlocks. :-(

So I implemented an RCU list. It isn't the prettiest solution but it
seems to work.

I posted a V2 and realized I forgot to cc you on it. I apologize for
that. :-( I hope you can find it, otherwise I can post a pointer to it.

Cheers,
Don

2013-04-15 22:50:15

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path

On Mon, Apr 15, 2013 at 04:55:04PM -0400, Don Zickus wrote:
> On Fri, Apr 12, 2013 at 02:30:24PM -0700, Guenter Roeck wrote:
> > On Fri, Apr 12, 2013 at 05:16:27PM -0400, Don Zickus wrote:
> > > On Wed, Apr 10, 2013 at 08:10:41AM -0700, Guenter Roeck wrote:
> > > > > have no idea how to even find out if multiple watchdogs are open on the
> > > > > system. Is there a list I could walk? And with regard to 'watchdog is
> > > >
> > > > /* the dev_t structure to store the dynamically allocated watchdog devices */
> > > > static dev_t watchdog_devt;
> > > >
> > > > One way to look up the allocated watchdogs might be to loop through all kobj
> > > > instances for the major device using kobj_lookup. Don't know if there is a
> > > > better way.
> > >
> > > Hmm, I got around to poking at this today and I am not sure kobj_lookup
> > > will work. Besides being surrounded with another mutex, I don't have
> > > access to the character device domain to pass to kobj_lookup.
> > >
> > > Perhaps I am not reading the code right, but I can't find a good way
> > > forward.
> > >
> > > The only other hack I can think of, is to embed a list object in the
> > > watchdog structure and list_add each new register'd watchdog. Then it
> > > would be trivial to walk the watchdog list.
> > >
> > After looking into it again, I agree. Maybe you can give it a try. At least
> > other options look even more complicated (eg creating a watchdog class ?).
>
> I looked at the watchdog class in watchdog_core.c. Even implemented
> class_for_each_device. But got stuck trying to figure out how to go from
> a struct device *dev to a struct watchdog_device. Then I realized in the
> bowels of class_for_each_device were spinlocks. :-(
>
> So I implemented an RCU list. It isn't the prettiest solution but it
> seems to work.
>
> I posted a V2 and realized I forgot to cc you on it. I apologize for
> that. :-( I hope you can find it, otherwise I can post a pointer to it.
>
Where did you submit it ? I am subscribed to the watchdog mailing list, but I
didn't see it.

Thanks,
Guenter