On Tue, Mar 31, 2015 at 02:30:44PM -0400, Chris Metcalf wrote:
> On 03/31/2015 03:25 AM, Ingo Molnar wrote:
> >* [email protected] <[email protected]> wrote:
> >
> >>From: Chris Metcalf <[email protected]>
> >>
> >>Running watchdog can be a helpful debugging feature on regular
> >>cores, but it's incompatible with nohz_full, since it forces
> >>regular scheduling events. Accordingly, just exit out immediately
> >>from any nohz_full core.
> >>
> >>An alternate approach would be to add a flags field or function to
> >>smp_hotplug_thread to control on which cores the percpu threads
> >>are created, but it wasn't clear that much mechanism was useful.
> >>
> >>[...]
> >So what happens if someone wants to enable the lockup detector, with a
> >long timeout, even on nohz-full CPUs? This patch makes that
> >impossible.
> >
> >A better solution would be to tweak the defaults:
> >
> > - to default the watchdog(s) to disabled when nohz-full is
> > enabled, even if HARDLOCKUP_DETECTOR=y or DETECT_HUNG_TASK=y, and
> > allow it to be re-enabled via its sysctl.
>
> That's certainly a reasonable thing to do; it looks like just an #ifdef
> at the top of watchdog.c would suffice. Does this look right?
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 8a46d9d8a66f..c8555c211e65 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -25,7 +25,11 @@
> #include <linux/kvm_para.h>
> #include <linux/perf_event.h>
> +#ifdef CONFIG_NO_HZ_FULL
> +int watchdog_user_enabled = 0;
> +#else
> int watchdog_user_enabled = 1;
> +#endif
> int __read_mostly watchdog_thresh = 10;
> #ifdef CONFIG_SMP
> int __read_mostly sysctl_softlockup_all_cpu_backtrace;
>
> It doesn't look like I need to do anything else special to disable
> HARDLOCKUP_DETECTOR, and khungtaskd can happily run on
> a non-nohz core, so that should be OK.
>
> What I was trying to achieve with my proposed patch was kind
> of orthogonal: to allow the watchdog to run on standard cores,
> but not run on nohz cores, so we could benefit from it on the
> cores where it was safe for it to run. Do you see value in this,
> or better to just enable/disable all watchdog threads collectively?
Hmm, I am not sure I am a big fan of this approach. I know RHEL keeps the
watchdogs enabled for customers and it would be a regression if we disabled
it. And at the same time, I could see RHEL leaning towards enabling
CONFIG_NO_HZ_FULL, which would just delay this problem a number of years
until RHEL-8 gets around to ramping up.
So I guess I would prefer to figure out a better co-existing solution now.
Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
system command that has to be sent? How does the kernel know to turn off
ticks and trust userspace to do the right thing?
Cheers,
Don
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On 4/2/2015 9:35 AM, Don Zickus wrote:
> On Tue, Mar 31, 2015 at 02:30:44PM -0400, Chris Metcalf wrote:
>> On 03/31/2015 03:25 AM, Ingo Molnar wrote:
>>> * [email protected] <[email protected]> wrote:
>>>
>>>> From: Chris Metcalf <[email protected]>
>>>>
>>>> Running watchdog can be a helpful debugging feature on regular
>>>> cores, but it's incompatible with nohz_full, since it forces
>>>> regular scheduling events. Accordingly, just exit out immediately
>>> >from any nohz_full core.
>>>> An alternate approach would be to add a flags field or function to
>>>> smp_hotplug_thread to control on which cores the percpu threads
>>>> are created, but it wasn't clear that much mechanism was useful.
>>>>
>>>> [...]
>>> So what happens if someone wants to enable the lockup detector, with a
>>> long timeout, even on nohz-full CPUs? This patch makes that
>>> impossible.
>>>
>>> A better solution would be to tweak the defaults:
>>>
>>> - to default the watchdog(s) to disabled when nohz-full is
>>> enabled, even if HARDLOCKUP_DETECTOR=y or DETECT_HUNG_TASK=y, and
>>> allow it to be re-enabled via its sysctl.
>> That's certainly a reasonable thing to do; it looks like just an #ifdef
>> at the top of watchdog.c would suffice. Does this look right?
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 8a46d9d8a66f..c8555c211e65 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -25,7 +25,11 @@
>> #include <linux/kvm_para.h>
>> #include <linux/perf_event.h>
>> +#ifdef CONFIG_NO_HZ_FULL
>> +int watchdog_user_enabled = 0;
>> +#else
>> int watchdog_user_enabled = 1;
>> +#endif
>> int __read_mostly watchdog_thresh = 10;
>> #ifdef CONFIG_SMP
>> int __read_mostly sysctl_softlockup_all_cpu_backtrace;
>>
>> It doesn't look like I need to do anything else special to disable
>> HARDLOCKUP_DETECTOR, and khungtaskd can happily run on
>> a non-nohz core, so that should be OK.
>>
>> What I was trying to achieve with my proposed patch was kind
>> of orthogonal: to allow the watchdog to run on standard cores,
>> but not run on nohz cores, so we could benefit from it on the
>> cores where it was safe for it to run. Do you see value in this,
>> or better to just enable/disable all watchdog threads collectively?
>
> Hmm, I am not sure I am a big fan of this approach. I know RHEL keeps the
> watchdogs enabled for customers and it would be a regression if we disabled
> it. And at the same time, I could see RHEL leaning towards enabling
> CONFIG_NO_HZ_FULL, which would just delay this problem a number of years
> until RHEL-8 gets around to ramping up.
>
> So I guess I would prefer to figure out a better co-existing solution now.
>
> Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
> system command that has to be sent? How does the kernel know to turn off
> ticks and trust userspace to do the right thing?
The NO_HZ_FULL option, when configured into the kernel, lets
you boot with "nohz_full=1-15" (or whatever cpumask you like),
typically in conjunction with "isolcpus=1-15". At this point no tasks
will run on those cores until explicitly placed there by affinity, and
once there and running in userspace, the kernel will automatically
get out of their way and not interrupt at all. This lets those tasks
run with 100.000% of the cpu, which is a requirement for many
user-space device drivers running high throughput devices.
(This is typically the use case for the tile architecture customers.)
So, other than a boot flag, there are no system commands or
other APIs to deal with.
Part of the requirement, though, is that there can be only one task
bound and runnable on that cpu, otherwise the kernel has to be
involved to do the context-switching off of the scheduler tick.
This is why having the standard watchdog kernel thread doesn't
work in this context.
I continue to suspect that the right model here is to disable the
watchdog specifically on the cores that the user has tagged with
the nohz_full boot argument. I agree that there might be a case
to be made for leaving the watchdog conditionally (as suggested
by Ingo) but it should be possible to have the watchdogs on
the nohz_full cores be turned off completely if desired.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Thu, Apr 02, 2015 at 09:49:45AM -0400, Chris Metcalf wrote:
> >Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
> >system command that has to be sent? How does the kernel know to turn off
> >ticks and trust userspace to do the right thing?
>
> The NO_HZ_FULL option, when configured into the kernel, lets
> you boot with "nohz_full=1-15" (or whatever cpumask you like),
> typically in conjunction with "isolcpus=1-15". At this point no tasks
> will run on those cores until explicitly placed there by affinity, and
> once there and running in userspace, the kernel will automatically
> get out of their way and not interrupt at all. This lets those tasks
> run with 100.000% of the cpu, which is a requirement for many
> user-space device drivers running high throughput devices.
> (This is typically the use case for the tile architecture customers.)
>
> So, other than a boot flag, there are no system commands or
> other APIs to deal with.
Ah, I am starting to understand your approach in the original patch better.
>
> Part of the requirement, though, is that there can be only one task
> bound and runnable on that cpu, otherwise the kernel has to be
> involved to do the context-switching off of the scheduler tick.
> This is why having the standard watchdog kernel thread doesn't
> work in this context.
So, there is no preemption happening, which means the softlockup is rather
pointless. Can interrupts be disabled or handled on that cpu? I am trying
to see if the hardlockup detector becomes rather silly on those cpus too.
>
> I continue to suspect that the right model here is to disable the
> watchdog specifically on the cores that the user has tagged with
> the nohz_full boot argument. I agree that there might be a case
> to be made for leaving the watchdog conditionally (as suggested
> by Ingo) but it should be possible to have the watchdogs on
> the nohz_full cores be turned off completely if desired.
I think I might be slowly coming around to your thoughts. I might request a
different patch though based on the answers above. Maybe even create a
subset of the online cpus for the watchdog to work off of. The watchdog
would copy the online cpu mask, mask off the nohz cpus and just function
that way. It would print loud messages for each nohz cpu it was masking
off.
Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
folks to modify in case they want to enable the watchdog on the nohz cpus.
Just some thoughts.
Cheers,
Don
On Thu, Apr 02, 2015 at 10:15:27AM -0400, Don Zickus wrote:
> On Thu, Apr 02, 2015 at 09:49:45AM -0400, Chris Metcalf wrote:
> > >Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
> > >system command that has to be sent? How does the kernel know to turn off
> > >ticks and trust userspace to do the right thing?
> >
> > The NO_HZ_FULL option, when configured into the kernel, lets
> > you boot with "nohz_full=1-15" (or whatever cpumask you like),
> > typically in conjunction with "isolcpus=1-15". At this point no tasks
> > will run on those cores until explicitly placed there by affinity, and
> > once there and running in userspace, the kernel will automatically
> > get out of their way and not interrupt at all. This lets those tasks
> > run with 100.000% of the cpu, which is a requirement for many
> > user-space device drivers running high throughput devices.
> > (This is typically the use case for the tile architecture customers.)
> >
> > So, other than a boot flag, there are no system commands or
> > other APIs to deal with.
>
> Ah, I am starting to understand your approach in the original patch better.
>
> >
> > Part of the requirement, though, is that there can be only one task
> > bound and runnable on that cpu, otherwise the kernel has to be
> > involved to do the context-switching off of the scheduler tick.
> > This is why having the standard watchdog kernel thread doesn't
> > work in this context.
>
> So, there is no preemption happening, which means the softlockup is rather
> pointless.
Still useful actually because nohz full only takes effect when a single task runs
on the CPU. But there can still be more than 1 task running, just nohz full will
be disabled. It all happens dynamically.
> Can interrupts be disabled or handled on that cpu? I am trying
> to see if the hardlockup detector becomes rather silly on those cpus too.
No interrupts aren't disabled on these CPUs. Now the goal is to avoid them:
migrate irqs, nohz full, etc...
But there can be irqs. And actually there is at least 1 tick every second in
order to keep the scheduler stats moving forward. We plan to get rid of it but
anyway the point is that IRQ can happen on nohz full CPUs.
>
> >
> > I continue to suspect that the right model here is to disable the
> > watchdog specifically on the cores that the user has tagged with
> > the nohz_full boot argument. I agree that there might be a case
> > to be made for leaving the watchdog conditionally (as suggested
> > by Ingo) but it should be possible to have the watchdogs on
> > the nohz_full cores be turned off completely if desired.
>
> I think I might be slowly coming around to your thoughts. I might request a
> different patch though based on the answers above. Maybe even create a
> subset of the online cpus for the watchdog to work off of. The watchdog
> would copy the online cpu mask, mask off the nohz cpus and just function
> that way. It would print loud messages for each nohz cpu it was masking
> off.
All agreed with that! We should at least keep the watchdog running on
non-nohz-full CPUs. And also allow to re-enable it everywhere when needed,
in case we have a lockup to chase on nohz full CPUs.
> Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
> folks to modify in case they want to enable the watchdog on the nohz cpus.
That sounds like a good idea.
>
> Just some thoughts.
>
> Cheers,
> Don
On 04/02/2015 11:38 AM, Frederic Weisbecker wrote:
> On Thu, Apr 02, 2015 at 10:15:27AM -0400, Don Zickus wrote:
>> On Thu, Apr 02, 2015 at 09:49:45AM -0400, Chris Metcalf wrote:
>>>> Can I ask how the NO_HZ_FULL technology works from userspace? Is there a
>>>> system command that has to be sent? How does the kernel know to turn off
>>>> ticks and trust userspace to do the right thing?
>>> The NO_HZ_FULL option, when configured into the kernel, lets
>>> you boot with "nohz_full=1-15" (or whatever cpumask you like),
>>> typically in conjunction with "isolcpus=1-15". At this point no tasks
>>> will run on those cores until explicitly placed there by affinity, and
>>> once there and running in userspace, the kernel will automatically
>>> get out of their way and not interrupt at all. This lets those tasks
>>> run with 100.000% of the cpu, which is a requirement for many
>>> user-space device drivers running high throughput devices.
>>> (This is typically the use case for the tile architecture customers.)
>>>
>>> So, other than a boot flag, there are no system commands or
>>> other APIs to deal with.
>> Ah, I am starting to understand your approach in the original patch better.
>>
>>> Part of the requirement, though, is that there can be only one task
>>> bound and runnable on that cpu, otherwise the kernel has to be
>>> involved to do the context-switching off of the scheduler tick.
>>> This is why having the standard watchdog kernel thread doesn't
>>> work in this context.
>> So, there is no preemption happening, which means the softlockup is rather
>> pointless.
> Still useful actually because nohz full only takes effect when a single task runs
> on the CPU. But there can still be more than 1 task running, just nohz full will
> be disabled. It all happens dynamically.
>
>> Can interrupts be disabled or handled on that cpu? I am trying
>> to see if the hardlockup detector becomes rather silly on those cpus too.
> No interrupts aren't disabled on these CPUs. Now the goal is to avoid them:
> migrate irqs, nohz full, etc...
>
> But there can be irqs. And actually there is at least 1 tick every second in
> order to keep the scheduler stats moving forward. We plan to get rid of it but
> anyway the point is that IRQ can happen on nohz full CPUs.
>
>>> I continue to suspect that the right model here is to disable the
>>> watchdog specifically on the cores that the user has tagged with
>>> the nohz_full boot argument. I agree that there might be a case
>>> to be made for leaving the watchdog conditionally (as suggested
>>> by Ingo) but it should be possible to have the watchdogs on
>>> the nohz_full cores be turned off completely if desired.
>> I think I might be slowly coming around to your thoughts. I might request a
>> different patch though based on the answers above. Maybe even create a
>> subset of the online cpus for the watchdog to work off of. The watchdog
>> would copy the online cpu mask, mask off the nohz cpus and just function
>> that way. It would print loud messages for each nohz cpu it was masking
>> off.
> All agreed with that! We should at least keep the watchdog running on
> non-nohz-full CPUs. And also allow to re-enable it everywhere when needed,
> in case we have a lockup to chase on nohz full CPUs.
>
>> Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
>> folks to modify in case they want to enable the watchdog on the nohz cpus.
> That sounds like a good idea.
OK, I will respin v2 of the patch as follows:
- Provide a watchdog_cpumask as suggested by Don.
- On a non-NO_HZ_FULL build, it defaults to cpu_possible as normal
- On a NO_HZ_FULL build, it defaults to the housekeeping cpus
- If the mask is modified, we disable and then re-enable the watchdog,
so that the watchdog init code can exit() the appropriate threads as
they start up
This should address the various concerns that have been raised.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Thu, Apr 02, 2015 at 11:42:43AM -0400, Chris Metcalf wrote:
> >
> >>Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
> >>folks to modify in case they want to enable the watchdog on the nohz cpus.
> >That sounds like a good idea.
>
> OK, I will respin v2 of the patch as follows:
>
> - Provide a watchdog_cpumask as suggested by Don.
> - On a non-NO_HZ_FULL build, it defaults to cpu_possible as normal
> - On a NO_HZ_FULL build, it defaults to the housekeeping cpus
> - If the mask is modified, we disable and then re-enable the watchdog,
> so that the watchdog init code can exit() the appropriate threads as
> they start up
Sounds good. :-)
Cheers,
Don
On Thu, Apr 02, 2015 at 11:42:43AM -0400, Chris Metcalf wrote:
> OK, I will respin v2 of the patch as follows:
>
> - Provide a watchdog_cpumask as suggested by Don.
> - On a non-NO_HZ_FULL build, it defaults to cpu_possible as normal
> - On a NO_HZ_FULL build, it defaults to the housekeeping cpus
Ah note that NO_HZ_FULL is only the capability. Nohz full is actually
only running if the nohz_full parameter is passed (or NO_HZ_FULL_ALL=y).
And now generalist distros enable NO_HZ_FULL so that anybody can use it.
So better check tick_nohz_full_enabled() instead of the CONFIG.
Thanks.
> - If the mask is modified, we disable and then re-enable the watchdog,
> so that the watchdog init code can exit() the appropriate threads as
> they start up
>
> This should address the various concerns that have been raised.
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
From: Chris Metcalf <[email protected]>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
Signed-off-by: Chris Metcalf <[email protected]>
---
Technically this is only v2, but I accidentally replied to an
earlier email after adding v2 to the subject line, so for clarity
I'm calling this thread v3.
This change depends on my earlier change to add a
tick_nohz_full_clear_cpus() API. If folks are OK with my doing so, I can
add it to the set of patches I'm planning to ask Linus to pull for 4.1.
Documentation/lockup-watchdogs.txt | 6 ++++++
Documentation/sysctl/kernel.txt | 9 +++++++++
include/linux/nmi.h | 1 +
include/linux/sched.h | 3 +++
kernel/sysctl.c | 7 +++++++
kernel/watchdog.c | 33 ++++++++++++++++++++++++++++++++-
6 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..82a99eedf904 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. In either case, the set of cores running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 83ab25660fc9..5821dc6bb5c2 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -858,6 +858,15 @@ example. If a system hangs up, try pressing the NMI switch.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog will run.
+The default cpumask specifies every core, but if NO_HZ_FULL is enabled
+in the kernel config, and cores are specified with the nohz_full= boot
+argument, those cores are excluded by default.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9b2022ab4d85..cebf36e618e0 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -70,6 +70,7 @@ int hw_nmi_is_cpu_stuck(struct pt_regs *);
u64 hw_nmi_get_sample_period(int watchdog_thresh);
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_mask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_dowatchdog(struct ctl_table *, int ,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..a6f048f4fbeb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -377,6 +377,9 @@ extern void touch_all_softlockup_watchdogs(void);
extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
void __user *buffer,
size_t *lenp, loff_t *ppos);
+extern int proc_dowatchdog_mask(struct ctl_table *table, int write,
+ void __user *buffer,
+ size_t *lenp, loff_t *ppos);
extern unsigned int softlockup_panic;
void lockup_detector_init(void);
#else
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..2fb96ffa56d1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -860,6 +860,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &sixty,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_mask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_dowatchdog_mask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 3174bf8e3538..2140c2d81dc9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -31,6 +32,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_mask;
+unsigned long *watchdog_mask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;
+ /* Exit if the cpu is not allowed for watchdog. */
+ if (!cpumask_test_cpu(cpu, watchdog_mask))
+ do_exit(0);
+
/* Enable the perf event */
watchdog_nmi_enable(cpu);
@@ -653,6 +660,8 @@ static void watchdog_disable_all_cpus(void)
}
}
+static DEFINE_MUTEX(watchdog_proc_mutex);
+
/*
* proc handler for /proc/sys/kernel/nmi_watchdog,watchdog_thresh
*/
@@ -662,7 +671,6 @@ int proc_dowatchdog(struct ctl_table *table, int write,
{
int err, old_thresh, old_enabled;
bool old_hardlockup;
- static DEFINE_MUTEX(watchdog_proc_mutex);
mutex_lock(&watchdog_proc_mutex);
old_thresh = ACCESS_ONCE(watchdog_thresh);
@@ -700,12 +708,35 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+int proc_dowatchdog_mask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write && watchdog_user_enabled) {
+ watchdog_disable_all_cpus();
+ watchdog_enable_all_cpus(false);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ alloc_bootmem_cpumask_var(&watchdog_mask);
+ cpumask_copy(watchdog_mask, cpu_possible_mask);
+ tick_nohz_full_clear_cpus(watchdog_mask);
+
+ /* The sysctl API requires a variable holding a pointer to the mask. */
+ watchdog_mask_bits = cpumask_bits(watchdog_mask);
+
if (watchdog_user_enabled)
watchdog_enable_all_cpus(false);
}
--
2.1.2
On Thu, Apr 02, 2015 at 01:39:28PM -0400, [email protected] wrote:
> @@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> hrtimer->function = watchdog_timer_fn;
>
> + /* Exit if the cpu is not allowed for watchdog. */
> + if (!cpumask_test_cpu(cpu, watchdog_mask))
> + do_exit(0);
> +
Ick, that doesn't look right for smpboot threads.
On 04/02/2015 02:06 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2015 at 01:39:28PM -0400, [email protected] wrote:
>> @@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
>> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> hrtimer->function = watchdog_timer_fn;
>>
>> + /* Exit if the cpu is not allowed for watchdog. */
>> + if (!cpumask_test_cpu(cpu, watchdog_mask))
>> + do_exit(0);
>> +
> Ick, that doesn't look right for smpboot threads.
I didn't see a better way to make this happen without adding
a bunch of infrastructure to the smpboot thread mechanism
to use a cpumask other than for_each_online_cpu(). The exit
seems benign in my testing, but I agree it's not the cleanest
way to express what we're trying to do here.
Perhaps something like an optional cpumask_t pointer in
struct smp_hotplug_thread, which if present specifies the
cpus to run on, and otherwise we stick with cpu_online_mask?
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Thu, Apr 02, 2015 at 02:16:09PM -0400, Chris Metcalf wrote:
> On 04/02/2015 02:06 PM, Peter Zijlstra wrote:
> >On Thu, Apr 02, 2015 at 01:39:28PM -0400, [email protected] wrote:
> >>@@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
> >> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >> hrtimer->function = watchdog_timer_fn;
> >>+ /* Exit if the cpu is not allowed for watchdog. */
> >>+ if (!cpumask_test_cpu(cpu, watchdog_mask))
> >>+ do_exit(0);
> >>+
> >Ick, that doesn't look right for smpboot threads.
>
> I didn't see a better way to make this happen without adding
> a bunch of infrastructure to the smpboot thread mechanism
> to use a cpumask other than for_each_online_cpu(). The exit
> seems benign in my testing, but I agree it's not the cleanest
> way to express what we're trying to do here.
>
> Perhaps something like an optional cpumask_t pointer in
> struct smp_hotplug_thread, which if present specifies the
> cpus to run on, and otherwise we stick with cpu_online_mask?
What's wrong with just leaving the thread be but making sure it'll never
actually do anything?
On Thu, Apr 02, 2015 at 01:39:28PM -0400, [email protected] wrote:
> From: Chris Metcalf <[email protected]>
>
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
>
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> Technically this is only v2, but I accidentally replied to an
> earlier email after adding v2 to the subject line, so for clarity
> I'm calling this thread v3.
>
> This change depends on my earlier change to add a
> tick_nohz_full_clear_cpus() API. If folks are OK with my doing so, I can
> add it to the set of patches I'm planning to ask Linus to pull for 4.1.
>
> Documentation/lockup-watchdogs.txt | 6 ++++++
> Documentation/sysctl/kernel.txt | 9 +++++++++
> include/linux/nmi.h | 1 +
> include/linux/sched.h | 3 +++
> kernel/sysctl.c | 7 +++++++
> kernel/watchdog.c | 33 ++++++++++++++++++++++++++++++++-
> 6 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> index ab0baa692c13..82a99eedf904 100644
> --- a/Documentation/lockup-watchdogs.txt
> +++ b/Documentation/lockup-watchdogs.txt
> @@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
> administrators to configure the period of the hrtimer and the perf
> event. The right value for a particular environment is a trade-off
> between fast response to lockups and detection overhead.
> +
> +By default, the watchdog runs on all online cores. However, on a
> +kernel configured with NO_HZ_FULL, by default the watchdog runs only
> +on the housekeeping cores, not the cores specified in the "nohz_full"
> +boot argument. In either case, the set of cores running the watchdog
> +may be adjusted via the kernel.watchdog_cpumask sysctl.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 83ab25660fc9..5821dc6bb5c2 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -858,6 +858,15 @@ example. If a system hangs up, try pressing the NMI switch.
>
> ==============================================================
>
> +watchdog_cpumask:
> +
> +This value can be used to control on which cpus the watchdog will run.
> +The default cpumask specifies every core, but if NO_HZ_FULL is enabled
> +in the kernel config, and cores are specified with the nohz_full= boot
> +argument, those cores are excluded by default.
> +
> +==============================================================
> +
> watchdog_thresh:
>
> This value can be used to control the frequency of hrtimer and NMI
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 9b2022ab4d85..cebf36e618e0 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -70,6 +70,7 @@ int hw_nmi_is_cpu_stuck(struct pt_regs *);
> u64 hw_nmi_get_sample_period(int watchdog_thresh);
> extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> +extern unsigned long *watchdog_mask_bits;
> extern int sysctl_softlockup_all_cpu_backtrace;
> struct ctl_table;
> extern int proc_dowatchdog(struct ctl_table *, int ,
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d77432e14ff..a6f048f4fbeb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -377,6 +377,9 @@ extern void touch_all_softlockup_watchdogs(void);
> extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> void __user *buffer,
> size_t *lenp, loff_t *ppos);
> +extern int proc_dowatchdog_mask(struct ctl_table *table, int write,
> + void __user *buffer,
> + size_t *lenp, loff_t *ppos);
> extern unsigned int softlockup_panic;
> void lockup_detector_init(void);
> #else
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 88ea2d6e0031..2fb96ffa56d1 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -860,6 +860,13 @@ static struct ctl_table kern_table[] = {
> .extra2 = &sixty,
> },
> {
> + .procname = "watchdog_cpumask",
> + .data = &watchdog_mask_bits,
> + .maxlen = NR_CPUS,
> + .mode = 0644,
> + .proc_handler = proc_dowatchdog_mask,
> + },
> + {
> .procname = "softlockup_panic",
> .data = &softlockup_panic,
> .maxlen = sizeof(int),
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 3174bf8e3538..2140c2d81dc9 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -19,6 +19,7 @@
> #include <linux/sysctl.h>
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> +#include <linux/tick.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -31,6 +32,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> +static cpumask_var_t watchdog_mask;
> +unsigned long *watchdog_mask_bits;
>
> static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
> @@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> hrtimer->function = watchdog_timer_fn;
>
> + /* Exit if the cpu is not allowed for watchdog. */
> + if (!cpumask_test_cpu(cpu, watchdog_mask))
> + do_exit(0);
> +
Besides the do_exit(), a printk is probably needed.
> /* Enable the perf event */
> watchdog_nmi_enable(cpu);
>
> @@ -653,6 +660,8 @@ static void watchdog_disable_all_cpus(void)
> }
> }
>
> +static DEFINE_MUTEX(watchdog_proc_mutex);
> +
I posted a patchset to akpm a while ago from Uli that changed things around
with regards to the procfs stuff. Andrew queued it, but I wasn't sure if
there was other issues with it or if it is good to go for 4.1. So this
piece and the stuff below might get modified later..
> /*
> * proc handler for /proc/sys/kernel/nmi_watchdog,watchdog_thresh
> */
> @@ -662,7 +671,6 @@ int proc_dowatchdog(struct ctl_table *table, int write,
> {
> int err, old_thresh, old_enabled;
> bool old_hardlockup;
> - static DEFINE_MUTEX(watchdog_proc_mutex);
>
> mutex_lock(&watchdog_proc_mutex);
> old_thresh = ACCESS_ONCE(watchdog_thresh);
> @@ -700,12 +708,35 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +int proc_dowatchdog_mask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write && watchdog_user_enabled) {
> + watchdog_disable_all_cpus();
> + watchdog_enable_all_cpus(false);
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
> +
> #endif /* CONFIG_SYSCTL */
Hmm, based on the procfs changes in the new code, instead of a do_exit(),
what if we do a 'return' instead. This keeps the thread registered but does
nothing. Later if we update the watchdog_cpumask, a restart easily enables
the soft/hard watchdog pieces.
The new procfs changes tries hard to handle a 'restart' scenario better as
the procfs variables are updated. This piece could fit nicely into that, I
think.
Those changes start here: https://lkml.org/lkml/2015/2/5/626
Cheers,
Don
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + alloc_bootmem_cpumask_var(&watchdog_mask);
> + cpumask_copy(watchdog_mask, cpu_possible_mask);
> + tick_nohz_full_clear_cpus(watchdog_mask);
> +
> + /* The sysctl API requires a variable holding a pointer to the mask. */
> + watchdog_mask_bits = cpumask_bits(watchdog_mask);
> +
> if (watchdog_user_enabled)
> watchdog_enable_all_cpus(false);
> }
> --
> 2.1.2
>
On 04/02/2015 02:33 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2015 at 02:16:09PM -0400, Chris Metcalf wrote:
>> On 04/02/2015 02:06 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 02, 2015 at 01:39:28PM -0400, [email protected] wrote:
>>>> @@ -431,6 +434,10 @@ static void watchdog_enable(unsigned int cpu)
>>>> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>>> hrtimer->function = watchdog_timer_fn;
>>>> + /* Exit if the cpu is not allowed for watchdog. */
>>>> + if (!cpumask_test_cpu(cpu, watchdog_mask))
>>>> + do_exit(0);
>>>> +
>>> Ick, that doesn't look right for smpboot threads.
>> I didn't see a better way to make this happen without adding
>> a bunch of infrastructure to the smpboot thread mechanism
>> to use a cpumask other than for_each_online_cpu(). The exit
>> seems benign in my testing, but I agree it's not the cleanest
>> way to express what we're trying to do here.
>>
>> Perhaps something like an optional cpumask_t pointer in
>> struct smp_hotplug_thread, which if present specifies the
>> cpus to run on, and otherwise we stick with cpu_online_mask?
> What's wrong with just leaving the thread be but making sure it'll never
> actually do anything?
I think a common case for nohz_full systems is that you'll
have a whole lot of watchdog threads that never do anything.
Our TILEGx-72 systems are often run with one housekeeping
core and the rest doing userspace nohz_full driver work. So
not creating the threads seems tidier - it keeps 71 threads out
of the "ps" listing :-)
Here's a quick sketch of the delta from my previous patch to
one with a new smp_hotplug_thread.cpumask field. If folks
are OK with modifying the smpboot threads like this, I think
it probably is a cleaner approach:
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 13e929679550..f28519612ee3 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,7 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @cpumask: Optional cpumask, specifying what cores to run on.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,6 +42,7 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ cpumask_t *cpumask;
bool selfparking;
const char *thread_comm;
};
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 40190f28db35..be503c2ddb5f 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -172,6 +172,9 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;
+ if (ht->cpumask && !cpumask_test_cpu(cpu, ht->cpumask))
+ return 0;
+
td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
if (!td)
return -ENOMEM;
@@ -220,9 +223,11 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (tsk) {
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2140c2d81dc9..681e5648e093 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -434,10 +434,6 @@ static void watchdog_enable(unsigned int cpu)
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;
- /* Exit if the cpu is not allowed for watchdog. */
- if (!cpumask_test_cpu(cpu, watchdog_mask))
- do_exit(0);
-
/* Enable the perf event */
watchdog_nmi_enable(cpu);
@@ -588,6 +584,7 @@ static struct smp_hotplug_thread watchdog_threads = {
.cleanup = watchdog_cleanup,
.park = watchdog_disable,
.unpark = watchdog_enable,
+ .cpumask = watchdog_mask,
};
static void restart_watchdog_hrtimer(void *info)
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
From: Chris Metcalf <[email protected]>
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The motivating example for this is
the watchdog threads, which by default we don't want to run
on any enabled nohz_full cores.
Signed-off-by: Chris Metcalf <[email protected]>
---
Relative to the quick diff I emailed out yesterday (sorry missed
adding Thomas to that one), this change uses a notion of "exclude_mask"
for smp_hotplug_thread. I think this is a better fit in general to
the idea that smp_hotplug_thread should run everywhere (including on
any newly-onlined cpus), EXCEPT for cores that have been specifically
tagged as reserved for some reason. Obviously for nohz_full we have
the nohz_full boot argument to use for this at the moment.
Thomas, Peter, how does this look to you?
include/linux/smpboot.h | 2 ++
kernel/smpboot.c | 11 ++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 13e929679550..0631964525f7 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,7 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @exclude_mask: Optional cpumask, specifying cores to exclude.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,6 +42,7 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ cpumask_t *exclude_mask;
bool selfparking;
const char *thread_comm;
};
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 40190f28db35..7df326ed80eb 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -172,6 +172,9 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;
+ if (ht->exclude_mask && cpumask_test_cpu(cpu, ht->exclude_mask))
+ return 0;
+
td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
if (!td)
return -ENOMEM;
@@ -220,9 +223,11 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (tsk) {
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
--
2.1.2
From: Chris Metcalf <[email protected]>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_exclude sysctl.
Signed-off-by: Chris Metcalf <[email protected]>
---
Don, I think this will merge pretty well with the restructuring changes
you passed to Andrew. In particular it benefits from that code moving
the mutex out to file scope already, and I don't think it conflicts
with any of the proposed sysctl renaming or file refactoring.
I changed your suggested kernel.watchdog_cpumask to
kernel.watchdog_exclude (i.e. the inverse set) since I thought that
was clearer in the context of smp_hotplug_thread where cores might
potentially go online or offline and the important invariant was that
the nohz_full cpuset be respected.
What do you think of using my proposed new smp_hotplug_thread
exclude_mask to simply prevent unwanted watchdog threads from existing
at all? It's cleaner than the "do_exit(0)" strategy, and I think also
better than leaving the watchdog threads hanging around - the most
common case for nohz_full is likely that "n - 1" cpus would otherwise
have kthreads created and never used, and just clutter ps and
potentially confuse people trying to understand possible sources of
interference to the nohz_full userspace tasks.
Documentation/lockup-watchdogs.txt | 6 ++++++
Documentation/sysctl/kernel.txt | 9 +++++++++
include/linux/nmi.h | 3 +++
kernel/sysctl.c | 7 +++++++
kernel/watchdog.c | 36 +++++++++++++++++++++++++++++++++++-
5 files changed, 60 insertions(+), 1 deletion(-)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..4f86aec1d69d 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. In either case, the set of cores excluded from running
+the watchdog may be adjusted via the kernel.watchdog_exclude sysctl.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 83ab25660fc9..aad9f9ba347c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -858,6 +858,15 @@ example. If a system hangs up, try pressing the NMI switch.
==============================================================
+watchdog_exclude:
+
+This value can be used to control on which cpus the watchdog is
+prohibited from running. The default exclude mask is empty, but if
+NO_HZ_FULL is enabled in the kernel config, and cores are specified
+with the nohz_full= boot argument, those cores are excluded by default.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9b2022ab4d85..1703829c5812 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -70,10 +70,13 @@ int hw_nmi_is_cpu_stuck(struct pt_regs *);
u64 hw_nmi_get_sample_period(int watchdog_thresh);
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_exclude_mask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_dowatchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_dowatchdog_exclude(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..f2c544181f4f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -860,6 +860,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &sixty,
},
{
+ .procname = "watchdog_exclude",
+ .data = &watchdog_exclude_mask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_dowatchdog_exclude,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 3174bf8e3538..66bfc80854d1 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -31,6 +32,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_exclude_mask;
+unsigned long *watchdog_exclude_mask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -581,6 +584,7 @@ static struct smp_hotplug_thread watchdog_threads = {
.cleanup = watchdog_cleanup,
.park = watchdog_disable,
.unpark = watchdog_enable,
+ .exclude_mask = watchdog_exclude_mask,
};
static void restart_watchdog_hrtimer(void *info)
@@ -653,6 +657,8 @@ static void watchdog_disable_all_cpus(void)
}
}
+static DEFINE_MUTEX(watchdog_proc_mutex);
+
/*
* proc handler for /proc/sys/kernel/nmi_watchdog,watchdog_thresh
*/
@@ -662,7 +668,6 @@ int proc_dowatchdog(struct ctl_table *table, int write,
{
int err, old_thresh, old_enabled;
bool old_hardlockup;
- static DEFINE_MUTEX(watchdog_proc_mutex);
mutex_lock(&watchdog_proc_mutex);
old_thresh = ACCESS_ONCE(watchdog_thresh);
@@ -700,12 +705,41 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+int proc_dowatchdog_exclude(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write && watchdog_user_enabled) {
+ watchdog_disable_all_cpus();
+ watchdog_enable_all_cpus(false);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_copy(watchdog_exclude_mask, tick_nohz_full_mask);
+#else
+ cpumask_clear(watchdog_exclude_mask);
+#endif
+
+ /* The sysctl API requires a variable holding a pointer to the mask. */
+ watchdog_exclude_mask_bits = cpumask_bits(watchdog_exclude_mask);
+
if (watchdog_user_enabled)
watchdog_enable_all_cpus(false);
}
--
2.1.2
Chris,
I'd like to comment on the following proposed change:
+int proc_dowatchdog_exclude(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write && watchdog_user_enabled) {
+ watchdog_disable_all_cpus();
+ watchdog_enable_all_cpus(false);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
The watchdog mechanism is enabled if watchdog_user_enabled and watchdog_thresh
are both non-zero. Hence, I think the if-statement in the above snippet of code
should look like this:
if (!err && write && watchdog_user_enabled && watchdog_thresh)
Please see proc_dowatchdog() which checks the content of both variables before
it calls watchdog_enable_all_cpus():
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/watchdog.c?id=refs/tags/v4.0-rc6#n682
For completeness, I'd also like to point out that if the patch series at
https://lkml.org/lkml/2015/2/5/626 gets accepted upstream, the if-statement
will have to be adjusted. I think it should then look like this:
if (!err && write && watchdog_enabled && watchdog_thresh) {
watchdog_disable_all_cpus();
watchdog_enable_all_cpus();
}
Please see proc_watchdog_update() here which is similar to the above.
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/kernel/watchdog.c?id=refs/tags/next-20150402#n710
Regards,
Uli
From: Chris Metcalf <[email protected]>
These changes allow the watchdog to work cleanly with nohz_full
by default, and to be configurable if desired to enable the watchdog
on cores that would normally disable it due to being nohz_full.
Thomas, does the addition of an exclude_mask to smp_hotplug_thread
meet with your approval? It seems like a pretty clean extension
and enables the desired watchdog functionality for nohz_full pretty
nicely, I think.
Uli, this version is based on the linux-next tree, which includes
your recent refactoring changes; I've made all your suggested changes.
Don and/or Uli, do you want to give your Acked-by to the watchdog change?
Frederick, I guess you could push it through your nohz tree? Or would
it make sense for Andrew to take this into his tree, which looks like
what happened with Uli/Don's earlier watchdog changes? I'm assuming my
pushing it through the arch/tile tree is probably not the best way to go.
Chris Metcalf (2):
smpboot: allow excluding cpus from the smpboot threads
watchdog: add watchdog_exclude sysctl to assist nohz
Documentation/lockup-watchdogs.txt | 6 ++++++
Documentation/sysctl/kernel.txt | 9 +++++++++
include/linux/nmi.h | 3 +++
include/linux/smpboot.h | 2 ++
kernel/smpboot.c | 11 ++++++++---
kernel/sysctl.c | 7 +++++++
kernel/watchdog.c | 33 +++++++++++++++++++++++++++++++++
7 files changed, 68 insertions(+), 3 deletions(-)
--
2.1.2
From: Chris Metcalf <[email protected]>
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The motivating example for this is
the watchdog threads, which by default we don't want to run
on any enabled nohz_full cores.
Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/smpboot.h | 2 ++
kernel/smpboot.c | 11 ++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..de2f64a98108 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,7 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @exclude_mask: Optional cpumask, specifying cores to exclude.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,6 +42,7 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ cpumask_t *exclude_mask;
bool selfparking;
const char *thread_comm;
};
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..8adff4f817fc 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -173,6 +173,9 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;
+ if (ht->exclude_mask && cpumask_test_cpu(cpu, ht->exclude_mask))
+ return 0;
+
td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
if (!td)
return -ENOMEM;
@@ -221,9 +224,11 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (tsk) {
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
--
2.1.2
From: Chris Metcalf <[email protected]>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_exclude sysctl.
Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/lockup-watchdogs.txt | 6 ++++++
Documentation/sysctl/kernel.txt | 9 +++++++++
include/linux/nmi.h | 3 +++
kernel/sysctl.c | 7 +++++++
kernel/watchdog.c | 33 +++++++++++++++++++++++++++++++++
5 files changed, 58 insertions(+)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..4f86aec1d69d 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. In either case, the set of cores excluded from running
+the watchdog may be adjusted via the kernel.watchdog_exclude sysctl.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..799a1fee3f26 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,15 @@ and nmi_watchdog.
==============================================================
+watchdog_exclude:
+
+This value can be used to control on which cpus the watchdog is
+prohibited from running. The default exclude mask is empty, but if
+NO_HZ_FULL is enabled in the kernel config, and cores are specified
+with the nohz_full= boot argument, those cores are excluded by default.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..5094386b4fb1 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_exclude_mask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_exclude(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..b934a4a01f0a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_exclude",
+ .data = &watchdog_exclude_mask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_exclude,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f2be11ab7e08..3f4fbb208437 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_exclude_mask;
+unsigned long *watchdog_exclude_mask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -841,12 +844,42 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+int proc_watchdog_exclude(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write && watchdog_enabled && watchdog_thresh) {
+ watchdog_disable_all_cpus();
+ watchdog_enable_all_cpus();
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
+ watchdog_threads.exclude_mask = watchdog_exclude_mask;
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_copy(watchdog_exclude_mask, tick_nohz_full_mask);
+#else
+ cpumask_clear(watchdog_exclude_mask);
+#endif
+
+ /* The sysctl API requires a variable holding a pointer to the mask. */
+ watchdog_exclude_mask_bits = cpumask_bits(watchdog_exclude_mask);
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
On Mon, Apr 06, 2015 at 03:45:56PM -0400, [email protected] wrote:
> From: Chris Metcalf <[email protected]>
>
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
>
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_exclude sysctl.
Assuming the first patch gets accepted, this implementation looks fine by
me.
Acked-by: Don Zickus <[email protected]>
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> Documentation/lockup-watchdogs.txt | 6 ++++++
> Documentation/sysctl/kernel.txt | 9 +++++++++
> include/linux/nmi.h | 3 +++
> kernel/sysctl.c | 7 +++++++
> kernel/watchdog.c | 33 +++++++++++++++++++++++++++++++++
> 5 files changed, 58 insertions(+)
>
> diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> index ab0baa692c13..4f86aec1d69d 100644
> --- a/Documentation/lockup-watchdogs.txt
> +++ b/Documentation/lockup-watchdogs.txt
> @@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
> administrators to configure the period of the hrtimer and the perf
> event. The right value for a particular environment is a trade-off
> between fast response to lockups and detection overhead.
> +
> +By default, the watchdog runs on all online cores. However, on a
> +kernel configured with NO_HZ_FULL, by default the watchdog runs only
> +on the housekeeping cores, not the cores specified in the "nohz_full"
> +boot argument. In either case, the set of cores excluded from running
> +the watchdog may be adjusted via the kernel.watchdog_exclude sysctl.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index c831001c45f1..799a1fee3f26 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -923,6 +923,15 @@ and nmi_watchdog.
>
> ==============================================================
>
> +watchdog_exclude:
> +
> +This value can be used to control on which cpus the watchdog is
> +prohibited from running. The default exclude mask is empty, but if
> +NO_HZ_FULL is enabled in the kernel config, and cores are specified
> +with the nohz_full= boot argument, those cores are excluded by default.
> +
> +==============================================================
> +
> watchdog_thresh:
>
> This value can be used to control the frequency of hrtimer and NMI
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 3d46fb4708e0..5094386b4fb1 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
> extern int soft_watchdog_enabled;
> extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> +extern unsigned long *watchdog_exclude_mask_bits;
> extern int sysctl_softlockup_all_cpu_backtrace;
> struct ctl_table;
> extern int proc_watchdog(struct ctl_table *, int ,
> @@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> extern int proc_watchdog_thresh(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> +extern int proc_watchdog_exclude(struct ctl_table *, int,
> + void __user *, size_t *, loff_t *);
> #endif
>
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a88fb9..b934a4a01f0a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
> .extra2 = &one,
> },
> {
> + .procname = "watchdog_exclude",
> + .data = &watchdog_exclude_mask_bits,
> + .maxlen = NR_CPUS,
> + .mode = 0644,
> + .proc_handler = proc_watchdog_exclude,
> + },
> + {
> .procname = "softlockup_panic",
> .data = &softlockup_panic,
> .maxlen = sizeof(int),
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index f2be11ab7e08..3f4fbb208437 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -19,6 +19,7 @@
> #include <linux/sysctl.h>
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> +#include <linux/tick.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -56,6 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> +static cpumask_var_t watchdog_exclude_mask;
> +unsigned long *watchdog_exclude_mask_bits;
>
> static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
> @@ -841,12 +844,42 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +int proc_watchdog_exclude(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write && watchdog_enabled && watchdog_thresh) {
> + watchdog_disable_all_cpus();
> + watchdog_enable_all_cpus();
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
> +
> #endif /* CONFIG_SYSCTL */
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
> + watchdog_threads.exclude_mask = watchdog_exclude_mask;
> +
> +#ifdef CONFIG_NO_HZ_FULL
> + if (!cpumask_empty(tick_nohz_full_mask))
> + pr_info("Disabling watchdog on nohz_full cores by default\n");
> + cpumask_copy(watchdog_exclude_mask, tick_nohz_full_mask);
> +#else
> + cpumask_clear(watchdog_exclude_mask);
> +#endif
> +
> + /* The sysctl API requires a variable holding a pointer to the mask. */
> + watchdog_exclude_mask_bits = cpumask_bits(watchdog_exclude_mask);
> +
> if (watchdog_enabled)
> watchdog_enable_all_cpus();
> }
> --
> 2.1.2
>
On 04/06/2015 03:45 PM, [email protected] wrote:
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
This happens pretty late in the boot process and should just be:
alloc_cpumask_var(&watchdog_exclude_mask, GFP_KERNEL);
Thanks,
Sasha
On 04/07/2015 11:56 AM, Sasha Levin wrote:
> On 04/06/2015 03:45 PM, [email protected] wrote:
>> void __init lockup_detector_init(void)
>> {
>> set_sample_period();
>>
>> + alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
> This happens pretty late in the boot process and should just be:
>
> alloc_cpumask_var(&watchdog_exclude_mask, GFP_KERNEL);
Thanks; fixed. It will be in the v6 patch, which should be the final
one, I think.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Mon, Apr 06, 2015 at 03:45:55PM -0400, [email protected] wrote:
> From: Chris Metcalf <[email protected]>
>
> This change allows some cores to be excluded from running the
> smp_hotplug_thread tasks. The motivating example for this is
> the watchdog threads, which by default we don't want to run
> on any enabled nohz_full cores.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> include/linux/smpboot.h | 2 ++
> kernel/smpboot.c | 11 ++++++++---
> 2 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
> index d600afb21926..de2f64a98108 100644
> --- a/include/linux/smpboot.h
> +++ b/include/linux/smpboot.h
> @@ -27,6 +27,7 @@ struct smpboot_thread_data;
> * @pre_unpark: Optional unpark function, called before the thread is
> * unparked (cpu online). This is not guaranteed to be
> * called on the target cpu of the thread. Careful!
> + * @exclude_mask: Optional cpumask, specifying cores to exclude.
> * @selfparking: Thread is not parked by the park function.
> * @thread_comm: The base name of the thread
> */
> @@ -41,6 +42,7 @@ struct smp_hotplug_thread {
> void (*park)(unsigned int cpu);
> void (*unpark)(unsigned int cpu);
> void (*pre_unpark)(unsigned int cpu);
> + cpumask_t *exclude_mask;
The usual pattern for cpumasks is to use them as affinity values instead
of non-affinity values.
Thanks.
On Mon, Apr 06, 2015 at 03:45:56PM -0400, [email protected] wrote:
> From: Chris Metcalf <[email protected]>
>
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
>
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_exclude sysctl.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> Documentation/lockup-watchdogs.txt | 6 ++++++
> Documentation/sysctl/kernel.txt | 9 +++++++++
> include/linux/nmi.h | 3 +++
> kernel/sysctl.c | 7 +++++++
> kernel/watchdog.c | 33 +++++++++++++++++++++++++++++++++
> 5 files changed, 58 insertions(+)
>
> diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> index ab0baa692c13..4f86aec1d69d 100644
> --- a/Documentation/lockup-watchdogs.txt
> +++ b/Documentation/lockup-watchdogs.txt
> @@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
> administrators to configure the period of the hrtimer and the perf
> event. The right value for a particular environment is a trade-off
> between fast response to lockups and detection overhead.
> +
> +By default, the watchdog runs on all online cores. However, on a
> +kernel configured with NO_HZ_FULL, by default the watchdog runs only
> +on the housekeeping cores, not the cores specified in the "nohz_full"
> +boot argument. In either case, the set of cores excluded from running
> +the watchdog may be adjusted via the kernel.watchdog_exclude sysctl.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index c831001c45f1..799a1fee3f26 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -923,6 +923,15 @@ and nmi_watchdog.
>
> ==============================================================
>
> +watchdog_exclude:
> +
> +This value can be used to control on which cpus the watchdog is
> +prohibited from running. The default exclude mask is empty, but if
> +NO_HZ_FULL is enabled in the kernel config, and cores are specified
> +with the nohz_full= boot argument, those cores are excluded by default.
> +
> +==============================================================
Same here, cpumask are rather used as inclusive instead of exclusive.
So I'd rather see "watchdog_cpumask".
> +
> watchdog_thresh:
>
> This value can be used to control the frequency of hrtimer and NMI
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 3d46fb4708e0..5094386b4fb1 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
> extern int soft_watchdog_enabled;
> extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> +extern unsigned long *watchdog_exclude_mask_bits;
> extern int sysctl_softlockup_all_cpu_backtrace;
> struct ctl_table;
> extern int proc_watchdog(struct ctl_table *, int ,
> @@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> extern int proc_watchdog_thresh(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> +extern int proc_watchdog_exclude(struct ctl_table *, int,
> + void __user *, size_t *, loff_t *);
> #endif
>
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a88fb9..b934a4a01f0a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
> .extra2 = &one,
> },
> {
> + .procname = "watchdog_exclude",
> + .data = &watchdog_exclude_mask_bits,
> + .maxlen = NR_CPUS,
> + .mode = 0644,
> + .proc_handler = proc_watchdog_exclude,
> + },
> + {
> .procname = "softlockup_panic",
> .data = &softlockup_panic,
> .maxlen = sizeof(int),
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index f2be11ab7e08..3f4fbb208437 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -19,6 +19,7 @@
> #include <linux/sysctl.h>
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> +#include <linux/tick.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -56,6 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> +static cpumask_var_t watchdog_exclude_mask;
> +unsigned long *watchdog_exclude_mask_bits;
>
> static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
> @@ -841,12 +844,42 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +int proc_watchdog_exclude(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write && watchdog_enabled && watchdog_thresh) {
> + watchdog_disable_all_cpus();
> + watchdog_enable_all_cpus();
The problem is that it modifies watchdog_threads.exclude_mask and watchdog_threads
is a live object handled by smpboot. It happens not to be racy now because smpboot
only checks this cpumask on per_cpu_threads register/unregister time, but it could
change and become racy in the future.
How about creating smpboot_update_mask_percpu_thread() and handle it from smpboot,
this way future evolutions of smpboot won't overlook this cpumask live change?
Thanks.
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
> +
> #endif /* CONFIG_SYSCTL */
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + alloc_bootmem_cpumask_var(&watchdog_exclude_mask);
> + watchdog_threads.exclude_mask = watchdog_exclude_mask;
> +
> +#ifdef CONFIG_NO_HZ_FULL
> + if (!cpumask_empty(tick_nohz_full_mask))
> + pr_info("Disabling watchdog on nohz_full cores by default\n");
> + cpumask_copy(watchdog_exclude_mask, tick_nohz_full_mask);
> +#else
> + cpumask_clear(watchdog_exclude_mask);
> +#endif
> +
> + /* The sysctl API requires a variable holding a pointer to the mask. */
> + watchdog_exclude_mask_bits = cpumask_bits(watchdog_exclude_mask);
> +
> if (watchdog_enabled)
> watchdog_enable_all_cpus();
> }
> --
> 2.1.2
>
On 04/08/2015 09:28 AM, Frederic Weisbecker wrote:
> On Mon, Apr 06, 2015 at 03:45:55PM -0400, [email protected] wrote:
>> From: Chris Metcalf <[email protected]>
>>
>> This change allows some cores to be excluded from running the
>> smp_hotplug_thread tasks. The motivating example for this is
>> the watchdog threads, which by default we don't want to run
>> on any enabled nohz_full cores.
>>
>> Signed-off-by: Chris Metcalf <[email protected]>
>> ---
>> include/linux/smpboot.h | 2 ++
>> kernel/smpboot.c | 11 ++++++++---
>> 2 files changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
>> index d600afb21926..de2f64a98108 100644
>> --- a/include/linux/smpboot.h
>> +++ b/include/linux/smpboot.h
>> @@ -27,6 +27,7 @@ struct smpboot_thread_data;
>> * @pre_unpark: Optional unpark function, called before the thread is
>> * unparked (cpu online). This is not guaranteed to be
>> * called on the target cpu of the thread. Careful!
>> + * @exclude_mask: Optional cpumask, specifying cores to exclude.
>> * @selfparking: Thread is not parked by the park function.
>> * @thread_comm: The base name of the thread
>> */
>> @@ -41,6 +42,7 @@ struct smp_hotplug_thread {
>> void (*park)(unsigned int cpu);
>> void (*unpark)(unsigned int cpu);
>> void (*pre_unpark)(unsigned int cpu);
>> + cpumask_t *exclude_mask;
> The usual pattern for cpumasks is to use them as affinity values instead
> of non-affinity values.
Yes. The issue here is that as cpus come and go from the hotplug set,
the ones that we want to exclude remain fixed. If we do it the way you
propose (and it's the way I originally did it), it means that if a new cpu
comes online you automatically treat it as nohz_full, which seems wrong
to me. I suppose we could add another callback so that the
smp_hotplug_thread struct could explicitly decide how to mark any
new cpu that comes online, but that all seems more complicated than
my final suggestion.
What do you think?
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Wed, Apr 08, 2015 at 10:06:44AM -0400, Chris Metcalf wrote:
> On 04/08/2015 09:28 AM, Frederic Weisbecker wrote:
> >On Mon, Apr 06, 2015 at 03:45:55PM -0400, [email protected] wrote:
> >>From: Chris Metcalf <[email protected]>
> >>
> >>This change allows some cores to be excluded from running the
> >>smp_hotplug_thread tasks. The motivating example for this is
> >>the watchdog threads, which by default we don't want to run
> >>on any enabled nohz_full cores.
> >>
> >>Signed-off-by: Chris Metcalf <[email protected]>
> >>---
> >> include/linux/smpboot.h | 2 ++
> >> kernel/smpboot.c | 11 ++++++++---
> >> 2 files changed, 10 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
> >>index d600afb21926..de2f64a98108 100644
> >>--- a/include/linux/smpboot.h
> >>+++ b/include/linux/smpboot.h
> >>@@ -27,6 +27,7 @@ struct smpboot_thread_data;
> >> * @pre_unpark: Optional unpark function, called before the thread is
> >> * unparked (cpu online). This is not guaranteed to be
> >> * called on the target cpu of the thread. Careful!
> >>+ * @exclude_mask: Optional cpumask, specifying cores to exclude.
> >> * @selfparking: Thread is not parked by the park function.
> >> * @thread_comm: The base name of the thread
> >> */
> >>@@ -41,6 +42,7 @@ struct smp_hotplug_thread {
> >> void (*park)(unsigned int cpu);
> >> void (*unpark)(unsigned int cpu);
> >> void (*pre_unpark)(unsigned int cpu);
> >>+ cpumask_t *exclude_mask;
> >The usual pattern for cpumasks is to use them as affinity values instead
> >of non-affinity values.
>
> Yes. The issue here is that as cpus come and go from the hotplug set,
> the ones that we want to exclude remain fixed. If we do it the way you
> propose (and it's the way I originally did it), it means that if a new cpu
> comes online you automatically treat it as nohz_full, which seems wrong
> to me. I suppose we could add another callback so that the
> smp_hotplug_thread struct could explicitly decide how to mark any
> new cpu that comes online, but that all seems more complicated than
> my final suggestion.
>
> What do you think?
No cpumasks are allocated to handle any cpu from the cpu_possible_mask.
So imagine that CPU 1 is offline and CPU 0 is online. It's perfectly
fine to write 0x3 to the cpumask, which means it's affine to both, then
if CPU 1 is turned online later, the smpboot subsystem takes care of it.
From: Chris Metcalf <[email protected]>
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The motivating example for this is
the watchdog threads, which by default we don't want to run
on any enabled nohz_full cores.
Signed-off-by: Chris Metcalf <[email protected]>
---
v6: change from an "exclude" data pointer to a more generic
valid_cpu() callback [Frederic]
include/linux/smpboot.h | 3 +++
kernel/smpboot.c | 11 ++++++++---
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..4648a4576ae4 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,8 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @valid_cpu: Optional function, called when creating the threads,
+ * to limit the set of cpus on which threads are created.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,6 +43,7 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ int (*valid_cpu)(unsigned int cpu);
bool selfparking;
const char *thread_comm;
};
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..6ffc2dacb94a 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -173,6 +173,9 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;
+ if (ht->valid_cpu && !ht->valid_cpu(cpu))
+ return 0;
+
td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
if (!td)
return -ENOMEM;
@@ -221,9 +224,11 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (tsk) {
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
--
2.1.2
From: Chris Metcalf <[email protected]>
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
v6: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
Documentation/lockup-watchdogs.txt | 6 ++++
Documentation/sysctl/kernel.txt | 11 ++++++++
include/linux/nmi.h | 3 ++
kernel/sysctl.c | 7 +++++
kernel/watchdog.c | 56 ++++++++++++++++++++++++++++++++++++++
5 files changed, 83 insertions(+)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..31c312853d4c 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. In either case, the set of cores excluded from running
+the watchdog may be adjusted via the kernel.watchdog_cpumask sysctl.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f6a9dca8c100 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,17 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..fc0a90684639 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -477,6 +480,16 @@ static int watchdog_should_run(unsigned int cpu)
}
/*
+ * This test is not serialized with updates to watchdog_cpumask,
+ * but when the update is complete we will disable and re-enable all
+ * the watchdog threads anyway.
+ */
+static int watchdog_valid_cpu(unsigned int cpu)
+{
+ return cpumask_test_cpu(cpu, watchdog_cpumask);
+}
+
+/*
* The watchdog thread function - touches the timestamp.
*
* It only runs once every sample_period seconds (4 seconds by
@@ -645,6 +658,7 @@ static struct smp_hotplug_thread watchdog_threads = {
.cleanup = watchdog_cleanup,
.park = watchdog_disable,
.unpark = watchdog_enable,
+ .valid_cpu = watchdog_valid_cpu,
};
static void restart_watchdog_hrtimer(void *info)
@@ -869,12 +883,54 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh) {
+ watchdog_disable_all_cpus();
+ watchdog_enable_all_cpus();
+ }
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
+ /* The sysctl API requires a variable holding a pointer to the mask. */
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
On 04/08/2015 10:01 AM, Frederic Weisbecker wrote:
> How about creating smpboot_update_mask_percpu_thread() and handle it from smpboot,
> this way future evolutions of smpboot won't overlook this cpumask live change?
It seemed like your proposed approach was actually a bit heavier-weight
from the perspective of generic smp_hotplug_thread, so instead I just
modified the proposed API to have a simple "valid_cpu()" callback,
which I think is clear and won't be damaged by smpboot evolution.
Let me know what you think.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Wed, 8 Apr 2015, [email protected] wrote:
> @@ -173,6 +173,9 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
> if (tsk)
> return 0;
>
> + if (ht->valid_cpu && !ht->valid_cpu(cpu))
> + return 0;
> +
> td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
> if (!td)
> return -ENOMEM;
> @@ -221,9 +224,11 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
> {
> struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
>
> - if (ht->pre_unpark)
> - ht->pre_unpark(cpu);
> - kthread_unpark(tsk);
> + if (tsk) {
> + if (ht->pre_unpark)
> + ht->pre_unpark(cpu);
> + kthread_unpark(tsk);
> + }
This is a very watchdog centric implementation. The watchdog actually
can afford to unregister/register the per cpu threads when the mask
changes, but other facilities might not.
So I'd like to see the threads still created and have the valid cpu
check in the unpark function. That way you can just park/unpark a
particular per cpu thread when the mask changes and you are not forced
to teardown/reenable the whole facility.
Thanks,
tglx
On Wed, Apr 08, 2015 at 03:21:00PM -0400, Chris Metcalf wrote:
> On 04/08/2015 10:01 AM, Frederic Weisbecker wrote:
> >How about creating smpboot_update_mask_percpu_thread() and handle it from smpboot,
> >this way future evolutions of smpboot won't overlook this cpumask live change?
>
> It seemed like your proposed approach was actually a bit heavier-weight
> from the perspective of generic smp_hotplug_thread, so instead I just
> modified the proposed API to have a simple "valid_cpu()" callback,
> which I think is clear and won't be damaged by smpboot evolution.
> Let me know what you think.
You mean have the cpumask private to watchdog and implement valid_cpu()
on top of it?
Well this just pulls all the complexity to the smpboot thread instead of
the smpboot subsystem. If you implement it to smpboot, this will be reusable
for other possible smpboot threads than watchdog.
Eventually if you take into account Thomas review that we should rather park
cpu threads that aren't included in the cpumask, this should look like this
(warning: totally untested):
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 40190f2..01bfb51 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -230,8 +230,10 @@ void smpboot_unpark_threads(unsigned int cpu)
struct smp_hotplug_thread *cur;
mutex_lock(&smpboot_threads_lock);
- list_for_each_entry(cur, &hotplug_threads, list)
- smpboot_unpark_thread(cur, cpu);
+ list_for_each_entry(cur, &hotplug_threads, list) {
+ if (cpumask_test_cpu(cpu, cur->cpumask))
+ smpboot_unpark_thread(cur, cpu);
+ }
mutex_unlock(&smpboot_threads_lock);
}
@@ -288,7 +290,8 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
smpboot_destroy_threads(plug_thread);
goto out;
}
- smpboot_unpark_thread(plug_thread, cpu);
+ if (cpumask_test_cpu(plug_thread->cpumask))
+ smpboot_unpark_thread(plug_thread, cpu);
}
list_add(&plug_thread->list, &hotplug_threads);
out:
@@ -298,6 +301,41 @@ out:
}
EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ cpumask_var_t new)
+{
+ cpumask_var_t tmp;
+ unsigned int cpu;
+
+ tmp = alloc_cpumask_var(&tmp, GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+
+ /* Park those that were exclusively enabled on old mask */
+ cpumask_andnot(tmp, plug_thread->cpumask, new);
+ for_each_cpu(cpu, tmp)
+ smpboot_park_thread(plug_thread, cpu);
+
+ /* Unpark those that are exclusively enabled on new mask */
+ cpumask_andnot(tmp, new, plug_thread->cpumask);
+ for_each_cpu(cpu, tmp) {
+ if (cpu_online(cpu))
+ smpboot_unpark_thread(plug_thread, cpu);
+ }
+ cpumask_copy(plug_thread->cpumask, new);
+
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+
+ free_cpumask_var(tmp);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
+
/**
* smpboot_unregister_percpu_thread - Unregister a per_cpu thread related to hotplug
* @plug_thread: Hotplug thread descriptor
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The motivating example for this is
the watchdog threads, which by default we don't want to run
on any enabled nohz_full cores.
A new smp_hotplug_thread field is introduced, "valid_cpu", which
is an optional pointer to a function that returns per-cpu whether
or not the given smp_hotplug_thread should run on that core; the
function is called when deciding whether to unpark the thread.
If a change is made to which cpus are valid, the
smpboot_repark_percpu_thread() function should be called and
threads will be suitably parked and unparked.
Signed-off-by: Chris Metcalf <[email protected]>
---
Thomas, how does this look? If this seems about right, I'll fold
in your feedback and put out a patch set that includes the matching
changes to the watchdog and, if Frederic will take it for the
nohz queue to the timer tree, send it up that way. This is just
compile-tested so far since I have to wrap up for the time being
and head home. Final patch will actually be tested :-)
I took Frederic's suggested patch from a 10,000 foot viewpoint and
modified it to stick with the valid_cpu() callback approach.
p.s. I think the smpboot_thread_schedule() declaration in
linux/smpboot.h is dead; there doesn't seem to be a definition.
include/linux/smpboot.h | 4 ++++
kernel/smpboot.c | 37 +++++++++++++++++++++++++++++++++----
2 files changed, 37 insertions(+), 4 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 13e929679550..7dedbf92420e 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,8 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @valid_cpu: Optional function, called when unparking the threads,
+ * to limit the set of cpus on which threads are unparked.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,12 +43,14 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ int (*valid_cpu)(unsigned int cpu);
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
+void smpboot_repark_percpu_thread(struct smp_hotplug_thread *plug_thread);
int smpboot_thread_schedule(void);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 40190f28db35..c7dd768a4599 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -218,11 +218,13 @@ int smpboot_create_threads(unsigned int cpu)
static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
- struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ if (!ht->valid_cpu || ht->valid_cpu(cpu)) {
+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
@@ -314,3 +316,30 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
put_online_cpus();
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+
+/**
+ * smpboot_repark_percpu_thread - Adjust which per_cpu hotplug threads stay parked
+ * @plug_thread: Hotplug thread descriptor
+ *
+ * After changing what the valid_cpu() callback will return, call this
+ * function to let appropriate threads park and unpark.
+ */
+void smpboot_repark_percpu_thread(struct smp_hotplug_thread *plug_thread)
+{
+ unsigned int cpu;
+
+ if (!plug_thread->valid_cpu)
+ return;
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+ for_each_online_cpu(cpu) {
+ if (plug_thread->valid_cpu(cpu))
+ smpboot_unpark_thread(plug_thread, cpu);
+ else
+ smpboot_park_thread(plug_thread, cpu);
+ }
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(smpboot_repark_percpu_thread);
--
2.1.2
On Thu, Apr 09, 2015 at 04:29:01PM -0400, Chris Metcalf wrote:
> This change allows some cores to be excluded from running the
> smp_hotplug_thread tasks. The motivating example for this is
> the watchdog threads, which by default we don't want to run
> on any enabled nohz_full cores.
>
> A new smp_hotplug_thread field is introduced, "valid_cpu", which
> is an optional pointer to a function that returns per-cpu whether
> or not the given smp_hotplug_thread should run on that core; the
> function is called when deciding whether to unpark the thread.
>
> If a change is made to which cpus are valid, the
> smpboot_repark_percpu_thread() function should be called and
> threads will be suitably parked and unparked.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> Thomas, how does this look? If this seems about right, I'll fold
> in your feedback and put out a patch set that includes the matching
> changes to the watchdog and, if Frederic will take it for the
> nohz queue to the timer tree, send it up that way. This is just
> compile-tested so far since I have to wrap up for the time being
> and head home. Final patch will actually be tested :-)
>
> I took Frederic's suggested patch from a 10,000 foot viewpoint and
> modified it to stick with the valid_cpu() callback approach.
>
> p.s. I think the smpboot_thread_schedule() declaration in
> linux/smpboot.h is dead; there doesn't seem to be a definition.
>
> include/linux/smpboot.h | 4 ++++
> kernel/smpboot.c | 37 +++++++++++++++++++++++++++++++++----
> 2 files changed, 37 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
> index 13e929679550..7dedbf92420e 100644
> --- a/include/linux/smpboot.h
> +++ b/include/linux/smpboot.h
> @@ -27,6 +27,8 @@ struct smpboot_thread_data;
> * @pre_unpark: Optional unpark function, called before the thread is
> * unparked (cpu online). This is not guaranteed to be
> * called on the target cpu of the thread. Careful!
> + * @valid_cpu: Optional function, called when unparking the threads,
> + * to limit the set of cpus on which threads are unparked.
I'm not sure why this needs to be a callback instead of a pointer to a cpumask
that smpboot can handle by itself. In fact I don't understand why you want to stick with
this valid_cpu() approach.
> * @selfparking: Thread is not parked by the park function.
> * @thread_comm: The base name of the thread
> */
> @@ -41,12 +43,14 @@ struct smp_hotplug_thread {
> void (*park)(unsigned int cpu);
> void (*unpark)(unsigned int cpu);
> void (*pre_unpark)(unsigned int cpu);
> + int (*valid_cpu)(unsigned int cpu);
> bool selfparking;
> const char *thread_comm;
> };
>
> int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
> void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
> +void smpboot_repark_percpu_thread(struct smp_hotplug_thread *plug_thread);
> int smpboot_thread_schedule(void);
>
> #endif
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index 40190f28db35..c7dd768a4599 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -218,11 +218,13 @@ int smpboot_create_threads(unsigned int cpu)
>
> static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
> {
> - struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> + if (!ht->valid_cpu || ht->valid_cpu(cpu)) {
> + struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
>
> - if (ht->pre_unpark)
> - ht->pre_unpark(cpu);
> - kthread_unpark(tsk);
> + if (ht->pre_unpark)
> + ht->pre_unpark(cpu);
> + kthread_unpark(tsk);
> + }
> }
>
> void smpboot_unpark_threads(unsigned int cpu)
> @@ -314,3 +316,30 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
> put_online_cpus();
> }
> EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
> +
> +/**
> + * smpboot_repark_percpu_thread - Adjust which per_cpu hotplug threads stay parked
> + * @plug_thread: Hotplug thread descriptor
> + *
> + * After changing what the valid_cpu() callback will return, call this
> + * function to let appropriate threads park and unpark.
> + */
> +void smpboot_repark_percpu_thread(struct smp_hotplug_thread *plug_thread)
That looks to me a bit of an unecessary indirect way to tell "update cpumask".
> +{
> + unsigned int cpu;
> +
> + if (!plug_thread->valid_cpu)
> + return;
> +
> + get_online_cpus();
> + mutex_lock(&smpboot_threads_lock);
> + for_each_online_cpu(cpu) {
> + if (plug_thread->valid_cpu(cpu))
> + smpboot_unpark_thread(plug_thread, cpu);
> + else
> + smpboot_park_thread(plug_thread, cpu);
> + }
> + mutex_unlock(&smpboot_threads_lock);
> + put_online_cpus();
> +}
> +EXPORT_SYMBOL_GPL(smpboot_repark_percpu_thread);
> --
> 2.1.2
>
On 04/09/2015 09:58 PM, Frederic Weisbecker wrote:
> On Thu, Apr 09, 2015 at 04:29:01PM -0400, Chris Metcalf wrote:
>> --- a/include/linux/smpboot.h
>> +++ b/include/linux/smpboot.h
>> @@ -27,6 +27,8 @@ struct smpboot_thread_data;
>> * @pre_unpark: Optional unpark function, called before the thread is
>> * unparked (cpu online). This is not guaranteed to be
>> * called on the target cpu of the thread. Careful!
>> + * @valid_cpu: Optional function, called when unparking the threads,
>> + * to limit the set of cpus on which threads are unparked.
> I'm not sure why this needs to be a callback instead of a pointer to a cpumask
> that smpboot can handle by itself. In fact I don't understand why you want to stick with
> this valid_cpu() approach.
I stuck with it since Thomas mentioned valid_cpu() as part of his earlier
suggestion to just park/unpark the threads, so I was assuming he had
a preference for that approach.
The problem with the code you provided, as I see it, is that the cpumask
field being kept in the struct smp_hotplug_thread is awkward to
initialize while keeping the default that it doesn't have to be mentioned
in the initializer for the client's structure. To make this work, in the
register function you have to check for a NULL pointer (for OFFSTACK)
and then allocate and initialize to cpu_possible_mask, but in the
!OFFSTACK case you could just require that an empty mask really means
cpu_possible_mask, which seems like an unfortunate overloading.
Or, you can add an extra bool that says "hey, the cpumask is valid",
and that at least makes the register function's work unambiguous.
But, you then never consult that bool field again, which seems a little
odd as part of the published API structure.
Or, we could create a new register function just for use with clients
that want to specify the cpumask at registration time, though that seems
a little clumsy.
Or, we could say that you can't set the cpumask at registration time,
but only by later calling the update_cpumask function. But this seems
somewhat unfortunate too, particularly since "cpumask" is sitting right
there in a function where every other field is controlled by the client.
Or, we can go back to my original suggestion of a cpumask pointer.
You raised the issue of potential racing between client cpumask updates and
smpboot subsystem updates, but I think it's a red herring -- basically, if
the client sets/clears a bit while a cpu is coming online, it's
unspecified whether
that cpu ends up with a thread or not; but we don't really care because
the client ends up calling the "update_cpumask" function after we're done
updating, and that forces all the threads to be properly parked or unparked.
The last option seems like the cleanest if you prefer using "struct
cpumask *"
rather than a valid_cpu function pointer. But let me spin a version of the
patch using "struct cpumask *" and you and Thomas can chime in with
which one you prefer (or if you prefer a different model).
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The motivating example for this is
the watchdog threads, which by default we don't want to run
on any enabled nohz_full cores.
A new smp_hotplug_thread field is introduced, "cpumask", which
is an optional pointer to a cpumask that indicates whether
or not the given smp_hotplug_thread should run on that core; the
cpumask is checked when deciding whether to unpark the thread.
If a change is made to the cpumask, the
smpboot_update_cpumask_percpu_thread() function should be called and
threads will be suitably parked and unparked.
Signed-off-by: Chris Metcalf <[email protected]>
---
I don't know why it is necessary to explicitly unpark the threads in
smpboot_destroy_threads() before destroying them. We can't even do it
in the same for_each_possible_cpu() loop as the kthread_stop() call;
it appears all the threads on online cores must be unparked prior to
trying to stop all the threads, or the system hangs.
Also, one unexpected consequence of leaving threads in TASK_PARKED
state is that we can actually see them in /proc! This isn't normally
true since we usually just park them briefly during cpu offlining.
/proc/NNN/stat{,us} reports the parked threads as "R (running)", even
though they are waiting in __kthread_parkme(). I proposed a fix for
this in a new patch 3/3 in this patch series.
In all honesty, I'm still fond of the model where we just do_exit(0)
the threads that we don't want, as soon as registration creates them.
(We can still support the watchdog_cpumask sysctl easily enough.)
This doesn't add any new semantics to smpboot (for good or for bad),
and more importantly it makes it clear that we don't have watchdog
tasks running on the nohz_full cores, which otherwise are going to
make people continually wonder what's going on until they carefully
read the code. But I'm OK with the direction laid out in this patch
if it's the consensus preferred model.
v7: change from valid_cpu() callback to optional cpumask field
park smpboot threads rather than just not creating them
v6: change from an "exclude" data pointer to a more generic
valid_cpu() callback [Frederic]
v5: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
include/linux/smpboot.h | 6 ++++++
kernel/smpboot.c | 42 ++++++++++++++++++++++++++++++++++++++----
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..fb9ed92201a5 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,10 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @cpumask: Optional pointer to a set of possible cores to
+ * allow threads to come unparked on.
+ * You must call smpboot_update_cpumask_percpu_thread()
+ * after any updates to the pointed-to mask.
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,11 +45,13 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ struct cpumask *cpumask;
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
+void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..12bd9b57a682 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -219,11 +219,13 @@ int smpboot_create_threads(unsigned int cpu)
static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
- struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ if (ht->cpumask == NULL || cpumask_test_cpu(cpu, ht->cpumask)) {
+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
- if (ht->pre_unpark)
- ht->pre_unpark(cpu);
- kthread_unpark(tsk);
+ if (ht->pre_unpark)
+ ht->pre_unpark(cpu);
+ kthread_unpark(tsk);
+ }
}
void smpboot_unpark_threads(unsigned int cpu)
@@ -258,6 +260,13 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
{
unsigned int cpu;
+ /* Unpark any threads that were voluntarily parked. */
+ if (ht->cpumask) {
+ for_each_online_cpu(cpu)
+ if (!cpumask_test_cpu(cpu, ht->cpumask))
+ kthread_unpark(*per_cpu_ptr(ht->store, cpu));
+ }
+
/* We need to destroy also the parked threads of offline cpus */
for_each_possible_cpu(cpu) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
@@ -316,6 +325,31 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+/**
+ * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
+ * @plug_thread: Hotplug thread descriptor
+ *
+ * After changing any bits in the mask pointed to by "cpumask", call this
+ * function to let appropriate threads park and unpark.
+ */
+void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread)
+{
+ unsigned int cpu;
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+ for_each_online_cpu(cpu) {
+ if (plug_thread->cpumask == NULL ||
+ cpumask_test_cpu(cpu, plug_thread->cpumask))
+ smpboot_unpark_thread(plug_thread, cpu);
+ else
+ smpboot_park_thread(plug_thread, cpu);
+ }
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
+
static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
/*
--
2.1.2
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/lockup-watchdogs.txt | 6 ++++++
Documentation/sysctl/kernel.txt | 11 ++++++++++
include/linux/nmi.h | 3 +++
kernel/sysctl.c | 7 ++++++
kernel/watchdog.c | 44 ++++++++++++++++++++++++++++++++++++++
5 files changed, 71 insertions(+)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..31c312853d4c 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,9 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. In either case, the set of cores excluded from running
+the watchdog may be adjusted via the kernel.watchdog_cpumask sysctl.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f6a9dca8c100 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,17 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..2199f1f0b5a5 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -869,12 +872,53 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
+ watchdog_threads.cpumask = watchdog_cpumask;
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
+ /* The sysctl API requires a variable holding a pointer to the mask. */
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc's stat and status files. The existing code reported such
threads as "Running", which is kind-of true if you think of the case
where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.
We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values.
The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.
Signed-off-by: Chris Metcalf <[email protected]>
---
fs/proc/array.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index a3893b7505b2..2eb623ffb0b7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -126,6 +126,10 @@ static inline const char *get_task_state(struct task_struct *tsk)
{
unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+ /* Treat parked tasks as sleeping. */
+ if (state == TASK_PARKED)
+ state = TASK_SLEEPING;
+
BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
return task_state_array[fls(state)];
--
2.1.2
On Fri, 10 Apr 2015 16:48:18 -0400 Chris Metcalf <[email protected]> wrote:
> This change allows some cores to be excluded from running the
> smp_hotplug_thread tasks. The motivating example for this is
> the watchdog threads, which by default we don't want to run
> on any enabled nohz_full cores.
Why not?
I can guess, but I'd rather not guess. Please fully explain the
end-user value of this change. Providing a benefit to users is the
whole point of the patchset, but the above assertion is the only
description we have.
This info should be in Documentation/lockup-watchdogs.txt and/or
Documentation/sysctl/kernel.txt as well as the changelogs, so users
have an answer to "why the heck should I enable this".
Please also describe the downside of the change. I assume this is
"lockups will go undetected on some CPUs"? Let's expand on this so we
can understand where the best tradeoff point lies.
If people are experiencing <whatever this problem is> then they can
disable the watchdog altogether. What value is there in this partial
disabling? Why is it worth doing this?
On Fri, Apr 10, 2015 at 12:33:38PM -0400, Chris Metcalf wrote:
> On 04/09/2015 09:58 PM, Frederic Weisbecker wrote:
> >On Thu, Apr 09, 2015 at 04:29:01PM -0400, Chris Metcalf wrote:
> >>--- a/include/linux/smpboot.h
> >>+++ b/include/linux/smpboot.h
> >>@@ -27,6 +27,8 @@ struct smpboot_thread_data;
> >> * @pre_unpark: Optional unpark function, called before the thread is
> >> * unparked (cpu online). This is not guaranteed to be
> >> * called on the target cpu of the thread. Careful!
> >>+ * @valid_cpu: Optional function, called when unparking the threads,
> >>+ * to limit the set of cpus on which threads are unparked.
> >I'm not sure why this needs to be a callback instead of a pointer to a cpumask
> >that smpboot can handle by itself. In fact I don't understand why you want to stick with
> >this valid_cpu() approach.
>
> I stuck with it since Thomas mentioned valid_cpu() as part of his earlier
> suggestion to just park/unpark the threads, so I was assuming he had
> a preference for that approach.
Hmm, that's not quite what he mentioned. He suggested to check whether the
CPU is valid in the unpark function, that didn't necessary imply to do it
through a callback.
>
> The problem with the code you provided, as I see it, is that the cpumask
> field being kept in the struct smp_hotplug_thread is awkward to
> initialize while keeping the default that it doesn't have to be mentioned
> in the initializer for the client's structure. To make this work, in the
> register function you have to check for a NULL pointer (for OFFSTACK)
> and then allocate and initialize to cpu_possible_mask, but in the
> !OFFSTACK case you could just require that an empty mask really means
> cpu_possible_mask, which seems like an unfortunate overloading.
If the field is of type "struct cpumask *", just checking NULL is enough.
I don't think OFFSTACK changes anything. This only changes the allocation
on the client side. But the pointer passed to the "struct smp_hotplug_thread"
is the same and that's all transparent to the smpboot subsystem.
Also if the cpumask is NULL on that struct (default), let the smpboot
subsystem attribute cpu_possible_mask to it (no need to allocate a copy).
Well this could even not be overwritten and handled by smpboot_thread_unpark()
itself.
Thanks.
On 04/10/2015 05:11 PM, Andrew Morton wrote:
> On Fri, 10 Apr 2015 16:48:18 -0400 Chris Metcalf <[email protected]> wrote:
>
>> This change allows some cores to be excluded from running the
>> smp_hotplug_thread tasks. The motivating example for this is
>> the watchdog threads, which by default we don't want to run
>> on any enabled nohz_full cores.
> Why not?
Thanks for the feedback. It's easy to assume everyone knows
everything about what's being done in the kernel :-)
I'll add some more descriptive language around what the point
of nohz_full is, and why the watchdog interferes with it, in v8.
>
> I can guess, but I'd rather not guess. Please fully explain the
> end-user value of this change. Providing a benefit to users is the
> whole point of the patchset, but the above assertion is the only
> description we have.
>
> This info should be in Documentation/lockup-watchdogs.txt and/or
> Documentation/sysctl/kernel.txt as well as the changelogs, so users
> have an answer to "why the heck should I enable this".
>
> Please also describe the downside of the change. I assume this is
> "lockups will go undetected on some CPUs"? Let's expand on this so we
> can understand where the best tradeoff point lies.
>
> If people are experiencing <whatever this problem is> then they can
> disable the watchdog altogether. What value is there in this partial
> disabling? Why is it worth doing this?
>
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On 04/12/2015 03:14 PM, Frederic Weisbecker wrote:
> On Fri, Apr 10, 2015 at 12:33:38PM -0400, Chris Metcalf wrote:
>> On 04/09/2015 09:58 PM, Frederic Weisbecker wrote:
>>> On Thu, Apr 09, 2015 at 04:29:01PM -0400, Chris Metcalf wrote:
>>>> --- a/include/linux/smpboot.h
>>>> +++ b/include/linux/smpboot.h
>>>> @@ -27,6 +27,8 @@ struct smpboot_thread_data;
>>>> * @pre_unpark: Optional unpark function, called before the thread is
>>>> * unparked (cpu online). This is not guaranteed to be
>>>> * called on the target cpu of the thread. Careful!
>>>> + * @valid_cpu: Optional function, called when unparking the threads,
>>>> + * to limit the set of cpus on which threads are unparked.
>>> I'm not sure why this needs to be a callback instead of a pointer to a cpumask
>>> that smpboot can handle by itself. In fact I don't understand why you want to stick with
>>> this valid_cpu() approach.
>> I stuck with it since Thomas mentioned valid_cpu() as part of his earlier
>> suggestion to just park/unpark the threads, so I was assuming he had
>> a preference for that approach.
> Hmm, that's not quite what he mentioned. He suggested to check whether the
> CPU is valid in the unpark function, that didn't necessary imply to do it
> through a callback.
Fair enough; I may have read his comment too specifically.
>> The problem with the code you provided, as I see it, is that the cpumask
>> field being kept in the struct smp_hotplug_thread is awkward to
>> initialize while keeping the default that it doesn't have to be mentioned
>> in the initializer for the client's structure. To make this work, in the
>> register function you have to check for a NULL pointer (for OFFSTACK)
>> and then allocate and initialize to cpu_possible_mask, but in the
>> !OFFSTACK case you could just require that an empty mask really means
>> cpu_possible_mask, which seems like an unfortunate overloading.
> If the field is of type "struct cpumask *", just checking NULL is enough.
> I don't think OFFSTACK changes anything. This only changes the allocation
> on the client side. But the pointer passed to the "struct smp_hotplug_thread"
> is the same and that's all transparent to the smpboot subsystem.
>
> Also if the cpumask is NULL on that struct (default), let the smpboot
> subsystem attribute cpu_possible_mask to it (no need to allocate a copy).
> Well this could even not be overwritten and handled by smpboot_thread_unpark()
> itself.
As you saw, I adopted the "struct cpumask *" approach in my current
(v7) patchset last Friday:
https://lkml.org/lkml/2015/4/10/750
There are really two ways to handle this:
1. The client owns the cpumask, and notifies the smpboot subsystem
whenever it has finished a round of changes to the cpumask so that
they can take effect. There is a technical race here where the smpboot
subsystem might look at the mask as it is being updated, but this is
OK since worst-case is a thread on a newly-brought-up core is incorrectly
parked or unparked, but that is corrected immediately when the client
calls in to say it has finished updating the mask.
2. The smpboot subsystem owns the cpumask, and it's only updated
by having the client call in to pass a new mask. This avoids the technical
race, but it does mean that the client can't update a field that it
allocated
and provided to the subsystem, which feels a bit unnatural.
Either one could be OK, but I opted for #1. What do you think of it?
Also, I want to ask Linus to pull the tile-specific changes for nohz_full
for the tile architecture. This includes a copy of the change to add the
tick_nohz_full_add_cpus_to() and tick_nohz_full_remove_cpus_from()
routines here:
https://lkml.org/lkml/2015/4/9/792
which I used to fix the tilegx network driver's default irq affinity mask.
There's also the support for tile's nohz_full in general, which you
commented on, but didn't provide an explicit Ack for:
https://lkml.org/lkml/2015/3/24/953
If you'd like to nack either change, or better yet ack them, let me know.
I'll wait a little while before asking Linus to pull.
The tile tree stuff to be pulled for v4.1 is here:
http://git.kernel.org/cgit/linux/kernel/git/cmetcalf/linux-tile.git/log/
if you want to look more closely.
Thanks!
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Mon, Apr 13, 2015 at 12:06:50PM -0400, Chris Metcalf wrote:
> >>The problem with the code you provided, as I see it, is that the cpumask
> >>field being kept in the struct smp_hotplug_thread is awkward to
> >>initialize while keeping the default that it doesn't have to be mentioned
> >>in the initializer for the client's structure. To make this work, in the
> >>register function you have to check for a NULL pointer (for OFFSTACK)
> >>and then allocate and initialize to cpu_possible_mask, but in the
> >>!OFFSTACK case you could just require that an empty mask really means
> >>cpu_possible_mask, which seems like an unfortunate overloading.
> >If the field is of type "struct cpumask *", just checking NULL is enough.
> >I don't think OFFSTACK changes anything. This only changes the allocation
> >on the client side. But the pointer passed to the "struct smp_hotplug_thread"
> >is the same and that's all transparent to the smpboot subsystem.
> >
> >Also if the cpumask is NULL on that struct (default), let the smpboot
> >subsystem attribute cpu_possible_mask to it (no need to allocate a copy).
> >Well this could even not be overwritten and handled by smpboot_thread_unpark()
> >itself.
>
> As you saw, I adopted the "struct cpumask *" approach in my current
> (v7) patchset last Friday:
>
> https://lkml.org/lkml/2015/4/10/750
>
> There are really two ways to handle this:
>
> 1. The client owns the cpumask, and notifies the smpboot subsystem
> whenever it has finished a round of changes to the cpumask so that
> they can take effect. There is a technical race here where the smpboot
> subsystem might look at the mask as it is being updated, but this is
> OK since worst-case is a thread on a newly-brought-up core is incorrectly
> parked or unparked, but that is corrected immediately when the client
> calls in to say it has finished updating the mask.
>
> 2. The smpboot subsystem owns the cpumask, and it's only updated
> by having the client call in to pass a new mask. This avoids the technical
> race, but it does mean that the client can't update a field that it
> allocated
> and provided to the subsystem, which feels a bit unnatural.
That's actually a common pattern. Check out struct timer_list,
it is allocated and pre-filled by the client. The "expires" field is
initialized by the client which then calls add_timer() to arm it.
Now if you want to modify the expiration of the timer while it's
queued, raw-modifying the "expires" field won't work much as expected.
You need to do that through mod_timer().
You seldom can directly change the field of an object while it's live
handled by another subsystem.
>
> Either one could be OK, but I opted for #1. What do you think of it?
>
> Also, I want to ask Linus to pull the tile-specific changes for nohz_full
> for the tile architecture. This includes a copy of the change to add the
> tick_nohz_full_add_cpus_to() and tick_nohz_full_remove_cpus_from()
> routines here:
>
> https://lkml.org/lkml/2015/4/9/792
Let's see that on the thread.
>
> which I used to fix the tilegx network driver's default irq affinity mask.
>
> There's also the support for tile's nohz_full in general, which you
> commented on, but didn't provide an explicit Ack for:
>
> https://lkml.org/lkml/2015/3/24/953
Right, I'll have a look at this.
Thanks.
>
> If you'd like to nack either change, or better yet ack them, let me know.
> I'll wait a little while before asking Linus to pull.
>
> The tile tree stuff to be pulled for v4.1 is here:
>
> http://git.kernel.org/cgit/linux/kernel/git/cmetcalf/linux-tile.git/log/
>
> if you want to look more closely.
>
> Thanks!
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On Mon, Apr 13, 2015 at 12:06:50PM -0400, Chris Metcalf wrote:
> Also, I want to ask Linus to pull the tile-specific changes for nohz_full
> for the tile architecture. This includes a copy of the change to add the
> tick_nohz_full_add_cpus_to() and tick_nohz_full_remove_cpus_from()
> routines here:
Hmm, this is going to be too late to push the tick_nohz_full_add_cpus_to()
stuff for this merge window. This needs to go through nohz core tree and
we are already in the middle of the merge window.
On 04/14/2015 11:23 AM, Frederic Weisbecker wrote:
> On Mon, Apr 13, 2015 at 12:06:50PM -0400, Chris Metcalf wrote:
>> Also, I want to ask Linus to pull the tile-specific changes for nohz_full
>> for the tile architecture. This includes a copy of the change to add the
>> tick_nohz_full_add_cpus_to() and tick_nohz_full_remove_cpus_from()
>> routines here:
> Hmm, this is going to be too late to push the tick_nohz_full_add_cpus_to()
> stuff for this merge window. This needs to go through nohz core tree and
> we are already in the middle of the merge window.
I'll defer the change in the tilegx network driver until the next merge.
(I guess what I would do is cherrypick the nohz change into the tile tree
to keep it easy to build and just assume it's easy to resolve when it's
pulled for 4.2.)
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Tue, 14 Apr 2015, Chris Metcalf wrote:
> On 04/14/2015 11:23 AM, Frederic Weisbecker wrote:
> > On Mon, Apr 13, 2015 at 12:06:50PM -0400, Chris Metcalf wrote:
> > > Also, I want to ask Linus to pull the tile-specific changes for nohz_full
> > > for the tile architecture. This includes a copy of the change to add the
> > > tick_nohz_full_add_cpus_to() and tick_nohz_full_remove_cpus_from()
> > > routines here:
> > Hmm, this is going to be too late to push the tick_nohz_full_add_cpus_to()
> > stuff for this merge window. This needs to go through nohz core tree and
> > we are already in the middle of the merge window.
>
> I'll defer the change in the tilegx network driver until the next merge.
> (I guess what I would do is cherrypick the nohz change into the tile tree
> to keep it easy to build and just assume it's easy to resolve when it's
> pulled for 4.2.)
Please do not cherry pick. We provide a branch for you to pull so we
don't end up with different commit ids for the same patches.
Thanks,
tglx
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The following commit to update
kernel/watchdog.c to use this functionality is the motivating
example, and more information on the motivation is provided there.
A new smp_hotplug_thread field is introduced, "cpumask", which
is an optional pointer to a cpumask that indicates whether
or not the given smp_hotplug_thread should run on that core; the
cpumask is checked when deciding whether to unpark the thread.
To change the cpumask after registering the thread, you must call
smpboot_update_cpumask_percpu_thread() with the new cpumask.
Signed-off-by: Chris Metcalf <[email protected]>
---
(Text repeated from v7 post since these are still open issues)
I don't know why it is necessary to explicitly unpark the threads in
smpboot_destroy_threads() before destroying them. We can't even do it
in the same for_each_possible_cpu() loop as the kthread_stop() call;
it appears all the threads on online cores must be unparked prior to
trying to stop all the threads, or the system hangs.
In all honesty, I'm still fond of the model where we just do_exit(0)
the threads that we don't want, as soon as registration creates them.
(We can still support the watchdog_cpumask sysctl easily enough.)
This doesn't add any new semantics to smpboot (for good or for bad),
and more importantly it makes it clear that we don't have watchdog
tasks running on the nohz_full cores, which otherwise are going to
make people continually wonder what's going on until they carefully
read the code. But I'm OK with the direction laid out in this patch
if it's the consensus preferred model.
v8: make cpumask only updated by smpboot subsystem [Frederic]
v7: change from valid_cpu() callback to optional cpumask field
park smpboot threads rather than just not creating them
v6: change from an "exclude" data pointer to a more generic
valid_cpu() callback [Frederic]
v5: switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
include/linux/smpboot.h | 6 ++++++
kernel/smpboot.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..63271b19333e 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,9 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @cpumask: Optional pointer to a set of possible cores to
+ * allow threads to come unparked on. To change it later
+ * you must call smpboot_update_cpumask_percpu_thread().
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,11 +44,14 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ struct cpumask *cpumask;
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
+void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..c5d53a335387 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -92,6 +92,9 @@ enum {
HP_THREAD_PARKED,
};
+/* Statically allocated and used under smpboot_threads_lock. */
+static struct cpumask tmp_mask;
+
/**
* smpboot_thread_fn - percpu hotplug thread loop function
* @data: thread data pointer
@@ -232,7 +235,8 @@ void smpboot_unpark_threads(unsigned int cpu)
mutex_lock(&smpboot_threads_lock);
list_for_each_entry(cur, &hotplug_threads, list)
- smpboot_unpark_thread(cur, cpu);
+ if (cur->cpumask == NULL || cpumask_test_cpu(cpu, cur->cpumask))
+ smpboot_unpark_thread(cur, cpu);
mutex_unlock(&smpboot_threads_lock);
}
@@ -258,6 +262,16 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
{
unsigned int cpu;
+ /* Unpark any threads that were voluntarily parked. */
+ if (ht->cpumask) {
+ cpumask_andnot(&tmp_mask, cpu_online_mask, ht->cpumask);
+ for_each_cpu(cpu, &tmp_mask) {
+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ if (tsk)
+ kthread_unpark(tsk);
+ }
+ }
+
/* We need to destroy also the parked threads of offline cpus */
for_each_possible_cpu(cpu) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
@@ -289,7 +303,9 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
smpboot_destroy_threads(plug_thread);
goto out;
}
- smpboot_unpark_thread(plug_thread, cpu);
+ if (plug_thread->cpumask == NULL ||
+ cpumask_test_cpu(cpu, plug_thread->cpumask))
+ smpboot_unpark_thread(plug_thread, cpu);
}
list_add(&plug_thread->list, &hotplug_threads);
out:
@@ -316,6 +332,43 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+/**
+ * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
+ * @plug_thread: Hotplug thread descriptor
+ * @new: Revised mask to use
+ *
+ * The cpumask field in the smp_hotplug_thread must not be updated directly
+ * by the client, but only by calling this function. A non-NULL cpumask must
+ * have been provided at registration time to be able to use this function.
+ */
+void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *new)
+{
+ unsigned int cpu;
+ struct cpumask *old = plug_thread->cpumask;
+
+ BUG_ON(old == NULL);
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+
+ /* Park threads that were exclusively enabled on the old mask. */
+ cpumask_andnot(&tmp_mask, old, new);
+ for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
+ smpboot_park_thread(plug_thread, cpu);
+
+ /* Unpark threads that are exclusively enabled on the new mask. */
+ cpumask_andnot(&tmp_mask, new, old);
+ for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
+ smpboot_unpark_thread(plug_thread, cpu);
+
+ cpumask_copy(old, new);
+
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
+
static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
/*
--
2.1.2
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
If we allowed the watchdog to run on nohz_full cores, the timer
interrupts and scheduler work would prevent the desired tickless
operation on those cores. But if we disable the watchdog globally,
then the housekeeping cores can't benefit from the watchdog
functionality. So we allow disabling it only on some cores.
See Documentation/lockup-watchdogs.txt for more information.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
improve documentation in "Documentation/" and in changelong [akpm]
v7: use cpumask field instead of valid_cpu() callback
v6: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
Documentation/lockup-watchdogs.txt | 18 ++++++++++++++
Documentation/sysctl/kernel.txt | 15 ++++++++++++
include/linux/nmi.h | 3 +++
kernel/sysctl.c | 7 ++++++
kernel/watchdog.c | 49 ++++++++++++++++++++++++++++++++++++++
5 files changed, 92 insertions(+)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..22dd6af2e4bd 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f1697858d71c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,21 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..5bd80a212486 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,9 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask_for_smpboot;
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -694,6 +698,7 @@ static int watchdog_enable_all_cpus(void)
int err = 0;
if (!watchdog_running) {
+ cpumask_copy(watchdog_cpumask_for_smpboot, watchdog_cpumask);
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
@@ -869,12 +874,56 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ /* One cpumask is allocated for smpboot to own. */
+ alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
+ watchdog_threads.cpumask = watchdog_cpumask_for_smpboot;
+
+ /* Another cpumask is allocated for /proc to use. */
+ alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc's stat and status files. The existing code reported such
threads as "Running", which is kind-of true if you think of the case
where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.
We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values.
The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.
Signed-off-by: Chris Metcalf <[email protected]>
---
v8: no change to this patch, just in 1/3 and 2/3
fs/proc/array.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index a3893b7505b2..2eb623ffb0b7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -126,6 +126,10 @@ static inline const char *get_task_state(struct task_struct *tsk)
{
unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+ /* Treat parked tasks as sleeping. */
+ if (state == TASK_PARKED)
+ state = TASK_SLEEPING;
+
BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
return task_state_array[fls(state)];
--
2.1.2
Chris,
if a user changes watchdog parameters in /proc/sys/kernel, the watchdog threads
are not stopped and restarted in all cases. Parameters can also be changed 'on
the fly', for example like 'watchdog_thresh' in the following flow of execution:
proc_watchdog_thresh
proc_watchdog_update
if (watchdog_enabled && watchdog_thresh)
watchdog_enable_all_cpus
if (!watchdog_running) {
// watchdog threads are already running so we don't get here
} else {
update_watchdog_all_cpus
for_each_online_cpu <-----------------------------.
update_watchdog |
watchdog_nmi_disable |
watchdog_nmi_enable |
} |
|
I think we would not want to call watchdog_nmi_enable() for each _online_ CPU,
but rather for each CPU that has an _unparked_ watchdog thread (i.e. where the
watchdog mechanism is actually enabled). So I think for_each_online_cpu() needs
to be replaced by something like:
for_each_cpu(cpu, &watchdog_cpumask_for_smpboot)
It seems that this has been an issue in previous versions of your proposed patch
as well. I'm sorry that I haven't noticed this earlier. I think you also need to
review watchdog_nmi_disable_all() and watchdog_nmi_enable_all() because those
functions call for_each_online_cpu() too.
Regards,
Uli
Subject: [PATCH v8 2/3] watchdog: add watchdog_cpumask sysctl to assist nohz
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
If we allowed the watchdog to run on nohz_full cores, the timer
interrupts and scheduler work would prevent the desired tickless
operation on those cores. But if we disable the watchdog globally,
then the housekeeping cores can't benefit from the watchdog
functionality. So we allow disabling it only on some cores.
See Documentation/lockup-watchdogs.txt for more information.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
improve documentation in "Documentation/" and in changelong [akpm]
v7: use cpumask field instead of valid_cpu() callback
v6: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
Documentation/lockup-watchdogs.txt | 18 ++++++++++++++
Documentation/sysctl/kernel.txt | 15 ++++++++++++
include/linux/nmi.h | 3 +++
kernel/sysctl.c | 7 ++++++
kernel/watchdog.c | 49 ++++++++++++++++++++++++++++++++++++++
5 files changed, 92 insertions(+)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..22dd6af2e4bd 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f1697858d71c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,21 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..5bd80a212486 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,9 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask_for_smpboot;
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -694,6 +698,7 @@ static int watchdog_enable_all_cpus(void)
int err = 0;
if (!watchdog_running) {
+ cpumask_copy(watchdog_cpumask_for_smpboot, watchdog_cpumask);
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
@@ -869,12 +874,56 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ /* One cpumask is allocated for smpboot to own. */
+ alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
+ watchdog_threads.cpumask = watchdog_cpumask_for_smpboot;
+
+ /* Another cpumask is allocated for /proc to use. */
+ alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
On Tue, Apr 14, 2015 at 03:37:31PM -0400, Chris Metcalf wrote:
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index c697f73d82d6..c5d53a335387 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -92,6 +92,9 @@ enum {
> HP_THREAD_PARKED,
> };
>
> +/* Statically allocated and used under smpboot_threads_lock. */
> +static struct cpumask tmp_mask;
> +
Better allocate the cpumask on need rather than have it resident on memory.
struct cpumask can be large. Plus we need to worry about locking it.
> /**
> * smpboot_thread_fn - percpu hotplug thread loop function
> * @data: thread data pointer
> @@ -232,7 +235,8 @@ void smpboot_unpark_threads(unsigned int cpu)
>
> mutex_lock(&smpboot_threads_lock);
> list_for_each_entry(cur, &hotplug_threads, list)
> - smpboot_unpark_thread(cur, cpu);
> + if (cur->cpumask == NULL || cpumask_test_cpu(cpu, cur->cpumask))
> + smpboot_unpark_thread(cur, cpu);
> mutex_unlock(&smpboot_threads_lock);
> }
>
> @@ -258,6 +262,16 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
> {
> unsigned int cpu;
>
> + /* Unpark any threads that were voluntarily parked. */
> + if (ht->cpumask) {
> + cpumask_andnot(&tmp_mask, cpu_online_mask, ht->cpumask);
> + for_each_cpu(cpu, &tmp_mask) {
> + struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> + if (tsk)
> + kthread_unpark(tsk);
> + }
> + }
Why do you need to do that? smpboot_destroy_threads() doesn't work on parked threads?
But kthread_stop() does an explicit unparking.
> +
> /* We need to destroy also the parked threads of offline cpus */
> for_each_possible_cpu(cpu) {
> struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> @@ -289,7 +303,9 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
> smpboot_destroy_threads(plug_thread);
> goto out;
> }
> - smpboot_unpark_thread(plug_thread, cpu);
> + if (plug_thread->cpumask == NULL ||
> + cpumask_test_cpu(cpu, plug_thread->cpumask))
> + smpboot_unpark_thread(plug_thread, cpu);
> }
> list_add(&plug_thread->list, &hotplug_threads);
> out:
> @@ -316,6 +332,43 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
> }
> EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
>
> +/**
> + * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
> + * @plug_thread: Hotplug thread descriptor
> + * @new: Revised mask to use
> + *
> + * The cpumask field in the smp_hotplug_thread must not be updated directly
> + * by the client, but only by calling this function. A non-NULL cpumask must
> + * have been provided at registration time to be able to use this function.
> + */
> +void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
> + const struct cpumask *new)
> +{
> + unsigned int cpu;
> + struct cpumask *old = plug_thread->cpumask;
> +
> + BUG_ON(old == NULL);
Ouch. So the caller must have passed an explicit mask to be able to modify it?
We can't do that.
> +
> + get_online_cpus();
> + mutex_lock(&smpboot_threads_lock);
> +
> + /* Park threads that were exclusively enabled on the old mask. */
> + cpumask_andnot(&tmp_mask, old, new);
> + for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
> + smpboot_park_thread(plug_thread, cpu);
> +
> + /* Unpark threads that are exclusively enabled on the new mask. */
> + cpumask_andnot(&tmp_mask, new, old);
> + for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
> + smpboot_unpark_thread(plug_thread, cpu);
> +
> + cpumask_copy(old, new);
So unfortunately I had to see the result to realize my mistake on one detail.
With this scheme, it's not clear who allocates and who releases the cpumasks.
If the caller of smpboot_register_percpu_thread() allocates the cpumask, then he
should release it itself after calling smpboot_unregister_percpu_thread(). But
if the cpumask is NULL and we call smpboot_update_cpumask_percpu_thread(), it's
not clear to the caller if we make a copy, if he can release it after calling
the function, etc...
So the client should not touch the cpumask field of struct smp_hotplug_thread at all
and it should pass the cpumask to smpboot_register_percpu_thread() and smpboot_update_cpumask_percpu_thread().
smpboot subsystem then does its own copy to the struct smp_hotplug_thread which it releases from
smpboot_unregister_percpu_thread().
This way we prevent from any nasty side effet or headscratch about who is responsible
of allocations and releases.
> +
> + mutex_unlock(&smpboot_threads_lock);
> + put_online_cpus();
> +}
> +EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
> +
> static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
>
> /*
> --
> 2.1.2
>
On 4/16/2015 11:28 AM, Frederic Weisbecker wrote:
>> + /* Unpark any threads that were voluntarily parked. */
>> >+ if (ht->cpumask) {
>> >+ cpumask_andnot(&tmp_mask, cpu_online_mask, ht->cpumask);
>> >+ for_each_cpu(cpu, &tmp_mask) {
>> >+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
>> >+ if (tsk)
>> >+ kthread_unpark(tsk);
>> >+ }
>> >+ }
> Why do you need to do that? smpboot_destroy_threads() doesn't work on parked threads?
> But kthread_stop() does an explicit unparking.
Yes, this part left me scratching my head. Experimentally, this was necessary.
I saw the unpark in kthread_stop() but it didn't make things work properly.
Currently it looks like parked threads are only in that state while cores are
being offlined, and then they are killed individually, so it seems likely that
this particular path hasn't been tested before.
> +/* Statically allocated and used under smpboot_threads_lock. */
> +static struct cpumask tmp_mask;
> +
> Better allocate the cpumask on need rather than have it resident on memory.
> struct cpumask can be large. Plus we need to worry about locking it.
>
I was trying to avoid the need to make functions return errors for the
extremely unlikely case of ENOMEM. No one is going to check that error
return in practice anyway; programmers are lazy. It seemed easy to
allocate one mask statically and use it under the lock; even large systems aren't
likely to burn more than a couple hundred bytes of .bss for this.
But, if you'd prefer using allocation and the error-return model, I can
certainly change the code to do that.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Thu, Apr 16, 2015 at 11:50:06AM -0400, Chris Metcalf wrote:
> On 4/16/2015 11:28 AM, Frederic Weisbecker wrote:
> >>+ /* Unpark any threads that were voluntarily parked. */
> >>>+ if (ht->cpumask) {
> >>>+ cpumask_andnot(&tmp_mask, cpu_online_mask, ht->cpumask);
> >>>+ for_each_cpu(cpu, &tmp_mask) {
> >>>+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> >>>+ if (tsk)
> >>>+ kthread_unpark(tsk);
> >>>+ }
> >>>+ }
> >Why do you need to do that? smpboot_destroy_threads() doesn't work on parked threads?
> >But kthread_stop() does an explicit unparking.
>
> Yes, this part left me scratching my head. Experimentally, this was necessary.
> I saw the unpark in kthread_stop() but it didn't make things work properly.
> Currently it looks like parked threads are only in that state while cores are
> being offlined, and then they are killed individually, so it seems likely that
> this particular path hasn't been tested before.
I'm not sure I understand. You mean that kthreads can be parked only when cores they
are affine to are offline?
Also I'm scratching my head around kthread_stop() when called on kthreads that are parked
on offline cores. I don't see how they can wake up and do the kthread->exited completion since
they are only affine to that offline core. But I likely overlooked something.
>
> >+/* Statically allocated and used under smpboot_threads_lock. */
> >+static struct cpumask tmp_mask;
> >+
> >Better allocate the cpumask on need rather than have it resident on memory.
> >struct cpumask can be large. Plus we need to worry about locking it.
> >
>
> I was trying to avoid the need to make functions return errors for the
> extremely unlikely case of ENOMEM. No one is going to check that error
> return in practice anyway; programmers are lazy. It seemed easy to
> allocate one mask statically and use it under the lock; even large systems aren't
> likely to burn more than a couple hundred bytes of .bss for this.
Sure, but I guess it's a common practice to allocate temporary cpumasks. I can't
see much "static struct cpumask" around that are used for temporary stuffs.
>
> But, if you'd prefer using allocation and the error-return model, I can
> certainly change the code to do that.
There is always a caller to return -ENOMEM to ;-)
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On 04/15/2015 03:37 AM, Chris Metcalf wrote:
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> If we allowed the watchdog to run on nohz_full cores, the timer
> interrupts and scheduler work would prevent the desired tickless
> operation on those cores. But if we disable the watchdog globally,
> then the housekeeping cores can't benefit from the watchdog
> functionality. So we allow disabling it only on some cores.
> See Documentation/lockup-watchdogs.txt for more information.
>
> Acked-by: Don Zickus <[email protected]>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
> improve documentation in "Documentation/" and in changelong [akpm]
s/changelong/changelog/
>
> v7: use cpumask field instead of valid_cpu() callback
>
> v6: use alloc_cpumask_var() [Sasha Levin]
> switch from watchdog_exclude to watchdog_cpumask [Frederic]
> simplify the smp_hotplug_thread API to watchdog [Frederic]
> add Don's Acked-by
>
> Documentation/lockup-watchdogs.txt | 18 ++++++++++++++
> Documentation/sysctl/kernel.txt | 15 ++++++++++++
> include/linux/nmi.h | 3 +++
> kernel/sysctl.c | 7 ++++++
> kernel/watchdog.c | 49 ++++++++++++++++++++++++++++++++++++++
> 5 files changed, 92 insertions(+)
>
> diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> index ab0baa692c13..22dd6af2e4bd 100644
> --- a/Documentation/lockup-watchdogs.txt
> +++ b/Documentation/lockup-watchdogs.txt
> @@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
> administrators to configure the period of the hrtimer and the perf
> event. The right value for a particular environment is a trade-off
> between fast response to lockups and detection overhead.
> +
> +By default, the watchdog runs on all online cores. However, on a
> +kernel configured with NO_HZ_FULL, by default the watchdog runs only
> +on the housekeeping cores, not the cores specified in the "nohz_full"
> +boot argument. If we allowed the watchdog to run by default on
> +the "nohz_full" cores, we would have to run timer ticks to activate
> +the scheduler, which would prevent the "nohz_full" functionality
> +from protecting the user code on those cores from the kernel.
> +Of course, disabling it by default on the nohz_full cores means that
> +when those cores do enter the kernel, by default we will not be
> +able to detect if they lock up. However, allowing the watchdog
> +to continue to run on the housekeeping (non-tickless) cores means
> +that we will continue to detect lockups properly on those cores.
> +
> +In either case, the set of cores excluded from running the watchdog
> +may be adjusted via the kernel.watchdog_cpumask sysctl. For
> +nohz_full cores, this may be useful for debugging a case where the
> +kernel seems to be hanging on the nohz_full cores.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index c831001c45f1..f1697858d71c 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -923,6 +923,21 @@ and nmi_watchdog.
>
> ==============================================================
>
> +watchdog_cpumask:
> +
> +This value can be used to control on which cpus the watchdog may run.
> +The default cpumask is all possible cores, but if NO_HZ_FULL is
> +enabled in the kernel config, and cores are specified with the
> +nohz_full= boot argument, those cores are excluded by default.
> +Offline cores can be included in this mask, and if the core is later
> +brought online, the watchdog will be started based on the mask value.
> +
> +Typically this value would only be touched in the nohz_full case
> +to re-enable cores that by default were not running the watchdog,
> +if a kernel lockup was suspected on those cores.
> +
> +==============================================================
> +
> watchdog_thresh:
>
> This value can be used to control the frequency of hrtimer and NMI
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 3d46fb4708e0..f94da0e65dea 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
> extern int soft_watchdog_enabled;
> extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> +extern unsigned long *watchdog_cpumask_bits;
> extern int sysctl_softlockup_all_cpu_backtrace;
> struct ctl_table;
> extern int proc_watchdog(struct ctl_table *, int ,
> @@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> extern int proc_watchdog_thresh(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> +extern int proc_watchdog_cpumask(struct ctl_table *, int,
> + void __user *, size_t *, loff_t *);
> #endif
>
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a88fb9..699571a74e3b 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
> .extra2 = &one,
> },
> {
> + .procname = "watchdog_cpumask",
> + .data = &watchdog_cpumask_bits,
> + .maxlen = NR_CPUS,
> + .mode = 0644,
> + .proc_handler = proc_watchdog_cpumask,
> + },
> + {
> .procname = "softlockup_panic",
> .data = &softlockup_panic,
> .maxlen = sizeof(int),
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 2316f50b07a4..5bd80a212486 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -19,6 +19,7 @@
> #include <linux/sysctl.h>
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> +#include <linux/tick.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -56,6 +57,9 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> +static cpumask_var_t watchdog_cpumask_for_smpboot;
> +static cpumask_var_t watchdog_cpumask;
> +unsigned long *watchdog_cpumask_bits;
>
> static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
> @@ -694,6 +698,7 @@ static int watchdog_enable_all_cpus(void)
> int err = 0;
>
> if (!watchdog_running) {
> + cpumask_copy(watchdog_cpumask_for_smpboot, watchdog_cpumask);
> err = smpboot_register_percpu_thread(&watchdog_threads);
> if (err)
> pr_err("Failed to create watchdog threads, disabled\n");
> @@ -869,12 +874,56 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +/*
> + * The cpumask is the mask of possible cpus that the watchdog can run
> + * on, not the mask of cpus it is actually running on. This allows the
> + * user to specify a mask that will include cpus that have not yet
> + * been brought online, if desired.
> + */
> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write) {
> + /* Remove impossible cpus to keep sysctl output cleaner. */
> + cpumask_and(watchdog_cpumask, watchdog_cpumask,
> + cpu_possible_mask);
> +
> + if (watchdog_enabled && watchdog_thresh)
If the new mask is same as the current one, then there is no need to go on ?
cpus_equal(watchdog_cpumask, watchdog_cpumask_for_smpboot) or something else ?
> + smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> + watchdog_cpumask);
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
> +
> #endif /* CONFIG_SYSCTL */
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + /* One cpumask is allocated for smpboot to own. */
> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
alloc_cpumask_var could fail?
> + watchdog_threads.cpumask = watchdog_cpumask_for_smpboot;
> +
> + /* Another cpumask is allocated for /proc to use. */
> + alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
ditto
thanks
chai wen
> + watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> +
> +#ifdef CONFIG_NO_HZ_FULL
> + if (!cpumask_empty(tick_nohz_full_mask))
> + pr_info("Disabling watchdog on nohz_full cores by default\n");
> + cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
> + tick_nohz_full_mask);
> +#else
> + cpumask_copy(watchdog_cpumask, cpu_possible_mask);
> +#endif
> +
> if (watchdog_enabled)
> watchdog_enable_all_cpus();
> }
On 04/16/2015 06:46 AM, Ulrich Obergfell wrote:
> if a user changes watchdog parameters in /proc/sys/kernel, the watchdog threads
> are not stopped and restarted in all cases. Parameters can also be changed 'on
> the fly', for example like 'watchdog_thresh' in the following flow of execution:
>
> proc_watchdog_thresh
> proc_watchdog_update
> if (watchdog_enabled && watchdog_thresh)
> watchdog_enable_all_cpus
> if (!watchdog_running) {
> // watchdog threads are already running so we don't get here
> } else {
> update_watchdog_all_cpus
> for_each_online_cpu <-----------------------------.
> update_watchdog |
> watchdog_nmi_disable |
> watchdog_nmi_enable |
> } |
> |
> I think we would not want to call watchdog_nmi_enable() for each_online_ CPU,
> but rather for each CPU that has an_unparked_ watchdog thread (i.e. where the
> watchdog mechanism is actually enabled).
How about something like this? I'll fold it into v9 of the patchset.
Thanks!
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0c5a37cdbedd..a4e1c9a2e769 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -61,6 +61,10 @@ static cpumask_var_t watchdog_cpumask_for_smpboot;
static cpumask_var_t watchdog_cpumask;
unsigned long *watchdog_cpumask_bits;
+/* Helper for online, unparked cpus. */
+#define for_each_watchdog_cpu(cpu) \
+ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
+
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -209,7 +213,7 @@ void touch_all_softlockup_watchdogs(void)
* do we care if a 0 races with a timestamp?
* all it means is the softlock check starts one cycle later
*/
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
}
@@ -616,7 +620,7 @@ void watchdog_nmi_enable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_enable(cpu);
put_online_cpus();
}
@@ -629,7 +633,7 @@ void watchdog_nmi_disable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_disable(cpu);
put_online_cpus();
}
@@ -688,7 +692,7 @@ static void update_watchdog_all_cpus(void)
int cpu;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
update_watchdog(cpu);
put_online_cpus();
}
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On 04/16/2015 09:31 PM, Chai Wen wrote:
> On 04/15/2015 03:37 AM, Chris Metcalf wrote:
>> +/*
>> + * The cpumask is the mask of possible cpus that the watchdog can run
>> + * on, not the mask of cpus it is actually running on. This allows the
>> + * user to specify a mask that will include cpus that have not yet
>> + * been brought online, if desired.
>> + */
>> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
>> + void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> + int err;
>> +
>> + mutex_lock(&watchdog_proc_mutex);
>> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
>> + if (!err && write) {
>> + /* Remove impossible cpus to keep sysctl output cleaner. */
>> + cpumask_and(watchdog_cpumask, watchdog_cpumask,
>> + cpu_possible_mask);
>> +
>> + if (watchdog_enabled && watchdog_thresh)
>
> If the new mask is same as the current one, then there is no need to go on ?
> cpus_equal(watchdog_cpumask, watchdog_cpumask_for_smpboot) or something else ?
It's a minor optimization, though, since the
smpboot_update_cpumask_percpu_thread()
function will do some cpumask calls and realize that nothing has changed and
return without doing anything anyway.
In any case, with Frederic's recent suggstion, we won't have a
watchdog_cpumask_for_smpboot variable exposed anyway.
>> + smpboot_update_cpumask_percpu_thread(&watchdog_threads,
>> + watchdog_cpumask);
>> + }
>> + mutex_unlock(&watchdog_proc_mutex);
>> + return err;
>> +}
>> +
>> #endif /* CONFIG_SYSCTL */
>>
>> void __init lockup_detector_init(void)
>> {
>> set_sample_period();
>>
>> + /* One cpumask is allocated for smpboot to own. */
>> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
>
> alloc_cpumask_var could fail?
Good catch; if I get a failure I'll just return early without trying to
start the watchdog, since clearly things are too memory-constrained
to enable that functionality anyway.
Thanks!
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On 04/16/2015 11:28 AM, Frederic Weisbecker wrote:
> So unfortunately I had to see the result to realize my mistake on one detail.
> With this scheme, it's not clear who allocates and who releases the cpumasks.
> If the caller of smpboot_register_percpu_thread() allocates the cpumask, then he
> should release it itself after calling smpboot_unregister_percpu_thread(). But
> if the cpumask is NULL and we call smpboot_update_cpumask_percpu_thread(), it's
> not clear to the caller if we make a copy, if he can release it after calling
> the function, etc...
>
> So the client should not touch the cpumask field of struct smp_hotplug_thread at all
> and it should pass the cpumask to smpboot_register_percpu_thread() and smpboot_update_cpumask_percpu_thread().
>
> smpboot subsystem then does its own copy to the struct smp_hotplug_thread which it releases from
> smpboot_unregister_percpu_thread().
>
> This way we prevent from any nasty side effet or headscratch about who is responsible
> of allocations and releases.
So here's a diff to change things in line with your suggestions.
Other than the unfortunate awkwardness that we now have to
test for error return from smpboot_update_cpumask_percpu_thread(),
I think it's a net win.
Note that I assume it's reasonable to register the smpboot thread
and create and unpark all the threads, then later call back in to
park the ones that we want to park. This avoids having to create
another API just for the case of the watchdog. It has some runtime
cost but I think relatively minor.
This is just the delta diff for your convenience; I'll roll out a v9 with
the various suggestions I've received shortly.
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 63271b19333e..7c42153edfac 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,9 +27,8 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
- * @cpumask: Optional pointer to a set of possible cores to
- * allow threads to come unparked on. To change it later
- * you must call smpboot_update_cpumask_percpu_thread().
+ * @cpumask: Internal state. To update which threads are unparked,
+ * call smpboot_update_cpumask_percpu_thread().
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -44,14 +43,14 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
- struct cpumask *cpumask;
+ struct cpumask cpumask;
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
-void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
- const struct cpumask *);
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c5d53a335387..0d131daf3e7f 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -92,9 +92,6 @@ enum {
HP_THREAD_PARKED,
};
-/* Statically allocated and used under smpboot_threads_lock. */
-static struct cpumask tmp_mask;
-
/**
* smpboot_thread_fn - percpu hotplug thread loop function
* @data: thread data pointer
@@ -235,7 +232,7 @@ void smpboot_unpark_threads(unsigned int cpu)
mutex_lock(&smpboot_threads_lock);
list_for_each_entry(cur, &hotplug_threads, list)
- if (cur->cpumask == NULL || cpumask_test_cpu(cpu, cur->cpumask))
+ if (cpumask_test_cpu(cpu, &cur->cpumask))
smpboot_unpark_thread(cur, cpu);
mutex_unlock(&smpboot_threads_lock);
}
@@ -263,9 +260,8 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
unsigned int cpu;
/* Unpark any threads that were voluntarily parked. */
- if (ht->cpumask) {
- cpumask_andnot(&tmp_mask, cpu_online_mask, ht->cpumask);
- for_each_cpu(cpu, &tmp_mask) {
+ for_each_cpu_not(cpu, &ht->cpumask) {
+ if (cpu_online(cpu)) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
if (tsk)
kthread_unpark(tsk);
@@ -295,6 +291,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
+ cpumask_copy(&plug_thread->cpumask, cpu_possible_mask);
get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
@@ -303,9 +300,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
smpboot_destroy_threads(plug_thread);
goto out;
}
- if (plug_thread->cpumask == NULL ||
- cpumask_test_cpu(cpu, plug_thread->cpumask))
- smpboot_unpark_thread(plug_thread, cpu);
+ smpboot_unpark_thread(plug_thread, cpu);
}
list_add(&plug_thread->list, &hotplug_threads);
out:
@@ -338,34 +333,39 @@ EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
* @new: Revised mask to use
*
* The cpumask field in the smp_hotplug_thread must not be updated directly
- * by the client, but only by calling this function. A non-NULL cpumask must
- * have been provided at registration time to be able to use this function.
+ * by the client, but only by calling this function.
*/
-void smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
- const struct cpumask *new)
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *new)
{
+ struct cpumask *old = &plug_thread->cpumask;
+ cpumask_var_t tmp;
unsigned int cpu;
- struct cpumask *old = plug_thread->cpumask;
- BUG_ON(old == NULL);
+ if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+ return -ENOMEM;
get_online_cpus();
mutex_lock(&smpboot_threads_lock);
/* Park threads that were exclusively enabled on the old mask. */
- cpumask_andnot(&tmp_mask, old, new);
- for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
+ cpumask_andnot(&tmp, old, new);
+ for_each_cpu_and(cpu, &tmp, cpu_online_mask)
smpboot_park_thread(plug_thread, cpu);
/* Unpark threads that are exclusively enabled on the new mask. */
- cpumask_andnot(&tmp_mask, new, old);
- for_each_cpu_and(cpu, &tmp_mask, cpu_online_mask)
+ cpumask_andnot(&tmp, new, old);
+ for_each_cpu_and(cpu, &tmp, cpu_online_mask)
smpboot_unpark_thread(plug_thread, cpu);
cpumask_copy(old, new);
mutex_unlock(&smpboot_threads_lock);
put_online_cpus();
+
+ free_cpumask_var(tmp);
+
+ return 0;
}
EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index a4e1c9a2e769..f3702cf7582b 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -57,7 +57,6 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
-static cpumask_var_t watchdog_cpumask_for_smpboot;
static cpumask_var_t watchdog_cpumask;
unsigned long *watchdog_cpumask_bits;
@@ -706,8 +705,12 @@ static int watchdog_enable_all_cpus(void)
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
- else
+ else {
+ if (smpboot_update_cpumask_percpu_thread(
+ &watchdog_threads, watchdog_cpumask))
+ pr_err("Failed to set cpumask for watchdog threads\n");
watchdog_running = 1;
+ }
} else {
/*
* Enable/disable the lockup detectors or
@@ -911,12 +914,10 @@ void __init lockup_detector_init(void)
{
set_sample_period();
- /* One cpumask is allocated for smpboot to own. */
- alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
- watchdog_threads.cpumask = watchdog_cpumask_for_smpboot;
-
- /* Another cpumask is allocated for /proc to use. */
- alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL);
+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
+ pr_err("Failed to allocate cpumask for watchdog");
+ return;
+ }
watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
#ifdef CONFIG_NO_HZ_FULL
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The following commit to update
kernel/watchdog.c to use this functionality is the motivating
example, and more information on the motivation is provided there.
A new smp_hotplug_thread field is introduced, "cpumask", which
is cpumask field managed by the smpboot subsystem that indicates whether
or not the given smp_hotplug_thread should run on that core; the
cpumask is checked when deciding whether to unpark the thread.
To limit the cpumask to less than cpu_possible, you must call
smpboot_update_cpumask_percpu_thread() after registering.
Signed-off-by: Chris Metcalf <[email protected]>
---
v9: move cpumask into smpboot_hotplug_thread and don't let the
client initialize it either [Frederic]
use alloc_cpumask_var, not a locked static cpumask [Frederic]
v8: make cpumask only updated by smpboot subsystem [Frederic]
v7: change from valid_cpu() callback to optional cpumask field
park smpboot threads rather than just not creating them
v6: change from an "exclude" data pointer to a more generic
valid_cpu() callback [Frederic]
v5: switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
include/linux/smpboot.h | 6 ++++++
kernel/smpboot.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 2 deletions(-)
include/linux/smpboot.h | 5 +++++
kernel/smpboot.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..7c42153edfac 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,8 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @cpumask: Internal state. To update which threads are unparked,
+ * call smpboot_update_cpumask_percpu_thread().
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,11 +43,14 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ struct cpumask cpumask;
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..0d131daf3e7f 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -232,7 +232,8 @@ void smpboot_unpark_threads(unsigned int cpu)
mutex_lock(&smpboot_threads_lock);
list_for_each_entry(cur, &hotplug_threads, list)
- smpboot_unpark_thread(cur, cpu);
+ if (cpumask_test_cpu(cpu, &cur->cpumask))
+ smpboot_unpark_thread(cur, cpu);
mutex_unlock(&smpboot_threads_lock);
}
@@ -258,6 +259,15 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
{
unsigned int cpu;
+ /* Unpark any threads that were voluntarily parked. */
+ for_each_cpu_not(cpu, &ht->cpumask) {
+ if (cpu_online(cpu)) {
+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ if (tsk)
+ kthread_unpark(tsk);
+ }
+ }
+
/* We need to destroy also the parked threads of offline cpus */
for_each_possible_cpu(cpu) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
@@ -281,6 +291,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
+ cpumask_copy(&plug_thread->cpumask, cpu_possible_mask);
get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
@@ -316,6 +327,48 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+/**
+ * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
+ * @plug_thread: Hotplug thread descriptor
+ * @new: Revised mask to use
+ *
+ * The cpumask field in the smp_hotplug_thread must not be updated directly
+ * by the client, but only by calling this function.
+ */
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *new)
+{
+ struct cpumask *old = &plug_thread->cpumask;
+ cpumask_var_t tmp;
+ unsigned int cpu;
+
+ if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+ return -ENOMEM;
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+
+ /* Park threads that were exclusively enabled on the old mask. */
+ cpumask_andnot(&tmp, old, new);
+ for_each_cpu_and(cpu, &tmp, cpu_online_mask)
+ smpboot_park_thread(plug_thread, cpu);
+
+ /* Unpark threads that are exclusively enabled on the new mask. */
+ cpumask_andnot(&tmp, new, old);
+ for_each_cpu_and(cpu, &tmp, cpu_online_mask)
+ smpboot_unpark_thread(plug_thread, cpu);
+
+ cpumask_copy(old, new);
+
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+
+ free_cpumask_var(tmp);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
+
static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
/*
--
2.1.2
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
If we allowed the watchdog to run on nohz_full cores, the timer
interrupts and scheduler work would prevent the desired tickless
operation on those cores. But if we disable the watchdog globally,
then the housekeeping cores can't benefit from the watchdog
functionality. So we allow disabling it only on some cores.
See Documentation/lockup-watchdogs.txt for more information.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
v9: use new, new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
add and use for_each_watchdog_cpu() [Uli]
check alloc_cpumask_var for failure [Chai Wen]
v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
improve documentation in "Documentation/" and in changelog [akpm]
v7: use cpumask field instead of valid_cpu() callback
v6: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
Documentation/lockup-watchdogs.txt | 18 +++++++++++
Documentation/sysctl/kernel.txt | 15 +++++++++
include/linux/nmi.h | 3 ++
kernel/sysctl.c | 7 +++++
kernel/watchdog.c | 63 +++++++++++++++++++++++++++++++++++---
5 files changed, 101 insertions(+), 5 deletions(-)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..22dd6af2e4bd 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f1697858d71c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,21 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..8875717b6616 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,12 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
+
+/* Helper for online, unparked cpus. */
+#define for_each_watchdog_cpu(cpu) \
+ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -205,7 +212,7 @@ void touch_all_softlockup_watchdogs(void)
* do we care if a 0 races with a timestamp?
* all it means is the softlock check starts one cycle later
*/
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
}
@@ -612,7 +619,7 @@ void watchdog_nmi_enable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_enable(cpu);
put_online_cpus();
}
@@ -625,7 +632,7 @@ void watchdog_nmi_disable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_disable(cpu);
put_online_cpus();
}
@@ -684,7 +691,7 @@ static void update_watchdog_all_cpus(void)
int cpu;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
update_watchdog(cpu);
put_online_cpus();
}
@@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
- else
+ else {
+ if (smpboot_update_cpumask_percpu_thread(
+ &watchdog_threads, watchdog_cpumask))
+ pr_err("Failed to set cpumask for watchdog threads\n");
watchdog_running = 1;
+ }
} else {
/*
* Enable/disable the lockup detectors or
@@ -869,12 +880,54 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
+ pr_err("Failed to allocate cpumask for watchdog");
+ return;
+ }
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc's stat and status files. The existing code reported such
threads as "Running", which is kind-of true if you think of the case
where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.
We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values.
The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.
Signed-off-by: Chris Metcalf <[email protected]>
---
v9: fix to check tsk->state, and to set to TASK_INTERRUPTIBLE
fs/proc/array.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index a3893b7505b2..2a59d061941e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -126,6 +126,10 @@ static inline const char *get_task_state(struct task_struct *tsk)
{
unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+ /* Treat parked tasks as sleeping. */
+ if (tsk->state == TASK_PARKED)
+ state = TASK_INTERRUPTIBLE;
+
BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
return task_state_array[fls(state)];
--
2.1.2
Chris,
in v9, smpboot_update_cpumask_percpu_thread() allocates 'tmp' mask dynamically.
This allocation can fail and thus the function can now return an error. However,
this error is being ignored by proc_watchdog_cpumask().
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
You may want to consider handling the error, for example something like:
save watchdog_cpumask because proc_do_large_bitmap() is going to change it
...
err = smpboot_update_cpumask_percpu_thread()
if (err)
restore saved watchdog_cpumask
...
return err so that the user becomes aware of the failure
Regards,
Uli
----- Original Message -----
From: "Chris Metcalf" <[email protected]>
To: "Frederic Weisbecker" <[email protected]>, "Don Zickus" <[email protected]>, "Ingo Molnar" <[email protected]>, "Andrew Morton" <[email protected]>, "Andrew Jones" <[email protected]>, "chai wen" <[email protected]>, "Ulrich Obergfell" <[email protected]>, "Fabian Frederick" <[email protected]>, "Aaron Tomlin" <[email protected]>, "Ben Zhang" <[email protected]>, "Christoph Lameter" <[email protected]>, "Gilad Ben-Yossef" <[email protected]>, "Steven Rostedt" <[email protected]>, [email protected], "Jonathan Corbet" <[email protected]>, [email protected], "Thomas Gleixner" <[email protected]>, "Peter Zijlstra" <[email protected]>
Cc: "Chris Metcalf" <[email protected]>
Sent: Friday, April 17, 2015 8:37:17 PM
Subject: [PATCH v9 2/3] watchdog: add watchdog_cpumask sysctl to assist nohz
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
If we allowed the watchdog to run on nohz_full cores, the timer
interrupts and scheduler work would prevent the desired tickless
operation on those cores. But if we disable the watchdog globally,
then the housekeeping cores can't benefit from the watchdog
functionality. So we allow disabling it only on some cores.
See Documentation/lockup-watchdogs.txt for more information.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
v9: use new, new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
add and use for_each_watchdog_cpu() [Uli]
check alloc_cpumask_var for failure [Chai Wen]
v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
improve documentation in "Documentation/" and in changelog [akpm]
v7: use cpumask field instead of valid_cpu() callback
v6: use alloc_cpumask_var() [Sasha Levin]
switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
add Don's Acked-by
Documentation/lockup-watchdogs.txt | 18 +++++++++++
Documentation/sysctl/kernel.txt | 15 +++++++++
include/linux/nmi.h | 3 ++
kernel/sysctl.c | 7 +++++
kernel/watchdog.c | 63 +++++++++++++++++++++++++++++++++++---
5 files changed, 101 insertions(+), 5 deletions(-)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..22dd6af2e4bd 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..f1697858d71c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,21 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..8875717b6616 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,12 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static cpumask_var_t watchdog_cpumask;
+unsigned long *watchdog_cpumask_bits;
+
+/* Helper for online, unparked cpus. */
+#define for_each_watchdog_cpu(cpu) \
+ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -205,7 +212,7 @@ void touch_all_softlockup_watchdogs(void)
* do we care if a 0 races with a timestamp?
* all it means is the softlock check starts one cycle later
*/
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
}
@@ -612,7 +619,7 @@ void watchdog_nmi_enable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_enable(cpu);
put_online_cpus();
}
@@ -625,7 +632,7 @@ void watchdog_nmi_disable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_disable(cpu);
put_online_cpus();
}
@@ -684,7 +691,7 @@ static void update_watchdog_all_cpus(void)
int cpu;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
update_watchdog(cpu);
put_online_cpus();
}
@@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
- else
+ else {
+ if (smpboot_update_cpumask_percpu_thread(
+ &watchdog_threads, watchdog_cpumask))
+ pr_err("Failed to set cpumask for watchdog threads\n");
watchdog_running = 1;
+ }
} else {
/*
* Enable/disable the lockup detectors or
@@ -869,12 +880,54 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
+ pr_err("Failed to allocate cpumask for watchdog");
+ return;
+ }
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
+
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
+#endif
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
Chris,
I think it would also be nice to check the plausibility of the user input.
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh)
+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
+ watchdog_cpumask);
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
I think the user should only be allowed to specify a mask that is a subset of
tick_nohz_full_mask as only those CPUs don't have a watchdog thread by default.
In other words, the user should not be able to interfere with housekeeping CPUs.
For example, add a plausibility check like so:
save watchdog_cpumask because proc_do_large_bitmap() is going to change it
proc_do_large_bitmap()
// return an error if the user-specified mask includes a housekeeping CPU
if (watchdog_cpumask and 'negated tick_nohz_full_mask') {
restore saved watchdog_cpumask
return -EINVAL
}
Regards,
Uli
Chris,
in principle the change looks o.k. to me, even though I'm not really familiar
with the watchdog_nmi_disable_all() and watchdog_nmi_enable_all() functions.
It is my understanding that those functions are only called once via 'initcall'
early during kernel startup as shown in the following flow of execution:
kernel_init
{
kernel_init_freeable
{
lockup_detector_init
{
cpumask_andnot(watchdog_cpumask, cpu_possible_mask,tick_nohz_full_mask)
watchdog_enable_all_cpus
smpboot_register_percpu_thread(&watchdog_threads)
smpboot_update_cpumask_percpu_thread(&watchdog_threads,watchdog_cpumask)
// here we make sure that watchdog threads don't run on nohz_full CPUs
// only the watchdog threads of housekeeping CPUs keep on running
}
do_basic_setup
do_initcalls
do_initcall_level
do_one_initcall
fixup_ht_bug // subsys_initcall(fixup_ht_bug)
{
watchdog_nmi_disable_all
// here we disable NMI watchdog only on housekeeping CPUs
for_each_cpu_and(cpu,cpu_online_mask,watchdog_cpumask)
watchdog_nmi_disable
watchdog_nmi_enable_all
// here we enable NMI watchdog only on housekeeping CPUs
for_each_cpu_and(cpu,cpu_online_mask,watchdog_cpumask)
watchdog_nmi_enable
}
}
}
It seems crucial that lockup_detector_init() is executed before fixup_ht_bug().
Regards,
Uli
On 04/16/2015 06:46 AM, Ulrich Obergfell wrote:
> if a user changes watchdog parameters in /proc/sys/kernel, the watchdog threads
> are not stopped and restarted in all cases. Parameters can also be changed 'on
> the fly', for example like 'watchdog_thresh' in the following flow of execution:
>
> proc_watchdog_thresh
> proc_watchdog_update
> if (watchdog_enabled && watchdog_thresh)
> watchdog_enable_all_cpus
> if (!watchdog_running) {
> // watchdog threads are already running so we don't get here
> } else {
> update_watchdog_all_cpus
> for_each_online_cpu <-----------------------------.
> update_watchdog |
> watchdog_nmi_disable |
> watchdog_nmi_enable |
> } |
> |
> I think we would not want to call watchdog_nmi_enable() for each_online_ CPU,
> but rather for each CPU that has an_unparked_ watchdog thread (i.e. where the
> watchdog mechanism is actually enabled).
How about something like this? I'll fold it into v9 of the patchset.
Thanks!
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0c5a37cdbedd..a4e1c9a2e769 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -61,6 +61,10 @@ static cpumask_var_t watchdog_cpumask_for_smpboot;
static cpumask_var_t watchdog_cpumask;
unsigned long *watchdog_cpumask_bits;
+/* Helper for online, unparked cpus. */
+#define for_each_watchdog_cpu(cpu) \
+ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
+
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -209,7 +213,7 @@ void touch_all_softlockup_watchdogs(void)
* do we care if a 0 races with a timestamp?
* all it means is the softlock check starts one cycle later
*/
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
}
@@ -616,7 +620,7 @@ void watchdog_nmi_enable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_enable(cpu);
put_online_cpus();
}
@@ -629,7 +633,7 @@ void watchdog_nmi_disable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_disable(cpu);
put_online_cpus();
}
@@ -688,7 +692,7 @@ static void update_watchdog_all_cpus(void)
int cpu;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
update_watchdog(cpu);
put_online_cpus();
}
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
Chris,
in https://lkml.org/lkml/2015/4/17/616 you stated:
">> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
>
> alloc_cpumask_var could fail?
Good catch; if I get a failure I'll just return early without trying to
start the watchdog, since clearly things are too memory-constrained
to enable that functionality anyway."
Let's assume that (in spite of the memory constraints) the kernel would still
be able to make progress and get to a point where the system will be usable.
In this corner case, the following code would leave a NULL pointer behind in
watchdog_cpumask and in watchdog_cpumask_bits which could subsequently lead
to a crash.
void __init lockup_detector_init(void)
{
set_sample_period();
+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
+ pr_err("Failed to allocate cpumask for watchdog");
+ return;
+ }
+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
For example, proc_watchdog_cpumask() and the change that your patch introduces
in watchdog_enable_all_cpus() are not protected against a possible NULL pointer.
I think the code needs to be made safer.
Regards,
Uli
On Tue, Apr 21, 2015 at 10:07:00AM -0400, Ulrich Obergfell wrote:
>
> Chris,
>
> I think it would also be nice to check the plausibility of the user input.
>
> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write) {
> + /* Remove impossible cpus to keep sysctl output cleaner. */
> + cpumask_and(watchdog_cpumask, watchdog_cpumask,
> + cpu_possible_mask);
> +
> + if (watchdog_enabled && watchdog_thresh)
> + smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> + watchdog_cpumask);
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
>
> I think the user should only be allowed to specify a mask that is a subset of
> tick_nohz_full_mask as only those CPUs don't have a watchdog thread by default.
> In other words, the user should not be able to interfere with housekeeping CPUs.
Hi Uli,
I am not sure that is necessary. This was supposed to be a debugging
interface for nohz (and possibly other technologies). I think restricting
it to just tick_nohz makes it difficult to try out new things or debug
certain problems.
Personally, I feel anyone who will use this sys interface will need to do so
at their own risk.
Cheers,
Don
>
> For example, add a plausibility check like so:
>
> save watchdog_cpumask because proc_do_large_bitmap() is going to change it
>
> proc_do_large_bitmap()
>
> // return an error if the user-specified mask includes a housekeeping CPU
> if (watchdog_cpumask and 'negated tick_nohz_full_mask') {
> restore saved watchdog_cpumask
> return -EINVAL
> }
>
>
> Regards,
>
> Uli
On Wed, Apr 22, 2015 at 07:02:31AM -0400, Ulrich Obergfell wrote:
>
> Chris,
>
> in https://lkml.org/lkml/2015/4/17/616 you stated:
>
> ">> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
> >
> > alloc_cpumask_var could fail?
>
> Good catch; if I get a failure I'll just return early without trying to
> start the watchdog, since clearly things are too memory-constrained
> to enable that functionality anyway."
>
> Let's assume that (in spite of the memory constraints) the kernel would still
> be able to make progress and get to a point where the system will be usable.
> In this corner case, the following code would leave a NULL pointer behind in
> watchdog_cpumask and in watchdog_cpumask_bits which could subsequently lead
> to a crash.
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
> + pr_err("Failed to allocate cpumask for watchdog");
> + return;
> + }
> + watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
>
> For example, proc_watchdog_cpumask() and the change that your patch introduces
> in watchdog_enable_all_cpus() are not protected against a possible NULL pointer.
> I think the code needs to be made safer.
Or we could just statically allocate it
static DECLARE_BITMAP(watchdog_cpumask, NR_CPUS) __read_mostly;
Cheers,
Don
----- Original Message -----
From: "Don Zickus" <[email protected]>
[...]
> On Tue, Apr 21, 2015 at 10:07:00AM -0400, Ulrich Obergfell wrote:
>>
>> Chris,
>>
[...]
>> I think the user should only be allowed to specify a mask that is a subset of
>> tick_nohz_full_mask as only those CPUs don't have a watchdog thread by default.
>> In other words, the user should not be able to interfere with housekeeping CPUs.
>
> Hi Uli,
>
> I am not sure that is necessary. This was supposed to be a debugging
> interface for nohz (and possibly other technologies). I think restricting
> it to just tick_nohz makes it difficult to try out new things or debug
> certain problems.
>
> Personally, I feel anyone who will use this sys interface will need to do so
> at their own risk.
>
>
> Cheers,
> Don
Don, o.k. - I understand.
Chris, please ignore my idea to add a plausibility check.
Regards,
Uli
I've been out on vacation the last ten days, but picking this up
again now.
I'll wait a bit before putting out a v10, and also address Uli's additional
emails. Meanwhile, who is the right person to eventually pick up this patchset
and push it up to Linus? Frederic, Don, Thomas, akpm? v9 is here:
https://lkml.org/lkml/2015/4/17/697
And I haven't heard any feedback on my fix to /proc/self/stat etc. to
avoid showing the PARKED threads in "R" state (patch 3/3 from that series).
Thanks for any guidance.
On 04/22/2015 11:21 AM, Don Zickus wrote:
> On Wed, Apr 22, 2015 at 07:02:31AM -0400, Ulrich Obergfell wrote:
>> Chris,
>>
>> in https://lkml.org/lkml/2015/4/17/616 you stated:
>>
>> ">> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
>> >
>> > alloc_cpumask_var could fail?
>>
>> Good catch; if I get a failure I'll just return early without trying to
>> start the watchdog, since clearly things are too memory-constrained
>> to enable that functionality anyway."
>>
>> Let's assume that (in spite of the memory constraints) the kernel would still
>> be able to make progress and get to a point where the system will be usable.
>> In this corner case, the following code would leave a NULL pointer behind in
>> watchdog_cpumask and in watchdog_cpumask_bits which could subsequently lead
>> to a crash.
>>
>> void __init lockup_detector_init(void)
>> {
>> set_sample_period();
>>
>> + if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
>> + pr_err("Failed to allocate cpumask for watchdog");
>> + return;
>> + }
>> + watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
>>
>> For example, proc_watchdog_cpumask() and the change that your patch introduces
>> in watchdog_enable_all_cpus() are not protected against a possible NULL pointer.
>> I think the code needs to be made safer.
> Or we could just statically allocate it
>
> static DECLARE_BITMAP(watchdog_cpumask, NR_CPUS) __read_mostly;
>
> Cheers,
> Don
I think Don's suggestion is best here. It's too intrusive to try to check
for the out-of-memory condition everywhere in the code, just to guard
against the possibility that a system that is already out of memory while
starting the watchdog still has users trying to fiddle with the
/proc/sys/kernel/watchdog* knobs.
The diff against v9 is just this (plus changing watchdog_cpumask to
&watchdog_cpumask in a bunch of places):
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 8875717b6616..ec742f38c90d 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -57,8 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
-static cpumask_var_t watchdog_cpumask;
-unsigned long *watchdog_cpumask_bits;
+static struct cpumask __read_mostly;
+unsigned long *watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
/* Helper for online, unparked cpus. */
#define for_each_watchdog_cpu(cpu) \
@@ -913,12 +913,6 @@ void __init lockup_detector_init(void)
{
set_sample_period();
- if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
- pr_err("Failed to allocate cpumask for watchdog");
- return;
- }
- watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
-
#ifdef CONFIG_NO_HZ_FULL
if (!cpumask_empty(tick_nohz_full_mask))
pr_info("Disabling watchdog on nohz_full cores by default\n");
That said, presumably we need to schedule a cage match between Frederic and Don
to decide on whether it's best to statically allocate cpumasks or not :-)
https://lkml.org/lkml/2015/4/16/416
My sense is that in this case it's appropriate, since it's much harder to
manage the failure case, whereas in the earlier discussion for
smpboot_update_cpumask_percpu_thread() it made sense to just give up and
return a quick ENOMEM. Also, in this case we have no locking issues.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
cc'ing Andrew
On Mon, Apr 27, 2015 at 04:27:16PM -0400, Chris Metcalf wrote:
> I've been out on vacation the last ten days, but picking this up
> again now.
>
> I'll wait a bit before putting out a v10, and also address Uli's additional
> emails. Meanwhile, who is the right person to eventually pick up this patchset
> and push it up to Linus? Frederic, Don, Thomas, akpm? v9 is here:
I usually resubmit watchdog changes with my signoff to Andrew. But would
just my ACK be ok, Andrew?
Cheers,
Don
>
> https://lkml.org/lkml/2015/4/17/697
>
> And I haven't heard any feedback on my fix to /proc/self/stat etc. to
> avoid showing the PARKED threads in "R" state (patch 3/3 from that series).
>
> Thanks for any guidance.
>
>
> On 04/22/2015 11:21 AM, Don Zickus wrote:
> >On Wed, Apr 22, 2015 at 07:02:31AM -0400, Ulrich Obergfell wrote:
> >>Chris,
> >>
> >>in https://lkml.org/lkml/2015/4/17/616 you stated:
> >>
> >> ">> + alloc_cpumask_var(&watchdog_cpumask_for_smpboot, GFP_KERNEL);
> >> >
> >> > alloc_cpumask_var could fail?
> >>
> >> Good catch; if I get a failure I'll just return early without trying to
> >> start the watchdog, since clearly things are too memory-constrained
> >> to enable that functionality anyway."
> >>
> >>Let's assume that (in spite of the memory constraints) the kernel would still
> >>be able to make progress and get to a point where the system will be usable.
> >>In this corner case, the following code would leave a NULL pointer behind in
> >>watchdog_cpumask and in watchdog_cpumask_bits which could subsequently lead
> >>to a crash.
> >>
> >> void __init lockup_detector_init(void)
> >> {
> >> set_sample_period();
> >>+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
> >>+ pr_err("Failed to allocate cpumask for watchdog");
> >>+ return;
> >>+ }
> >>+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> >>
> >>For example, proc_watchdog_cpumask() and the change that your patch introduces
> >>in watchdog_enable_all_cpus() are not protected against a possible NULL pointer.
> >>I think the code needs to be made safer.
> >Or we could just statically allocate it
> >
> >static DECLARE_BITMAP(watchdog_cpumask, NR_CPUS) __read_mostly;
> >
> >Cheers,
> >Don
>
> I think Don's suggestion is best here. It's too intrusive to try to check
> for the out-of-memory condition everywhere in the code, just to guard
> against the possibility that a system that is already out of memory while
> starting the watchdog still has users trying to fiddle with the
> /proc/sys/kernel/watchdog* knobs.
>
> The diff against v9 is just this (plus changing watchdog_cpumask to
> &watchdog_cpumask in a bunch of places):
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 8875717b6616..ec742f38c90d 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -57,8 +57,8 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> -static cpumask_var_t watchdog_cpumask;
> -unsigned long *watchdog_cpumask_bits;
> +static struct cpumask __read_mostly;
> +unsigned long *watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> /* Helper for online, unparked cpus. */
> #define for_each_watchdog_cpu(cpu) \
> @@ -913,12 +913,6 @@ void __init lockup_detector_init(void)
> {
> set_sample_period();
> - if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
> - pr_err("Failed to allocate cpumask for watchdog");
> - return;
> - }
> - watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> -
> #ifdef CONFIG_NO_HZ_FULL
> if (!cpumask_empty(tick_nohz_full_mask))
> pr_info("Disabling watchdog on nohz_full cores by default\n");
>
> That said, presumably we need to schedule a cage match between Frederic and Don
> to decide on whether it's best to statically allocate cpumasks or not :-)
>
> https://lkml.org/lkml/2015/4/16/416
>
> My sense is that in this case it's appropriate, since it's much harder to
> manage the failure case, whereas in the earlier discussion for
> smpboot_update_cpumask_percpu_thread() it made sense to just give up and
> return a quick ENOMEM. Also, in this case we have no locking issues.
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On 04/22/2015 04:20 AM, Ulrich Obergfell wrote:
> Chris,
>
> in principle the change looks o.k. to me, even though I'm not really familiar
> with the watchdog_nmi_disable_all() and watchdog_nmi_enable_all() functions.
> It is my understanding that those functions are only called once via 'initcall'
> early during kernel startup as shown in the following flow of execution:
>
> [...]
> It seems crucial that lockup_detector_init() is executed before fixup_ht_bug().
Uli, thanks for doing the follow-up analysis. I didn't know
about the fixup_ht_bug() path, but as you show, it seems to be OK.
We could think about doing some kind of additional paranoia here,
like a wrapper around &watchdog_cpumask that checks some additional
boolean that says whether it's been properly initialized or not.
But I think it's probably OK to leave it as-is; we already had the
potential of issues if any watchdog code was invoked prior to
init_watchdog(), for example due to the sample period being unset.
What do you think?
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On 04/21/2015 08:32 AM, Ulrich Obergfell wrote:
> Chris,
>
> in v9, smpboot_update_cpumask_percpu_thread() allocates 'tmp' mask dynamically.
> This allocation can fail and thus the function can now return an error. However,
> this error is being ignored by proc_watchdog_cpumask().
Yes, I did that intentionally, because it seemed like a pretty extreme
corner
case (not enough memory to allocate one cpumask), and a relatively
unproblematic outcome (we don't actually modify the running set of watchdog
threads the way the /proc knob requested).
The problem with your proposal (to save the old cpumask and put it back on
failure) is that we will almost certainly not be able to do that either
if we can't
successfully run smpboot_update_cpumask_percpu_thread(), since that's
exactly the allocation that we're presuming is going to fail internally.
I went down this rathole and decided it wasn't worth worrying about.
Let me know if you think we need to beat on it some more :-)
>
> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write) {
> + /* Remove impossible cpus to keep sysctl output cleaner. */
> + cpumask_and(watchdog_cpumask, watchdog_cpumask,
> + cpu_possible_mask);
> +
> + if (watchdog_enabled && watchdog_thresh)
> + smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> + watchdog_cpumask);
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
>
> You may want to consider handling the error, for example something like:
>
> save watchdog_cpumask because proc_do_large_bitmap() is going to change it
> ...
> err = smpboot_update_cpumask_percpu_thread()
> if (err)
> restore saved watchdog_cpumask
> ...
> return err so that the user becomes aware of the failure
>
>
> Regards,
>
> Uli
>
>
> ----- Original Message -----
> From: "Chris Metcalf" <[email protected]>
> To: "Frederic Weisbecker" <[email protected]>, "Don Zickus" <[email protected]>, "Ingo Molnar" <[email protected]>, "Andrew Morton" <[email protected]>, "Andrew Jones" <[email protected]>, "chai wen" <[email protected]>, "Ulrich Obergfell" <[email protected]>, "Fabian Frederick" <[email protected]>, "Aaron Tomlin" <[email protected]>, "Ben Zhang" <[email protected]>, "Christoph Lameter" <[email protected]>, "Gilad Ben-Yossef" <[email protected]>, "Steven Rostedt" <[email protected]>, [email protected], "Jonathan Corbet" <[email protected]>, [email protected], "Thomas Gleixner" <[email protected]>, "Peter Zijlstra" <[email protected]>
> Cc: "Chris Metcalf" <[email protected]>
> Sent: Friday, April 17, 2015 8:37:17 PM
> Subject: [PATCH v9 2/3] watchdog: add watchdog_cpumask sysctl to assist nohz
>
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> If we allowed the watchdog to run on nohz_full cores, the timer
> interrupts and scheduler work would prevent the desired tickless
> operation on those cores. But if we disable the watchdog globally,
> then the housekeeping cores can't benefit from the watchdog
> functionality. So we allow disabling it only on some cores.
> See Documentation/lockup-watchdogs.txt for more information.
>
> Acked-by: Don Zickus <[email protected]>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> v9: use new, new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
> add and use for_each_watchdog_cpu() [Uli]
> check alloc_cpumask_var for failure [Chai Wen]
>
> v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
> improve documentation in "Documentation/" and in changelog [akpm]
>
> v7: use cpumask field instead of valid_cpu() callback
>
> v6: use alloc_cpumask_var() [Sasha Levin]
> switch from watchdog_exclude to watchdog_cpumask [Frederic]
> simplify the smp_hotplug_thread API to watchdog [Frederic]
> add Don's Acked-by
>
> Documentation/lockup-watchdogs.txt | 18 +++++++++++
> Documentation/sysctl/kernel.txt | 15 +++++++++
> include/linux/nmi.h | 3 ++
> kernel/sysctl.c | 7 +++++
> kernel/watchdog.c | 63 +++++++++++++++++++++++++++++++++++---
> 5 files changed, 101 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> index ab0baa692c13..22dd6af2e4bd 100644
> --- a/Documentation/lockup-watchdogs.txt
> +++ b/Documentation/lockup-watchdogs.txt
> @@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
> administrators to configure the period of the hrtimer and the perf
> event. The right value for a particular environment is a trade-off
> between fast response to lockups and detection overhead.
> +
> +By default, the watchdog runs on all online cores. However, on a
> +kernel configured with NO_HZ_FULL, by default the watchdog runs only
> +on the housekeeping cores, not the cores specified in the "nohz_full"
> +boot argument. If we allowed the watchdog to run by default on
> +the "nohz_full" cores, we would have to run timer ticks to activate
> +the scheduler, which would prevent the "nohz_full" functionality
> +from protecting the user code on those cores from the kernel.
> +Of course, disabling it by default on the nohz_full cores means that
> +when those cores do enter the kernel, by default we will not be
> +able to detect if they lock up. However, allowing the watchdog
> +to continue to run on the housekeeping (non-tickless) cores means
> +that we will continue to detect lockups properly on those cores.
> +
> +In either case, the set of cores excluded from running the watchdog
> +may be adjusted via the kernel.watchdog_cpumask sysctl. For
> +nohz_full cores, this may be useful for debugging a case where the
> +kernel seems to be hanging on the nohz_full cores.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index c831001c45f1..f1697858d71c 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -923,6 +923,21 @@ and nmi_watchdog.
>
> ==============================================================
>
> +watchdog_cpumask:
> +
> +This value can be used to control on which cpus the watchdog may run.
> +The default cpumask is all possible cores, but if NO_HZ_FULL is
> +enabled in the kernel config, and cores are specified with the
> +nohz_full= boot argument, those cores are excluded by default.
> +Offline cores can be included in this mask, and if the core is later
> +brought online, the watchdog will be started based on the mask value.
> +
> +Typically this value would only be touched in the nohz_full case
> +to re-enable cores that by default were not running the watchdog,
> +if a kernel lockup was suspected on those cores.
> +
> +==============================================================
> +
> watchdog_thresh:
>
> This value can be used to control the frequency of hrtimer and NMI
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 3d46fb4708e0..f94da0e65dea 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
> extern int soft_watchdog_enabled;
> extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> +extern unsigned long *watchdog_cpumask_bits;
> extern int sysctl_softlockup_all_cpu_backtrace;
> struct ctl_table;
> extern int proc_watchdog(struct ctl_table *, int ,
> @@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> extern int proc_watchdog_thresh(struct ctl_table *, int ,
> void __user *, size_t *, loff_t *);
> +extern int proc_watchdog_cpumask(struct ctl_table *, int,
> + void __user *, size_t *, loff_t *);
> #endif
>
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a88fb9..699571a74e3b 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
> .extra2 = &one,
> },
> {
> + .procname = "watchdog_cpumask",
> + .data = &watchdog_cpumask_bits,
> + .maxlen = NR_CPUS,
> + .mode = 0644,
> + .proc_handler = proc_watchdog_cpumask,
> + },
> + {
> .procname = "softlockup_panic",
> .data = &softlockup_panic,
> .maxlen = sizeof(int),
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 2316f50b07a4..8875717b6616 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -19,6 +19,7 @@
> #include <linux/sysctl.h>
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> +#include <linux/tick.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -56,6 +57,12 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> #else
> #define sysctl_softlockup_all_cpu_backtrace 0
> #endif
> +static cpumask_var_t watchdog_cpumask;
> +unsigned long *watchdog_cpumask_bits;
> +
> +/* Helper for online, unparked cpus. */
> +#define for_each_watchdog_cpu(cpu) \
> + for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
>
> static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
> @@ -205,7 +212,7 @@ void touch_all_softlockup_watchdogs(void)
> * do we care if a 0 races with a timestamp?
> * all it means is the softlock check starts one cycle later
> */
> - for_each_online_cpu(cpu)
> + for_each_watchdog_cpu(cpu)
> per_cpu(watchdog_touch_ts, cpu) = 0;
> }
>
> @@ -612,7 +619,7 @@ void watchdog_nmi_enable_all(void)
> return;
>
> get_online_cpus();
> - for_each_online_cpu(cpu)
> + for_each_watchdog_cpu(cpu)
> watchdog_nmi_enable(cpu);
> put_online_cpus();
> }
> @@ -625,7 +632,7 @@ void watchdog_nmi_disable_all(void)
> return;
>
> get_online_cpus();
> - for_each_online_cpu(cpu)
> + for_each_watchdog_cpu(cpu)
> watchdog_nmi_disable(cpu);
> put_online_cpus();
> }
> @@ -684,7 +691,7 @@ static void update_watchdog_all_cpus(void)
> int cpu;
>
> get_online_cpus();
> - for_each_online_cpu(cpu)
> + for_each_watchdog_cpu(cpu)
> update_watchdog(cpu);
> put_online_cpus();
> }
> @@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
> err = smpboot_register_percpu_thread(&watchdog_threads);
> if (err)
> pr_err("Failed to create watchdog threads, disabled\n");
> - else
> + else {
> + if (smpboot_update_cpumask_percpu_thread(
> + &watchdog_threads, watchdog_cpumask))
> + pr_err("Failed to set cpumask for watchdog threads\n");
> watchdog_running = 1;
> + }
> } else {
> /*
> * Enable/disable the lockup detectors or
> @@ -869,12 +880,54 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +/*
> + * The cpumask is the mask of possible cpus that the watchdog can run
> + * on, not the mask of cpus it is actually running on. This allows the
> + * user to specify a mask that will include cpus that have not yet
> + * been brought online, if desired.
> + */
> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write) {
> + /* Remove impossible cpus to keep sysctl output cleaner. */
> + cpumask_and(watchdog_cpumask, watchdog_cpumask,
> + cpu_possible_mask);
> +
> + if (watchdog_enabled && watchdog_thresh)
> + smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> + watchdog_cpumask);
> + }
> + mutex_unlock(&watchdog_proc_mutex);
> + return err;
> +}
> +
> #endif /* CONFIG_SYSCTL */
>
> void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> + if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
> + pr_err("Failed to allocate cpumask for watchdog");
> + return;
> + }
> + watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> +
> +#ifdef CONFIG_NO_HZ_FULL
> + if (!cpumask_empty(tick_nohz_full_mask))
> + pr_info("Disabling watchdog on nohz_full cores by default\n");
> + cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
> + tick_nohz_full_mask);
> +#else
> + cpumask_copy(watchdog_cpumask, cpu_possible_mask);
> +#endif
> +
> if (watchdog_enabled)
> watchdog_enable_all_cpus();
> }
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Tue, 28 Apr 2015 11:17:59 -0400 Don Zickus <[email protected]> wrote:
> cc'ing Andrew
>
> On Mon, Apr 27, 2015 at 04:27:16PM -0400, Chris Metcalf wrote:
> > I've been out on vacation the last ten days, but picking this up
> > again now.
> >
> > I'll wait a bit before putting out a v10, and also address Uli's additional
> > emails. Meanwhile, who is the right person to eventually pick up this patchset
> > and push it up to Linus? Frederic, Don, Thomas, akpm? v9 is here:
>
> I usually resubmit watchdog changes with my signoff to Andrew. But would
> just my ACK be ok, Andrew?
Yep, thanks, I'll take a look.
----- Original Message -----
From: "Chris Metcalf" <[email protected]>
[...]
> On 04/22/2015 04:20 AM, Ulrich Obergfell wrote:
>> Chris,
>>
>> in principle the change looks o.k. to me, even though I'm not really familiar
>> with the watchdog_nmi_disable_all() and watchdog_nmi_enable_all() functions.
>> It is my understanding that those functions are only called once via 'initcall'
>> early during kernel startup as shown in the following flow of execution:
>>
>> [...]
>> It seems crucial that lockup_detector_init() is executed before fixup_ht_bug().
>
> Uli, thanks for doing the follow-up analysis. I didn't know
> about the fixup_ht_bug() path, but as you show, it seems to be OK.
>
> We could think about doing some kind of additional paranoia here,
> like a wrapper around &watchdog_cpumask that checks some additional
> boolean that says whether it's been properly initialized or not.
>
> But I think it's probably OK to leave it as-is; we already had the
> potential of issues if any watchdog code was invoked prior to
> init_watchdog(), for example due to the sample period being unset.
>
> What do you think?
Chris,
I also think it's probably OK to leave it as-is, in particular because
you indicated in http://marc.info/?l=linux-kernel&m=143016646903545&w=2
that you are going to make watchdog_cpumask static instead of allocating
it dynamically.
Regards,
Uli
----- Original Message -----
From: "Chris Metcalf" <[email protected]>
[...]
On 04/21/2015 08:32 AM, Ulrich Obergfell wrote:
>> Chris,
>>
>> in v9, smpboot_update_cpumask_percpu_thread() allocates 'tmp' mask dynamically.
>> This allocation can fail and thus the function can now return an error. However,
>> this error is being ignored by proc_watchdog_cpumask().
>
> Yes, I did that intentionally, because it seemed like a pretty extreme
> corner case (not enough memory to allocate one cpumask), and a relatively
> unproblematic outcome (we don't actually modify the running set of watchdog
> threads the way the /proc knob requested).
>
> The problem with your proposal (to save the old cpumask and put it back on
> failure) is that we will almost certainly not be able to do that either
> if we can't successfully run smpboot_update_cpumask_percpu_thread(),
> since that's exactly the allocation that we're presuming is going to fail
> internally.
>
> I went down this rathole and decided it wasn't worth worrying about.
> Let me know if you think we need to beat on it some more :-)
Chris,
the other handlers for the watchdog parameters in /proc restore the
original value on failure, so I thought it would be nice to make the
error handling consistent in that regard.
However, on the other hand the 'watchdog_cpumask' parameter is kind of
an exception in terms of when and how it should be used, and thus it's
probably OK if this interface is less 'user-friendly'. As Don commented
in https://lkml.org/lkml/2015/4/22/325 in reply to my suggestion to add
a plausibility check for 'watchdog_cpumask':
"I am not sure that is necessary. This was supposed to be a debugging
interface for nohz (and possibly other technologies). ... Personally,
I feel anyone who will use this sys interface will need to do so at
their own risk."
So I think we could apply the same rationale here and ignore a possible
error returned by smpboot_update_cpumask_percpu_thread(). Perhaps you
could add a few comment lines to the code.
Don,
please let us know what you think.
Regards,
Uli
On Tue, Apr 28, 2015 at 02:07:59PM -0400, Chris Metcalf wrote:
> On 04/21/2015 08:32 AM, Ulrich Obergfell wrote:
> >Chris,
> >
> >in v9, smpboot_update_cpumask_percpu_thread() allocates 'tmp' mask dynamically.
> >This allocation can fail and thus the function can now return an error. However,
> >this error is being ignored by proc_watchdog_cpumask().
>
> Yes, I did that intentionally, because it seemed like a pretty
> extreme corner
> case (not enough memory to allocate one cpumask), and a relatively
> unproblematic outcome (we don't actually modify the running set of watchdog
> threads the way the /proc knob requested).
>
> The problem with your proposal (to save the old cpumask and put it back on
> failure) is that we will almost certainly not be able to do that
> either if we can't
> successfully run smpboot_update_cpumask_percpu_thread(), since that's
> exactly the allocation that we're presuming is going to fail internally.
>
> I went down this rathole and decided it wasn't worth worrying about.
> Let me know if you think we need to beat on it some more :-)
Hi Chris,
It seems we should at least output something to the user that their request
failed. I think it would be more frustrating to the user if they changed
the mask and the system did not respond the way they intended, not knowing
it did not successfully configure.
I understand and agree with the difficulty of handling this corner case, but
we can still trap and report it in hopes some day we get smarter. :-)
Cheers,
Don
>
> >
> >+int proc_watchdog_cpumask(struct ctl_table *table, int write,
> >+ void __user *buffer, size_t *lenp, loff_t *ppos)
> >+{
> >+ int err;
> >+
> >+ mutex_lock(&watchdog_proc_mutex);
> >+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> >+ if (!err && write) {
> >+ /* Remove impossible cpus to keep sysctl output cleaner. */
> >+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
> >+ cpu_possible_mask);
> >+
> >+ if (watchdog_enabled && watchdog_thresh)
> >+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> >+ watchdog_cpumask);
> >+ }
> >+ mutex_unlock(&watchdog_proc_mutex);
> >+ return err;
> >+}
> >
> >You may want to consider handling the error, for example something like:
> >
> > save watchdog_cpumask because proc_do_large_bitmap() is going to change it
> > ...
> > err = smpboot_update_cpumask_percpu_thread()
> > if (err)
> > restore saved watchdog_cpumask
> > ...
> > return err so that the user becomes aware of the failure
> >
> >
> >Regards,
> >
> >Uli
> >
> >
> >----- Original Message -----
> >From: "Chris Metcalf" <[email protected]>
> >To: "Frederic Weisbecker" <[email protected]>, "Don Zickus" <[email protected]>, "Ingo Molnar" <[email protected]>, "Andrew Morton" <[email protected]>, "Andrew Jones" <[email protected]>, "chai wen" <[email protected]>, "Ulrich Obergfell" <[email protected]>, "Fabian Frederick" <[email protected]>, "Aaron Tomlin" <[email protected]>, "Ben Zhang" <[email protected]>, "Christoph Lameter" <[email protected]>, "Gilad Ben-Yossef" <[email protected]>, "Steven Rostedt" <[email protected]>, [email protected], "Jonathan Corbet" <[email protected]>, [email protected], "Thomas Gleixner" <[email protected]>, "Peter Zijlstra" <[email protected]>
> >Cc: "Chris Metcalf" <[email protected]>
> >Sent: Friday, April 17, 2015 8:37:17 PM
> >Subject: [PATCH v9 2/3] watchdog: add watchdog_cpumask sysctl to assist nohz
> >
> >Change the default behavior of watchdog so it only runs on the
> >housekeeping cores when nohz_full is enabled at build and boot time.
> >Allow modifying the set of cores the watchdog is currently running
> >on with a new kernel.watchdog_cpumask sysctl.
> >
> >If we allowed the watchdog to run on nohz_full cores, the timer
> >interrupts and scheduler work would prevent the desired tickless
> >operation on those cores. But if we disable the watchdog globally,
> >then the housekeeping cores can't benefit from the watchdog
> >functionality. So we allow disabling it only on some cores.
> >See Documentation/lockup-watchdogs.txt for more information.
> >
> >Acked-by: Don Zickus <[email protected]>
> >Signed-off-by: Chris Metcalf <[email protected]>
> >---
> >v9: use new, new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
> > add and use for_each_watchdog_cpu() [Uli]
> > check alloc_cpumask_var for failure [Chai Wen]
> >
> >v8: use new semantics of smpboot_update_cpumask_percpu_thread() [Frederic]
> > improve documentation in "Documentation/" and in changelog [akpm]
> >
> >v7: use cpumask field instead of valid_cpu() callback
> >
> >v6: use alloc_cpumask_var() [Sasha Levin]
> > switch from watchdog_exclude to watchdog_cpumask [Frederic]
> > simplify the smp_hotplug_thread API to watchdog [Frederic]
> > add Don's Acked-by
> >
> > Documentation/lockup-watchdogs.txt | 18 +++++++++++
> > Documentation/sysctl/kernel.txt | 15 +++++++++
> > include/linux/nmi.h | 3 ++
> > kernel/sysctl.c | 7 +++++
> > kernel/watchdog.c | 63 +++++++++++++++++++++++++++++++++++---
> > 5 files changed, 101 insertions(+), 5 deletions(-)
> >
> >diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
> >index ab0baa692c13..22dd6af2e4bd 100644
> >--- a/Documentation/lockup-watchdogs.txt
> >+++ b/Documentation/lockup-watchdogs.txt
> >@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
> > administrators to configure the period of the hrtimer and the perf
> > event. The right value for a particular environment is a trade-off
> > between fast response to lockups and detection overhead.
> >+
> >+By default, the watchdog runs on all online cores. However, on a
> >+kernel configured with NO_HZ_FULL, by default the watchdog runs only
> >+on the housekeeping cores, not the cores specified in the "nohz_full"
> >+boot argument. If we allowed the watchdog to run by default on
> >+the "nohz_full" cores, we would have to run timer ticks to activate
> >+the scheduler, which would prevent the "nohz_full" functionality
> >+from protecting the user code on those cores from the kernel.
> >+Of course, disabling it by default on the nohz_full cores means that
> >+when those cores do enter the kernel, by default we will not be
> >+able to detect if they lock up. However, allowing the watchdog
> >+to continue to run on the housekeeping (non-tickless) cores means
> >+that we will continue to detect lockups properly on those cores.
> >+
> >+In either case, the set of cores excluded from running the watchdog
> >+may be adjusted via the kernel.watchdog_cpumask sysctl. For
> >+nohz_full cores, this may be useful for debugging a case where the
> >+kernel seems to be hanging on the nohz_full cores.
> >diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> >index c831001c45f1..f1697858d71c 100644
> >--- a/Documentation/sysctl/kernel.txt
> >+++ b/Documentation/sysctl/kernel.txt
> >@@ -923,6 +923,21 @@ and nmi_watchdog.
> > ==============================================================
> >+watchdog_cpumask:
> >+
> >+This value can be used to control on which cpus the watchdog may run.
> >+The default cpumask is all possible cores, but if NO_HZ_FULL is
> >+enabled in the kernel config, and cores are specified with the
> >+nohz_full= boot argument, those cores are excluded by default.
> >+Offline cores can be included in this mask, and if the core is later
> >+brought online, the watchdog will be started based on the mask value.
> >+
> >+Typically this value would only be touched in the nohz_full case
> >+to re-enable cores that by default were not running the watchdog,
> >+if a kernel lockup was suspected on those cores.
> >+
> >+==============================================================
> >+
> > watchdog_thresh:
> > This value can be used to control the frequency of hrtimer and NMI
> >diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> >index 3d46fb4708e0..f94da0e65dea 100644
> >--- a/include/linux/nmi.h
> >+++ b/include/linux/nmi.h
> >@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
> > extern int soft_watchdog_enabled;
> > extern int watchdog_user_enabled;
> > extern int watchdog_thresh;
> >+extern unsigned long *watchdog_cpumask_bits;
> > extern int sysctl_softlockup_all_cpu_backtrace;
> > struct ctl_table;
> > extern int proc_watchdog(struct ctl_table *, int ,
> >@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
> > void __user *, size_t *, loff_t *);
> > extern int proc_watchdog_thresh(struct ctl_table *, int ,
> > void __user *, size_t *, loff_t *);
> >+extern int proc_watchdog_cpumask(struct ctl_table *, int,
> >+ void __user *, size_t *, loff_t *);
> > #endif
> > #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> >diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >index 2082b1a88fb9..699571a74e3b 100644
> >--- a/kernel/sysctl.c
> >+++ b/kernel/sysctl.c
> >@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
> > .extra2 = &one,
> > },
> > {
> >+ .procname = "watchdog_cpumask",
> >+ .data = &watchdog_cpumask_bits,
> >+ .maxlen = NR_CPUS,
> >+ .mode = 0644,
> >+ .proc_handler = proc_watchdog_cpumask,
> >+ },
> >+ {
> > .procname = "softlockup_panic",
> > .data = &softlockup_panic,
> > .maxlen = sizeof(int),
> >diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >index 2316f50b07a4..8875717b6616 100644
> >--- a/kernel/watchdog.c
> >+++ b/kernel/watchdog.c
> >@@ -19,6 +19,7 @@
> > #include <linux/sysctl.h>
> > #include <linux/smpboot.h>
> > #include <linux/sched/rt.h>
> >+#include <linux/tick.h>
> > #include <asm/irq_regs.h>
> > #include <linux/kvm_para.h>
> >@@ -56,6 +57,12 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
> > #else
> > #define sysctl_softlockup_all_cpu_backtrace 0
> > #endif
> >+static cpumask_var_t watchdog_cpumask;
> >+unsigned long *watchdog_cpumask_bits;
> >+
> >+/* Helper for online, unparked cpus. */
> >+#define for_each_watchdog_cpu(cpu) \
> >+ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
> > static int __read_mostly watchdog_running;
> > static u64 __read_mostly sample_period;
> >@@ -205,7 +212,7 @@ void touch_all_softlockup_watchdogs(void)
> > * do we care if a 0 races with a timestamp?
> > * all it means is the softlock check starts one cycle later
> > */
> >- for_each_online_cpu(cpu)
> >+ for_each_watchdog_cpu(cpu)
> > per_cpu(watchdog_touch_ts, cpu) = 0;
> > }
> >@@ -612,7 +619,7 @@ void watchdog_nmi_enable_all(void)
> > return;
> > get_online_cpus();
> >- for_each_online_cpu(cpu)
> >+ for_each_watchdog_cpu(cpu)
> > watchdog_nmi_enable(cpu);
> > put_online_cpus();
> > }
> >@@ -625,7 +632,7 @@ void watchdog_nmi_disable_all(void)
> > return;
> > get_online_cpus();
> >- for_each_online_cpu(cpu)
> >+ for_each_watchdog_cpu(cpu)
> > watchdog_nmi_disable(cpu);
> > put_online_cpus();
> > }
> >@@ -684,7 +691,7 @@ static void update_watchdog_all_cpus(void)
> > int cpu;
> > get_online_cpus();
> >- for_each_online_cpu(cpu)
> >+ for_each_watchdog_cpu(cpu)
> > update_watchdog(cpu);
> > put_online_cpus();
> > }
> >@@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
> > err = smpboot_register_percpu_thread(&watchdog_threads);
> > if (err)
> > pr_err("Failed to create watchdog threads, disabled\n");
> >- else
> >+ else {
> >+ if (smpboot_update_cpumask_percpu_thread(
> >+ &watchdog_threads, watchdog_cpumask))
> >+ pr_err("Failed to set cpumask for watchdog threads\n");
> > watchdog_running = 1;
> >+ }
> > } else {
> > /*
> > * Enable/disable the lockup detectors or
> >@@ -869,12 +880,54 @@ out:
> > mutex_unlock(&watchdog_proc_mutex);
> > return err;
> > }
> >+
> >+/*
> >+ * The cpumask is the mask of possible cpus that the watchdog can run
> >+ * on, not the mask of cpus it is actually running on. This allows the
> >+ * user to specify a mask that will include cpus that have not yet
> >+ * been brought online, if desired.
> >+ */
> >+int proc_watchdog_cpumask(struct ctl_table *table, int write,
> >+ void __user *buffer, size_t *lenp, loff_t *ppos)
> >+{
> >+ int err;
> >+
> >+ mutex_lock(&watchdog_proc_mutex);
> >+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> >+ if (!err && write) {
> >+ /* Remove impossible cpus to keep sysctl output cleaner. */
> >+ cpumask_and(watchdog_cpumask, watchdog_cpumask,
> >+ cpu_possible_mask);
> >+
> >+ if (watchdog_enabled && watchdog_thresh)
> >+ smpboot_update_cpumask_percpu_thread(&watchdog_threads,
> >+ watchdog_cpumask);
> >+ }
> >+ mutex_unlock(&watchdog_proc_mutex);
> >+ return err;
> >+}
> >+
> > #endif /* CONFIG_SYSCTL */
> > void __init lockup_detector_init(void)
> > {
> > set_sample_period();
> >+ if (!alloc_cpumask_var(&watchdog_cpumask, GFP_KERNEL)) {
> >+ pr_err("Failed to allocate cpumask for watchdog");
> >+ return;
> >+ }
> >+ watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask);
> >+
> >+#ifdef CONFIG_NO_HZ_FULL
> >+ if (!cpumask_empty(tick_nohz_full_mask))
> >+ pr_info("Disabling watchdog on nohz_full cores by default\n");
> >+ cpumask_andnot(watchdog_cpumask, cpu_possible_mask,
> >+ tick_nohz_full_mask);
> >+#else
> >+ cpumask_copy(watchdog_cpumask, cpu_possible_mask);
> >+#endif
> >+
> > if (watchdog_enabled)
> > watchdog_enable_all_cpus();
> > }
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On Fri, 17 Apr 2015 14:37:16 -0400 Chris Metcalf <[email protected]> wrote:
> This change allows some cores to be excluded from running the
> smp_hotplug_thread tasks. The following commit to update
> kernel/watchdog.c to use this functionality is the motivating
> example, and more information on the motivation is provided there.
>
> A new smp_hotplug_thread field is introduced, "cpumask", which
> is cpumask field managed by the smpboot subsystem that indicates whether
> or not the given smp_hotplug_thread should run on that core; the
> cpumask is checked when deciding whether to unpark the thread.
>
> To limit the cpumask to less than cpu_possible, you must call
> smpboot_update_cpumask_percpu_thread() after registering.
Has Thomas commented on any version of this? t'would be nice.
On Fri, 17 Apr 2015 14:37:17 -0400 Chris Metcalf <[email protected]> wrote:
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> If we allowed the watchdog to run on nohz_full cores, the timer
> interrupts and scheduler work would prevent the desired tickless
> operation on those cores. But if we disable the watchdog globally,
> then the housekeeping cores can't benefit from the watchdog
> functionality. So we allow disabling it only on some cores.
> See Documentation/lockup-watchdogs.txt for more information.
Could you please expand on the patch motivation? "would prevent the
desired tickless operation on those cores" is quite vague.
Exactly what userspace-visible problem is the current implementation
causing and how does the patchset improve things?
If any of this is quantifiable (wakeups/sec, joules/hour etc) then some
attempt to perform those measurements would also be useful.
On Fri, 17 Apr 2015 14:37:17 -0400 Chris Metcalf <[email protected]> wrote:
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> If we allowed the watchdog to run on nohz_full cores, the timer
> interrupts and scheduler work would prevent the desired tickless
> operation on those cores. But if we disable the watchdog globally,
> then the housekeeping cores can't benefit from the watchdog
> functionality. So we allow disabling it only on some cores.
> See Documentation/lockup-watchdogs.txt for more information.
>
> ...
>
> +watchdog_cpumask:
> +
> +This value can be used to control on which cpus the watchdog may run.
> +The default cpumask is all possible cores, but if NO_HZ_FULL is
> +enabled in the kernel config, and cores are specified with the
> +nohz_full= boot argument, those cores are excluded by default.
> +Offline cores can be included in this mask, and if the core is later
> +brought online, the watchdog will be started based on the mask value.
> +
> +Typically this value would only be touched in the nohz_full case
> +to re-enable cores that by default were not running the watchdog,
> +if a kernel lockup was suspected on those cores.
Now the reader is wondering "how the heck do I specify a cpumask". Is
it hex encoded? decimal? binary string? Which bit corresponds to
which CPU?
A little example would help things along.
On Fri, 17 Apr 2015 14:37:18 -0400 Chris Metcalf <[email protected]> wrote:
> Allowing watchdog threads to be parked means that we now have the
> opportunity of actually seeing persistent parked threads in the output
> of /proc's stat and status files. The existing code reported such
"/proc's stat" is ambiguous (/proc/stat?). We can remove all doubt by
using full pathnames: /proc/<pid>/stat.
> threads as "Running", which is kind-of true if you think of the case
> where we park them as part of taking cpus offline. But if we allow
> parking them indefinitely, "Running" is pretty misleading, so we report
> them as "Sleeping" instead.
>
> We could simply report them with a new string, "Parked", but it feels
> like it's a bit risky for userspace to see unexpected new values.
> The scheduler does report parked tasks with a "P" in debugging output
> from sched_show_task() or dump_cpu_task(), but that's a different API.
>
> This change seemed slightly cleaner than updating the task_state_array
> to have additional rows. TASK_DEAD should be subsumed by the exit_state
> bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
> reasonably be reported as "Running" (as it is now). Only TASK_PARKED
> shows up with unreasonable output here.
Documentation/filesystems/proc.txt documents /proc/<pid>/status. It
documents "State" explicitly.
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -126,6 +126,10 @@ static inline const char *get_task_state(struct task_struct *tsk)
> {
> unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
>
> + /* Treat parked tasks as sleeping. */
> + if (tsk->state == TASK_PARKED)
> + state = TASK_INTERRUPTIBLE;
The comment describes something which is utterly obvious. What it
doesn't describe (and should) is *why* we do this.
On 04/29/2015 06:26 PM, Andrew Morton wrote:
> On Fri, 17 Apr 2015 14:37:16 -0400 Chris Metcalf <[email protected]> wrote:
>
>> This change allows some cores to be excluded from running the
>> smp_hotplug_thread tasks. The following commit to update
>> kernel/watchdog.c to use this functionality is the motivating
>> example, and more information on the motivation is provided there.
>>
>> A new smp_hotplug_thread field is introduced, "cpumask", which
>> is cpumask field managed by the smpboot subsystem that indicates whether
>> or not the given smp_hotplug_thread should run on that core; the
>> cpumask is checked when deciding whether to unpark the thread.
>>
>> To limit the cpumask to less than cpu_possible, you must call
>> smpboot_update_cpumask_percpu_thread() after registering.
> Has Thomas commented on any version of this? t'would be nice.
He offered some specific feedback here:
https://lkml.org/lkml/2015/4/8/684
that I reflected in the subsequent iteration of the patch series.
Thomas, any additional feedback would be appreciated.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
This patch series allows the watchdog to run by default only
on the housekeeping cores when nohz_full is in effect; this
seems to be a good compromise short of turning it off completely
(since the nohz_full cores can't tolerate a watchdog).
To provide customizability, we add /proc/sys/kernel/watchdog_cpumask
so that the set of cores running the watchdog can be tuned to
different values after bootup.
To implement this customizability, we add a new
smpboot_update_cpumask_percpu_thread() API to the smpboot_thread
subsystem that lets us park or unpark "unwanted" threads.
And now that threads can be parked for long periods of time, we tweak
the /proc/<pid>/stat and /proc/<pid>/status code so parked threads
aren't reported as running, which is otherwise confusing.
v10: improved documentation and comments [akpm]
made watchdog_cpumask a static struct cpumask [Don, Uli]
print warning if update_cpumask fails for watchdog [Don]
v9: move cpumask into smpboot_hotplug_thread and don't let the
client initialize it either [Frederic]
use alloc_cpumask_var, not a locked static cpumask [Frederic]
add and use for_each_watchdog_cpu() [Uli]
check alloc_cpumask_var for failure [Chai Wen]
v8: make cpumask only updated by smpboot subsystem [Frederic]
improve documentation in "Documentation/" and in changelog [akpm]
v7: change from valid_cpu() callback to optional cpumask field
park smpboot threads rather than just not creating them
v6: use alloc_cpumask_var() [Sasha Levin]
change from an "exclude" data pointer to a more generic
valid_cpu() callback [Frederic]
add Don's Acked-by
v5: switch from watchdog_exclude to watchdog_cpumask [Frederic]
simplify the smp_hotplug_thread API to watchdog [Frederic]
Chris Metcalf (3):
smpboot: allow excluding cpus from the smpboot threads
watchdog: add watchdog_cpumask sysctl to assist nohz
procfs: treat parked tasks as sleeping for task state
Documentation/lockup-watchdogs.txt | 18 +++++++++++
Documentation/sysctl/kernel.txt | 21 +++++++++++++
fs/proc/array.c | 8 +++++
include/linux/nmi.h | 3 ++
include/linux/smpboot.h | 5 +++
kernel/smpboot.c | 55 +++++++++++++++++++++++++++++++-
kernel/sysctl.c | 7 +++++
kernel/watchdog.c | 64 +++++++++++++++++++++++++++++++++++---
8 files changed, 175 insertions(+), 6 deletions(-)
--
2.1.2
This change allows some cores to be excluded from running the
smp_hotplug_thread tasks. The following commit to update
kernel/watchdog.c to use this functionality is the motivating
example, and more information on the motivation is provided there.
A new smp_hotplug_thread field is introduced, "cpumask", which
is cpumask field managed by the smpboot subsystem that indicates whether
or not the given smp_hotplug_thread should run on that core; the
cpumask is checked when deciding whether to unpark the thread.
To limit the cpumask to less than cpu_possible, you must call
smpboot_update_cpumask_percpu_thread() after registering.
Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/smpboot.h | 5 +++++
kernel/smpboot.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index d600afb21926..7c42153edfac 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -27,6 +27,8 @@ struct smpboot_thread_data;
* @pre_unpark: Optional unpark function, called before the thread is
* unparked (cpu online). This is not guaranteed to be
* called on the target cpu of the thread. Careful!
+ * @cpumask: Internal state. To update which threads are unparked,
+ * call smpboot_update_cpumask_percpu_thread().
* @selfparking: Thread is not parked by the park function.
* @thread_comm: The base name of the thread
*/
@@ -41,11 +43,14 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
+ struct cpumask cpumask;
bool selfparking;
const char *thread_comm;
};
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *);
#endif
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index c697f73d82d6..209750ab7031 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -232,7 +232,8 @@ void smpboot_unpark_threads(unsigned int cpu)
mutex_lock(&smpboot_threads_lock);
list_for_each_entry(cur, &hotplug_threads, list)
- smpboot_unpark_thread(cur, cpu);
+ if (cpumask_test_cpu(cpu, &cur->cpumask))
+ smpboot_unpark_thread(cur, cpu);
mutex_unlock(&smpboot_threads_lock);
}
@@ -258,6 +259,15 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
{
unsigned int cpu;
+ /* Unpark any threads that were voluntarily parked. */
+ for_each_cpu_not(cpu, &ht->cpumask) {
+ if (cpu_online(cpu)) {
+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ if (tsk)
+ kthread_unpark(tsk);
+ }
+ }
+
/* We need to destroy also the parked threads of offline cpus */
for_each_possible_cpu(cpu) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
@@ -281,6 +291,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
+ cpumask_copy(&plug_thread->cpumask, cpu_possible_mask);
get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
@@ -316,6 +327,48 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+/**
+ * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
+ * @plug_thread: Hotplug thread descriptor
+ * @new: Revised mask to use
+ *
+ * The cpumask field in the smp_hotplug_thread must not be updated directly
+ * by the client, but only by calling this function.
+ */
+int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
+ const struct cpumask *new)
+{
+ struct cpumask *old = &plug_thread->cpumask;
+ cpumask_var_t tmp;
+ unsigned int cpu;
+
+ if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+ return -ENOMEM;
+
+ get_online_cpus();
+ mutex_lock(&smpboot_threads_lock);
+
+ /* Park threads that were exclusively enabled on the old mask. */
+ cpumask_andnot(tmp, old, new);
+ for_each_cpu_and(cpu, tmp, cpu_online_mask)
+ smpboot_park_thread(plug_thread, cpu);
+
+ /* Unpark threads that are exclusively enabled on the new mask. */
+ cpumask_andnot(tmp, new, old);
+ for_each_cpu_and(cpu, tmp, cpu_online_mask)
+ smpboot_unpark_thread(plug_thread, cpu);
+
+ cpumask_copy(old, new);
+
+ mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
+
+ free_cpumask_var(tmp);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
+
static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
/*
--
2.1.2
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running
on with a new kernel.watchdog_cpumask sysctl.
In the current system, the watchdog subsystem runs a periodic timer
that schedules the watchdog kthread to run. However, nohz_full cores
are designed to allow userspace application code running on those cores
to have 100% access to the CPU. So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores,
which this patchset provides by default.
However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality. So we allow
disabling it only on some cores. See Documentation/lockup-watchdogs.txt
for more information.
Acked-by: Don Zickus <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/lockup-watchdogs.txt | 18 +++++++++++
Documentation/sysctl/kernel.txt | 21 +++++++++++++
include/linux/nmi.h | 3 ++
kernel/sysctl.c | 7 +++++
kernel/watchdog.c | 64 +++++++++++++++++++++++++++++++++++---
5 files changed, 108 insertions(+), 5 deletions(-)
diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt
index ab0baa692c13..22dd6af2e4bd 100644
--- a/Documentation/lockup-watchdogs.txt
+++ b/Documentation/lockup-watchdogs.txt
@@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows
administrators to configure the period of the hrtimer and the perf
event. The right value for a particular environment is a trade-off
between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index c831001c45f1..e5d528e0c46e 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -923,6 +923,27 @@ and nmi_watchdog.
==============================================================
+watchdog_cpumask:
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+The argument value is the standard cpulist format for cpumasks,
+so for example to enable the watchdog on cores 0, 2, 3, and 4 you
+might say:
+
+ echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask
+
+==============================================================
+
watchdog_thresh:
This value can be used to control the frequency of hrtimer and NMI
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3d46fb4708e0..f94da0e65dea 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -67,6 +67,7 @@ extern int nmi_watchdog_enabled;
extern int soft_watchdog_enabled;
extern int watchdog_user_enabled;
extern int watchdog_thresh;
+extern unsigned long *watchdog_cpumask_bits;
extern int sysctl_softlockup_all_cpu_backtrace;
struct ctl_table;
extern int proc_watchdog(struct ctl_table *, int ,
@@ -77,6 +78,8 @@ extern int proc_soft_watchdog(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
+extern int proc_watchdog_cpumask(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
#endif
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a88fb9..699571a74e3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -881,6 +881,13 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
{
+ .procname = "watchdog_cpumask",
+ .data = &watchdog_cpumask_bits,
+ .maxlen = NR_CPUS,
+ .mode = 0644,
+ .proc_handler = proc_watchdog_cpumask,
+ },
+ {
.procname = "softlockup_panic",
.data = &softlockup_panic,
.maxlen = sizeof(int),
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07a4..436046e2562c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -19,6 +19,7 @@
#include <linux/sysctl.h>
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
+#include <linux/tick.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -56,6 +57,12 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
#else
#define sysctl_softlockup_all_cpu_backtrace 0
#endif
+static struct cpumask watchdog_cpumask __read_mostly;
+unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
+
+/* Helper for online, unparked cpus. */
+#define for_each_watchdog_cpu(cpu) \
+ for_each_cpu_and((cpu), cpu_online_mask, &watchdog_cpumask)
static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
@@ -205,7 +212,7 @@ void touch_all_softlockup_watchdogs(void)
* do we care if a 0 races with a timestamp?
* all it means is the softlock check starts one cycle later
*/
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
}
@@ -612,7 +619,7 @@ void watchdog_nmi_enable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_enable(cpu);
put_online_cpus();
}
@@ -625,7 +632,7 @@ void watchdog_nmi_disable_all(void)
return;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
watchdog_nmi_disable(cpu);
put_online_cpus();
}
@@ -684,7 +691,7 @@ static void update_watchdog_all_cpus(void)
int cpu;
get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_watchdog_cpu(cpu)
update_watchdog(cpu);
put_online_cpus();
}
@@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
err = smpboot_register_percpu_thread(&watchdog_threads);
if (err)
pr_err("Failed to create watchdog threads, disabled\n");
- else
+ else {
+ if (smpboot_update_cpumask_percpu_thread(
+ &watchdog_threads, &watchdog_cpumask))
+ pr_err("Failed to set cpumask for watchdog threads\n");
watchdog_running = 1;
+ }
} else {
/*
* Enable/disable the lockup detectors or
@@ -869,12 +880,55 @@ out:
mutex_unlock(&watchdog_proc_mutex);
return err;
}
+
+/*
+ * The cpumask is the mask of possible cpus that the watchdog can run
+ * on, not the mask of cpus it is actually running on. This allows the
+ * user to specify a mask that will include cpus that have not yet
+ * been brought online, if desired.
+ */
+int proc_watchdog_cpumask(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ mutex_lock(&watchdog_proc_mutex);
+ err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
+ if (!err && write) {
+ /* Remove impossible cpus to keep sysctl output cleaner. */
+ cpumask_and(&watchdog_cpumask, &watchdog_cpumask,
+ cpu_possible_mask);
+
+ if (watchdog_enabled && watchdog_thresh) {
+ /*
+ * Failure would be due to being unable to allocate
+ * a temporary cpumask, so we are likely not in a
+ * position to do much else to make things better.
+ */
+ if (smpboot_update_cpumask_percpu_thread(
+ &watchdog_threads, &watchdog_cpumask) != 0)
+ pr_err("cpumask update failed\n");
+ }
+ }
+ mutex_unlock(&watchdog_proc_mutex);
+ return err;
+}
+
#endif /* CONFIG_SYSCTL */
void __init lockup_detector_init(void)
{
set_sample_period();
+#ifdef CONFIG_NO_HZ_FULL
+ if (!cpumask_empty(tick_nohz_full_mask))
+ pr_info("Disabling watchdog on nohz_full cores by default\n");
+ cpumask_andnot(&watchdog_cpumask, cpu_possible_mask,
+ tick_nohz_full_mask);
+#else
+ cpumask_copy(&watchdog_cpumask, cpu_possible_mask);
+#endif
+
if (watchdog_enabled)
watchdog_enable_all_cpus();
}
--
2.1.2
Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc/<pid>/stat and /proc/<pid>/status. The existing code reported
such threads as "Running", which is kind-of true if you think of the case
where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.
We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values; the
output is already documented in Documentation/filesystems/proc.txt,
and it seems like a mistake to change that lightly.
The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
but again, different API.
This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.
Signed-off-by: Chris Metcalf <[email protected]>
---
fs/proc/array.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index fd02a9ebfc30..3f57dac31ba6 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -126,6 +126,14 @@ static inline const char *get_task_state(struct task_struct *tsk)
{
unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+ /*
+ * Parked tasks do not run; they sit in __kthread_parkme().
+ * Without this check, we would report them as running, which is
+ * clearly wrong, so we report them as sleeping instead.
+ */
+ if (tsk->state == TASK_PARKED)
+ state = TASK_INTERRUPTIBLE;
+
BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
return task_state_array[fls(state)];
--
2.1.2
On Thu, Apr 30, 2015 at 03:39:25PM -0400, Chris Metcalf wrote:
> Change the default behavior of watchdog so it only runs on the
> housekeeping cores when nohz_full is enabled at build and boot time.
> Allow modifying the set of cores the watchdog is currently running
> on with a new kernel.watchdog_cpumask sysctl.
>
> In the current system, the watchdog subsystem runs a periodic timer
> that schedules the watchdog kthread to run. However, nohz_full cores
> are designed to allow userspace application code running on those cores
> to have 100% access to the CPU. So the watchdog system prevents the
> nohz_full application code from being able to run the way it wants to,
> thus the motivation to suppress the watchdog on nohz_full cores,
> which this patchset provides by default.
>
> However, if we disable the watchdog globally, then the housekeeping
> cores can't benefit from the watchdog functionality. So we allow
> disabling it only on some cores. See Documentation/lockup-watchdogs.txt
> for more information.
>
> Acked-by: Don Zickus <[email protected]>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> Documentation/lockup-watchdogs.txt | 18 +++++++++++
> Documentation/sysctl/kernel.txt | 21 +++++++++++++
> include/linux/nmi.h | 3 ++
> kernel/sysctl.c | 7 +++++
> kernel/watchdog.c | 64 +++++++++++++++++++++++++++++++++++---
> 5 files changed, 108 insertions(+), 5 deletions(-)
>
<snip>
> @@ -697,8 +704,12 @@ static int watchdog_enable_all_cpus(void)
> err = smpboot_register_percpu_thread(&watchdog_threads);
> if (err)
> pr_err("Failed to create watchdog threads, disabled\n");
> - else
> + else {
> + if (smpboot_update_cpumask_percpu_thread(
> + &watchdog_threads, &watchdog_cpumask))
> + pr_err("Failed to set cpumask for watchdog threads\n");
Stupid nitpick, this error message tells us the 'watchdog' threads caused
the cpumask failure, but ....
> watchdog_running = 1;
> + }
> } else {
> /*
> * Enable/disable the lockup detectors or
> @@ -869,12 +880,55 @@ out:
> mutex_unlock(&watchdog_proc_mutex);
> return err;
> }
> +
> +/*
> + * The cpumask is the mask of possible cpus that the watchdog can run
> + * on, not the mask of cpus it is actually running on. This allows the
> + * user to specify a mask that will include cpus that have not yet
> + * been brought online, if desired.
> + */
> +int proc_watchdog_cpumask(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int err;
> +
> + mutex_lock(&watchdog_proc_mutex);
> + err = proc_do_large_bitmap(table, write, buffer, lenp, ppos);
> + if (!err && write) {
> + /* Remove impossible cpus to keep sysctl output cleaner. */
> + cpumask_and(&watchdog_cpumask, &watchdog_cpumask,
> + cpu_possible_mask);
> +
> + if (watchdog_enabled && watchdog_thresh) {
> + /*
> + * Failure would be due to being unable to allocate
> + * a temporary cpumask, so we are likely not in a
> + * position to do much else to make things better.
> + */
> + if (smpboot_update_cpumask_percpu_thread(
> + &watchdog_threads, &watchdog_cpumask) != 0)
> + pr_err("cpumask update failed\n");
This one does not. :-( If there is a respin, I would suggest copying the
above message down here.
Cheers,
Don
On 04/30/2015 04:00 PM, Don Zickus wrote:
> On Thu, Apr 30, 2015 at 03:39:25PM -0400, Chris Metcalf wrote:
>> if (err)
>> pr_err("Failed to create watchdog threads, disabled\n");
>> + else {
>> + if (smpboot_update_cpumask_percpu_thread(
>> + &watchdog_threads, &watchdog_cpumask))
>> + pr_err("Failed to set cpumask for watchdog threads\n");
> Stupid nitpick, this error message tells us the 'watchdog' threads caused
> the cpumask failure, but ....
>
>> + /*
>> + * Failure would be due to being unable to allocate
>> + * a temporary cpumask, so we are likely not in a
>> + * position to do much else to make things better.
>> + */
>> + if (smpboot_update_cpumask_percpu_thread(
>> + &watchdog_threads, &watchdog_cpumask) != 0)
>> + pr_err("cpumask update failed\n");
> This one does not. :-( If there is a respin, I would suggest copying the
> above message down here.
There is that "#define pr_fmt(fmt)" at the top of the file that prefixes
all the messages with "NMI watchdog: ", though. I think that's
sufficient to make it clear what the second message is about.
(The first message I wrote the way I did to be parallel with the
message just before it, if the thread creation failed.)
I could tweak the messages but I think they're reasonable given the
prefix. What do you think?
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Thu, Apr 30, 2015 at 03:39:24PM -0400, Chris Metcalf wrote:
> This change allows some cores to be excluded from running the
> smp_hotplug_thread tasks. The following commit to update
> kernel/watchdog.c to use this functionality is the motivating
> example, and more information on the motivation is provided there.
>
> A new smp_hotplug_thread field is introduced, "cpumask", which
> is cpumask field managed by the smpboot subsystem that indicates whether
> or not the given smp_hotplug_thread should run on that core; the
> cpumask is checked when deciding whether to unpark the thread.
>
> To limit the cpumask to less than cpu_possible, you must call
> smpboot_update_cpumask_percpu_thread() after registering.
>
> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> include/linux/smpboot.h | 5 +++++
> kernel/smpboot.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
> index d600afb21926..7c42153edfac 100644
> --- a/include/linux/smpboot.h
> +++ b/include/linux/smpboot.h
> @@ -27,6 +27,8 @@ struct smpboot_thread_data;
> * @pre_unpark: Optional unpark function, called before the thread is
> * unparked (cpu online). This is not guaranteed to be
> * called on the target cpu of the thread. Careful!
> + * @cpumask: Internal state. To update which threads are unparked,
> + * call smpboot_update_cpumask_percpu_thread().
> * @selfparking: Thread is not parked by the park function.
> * @thread_comm: The base name of the thread
> */
> @@ -41,11 +43,14 @@ struct smp_hotplug_thread {
> void (*park)(unsigned int cpu);
> void (*unpark)(unsigned int cpu);
> void (*pre_unpark)(unsigned int cpu);
> + struct cpumask cpumask;
I believe it should be allocated dynamically, otherwise it gets the size of NR_CPUS
instead of nr_cpus_bits. It's not _that_ much space spared but think there should be
several struct smp_hotplug_thread registered.
> bool selfparking;
> const char *thread_comm;
> };
>
> int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread);
> void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread);
> +int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
> + const struct cpumask *);
>
> #endif
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index c697f73d82d6..209750ab7031 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -232,7 +232,8 @@ void smpboot_unpark_threads(unsigned int cpu)
>
> mutex_lock(&smpboot_threads_lock);
> list_for_each_entry(cur, &hotplug_threads, list)
> - smpboot_unpark_thread(cur, cpu);
> + if (cpumask_test_cpu(cpu, &cur->cpumask))
> + smpboot_unpark_thread(cur, cpu);
> mutex_unlock(&smpboot_threads_lock);
> }
>
> @@ -258,6 +259,15 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
> {
> unsigned int cpu;
>
> + /* Unpark any threads that were voluntarily parked. */
> + for_each_cpu_not(cpu, &ht->cpumask) {
> + if (cpu_online(cpu)) {
> + struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> + if (tsk)
> + kthread_unpark(tsk);
I'm still not clear why we are doing that. kthread_stop() should be able
to handle parked kthreads, otherwise it needs to be fixed.
> + }
> + }
> +
> /* We need to destroy also the parked threads of offline cpus */
> for_each_possible_cpu(cpu) {
> struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> @@ -281,6 +291,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
> unsigned int cpu;
> int ret = 0;
>
> + cpumask_copy(&plug_thread->cpumask, cpu_possible_mask);
> get_online_cpus();
> mutex_lock(&smpboot_threads_lock);
> for_each_online_cpu(cpu) {
> @@ -316,6 +327,48 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
> }
> EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
>
> +/**
> + * smpboot_update_cpumask_percpu_thread - Adjust which per_cpu hotplug threads stay parked
> + * @plug_thread: Hotplug thread descriptor
> + * @new: Revised mask to use
> + *
> + * The cpumask field in the smp_hotplug_thread must not be updated directly
> + * by the client, but only by calling this function.
> + */
> +int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
> + const struct cpumask *new)
> +{
> + struct cpumask *old = &plug_thread->cpumask;
> + cpumask_var_t tmp;
> + unsigned int cpu;
> +
> + if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
> + return -ENOMEM;
> +
> + get_online_cpus();
> + mutex_lock(&smpboot_threads_lock);
> +
> + /* Park threads that were exclusively enabled on the old mask. */
> + cpumask_andnot(tmp, old, new);
> + for_each_cpu_and(cpu, tmp, cpu_online_mask)
> + smpboot_park_thread(plug_thread, cpu);
> +
> + /* Unpark threads that are exclusively enabled on the new mask. */
> + cpumask_andnot(tmp, new, old);
> + for_each_cpu_and(cpu, tmp, cpu_online_mask)
> + smpboot_unpark_thread(plug_thread, cpu);
> +
> + cpumask_copy(old, new);
> +
> + mutex_unlock(&smpboot_threads_lock);
> + put_online_cpus();
> +
> + free_cpumask_var(tmp);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(smpboot_update_cpumask_percpu_thread);
> +
> static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
>
> /*
> --
> 2.1.2
>
On Thu, Apr 30, 2015 at 04:09:52PM -0400, Chris Metcalf wrote:
> On 04/30/2015 04:00 PM, Don Zickus wrote:
> >On Thu, Apr 30, 2015 at 03:39:25PM -0400, Chris Metcalf wrote:
> >> if (err)
> >> pr_err("Failed to create watchdog threads, disabled\n");
> >>+ else {
> >>+ if (smpboot_update_cpumask_percpu_thread(
> >>+ &watchdog_threads, &watchdog_cpumask))
> >>+ pr_err("Failed to set cpumask for watchdog threads\n");
> >Stupid nitpick, this error message tells us the 'watchdog' threads caused
> >the cpumask failure, but ....
> >
> >>+ /*
> >>+ * Failure would be due to being unable to allocate
> >>+ * a temporary cpumask, so we are likely not in a
> >>+ * position to do much else to make things better.
> >>+ */
> >>+ if (smpboot_update_cpumask_percpu_thread(
> >>+ &watchdog_threads, &watchdog_cpumask) != 0)
> >>+ pr_err("cpumask update failed\n");
> >This one does not. :-( If there is a respin, I would suggest copying the
> >above message down here.
>
> There is that "#define pr_fmt(fmt)" at the top of the file that prefixes
> all the messages with "NMI watchdog: ", though. I think that's
> sufficient to make it clear what the second message is about.
> (The first message I wrote the way I did to be parallel with the
> message just before it, if the thread creation failed.)
Ah, yes. Nevermind. I keep forgetting about that. :-)
Cheers,
Don
On 05/01/2015 04:53 AM, Frederic Weisbecker wrote:
> On Thu, Apr 30, 2015 at 03:39:24PM -0400, Chris Metcalf wrote:
>> This change allows some cores to be excluded from running the
>> smp_hotplug_thread tasks. The following commit to update
>> kernel/watchdog.c to use this functionality is the motivating
>> example, and more information on the motivation is provided there.
>>
>> A new smp_hotplug_thread field is introduced, "cpumask", which
>> is cpumask field managed by the smpboot subsystem that indicates whether
>> or not the given smp_hotplug_thread should run on that core; the
>> cpumask is checked when deciding whether to unpark the thread.
>>
>> To limit the cpumask to less than cpu_possible, you must call
>> smpboot_update_cpumask_percpu_thread() after registering.
>>
>> Signed-off-by: Chris Metcalf <[email protected]>
>> ---
>> include/linux/smpboot.h | 5 +++++
>> kernel/smpboot.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
>> index d600afb21926..7c42153edfac 100644
>> --- a/include/linux/smpboot.h
>> +++ b/include/linux/smpboot.h
>> @@ -27,6 +27,8 @@ struct smpboot_thread_data;
>> * @pre_unpark: Optional unpark function, called before the thread is
>> * unparked (cpu online). This is not guaranteed to be
>> * called on the target cpu of the thread. Careful!
>> + * @cpumask: Internal state. To update which threads are unparked,
>> + * call smpboot_update_cpumask_percpu_thread().
>> * @selfparking: Thread is not parked by the park function.
>> * @thread_comm: The base name of the thread
>> */
>> @@ -41,11 +43,14 @@ struct smp_hotplug_thread {
>> void (*park)(unsigned int cpu);
>> void (*unpark)(unsigned int cpu);
>> void (*pre_unpark)(unsigned int cpu);
>> + struct cpumask cpumask;
> I believe it should be allocated dynamically, otherwise it gets the size of NR_CPUS
> instead of nr_cpus_bits. It's not _that_ much space spared but think there should be
> several struct smp_hotplug_thread registered.
I'll submit a follow-up patch to do this. I'm assuming this doesn't need to
be rolled as a v11, and can be a stand-alone patch, but I'll do it whichever
way Andrew prefers.
>> + /* Unpark any threads that were voluntarily parked. */
>> + for_each_cpu_not(cpu, &ht->cpumask) {
>> + if (cpu_online(cpu)) {
>> + struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
>> + if (tsk)
>> + kthread_unpark(tsk);
> I'm still not clear why we are doing that. kthread_stop() should be able
> to handle parked kthreads, otherwise it needs to be fixed.
Checking without the unpark, it's actually only a problem with nohz_full.
In a system without nohz_full, the kthreads are able to stop even when
they are parked; it's only in the nohz_full case that things wedge.
For example, booting with only cpu 0 as a housekeeping core (and
therefore all watchdogs 1-35 on my 36-core tilegx are parked), and
immediately doing "echo 0 > /proc/sys/kernel/watchdog", I see
(via SysRq ^O-l) the first parked watchdog, on cpu 1, hung with:
frame 0: 0xfffffff7000f2928 lock_hrtimer_base+0xb8/0xc0
frame 1: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
frame 2: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
frame 3: 0xfffffff7000f2b98 hrtimer_cancel+0x40/0x68
frame 4: 0xfffffff70014cce0 watchdog_disable+0x50/0x70
frame 5: 0xfffffff70008c2d0 smpboot_thread_fn+0x350/0x438
frame 6: 0xfffffff700084b28 kthread+0x160/0x178
The other cores are all idle.
I have no idea why lock_hrtimer_base() is hanging; perhaps the
hrtimer_cpu_base lock is taken by some other task that is now
scheduled out.
The config does not have NO_HZ_FULL_ALL or NO_HZ_FULL_SYSIDLE
set, and does have RCU_FAST_NO_HZ and RCU_NOCB_CPU_ALL.
I don't really know how to start debugging this, but I do know that
unparking the threads first avoids the issue :-)
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
Frederic Weisbecker observed that we'd be better off to dynamically
allocate the cpumask associated with smpboot threads to save memory.
Signed-off-by: Chris Metcalf <[email protected]>
---
(Dropped a bunch of people off the cc's)
I figured since Andrew had already taken the previous v10 patchset into
the -mm tree, it was better to just submit this as a followup patch.
I can also argue that since it doesn't fix a bug per se, just improves
memory usage, it seems OK to have as a follow-on patch.
That said, I can respin a v11 patch 1/3 instead to give a new single
patch for smpboot; Andrew, just let me know if you'd prefer that.
include/linux/smpboot.h | 2 +-
kernel/smpboot.c | 12 ++++++++----
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 7c42153edfac..da3c593f9845 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -43,7 +43,7 @@ struct smp_hotplug_thread {
void (*park)(unsigned int cpu);
void (*unpark)(unsigned int cpu);
void (*pre_unpark)(unsigned int cpu);
- struct cpumask cpumask;
+ cpumask_var_t cpumask;
bool selfparking;
const char *thread_comm;
};
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 209750ab7031..5e46c2a75d59 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -232,7 +232,7 @@ void smpboot_unpark_threads(unsigned int cpu)
mutex_lock(&smpboot_threads_lock);
list_for_each_entry(cur, &hotplug_threads, list)
- if (cpumask_test_cpu(cpu, &cur->cpumask))
+ if (cpumask_test_cpu(cpu, cur->cpumask))
smpboot_unpark_thread(cur, cpu);
mutex_unlock(&smpboot_threads_lock);
}
@@ -260,7 +260,7 @@ static void smpboot_destroy_threads(struct smp_hotplug_thread *ht)
unsigned int cpu;
/* Unpark any threads that were voluntarily parked. */
- for_each_cpu_not(cpu, &ht->cpumask) {
+ for_each_cpu_not(cpu, ht->cpumask) {
if (cpu_online(cpu)) {
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
if (tsk)
@@ -291,7 +291,10 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
- cpumask_copy(&plug_thread->cpumask, cpu_possible_mask);
+ if (!alloc_cpumask_var(&plug_thread->cpumask, GFP_KERNEL))
+ return -ENOMEM;
+ cpumask_copy(plug_thread->cpumask, cpu_possible_mask);
+
get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
@@ -324,6 +327,7 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
smpboot_destroy_threads(plug_thread);
mutex_unlock(&smpboot_threads_lock);
put_online_cpus();
+ free_cpumask_var(plug_thread->cpumask);
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
@@ -338,7 +342,7 @@ EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
int smpboot_update_cpumask_percpu_thread(struct smp_hotplug_thread *plug_thread,
const struct cpumask *new)
{
- struct cpumask *old = &plug_thread->cpumask;
+ struct cpumask *old = plug_thread->cpumask;
cpumask_var_t tmp;
unsigned int cpu;
--
2.1.2
On Fri, May 01, 2015 at 03:57:51PM -0400, Chris Metcalf wrote:
> On 05/01/2015 04:53 AM, Frederic Weisbecker wrote:
> >>+ /* Unpark any threads that were voluntarily parked. */
> >>+ for_each_cpu_not(cpu, &ht->cpumask) {
> >>+ if (cpu_online(cpu)) {
> >>+ struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
> >>+ if (tsk)
> >>+ kthread_unpark(tsk);
> >I'm still not clear why we are doing that. kthread_stop() should be able
> >to handle parked kthreads, otherwise it needs to be fixed.
>
> Checking without the unpark, it's actually only a problem with nohz_full.
> In a system without nohz_full, the kthreads are able to stop even when
> they are parked; it's only in the nohz_full case that things wedge.
Ok. So this isn't a proper fix but a workaround for a bug that we don't
understand yet. In this case I much prefer that you remove this workaround
(I'm talking about this unpark loop) because it hides the issue. And hiding
the bug is the last thing we want if we plan to fix it properly.
>
> For example, booting with only cpu 0 as a housekeeping core (and
> therefore all watchdogs 1-35 on my 36-core tilegx are parked), and
> immediately doing "echo 0 > /proc/sys/kernel/watchdog", I see
> (via SysRq ^O-l) the first parked watchdog, on cpu 1, hung with:
>
> frame 0: 0xfffffff7000f2928 lock_hrtimer_base+0xb8/0xc0
> frame 1: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
> frame 2: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
> frame 3: 0xfffffff7000f2b98 hrtimer_cancel+0x40/0x68
> frame 4: 0xfffffff70014cce0 watchdog_disable+0x50/0x70
> frame 5: 0xfffffff70008c2d0 smpboot_thread_fn+0x350/0x438
> frame 6: 0xfffffff700084b28 kthread+0x160/0x178
Have you tried to do that before your patchset?
> The other cores are all idle.
>
> I have no idea why lock_hrtimer_base() is hanging; perhaps the
> hrtimer_cpu_base lock is taken by some other task that is now
> scheduled out.
No, it's a spinlock, tasks can't sleep while holding it. But it looks like
a deadlock.
>
> The config does not have NO_HZ_FULL_ALL or NO_HZ_FULL_SYSIDLE
> set, and does have RCU_FAST_NO_HZ and RCU_NOCB_CPU_ALL.
>
> I don't really know how to start debugging this, but I do know that
> unparking the threads first avoids the issue :-)
Do you have CONFIG_PROVE_LOCKING=y ?
I can't check that myself until the middle of next week.
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
On 5/1/2015 5:23 PM, Frederic Weisbecker wrote:
> On Fri, May 01, 2015 at 03:57:51PM -0400, Chris Metcalf wrote:
>
>> For example, booting with only cpu 0 as a housekeeping core (and
>> therefore all watchdogs 1-35 on my 36-core tilegx are parked), and
>> immediately doing "echo 0 > /proc/sys/kernel/watchdog", I see
>> (via SysRq ^O-l) the first parked watchdog, on cpu 1, hung with:
>>
>> frame 0: 0xfffffff7000f2928 lock_hrtimer_base+0xb8/0xc0
>> frame 1: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
>> frame 2: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
>> frame 3: 0xfffffff7000f2b98 hrtimer_cancel+0x40/0x68
>> frame 4: 0xfffffff70014cce0 watchdog_disable+0x50/0x70
>> frame 5: 0xfffffff70008c2d0 smpboot_thread_fn+0x350/0x438
>> frame 6: 0xfffffff700084b28 kthread+0x160/0x178
> Have you tried to do that before your patchset?
Yes, it works fine. It requires the presence of the parked threads to trigger the issue.
>> The config does not have NO_HZ_FULL_ALL or NO_HZ_FULL_SYSIDLE
>> set, and does have RCU_FAST_NO_HZ and RCU_NOCB_CPU_ALL.
>>
>> I don't really know how to start debugging this, but I do know that
>> unparking the threads first avoids the issue :-)
> Do you have CONFIG_PROVE_LOCKING=y ?
There seems to be some skew between the community version, which is throwing a
bunch of errors when I enable PROVE_LOCKING, and our internal version where some
things are not yet upstreamed but PROVE_LOCKING works :-)
I'll try to set aside some time to reconcile the two to figure it out.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On Mon, May 04, 2015 at 06:06:24PM -0400, Chris Metcalf wrote:
> On 5/1/2015 5:23 PM, Frederic Weisbecker wrote:
> >On Fri, May 01, 2015 at 03:57:51PM -0400, Chris Metcalf wrote:
> >
> >>For example, booting with only cpu 0 as a housekeeping core (and
> >>therefore all watchdogs 1-35 on my 36-core tilegx are parked), and
> >>immediately doing "echo 0 > /proc/sys/kernel/watchdog", I see
> >>(via SysRq ^O-l) the first parked watchdog, on cpu 1, hung with:
> >>
> >> frame 0: 0xfffffff7000f2928 lock_hrtimer_base+0xb8/0xc0
> >> frame 1: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
> >> frame 2: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
> >> frame 3: 0xfffffff7000f2b98 hrtimer_cancel+0x40/0x68
> >> frame 4: 0xfffffff70014cce0 watchdog_disable+0x50/0x70
> >> frame 5: 0xfffffff70008c2d0 smpboot_thread_fn+0x350/0x438
> >> frame 6: 0xfffffff700084b28 kthread+0x160/0x178
> >Have you tried to do that before your patchset?
>
> Yes, it works fine. It requires the presence of the parked threads to trigger the issue.
>
> >>The config does not have NO_HZ_FULL_ALL or NO_HZ_FULL_SYSIDLE
> >>set, and does have RCU_FAST_NO_HZ and RCU_NOCB_CPU_ALL.
> >>
> >>I don't really know how to start debugging this, but I do know that
> >>unparking the threads first avoids the issue :-)
> >Do you have CONFIG_PROVE_LOCKING=y ?
>
> There seems to be some skew between the community version, which is throwing a
> bunch of errors when I enable PROVE_LOCKING, and our internal version where some
> things are not yet upstreamed but PROVE_LOCKING works :-)
>
> I'll try to set aside some time to reconcile the two to figure it out.
Hi Chris,
I was digging this thread back up and wondered what happened. It seems like
a v11 was going to materialize?
Cheers,
Don
On 05/01/2015 05:23 PM, Frederic Weisbecker wrote:
> On Fri, May 01, 2015 at 03:57:51PM -0400, Chris Metcalf wrote:
>> On 05/01/2015 04:53 AM, Frederic Weisbecker wrote:
>>>> + /* Unpark any threads that were voluntarily parked. */
>>>> + for_each_cpu_not(cpu, &ht->cpumask) {
>>>> + if (cpu_online(cpu)) {
>>>> + struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
>>>> + if (tsk)
>>>> + kthread_unpark(tsk);
>>> I'm still not clear why we are doing that. kthread_stop() should be able
>>> to handle parked kthreads, otherwise it needs to be fixed.
>> Checking without the unpark, it's actually only a problem with nohz_full.
>> In a system without nohz_full, the kthreads are able to stop even when
>> they are parked; it's only in the nohz_full case that things wedge.
>> For example, booting with only cpu 0 as a housekeeping core (and
>> therefore all watchdogs 1-35 on my 36-core tilegx are parked), and
>> immediately doing "echo 0 > /proc/sys/kernel/watchdog", I see
>> (via SysRq ^O-l) the first parked watchdog, on cpu 1, hung with:
>>
>> frame 0: 0xfffffff7000f2928 lock_hrtimer_base+0xb8/0xc0
>> frame 1: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
>> frame 2: 0xfffffff7000f2a28 hrtimer_try_to_cancel+0x40/0x170
>> frame 3: 0xfffffff7000f2b98 hrtimer_cancel+0x40/0x68
>> frame 4: 0xfffffff70014cce0 watchdog_disable+0x50/0x70
>> frame 5: 0xfffffff70008c2d0 smpboot_thread_fn+0x350/0x438
>> frame 6: 0xfffffff700084b28 kthread+0x160/0x178
I finally had some time to look into this issue some more.
With PROVE_LOCKING enabled (after a fix I'll send to LKML shortly), we
get no warnings, and ^O-d to print locks shows:
Showing all locks held in the system:
3 locks held by watchdog/1/15:
#0: (&(&hp->lock)->rlock){-.....}, at: [<fffffff700620740>] hvc_poll+0xb8/0x4b8
#1: (rcu_read_lock){......}, at: [<fffffff70061d710>] __handle_sysrq+0x0/0x440
#2: (tasklist_lock){.+.+..}, at: [<fffffff7000d7310>] debug_show_all_locks+0xc0/0x350
3 locks held by sh/1732:
#0: (sb_writers#4){.+.+.+}, at: [<fffffff70022f6b8>] vfs_write+0x268/0x2c0
#1: (watchdog_proc_mutex){+.+.+.}, at: [<fffffff70016f368>] proc_watchdog_common+0x78/0x1c8
#2: (smpboot_threads_lock){+.+.+.}, at: [<fffffff700093558>] smpboot_unregister_percpu_thread+0x48/0x88
All the watchdog/1/15 locks are attributable to the fact that it's running
on the same core that ended up handling the "^O-d" request from SysRq.
The sh process from which I ran the echo eventually shows up as "blocked for
more than 120 seconds" and pretty much where you'd expect it to be, waiting
on a completion in kthread_stop() at kthread.c:473.
I instrumented lock_hrtimer_base(), and timer->base is null, and never gets
set non-null, so the loop spins forever. Perhaps something in nohz is preventing
the timer->base from being set?
I'm happy to keep debugging this but I'm not really clear on what could
be going wrong here. Any ideas?
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com