Hi,
A few fixes for nohz against -rc1. I'll likely send the non-rfc
part to Ingo tomorrow, unless anybody has concerns.
Thanks.
---
Frederic Weisbecker (6):
vtime: Use consistent clocks among nohz accounting
watchdog: Boot-disable by default on full dynticks
kvm: Move guest entry/exit APIs to context_tracking
kthread: Enable parking requests from setup() and unpark() callbacks
watchdog: Rename confusing state variable
watchdog: Fix internal state with boot user disabled watchdog
Li Zhong (1):
nohz: Fix notifier return val that enforce timekeeping
Steven Rostedt (1):
nohz: Warn if the machine can not perform nohz_full
include/linux/context_tracking.h | 35 +++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 37 +------------------------------------
include/linux/nmi.h | 2 +-
include/linux/vtime.h | 4 ++--
kernel/context_tracking.c | 1 -
kernel/sched/core.c | 2 +-
kernel/sched/cputime.c | 6 +++---
kernel/smpboot.c | 6 ++++++
kernel/sysctl.c | 4 ++--
kernel/time/tick-sched.c | 7 ++++++-
kernel/watchdog.c | 34 ++++++++++++++++++++--------------
11 files changed, 77 insertions(+), 61 deletions(-)
--
1.7.5.4
From: Li Zhong <[email protected]>
In tick_nohz_cpu_down_callback() if the cpu is the one handling
timekeeping , we must return something that stops the CPU_DOWN_PREPARE
notifiers and then start notify CPU_DOWN_FAILED on the already called
notifier call backs.
However traditional errno values are not handled by the notifier unless
these are encapsulated using errno_to_notifier().
Hence the current -EINVAL is misinterpreted and converted to junk after
notifier_to_errno(), leaving the notifier subsystem to random behaviour
such as eventually allowing the cpu to go down.
Fix this by using the standard NOTIFY_BAD instead.
Signed-off-by: Li Zhong <[email protected]>
Reviewed-by: Srivatsa S. Bhat <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cfc798b..58d8d4d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -311,7 +311,7 @@ static int __cpuinit tick_nohz_cpu_down_callback(struct notifier_block *nfb,
* we can't safely shutdown that CPU.
*/
if (have_nohz_full_mask && tick_do_timer_cpu == cpu)
- return -EINVAL;
+ return NOTIFY_BAD;
break;
}
return NOTIFY_OK;
--
1.7.5.4
When the watchdog is disabled through the 'nmi_watchdog=0'
boot parameter, the watchdog_running internal state isn't
cleared. Hence further enablements through sysctl/procfs
are ignored because the subsystem spuriously thinks it's
already running.
Initialize it properly on boot.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/watchdog.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 10776fe..3fa5df20 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -31,7 +31,7 @@
int watchdog_user_enabled = 1;
int __read_mostly watchdog_thresh = 10;
-static int __read_mostly watchdog_running = 1;
+static int __read_mostly watchdog_running;
static u64 __read_mostly sample_period;
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
@@ -545,10 +545,10 @@ void __init lockup_detector_init(void)
{
#ifdef CONFIG_NO_HZ_FULL
watchdog_user_enabled = 0;
- watchdog_running = 0;
pr_warning("Disabled lockup detectors by default because of full dynticks\n");
pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
#endif
+ watchdog_running = watchdog_user_enabled;
set_sample_period();
if (smpboot_register_percpu_thread(&watchdog_threads)) {
pr_err("Failed to create watchdog threads, disabled\n");
--
1.7.5.4
We have two very conflicting state variable names in the
watchdog:
* watchdog_enabled: This one reflects the user interface. It's
set to 1 by default and can be overriden with boot options
or sysctl/procfs interface.
* watchdog_disabled: This is the internal toggle state that
tells if watchdog threads, timers and NMI events are currently
running or not. This state depends on the user settings and
hardware support. It's a convenient state latch.
Now we really need to find clearer names because those
are just too confusing to encourage deep review.
watchdog_enabled now becomes watchdog_user_enabled to reflect
its purpose as an interface.
watchdog_disabled becomes watchdog_running to suggest its
role as a pure internal state.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/linux/nmi.h | 2 +-
kernel/sysctl.c | 4 ++--
kernel/watchdog.c | 32 ++++++++++++++++----------------
3 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index db50840..6a45fb5 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -46,7 +46,7 @@ static inline bool trigger_all_cpu_backtrace(void)
#ifdef CONFIG_LOCKUP_DETECTOR
int hw_nmi_is_cpu_stuck(struct pt_regs *);
u64 hw_nmi_get_sample_period(int watchdog_thresh);
-extern int watchdog_enabled;
+extern int watchdog_user_enabled;
extern int watchdog_thresh;
struct ctl_table;
extern int proc_dowatchdog(struct ctl_table *, int ,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..b080565 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -801,7 +801,7 @@ static struct ctl_table kern_table[] = {
#if defined(CONFIG_LOCKUP_DETECTOR)
{
.procname = "watchdog",
- .data = &watchdog_enabled,
+ .data = &watchdog_user_enabled,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = proc_dowatchdog,
@@ -828,7 +828,7 @@ static struct ctl_table kern_table[] = {
},
{
.procname = "nmi_watchdog",
- .data = &watchdog_enabled,
+ .data = &watchdog_user_enabled,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = proc_dowatchdog,
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7e1a021..10776fe 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -29,9 +29,9 @@
#include <linux/kvm_para.h>
#include <linux/perf_event.h>
-int watchdog_enabled = 1;
+int watchdog_user_enabled = 1;
int __read_mostly watchdog_thresh = 10;
-static int __read_mostly watchdog_disabled;
+static int __read_mostly watchdog_running = 1;
static u64 __read_mostly sample_period;
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
@@ -63,7 +63,7 @@ static int __init hardlockup_panic_setup(char *str)
else if (!strncmp(str, "nopanic", 7))
hardlockup_panic = 0;
else if (!strncmp(str, "0", 1))
- watchdog_enabled = 0;
+ watchdog_user_enabled = 0;
return 1;
}
__setup("nmi_watchdog=", hardlockup_panic_setup);
@@ -82,7 +82,7 @@ __setup("softlockup_panic=", softlockup_panic_setup);
static int __init nowatchdog_setup(char *str)
{
- watchdog_enabled = 0;
+ watchdog_user_enabled = 0;
return 1;
}
__setup("nowatchdog", nowatchdog_setup);
@@ -90,7 +90,7 @@ __setup("nowatchdog", nowatchdog_setup);
/* deprecated */
static int __init nosoftlockup_setup(char *str)
{
- watchdog_enabled = 0;
+ watchdog_user_enabled = 0;
return 1;
}
__setup("nosoftlockup", nosoftlockup_setup);
@@ -158,7 +158,7 @@ void touch_all_softlockup_watchdogs(void)
#ifdef CONFIG_HARDLOCKUP_DETECTOR
void touch_nmi_watchdog(void)
{
- if (watchdog_enabled) {
+ if (watchdog_user_enabled) {
unsigned cpu;
for_each_present_cpu(cpu) {
@@ -347,7 +347,7 @@ static void watchdog_enable(unsigned int cpu)
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;
- if (!watchdog_enabled) {
+ if (!watchdog_user_enabled) {
kthread_park(current);
return;
}
@@ -482,8 +482,8 @@ static void watchdog_enable_all_cpus(void)
{
unsigned int cpu;
- if (watchdog_disabled) {
- watchdog_disabled = 0;
+ if (!watchdog_running) {
+ watchdog_running = 1;
for_each_online_cpu(cpu)
kthread_unpark(per_cpu(softlockup_watchdog, cpu));
}
@@ -493,8 +493,8 @@ static void watchdog_disable_all_cpus(void)
{
unsigned int cpu;
- if (!watchdog_disabled) {
- watchdog_disabled = 1;
+ if (watchdog_running) {
+ watchdog_running = 0;
for_each_online_cpu(cpu)
kthread_park(per_cpu(softlockup_watchdog, cpu));
}
@@ -509,7 +509,7 @@ int proc_dowatchdog(struct ctl_table *table, int write,
{
int ret;
- if (watchdog_disabled < 0)
+ if (watchdog_running < 0)
return -ENODEV;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
@@ -522,7 +522,7 @@ int proc_dowatchdog(struct ctl_table *table, int write,
* disabled. The 'watchdog_disabled' variable check in
* watchdog_*_all_cpus() function takes care of this.
*/
- if (watchdog_enabled && watchdog_thresh)
+ if (watchdog_user_enabled && watchdog_thresh)
watchdog_enable_all_cpus();
else
watchdog_disable_all_cpus();
@@ -544,14 +544,14 @@ static struct smp_hotplug_thread watchdog_threads = {
void __init lockup_detector_init(void)
{
#ifdef CONFIG_NO_HZ_FULL
- watchdog_enabled = 0;
- watchdog_disabled = 1;
+ watchdog_user_enabled = 0;
+ watchdog_running = 0;
pr_warning("Disabled lockup detectors by default because of full dynticks\n");
pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
#endif
set_sample_period();
if (smpboot_register_percpu_thread(&watchdog_threads)) {
pr_err("Failed to create watchdog threads, disabled\n");
- watchdog_disabled = -ENODEV;
+ watchdog_running = -ENODEV;
}
}
--
1.7.5.4
When the watchdog code is boot-disabled by the user, for example
through the 'nmi_watchdog=0' boot option, the setup() callback of
the watchdog kthread requests to park the task, and that until the
user later re-enables the watchdog through sysctl or procfs.
However the parking request is not well handled when done from
the setup() callback. After ->setup() is called, the generic smpboot
kthread loop directly tries to call the thread function or wait
for some event if ->thread_should_run() is false.
In the case of the watchdog kthread, ->thread_should_run() returns
false and the kthread goes to sleep and wait for the watchdog timer
to wake it up. But the timer is not enabled since the user requested
to disable the watchdog. We want the kthread to park instead of waiting
for events that can't happen.
As a result, later unpark requests after sysctl write through
'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
expected, since it's not parked anyway, leaving the value modified
without triggering any action.
We could workaround some solution in the watchdog code like forcing
one pass to the thread function and immediately return to park.
But supporting parking requests from ->setup() or ->unpark()
callbacks look like proper way to implement cancellation in
general. So let's fix it that way.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Don Zickus <[email protected]>
---
kernel/smpboot.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 02fc5c9..3394ed0 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -151,6 +151,12 @@ static int smpboot_thread_fn(void *data)
break;
}
+ /* Check if setup or unpark actually want us to park */
+ if (kthread_should_stop() || kthread_should_park()) {
+ preempt_enable();
+ continue;
+ }
+
if (!ht->thread_should_run(td->cpu)) {
preempt_enable();
schedule();
--
1.7.5.4
While computing the cputime delta of dynticks CPUs,
we are mixing up clocks of differents natures:
* local_clock() which takes care of unstable clock
sources and fix these if needed.
* sched_clock() which is the weaker version of
local_clock(). It doesn't compute any fixup in case
of unstable source.
If the clock source is stable, those two clocks are the
same and we can safely compute the difference against
two random points.
Otherwise it results in random deltas as sched_clock()
can randomly drift away, back or forward, from local_clock().
As a consequence, some strange behaviour with unstable tsc
has been observed such as non progressing constant zero cputime.
(The 'top' command showing no load).
Fix this by only using local_clock(), or its irq safe/remote
equivalent, in vtime code.
Reported-by: Mike Galbraith <[email protected]>
Suggested-by: Mike Galbraith <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Mike Galbraith <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/vtime.h | 4 ++--
kernel/sched/core.c | 2 +-
kernel/sched/cputime.c | 6 +++---
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 71a5782..b1dd2db 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -34,7 +34,7 @@ static inline void vtime_user_exit(struct task_struct *tsk)
}
extern void vtime_guest_enter(struct task_struct *tsk);
extern void vtime_guest_exit(struct task_struct *tsk);
-extern void vtime_init_idle(struct task_struct *tsk);
+extern void vtime_init_idle(struct task_struct *tsk, int cpu);
#else
static inline void vtime_account_irq_exit(struct task_struct *tsk)
{
@@ -45,7 +45,7 @@ static inline void vtime_user_enter(struct task_struct *tsk) { }
static inline void vtime_user_exit(struct task_struct *tsk) { }
static inline void vtime_guest_enter(struct task_struct *tsk) { }
static inline void vtime_guest_exit(struct task_struct *tsk) { }
-static inline void vtime_init_idle(struct task_struct *tsk) { }
+static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..e1a27f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4745,7 +4745,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
*/
idle->sched_class = &idle_sched_class;
ftrace_graph_init_idle_task(idle, cpu);
- vtime_init_idle(idle);
+ vtime_init_idle(idle, cpu);
#if defined(CONFIG_SMP)
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index cc2dc3ee..b5ccba2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -747,17 +747,17 @@ void arch_vtime_task_switch(struct task_struct *prev)
write_seqlock(¤t->vtime_seqlock);
current->vtime_snap_whence = VTIME_SYS;
- current->vtime_snap = sched_clock();
+ current->vtime_snap = sched_clock_cpu(smp_processor_id());
write_sequnlock(¤t->vtime_seqlock);
}
-void vtime_init_idle(struct task_struct *t)
+void vtime_init_idle(struct task_struct *t, int cpu)
{
unsigned long flags;
write_seqlock_irqsave(&t->vtime_seqlock, flags);
t->vtime_snap_whence = VTIME_SYS;
- t->vtime_snap = sched_clock();
+ t->vtime_snap = sched_clock_cpu(cpu);
write_sequnlock_irqrestore(&t->vtime_seqlock, flags);
}
--
1.7.5.4
When the watchdog runs, it prevents the full dynticks
CPUs from stopping their tick because the hard lockup
detector uses perf events internally, which in turn
rely on the periodic tick.
Since this is a rather confusing behaviour that is not
easy to track down and identify for those who want to
test CONFIG_NO_HZ_FULL, let's default disable the
watchdog on boot time when full dynticks is enabled.
The user can still enable it later on runtime using
proc or sysctl.
Reported-by: Steven Rostedt <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Don Zickus <[email protected]>
---
kernel/watchdog.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 05039e3..7e1a021 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -543,6 +543,12 @@ static struct smp_hotplug_thread watchdog_threads = {
void __init lockup_detector_init(void)
{
+#ifdef CONFIG_NO_HZ_FULL
+ watchdog_enabled = 0;
+ watchdog_disabled = 1;
+ pr_warning("Disabled lockup detectors by default because of full dynticks\n");
+ pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
+#endif
set_sample_period();
if (smpboot_register_percpu_thread(&watchdog_threads)) {
pr_err("Failed to create watchdog threads, disabled\n");
--
1.7.5.4
The kvm_host.h header file doesn't handle well
inclusion when archs don't support KVM.
This results in build crashes for such archs when they
want to implement context tracking because this subsystem
includes kvm_host.h in order to implement the
guest_enter/exit APIs but it doesn't handle KVM off case.
To fix this, move the guest_enter()/guest_exit()
declarations and generic implementation to the context
tracking headers. These generic APIs actually belong to
this subsystem, besides other domains boundary tracking
like user_enter() et al.
KVM now properly becomes a user of this library, not the
other buggy way around.
Reported-by: Kevin Hilman <[email protected]>
Reviewed-by: Kevin Hilman <[email protected]>
Tested-by: Kevin Hilman <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Gleb Natapov <[email protected]>
---
include/linux/context_tracking.h | 35 +++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 37 +------------------------------------
kernel/context_tracking.c | 1 -
3 files changed, 36 insertions(+), 37 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 365f4a6..fc09d7b 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
#include <linux/sched.h>
#include <linux/percpu.h>
+#include <linux/vtime.h>
#include <asm/ptrace.h>
struct context_tracking {
@@ -19,6 +20,26 @@ struct context_tracking {
} state;
};
+static inline void __guest_enter(void)
+{
+ /*
+ * This is running in ioctl context so we can avoid
+ * the call to vtime_account() with its unnecessary idle check.
+ */
+ vtime_account_system(current);
+ current->flags |= PF_VCPU;
+}
+
+static inline void __guest_exit(void)
+{
+ /*
+ * This is running in ioctl context so we can avoid
+ * the call to vtime_account() with its unnecessary idle check.
+ */
+ vtime_account_system(current);
+ current->flags &= ~PF_VCPU;
+}
+
#ifdef CONFIG_CONTEXT_TRACKING
DECLARE_PER_CPU(struct context_tracking, context_tracking);
@@ -35,6 +56,9 @@ static inline bool context_tracking_active(void)
extern void user_enter(void);
extern void user_exit(void);
+extern void guest_enter(void);
+extern void guest_exit(void);
+
static inline enum ctx_state exception_enter(void)
{
enum ctx_state prev_ctx;
@@ -57,6 +81,17 @@ extern void context_tracking_task_switch(struct task_struct *prev,
static inline bool context_tracking_in_user(void) { return false; }
static inline void user_enter(void) { }
static inline void user_exit(void) { }
+
+static inline void guest_enter(void)
+{
+ __guest_enter();
+}
+
+static inline void guest_exit(void)
+{
+ __guest_exit();
+}
+
static inline enum ctx_state exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
static inline void context_tracking_task_switch(struct task_struct *prev,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0eea07..8db53cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -23,6 +23,7 @@
#include <linux/ratelimit.h>
#include <linux/err.h>
#include <linux/irqflags.h>
+#include <linux/context_tracking.h>
#include <asm/signal.h>
#include <linux/kvm.h>
@@ -760,42 +761,6 @@ static inline int kvm_iommu_unmap_guest(struct kvm *kvm)
}
#endif
-static inline void __guest_enter(void)
-{
- /*
- * This is running in ioctl context so we can avoid
- * the call to vtime_account() with its unnecessary idle check.
- */
- vtime_account_system(current);
- current->flags |= PF_VCPU;
-}
-
-static inline void __guest_exit(void)
-{
- /*
- * This is running in ioctl context so we can avoid
- * the call to vtime_account() with its unnecessary idle check.
- */
- vtime_account_system(current);
- current->flags &= ~PF_VCPU;
-}
-
-#ifdef CONFIG_CONTEXT_TRACKING
-extern void guest_enter(void);
-extern void guest_exit(void);
-
-#else /* !CONFIG_CONTEXT_TRACKING */
-static inline void guest_enter(void)
-{
- __guest_enter();
-}
-
-static inline void guest_exit(void)
-{
- __guest_exit();
-}
-#endif /* !CONFIG_CONTEXT_TRACKING */
-
static inline void kvm_guest_enter(void)
{
unsigned long flags;
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 65349f0..85bdde1 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -15,7 +15,6 @@
*/
#include <linux/context_tracking.h>
-#include <linux/kvm_host.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/hardirq.h>
--
1.7.5.4
From: Steven Rostedt <[email protected]>
If the user configures NO_HZ_FULL and defines nohz_full=XXX on the
kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails
due to the machine having a unstable clock, warn about it.
We do not want users thinking that they are getting the benefit
of nohz when their machine can not support it.
Signed-off-by: Steven Rostedt <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index bc67d42..cfc798b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -178,6 +178,11 @@ static bool can_stop_full_tick(void)
*/
if (!sched_clock_stable) {
trace_tick_stop(0, "unstable sched clock\n");
+ /*
+ * Don't allow the user to think they can get
+ * full NO_HZ with this machine.
+ */
+ WARN_ONCE(1, "NO_HZ FULL will not work with unstable sched clock");
return false;
}
#endif
--
1.7.5.4
On Mon, May 20, 2013 at 06:01:51PM +0200, Frederic Weisbecker wrote:
> When the watchdog runs, it prevents the full dynticks
> CPUs from stopping their tick because the hard lockup
> detector uses perf events internally, which in turn
> rely on the periodic tick.
>
> Since this is a rather confusing behaviour that is not
> easy to track down and identify for those who want to
> test CONFIG_NO_HZ_FULL, let's default disable the
> watchdog on boot time when full dynticks is enabled.
>
> The user can still enable it later on runtime using
> proc or sysctl.
I thought Peter committed a patch to perf so that this isn't needed any
more?
Cheers,
Don
>
> Reported-by: Steven Rostedt <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Li Zhong <[email protected]>
> Cc: Don Zickus <[email protected]>
> ---
> kernel/watchdog.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 05039e3..7e1a021 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -543,6 +543,12 @@ static struct smp_hotplug_thread watchdog_threads = {
>
> void __init lockup_detector_init(void)
> {
> +#ifdef CONFIG_NO_HZ_FULL
> + watchdog_enabled = 0;
> + watchdog_disabled = 1;
> + pr_warning("Disabled lockup detectors by default because of full dynticks\n");
> + pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
> +#endif
> set_sample_period();
> if (smpboot_register_percpu_thread(&watchdog_threads)) {
> pr_err("Failed to create watchdog threads, disabled\n");
> --
> 1.7.5.4
>
On Mon, May 20, 2013 at 06:01:55PM +0200, Frederic Weisbecker wrote:
> We have two very conflicting state variable names in the
> watchdog:
>
> * watchdog_enabled: This one reflects the user interface. It's
> set to 1 by default and can be overriden with boot options
> or sysctl/procfs interface.
>
> * watchdog_disabled: This is the internal toggle state that
> tells if watchdog threads, timers and NMI events are currently
> running or not. This state depends on the user settings and
> hardware support. It's a convenient state latch.
>
> Now we really need to find clearer names because those
> are just too confusing to encourage deep review.
>
> watchdog_enabled now becomes watchdog_user_enabled to reflect
> its purpose as an interface.
>
> watchdog_disabled becomes watchdog_running to suggest its
> role as a pure internal state.
Heh. Thanks for the cleanup.
Acked-by: Don Zickus <[email protected]>
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> include/linux/nmi.h | 2 +-
> kernel/sysctl.c | 4 ++--
> kernel/watchdog.c | 32 ++++++++++++++++----------------
> 3 files changed, 19 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index db50840..6a45fb5 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -46,7 +46,7 @@ static inline bool trigger_all_cpu_backtrace(void)
> #ifdef CONFIG_LOCKUP_DETECTOR
> int hw_nmi_is_cpu_stuck(struct pt_regs *);
> u64 hw_nmi_get_sample_period(int watchdog_thresh);
> -extern int watchdog_enabled;
> +extern int watchdog_user_enabled;
> extern int watchdog_thresh;
> struct ctl_table;
> extern int proc_dowatchdog(struct ctl_table *, int ,
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 9edcf45..b080565 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -801,7 +801,7 @@ static struct ctl_table kern_table[] = {
> #if defined(CONFIG_LOCKUP_DETECTOR)
> {
> .procname = "watchdog",
> - .data = &watchdog_enabled,
> + .data = &watchdog_user_enabled,
> .maxlen = sizeof (int),
> .mode = 0644,
> .proc_handler = proc_dowatchdog,
> @@ -828,7 +828,7 @@ static struct ctl_table kern_table[] = {
> },
> {
> .procname = "nmi_watchdog",
> - .data = &watchdog_enabled,
> + .data = &watchdog_user_enabled,
> .maxlen = sizeof (int),
> .mode = 0644,
> .proc_handler = proc_dowatchdog,
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 7e1a021..10776fe 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -29,9 +29,9 @@
> #include <linux/kvm_para.h>
> #include <linux/perf_event.h>
>
> -int watchdog_enabled = 1;
> +int watchdog_user_enabled = 1;
> int __read_mostly watchdog_thresh = 10;
> -static int __read_mostly watchdog_disabled;
> +static int __read_mostly watchdog_running = 1;
> static u64 __read_mostly sample_period;
>
> static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
> @@ -63,7 +63,7 @@ static int __init hardlockup_panic_setup(char *str)
> else if (!strncmp(str, "nopanic", 7))
> hardlockup_panic = 0;
> else if (!strncmp(str, "0", 1))
> - watchdog_enabled = 0;
> + watchdog_user_enabled = 0;
> return 1;
> }
> __setup("nmi_watchdog=", hardlockup_panic_setup);
> @@ -82,7 +82,7 @@ __setup("softlockup_panic=", softlockup_panic_setup);
>
> static int __init nowatchdog_setup(char *str)
> {
> - watchdog_enabled = 0;
> + watchdog_user_enabled = 0;
> return 1;
> }
> __setup("nowatchdog", nowatchdog_setup);
> @@ -90,7 +90,7 @@ __setup("nowatchdog", nowatchdog_setup);
> /* deprecated */
> static int __init nosoftlockup_setup(char *str)
> {
> - watchdog_enabled = 0;
> + watchdog_user_enabled = 0;
> return 1;
> }
> __setup("nosoftlockup", nosoftlockup_setup);
> @@ -158,7 +158,7 @@ void touch_all_softlockup_watchdogs(void)
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> void touch_nmi_watchdog(void)
> {
> - if (watchdog_enabled) {
> + if (watchdog_user_enabled) {
> unsigned cpu;
>
> for_each_present_cpu(cpu) {
> @@ -347,7 +347,7 @@ static void watchdog_enable(unsigned int cpu)
> hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> hrtimer->function = watchdog_timer_fn;
>
> - if (!watchdog_enabled) {
> + if (!watchdog_user_enabled) {
> kthread_park(current);
> return;
> }
> @@ -482,8 +482,8 @@ static void watchdog_enable_all_cpus(void)
> {
> unsigned int cpu;
>
> - if (watchdog_disabled) {
> - watchdog_disabled = 0;
> + if (!watchdog_running) {
> + watchdog_running = 1;
> for_each_online_cpu(cpu)
> kthread_unpark(per_cpu(softlockup_watchdog, cpu));
> }
> @@ -493,8 +493,8 @@ static void watchdog_disable_all_cpus(void)
> {
> unsigned int cpu;
>
> - if (!watchdog_disabled) {
> - watchdog_disabled = 1;
> + if (watchdog_running) {
> + watchdog_running = 0;
> for_each_online_cpu(cpu)
> kthread_park(per_cpu(softlockup_watchdog, cpu));
> }
> @@ -509,7 +509,7 @@ int proc_dowatchdog(struct ctl_table *table, int write,
> {
> int ret;
>
> - if (watchdog_disabled < 0)
> + if (watchdog_running < 0)
> return -ENODEV;
>
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> @@ -522,7 +522,7 @@ int proc_dowatchdog(struct ctl_table *table, int write,
> * disabled. The 'watchdog_disabled' variable check in
> * watchdog_*_all_cpus() function takes care of this.
> */
> - if (watchdog_enabled && watchdog_thresh)
> + if (watchdog_user_enabled && watchdog_thresh)
> watchdog_enable_all_cpus();
> else
> watchdog_disable_all_cpus();
> @@ -544,14 +544,14 @@ static struct smp_hotplug_thread watchdog_threads = {
> void __init lockup_detector_init(void)
> {
> #ifdef CONFIG_NO_HZ_FULL
> - watchdog_enabled = 0;
> - watchdog_disabled = 1;
> + watchdog_user_enabled = 0;
> + watchdog_running = 0;
> pr_warning("Disabled lockup detectors by default because of full dynticks\n");
> pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
> #endif
> set_sample_period();
> if (smpboot_register_percpu_thread(&watchdog_threads)) {
> pr_err("Failed to create watchdog threads, disabled\n");
> - watchdog_disabled = -ENODEV;
> + watchdog_running = -ENODEV;
> }
> }
> --
> 1.7.5.4
>
On Mon, May 20, 2013 at 06:01:56PM +0200, Frederic Weisbecker wrote:
> When the watchdog is disabled through the 'nmi_watchdog=0'
> boot parameter, the watchdog_running internal state isn't
> cleared. Hence further enablements through sysctl/procfs
> are ignored because the subsystem spuriously thinks it's
> already running.
>
> Initialize it properly on boot.
Looks fine to me. Can't think of a case where this breaks something.
Acked-by: Don Zickus <[email protected]>
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> kernel/watchdog.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 10776fe..3fa5df20 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -31,7 +31,7 @@
>
> int watchdog_user_enabled = 1;
> int __read_mostly watchdog_thresh = 10;
> -static int __read_mostly watchdog_running = 1;
> +static int __read_mostly watchdog_running;
> static u64 __read_mostly sample_period;
>
> static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
> @@ -545,10 +545,10 @@ void __init lockup_detector_init(void)
> {
> #ifdef CONFIG_NO_HZ_FULL
> watchdog_user_enabled = 0;
> - watchdog_running = 0;
> pr_warning("Disabled lockup detectors by default because of full dynticks\n");
> pr_warning("You can overwrite that with 'sysctl -w kernel.watchdog=1'\n");
> #endif
> + watchdog_running = watchdog_user_enabled;
> set_sample_period();
> if (smpboot_register_percpu_thread(&watchdog_threads)) {
> pr_err("Failed to create watchdog threads, disabled\n");
> --
> 1.7.5.4
>
On Mon, May 20, 2013 at 01:52:29PM -0400, Don Zickus wrote:
> On Mon, May 20, 2013 at 06:01:51PM +0200, Frederic Weisbecker wrote:
> > When the watchdog runs, it prevents the full dynticks
> > CPUs from stopping their tick because the hard lockup
> > detector uses perf events internally, which in turn
> > rely on the periodic tick.
> >
> > Since this is a rather confusing behaviour that is not
> > easy to track down and identify for those who want to
> > test CONFIG_NO_HZ_FULL, let's default disable the
> > watchdog on boot time when full dynticks is enabled.
> >
> > The user can still enable it later on runtime using
> > proc or sysctl.
>
> I thought Peter committed a patch to perf so that this isn't needed any
> more?
There is the patchset that converts perf_event_task_tick() to an hrtimer.
But I'm not sure it covers everything. Also this will be for the next merge
window. And in the end this moves the problem elsewhere: we still have timer
interrupts, they just come as specific hrtimers, although may be less often,
I haven't looked yet at the patchset.
On Mon, May 20, 2013 at 9:31 PM, Frederic Weisbecker <[email protected]> wrote:
> When the watchdog code is boot-disabled by the user, for example
> through the 'nmi_watchdog=0' boot option, the setup() callback of
> the watchdog kthread requests to park the task, and that until the
> user later re-enables the watchdog through sysctl or procfs.
>
> However the parking request is not well handled when done from
> the setup() callback. After ->setup() is called, the generic smpboot
> kthread loop directly tries to call the thread function or wait
> for some event if ->thread_should_run() is false.
>
> In the case of the watchdog kthread, ->thread_should_run() returns
> false and the kthread goes to sleep and wait for the watchdog timer
> to wake it up. But the timer is not enabled since the user requested
> to disable the watchdog. We want the kthread to park instead of waiting
> for events that can't happen.
>
> As a result, later unpark requests after sysctl write through
> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
> expected, since it's not parked anyway, leaving the value modified
> without triggering any action.
Out of curiosity, this can happen only for short period of time right?After
some time this will work right as the thread get back in action
after the schedule call.Then sysctl and procfs will work I think.
>
> We could workaround some solution in the watchdog code like forcing
> one pass to the thread function and immediately return to park.
>
> But supporting parking requests from ->setup() or ->unpark()
> callbacks look like proper way to implement cancellation in
> general. So let's fix it that way.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Srivatsa S. Bhat <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Li Zhong <[email protected]>
> Cc: Don Zickus <[email protected]>
> ---
> kernel/smpboot.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index 02fc5c9..3394ed0 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -151,6 +151,12 @@ static int smpboot_thread_fn(void *data)
> break;
> }
>
> + /* Check if setup or unpark actually want us to park */
> + if (kthread_should_stop() || kthread_should_park()) {
> + preempt_enable();
> + continue;
> + }
> +
> if (!ht->thread_should_run(td->cpu)) {
> preempt_enable();
> schedule();
> --
> 1.7.5.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 05/20/2013 09:31 PM, Frederic Weisbecker wrote:
> When the watchdog code is boot-disabled by the user, for example
> through the 'nmi_watchdog=0' boot option, the setup() callback of
> the watchdog kthread requests to park the task, and that until the
> user later re-enables the watchdog through sysctl or procfs.
>
> However the parking request is not well handled when done from
> the setup() callback. After ->setup() is called, the generic smpboot
> kthread loop directly tries to call the thread function or wait
> for some event if ->thread_should_run() is false.
>
> In the case of the watchdog kthread, ->thread_should_run() returns
> false and the kthread goes to sleep and wait for the watchdog timer
> to wake it up. But the timer is not enabled since the user requested
> to disable the watchdog. We want the kthread to park instead of waiting
> for events that can't happen.
>
> As a result, later unpark requests after sysctl write through
> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
> expected, since it's not parked anyway, leaving the value modified
> without triggering any action.
>
> We could workaround some solution in the watchdog code like forcing
> one pass to the thread function and immediately return to park.
>
> But supporting parking requests from ->setup() or ->unpark()
unpark() can already do a proper park, because immediately after
coming out of the parked state, the 'continue' statement helps
re-evaluate the stop/park condition.
So this fix is only for the ->setup() case.
> callbacks look like proper way to implement cancellation in
> general. So let's fix it that way.
>
Sounds good to me.
Reviewed-by: Srivatsa S. Bhat <[email protected]>
But I wonder what nmi_watchdog=0 should actually mean, semantically..
The current code works as if the user asked us not to run the watchdog
threads, but it could as well be interpreted as if the user does not
want to run *any* watchdog-related *code*. In that case, ideally we
should *unregister* the watchdog threads, instead of just parking them.
And when the user enables them again via sysctl/procfs, we should
register the watchdog threads with the smpboot infrastructure.
I'm not saying that the current semantics is wrong, but if we really
implement it the other way I proposed above, then we won't have to deal
with weird issues like ->setup() code wanting to park, and watchdog
threads unparking and parking themselves on every CPU hotplug operation,
despite the fact that the user specified nmi_watchdog=0 on the kernel
command line.
Regards,
Srivatsa S. Bhat
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Srivatsa S. Bhat <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Li Zhong <[email protected]>
> Cc: Don Zickus <[email protected]>
> ---
> kernel/smpboot.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index 02fc5c9..3394ed0 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -151,6 +151,12 @@ static int smpboot_thread_fn(void *data)
> break;
> }
>
> + /* Check if setup or unpark actually want us to park */
> + if (kthread_should_stop() || kthread_should_park()) {
> + preempt_enable();
> + continue;
> + }
> +
> if (!ht->thread_should_run(td->cpu)) {
> preempt_enable();
> schedule();
>
On 05/21/2013 11:04 AM, anish singh wrote:
> On Mon, May 20, 2013 at 9:31 PM, Frederic Weisbecker <[email protected]> wrote:
>> When the watchdog code is boot-disabled by the user, for example
>> through the 'nmi_watchdog=0' boot option, the setup() callback of
>> the watchdog kthread requests to park the task, and that until the
>> user later re-enables the watchdog through sysctl or procfs.
>>
>> However the parking request is not well handled when done from
>> the setup() callback. After ->setup() is called, the generic smpboot
>> kthread loop directly tries to call the thread function or wait
>> for some event if ->thread_should_run() is false.
>>
>> In the case of the watchdog kthread, ->thread_should_run() returns
>> false and the kthread goes to sleep and wait for the watchdog timer
>> to wake it up. But the timer is not enabled since the user requested
>> to disable the watchdog. We want the kthread to park instead of waiting
>> for events that can't happen.
>>
>> As a result, later unpark requests after sysctl write through
>> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
>> expected, since it's not parked anyway, leaving the value modified
>> without triggering any action.
> Out of curiosity, this can happen only for short period of time right?After
> some time this will work right as the thread get back in action
> after the schedule call.Then sysctl and procfs will work I think.
kthread_unpark() can wake up a task only if the task is in TASK_PARKED
state. But since the above task would be in TASK_INTERRUPTIBLE state
(since it is not parked), kthread_unpark() will be powerless to do anything.
Regards,
Srivatsa S. Bhat
On Tue, May 21, 2013 at 1:19 PM, Srivatsa S. Bhat
<[email protected]> wrote:
> On 05/21/2013 11:04 AM, anish singh wrote:
>> On Mon, May 20, 2013 at 9:31 PM, Frederic Weisbecker <[email protected]> wrote:
>>> When the watchdog code is boot-disabled by the user, for example
>>> through the 'nmi_watchdog=0' boot option, the setup() callback of
>>> the watchdog kthread requests to park the task, and that until the
>>> user later re-enables the watchdog through sysctl or procfs.
>>>
>>> However the parking request is not well handled when done from
>>> the setup() callback. After ->setup() is called, the generic smpboot
>>> kthread loop directly tries to call the thread function or wait
>>> for some event if ->thread_should_run() is false.
>>>
>>> In the case of the watchdog kthread, ->thread_should_run() returns
>>> false and the kthread goes to sleep and wait for the watchdog timer
>>> to wake it up. But the timer is not enabled since the user requested
>>> to disable the watchdog. We want the kthread to park instead of waiting
>>> for events that can't happen.
>>>
>>> As a result, later unpark requests after sysctl write through
>>> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
>>> expected, since it's not parked anyway, leaving the value modified
>>> without triggering any action.
>> Out of curiosity, this can happen only for short period of time right?After
>> some time this will work right as the thread get back in action
>> after the schedule call.Then sysctl and procfs will work I think.
>
> kthread_unpark() can wake up a task only if the task is in TASK_PARKED
> state. But since the above task would be in TASK_INTERRUPTIBLE state
> (since it is not parked), kthread_unpark() will be powerless to do anything.
Yes but this will happen only for a short period of time right?
After the schdule() call the below code gets executed in while() loop.
if (kthread_should_park()) {
__set_current_state(TASK_RUNNING);
preempt_enable();
if (ht->park && td->status == HP_THREAD_ACTIVE) {
BUG_ON(td->cpu != smp_processor_id());
ht->park(td->cpu);
td->status = HP_THREAD_PARKED;
}
kthread_parkme();
/* We might have been woken for stop */
continue;
}
As we have already called kthread_park this above if() condition gets true and
it will park the thread wouldn't it?But this will happen after the schedule
call which is not right as mentioned by fredrick.
On 05/21/2013 02:28 PM, anish singh wrote:
> On Tue, May 21, 2013 at 1:19 PM, Srivatsa S. Bhat
> <[email protected]> wrote:
>> On 05/21/2013 11:04 AM, anish singh wrote:
>>> On Mon, May 20, 2013 at 9:31 PM, Frederic Weisbecker <[email protected]> wrote:
>>>> When the watchdog code is boot-disabled by the user, for example
>>>> through the 'nmi_watchdog=0' boot option, the setup() callback of
>>>> the watchdog kthread requests to park the task, and that until the
>>>> user later re-enables the watchdog through sysctl or procfs.
>>>>
>>>> However the parking request is not well handled when done from
>>>> the setup() callback. After ->setup() is called, the generic smpboot
>>>> kthread loop directly tries to call the thread function or wait
>>>> for some event if ->thread_should_run() is false.
>>>>
>>>> In the case of the watchdog kthread, ->thread_should_run() returns
>>>> false and the kthread goes to sleep and wait for the watchdog timer
>>>> to wake it up. But the timer is not enabled since the user requested
>>>> to disable the watchdog. We want the kthread to park instead of waiting
>>>> for events that can't happen.
>>>>
>>>> As a result, later unpark requests after sysctl write through
>>>> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
>>>> expected, since it's not parked anyway, leaving the value modified
>>>> without triggering any action.
>>> Out of curiosity, this can happen only for short period of time right?After
>>> some time this will work right as the thread get back in action
>>> after the schedule call.Then sysctl and procfs will work I think.
>>
>> kthread_unpark() can wake up a task only if the task is in TASK_PARKED
>> state. But since the above task would be in TASK_INTERRUPTIBLE state
>> (since it is not parked), kthread_unpark() will be powerless to do anything.
> Yes but this will happen only for a short period of time right?
> After the schdule() call the below code gets executed in while() loop.
>
schedule() is not just another function call from which you'll simply
return "after a short period of time". We have set the task's state to
TASK_INTERRUPTIBLE at the beginning of the loop. So the call to schedule()
will kick the task out of the runqueue, since it is not runnable any more.
You need somebody to do a wake_up_process() or something similar to make
the task runnable again, after which it gets back to the runqueue and gets
a chance to run. But the only one who can do wake_up_process() for this
watchdog task in this case, is the kthread_unpark() code, which needs the
task to be in TASK_PARKED state, not TASK_INTERRUPTIBLE. So the call to
wake_up_state() returns without doing anything, leaving the watchdog task
non-runnable. This is what Frederic referred to, when he said that
enabling nmi watchdog via sysctl/procfs won't do anything.
> if (kthread_should_park()) {
> __set_current_state(TASK_RUNNING);
> preempt_enable();
> if (ht->park && td->status == HP_THREAD_ACTIVE) {
> BUG_ON(td->cpu != smp_processor_id());
> ht->park(td->cpu);
> td->status = HP_THREAD_PARKED;
> }
> kthread_parkme();
> /* We might have been woken for stop */
> continue;
> }
>
> As we have already called kthread_park this above if() condition gets true and
> it will park the thread wouldn't it?But this will happen after the schedule
> call which is not right as mentioned by fredrick.
>
Regards,
Srivatsa S. Bhat
2013/5/21, anish singh <[email protected]>:
> On Tue, May 21, 2013 at 1:19 PM, Srivatsa S. Bhat
> <[email protected]> wrote:
>> On 05/21/2013 11:04 AM, anish singh wrote:
>>> On Mon, May 20, 2013 at 9:31 PM, Frederic Weisbecker <[email protected]>
>>> wrote:
>>>> When the watchdog code is boot-disabled by the user, for example
>>>> through the 'nmi_watchdog=0' boot option, the setup() callback of
>>>> the watchdog kthread requests to park the task, and that until the
>>>> user later re-enables the watchdog through sysctl or procfs.
>>>>
>>>> However the parking request is not well handled when done from
>>>> the setup() callback. After ->setup() is called, the generic smpboot
>>>> kthread loop directly tries to call the thread function or wait
>>>> for some event if ->thread_should_run() is false.
>>>>
>>>> In the case of the watchdog kthread, ->thread_should_run() returns
>>>> false and the kthread goes to sleep and wait for the watchdog timer
>>>> to wake it up. But the timer is not enabled since the user requested
>>>> to disable the watchdog. We want the kthread to park instead of waiting
>>>> for events that can't happen.
>>>>
>>>> As a result, later unpark requests after sysctl write through
>>>> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
>>>> expected, since it's not parked anyway, leaving the value modified
>>>> without triggering any action.
>>> Out of curiosity, this can happen only for short period of time
>>> right?After
>>> some time this will work right as the thread get back in action
>>> after the schedule call.Then sysctl and procfs will work I think.
>>
>> kthread_unpark() can wake up a task only if the task is in TASK_PARKED
>> state. But since the above task would be in TASK_INTERRUPTIBLE state
>> (since it is not parked), kthread_unpark() will be powerless to do
>> anything.
> Yes but this will happen only for a short period of time right?
> After the schdule() call the below code gets executed in while() loop.
>
> if (kthread_should_park()) {
> __set_current_state(TASK_RUNNING);
> preempt_enable();
> if (ht->park && td->status == HP_THREAD_ACTIVE) {
> BUG_ON(td->cpu != smp_processor_id());
> ht->park(td->cpu);
> td->status = HP_THREAD_PARKED;
> }
> kthread_parkme();
> /* We might have been woken for stop */
> continue;
> }
>
> As we have already called kthread_park this above if() condition gets true
> and
> it will park the thread wouldn't it?But this will happen after the schedule
> call which is not right as mentioned by fredrick.
>
But we are scheduling in TASK_INTERRUPTIBLE mode so we are going to
sleep until some other thread wake us. But we are
unparked instead of being awoken. This just have no effect because
kthread_unpark() is a no-op on non-parked kthreads.
Am 20.05.2013 18:01, schrieb Frederic Weisbecker:
> While computing the cputime delta of dynticks CPUs,
> we are mixing up clocks of differents natures:
[...]
> As a consequence, some strange behaviour with unstable tsc
> has been observed such as non progressing constant zero cputime.
> (The 'top' command showing no load).
This happens for example on my trusty ThinkPad X200s (family 6 model 23
stepping 10 Core 2 duo), seriously confusing its user (me :-).
> Fix this by only using local_clock(), or its irq safe/remote
> equivalent, in vtime code.
>
> Reported-by: Mike Galbraith <[email protected]>
> Suggested-by: Mike Galbraith <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Li Zhong <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
FWIW:
Tested-by: Stefan Seyfried <[email protected]>
This patch fixes the 0% CPU issue on openSUSE Factory kernels for me.
Best regards,
Stefan
--
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B
B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
On Mon, 2013-06-03 at 11:47 +0200, Stefan Seyfried wrote:
> FWIW:
> Tested-by: Stefan Seyfried <[email protected]>
FWIW, we can always use testers. Thanks for testing!
-- Steve
On Mon, Jun 03, 2013 at 11:47:17AM +0200, Stefan Seyfried wrote:
> Am 20.05.2013 18:01, schrieb Frederic Weisbecker:
> > While computing the cputime delta of dynticks CPUs,
> > we are mixing up clocks of differents natures:
>
> [...]
>
> > As a consequence, some strange behaviour with unstable tsc
> > has been observed such as non progressing constant zero cputime.
> > (The 'top' command showing no load).
>
> This happens for example on my trusty ThinkPad X200s (family 6 model 23
> stepping 10 Core 2 duo), seriously confusing its user (me :-).
>
> > Fix this by only using local_clock(), or its irq safe/remote
> > equivalent, in vtime code.
> >
> > Reported-by: Mike Galbraith <[email protected]>
> > Suggested-by: Mike Galbraith <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Li Zhong <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
>
> FWIW:
> Tested-by: Stefan Seyfried <[email protected]>
>
> This patch fixes the 0% CPU issue on openSUSE Factory kernels for me.
Thanks! The patch has been committed already so I can't add your Tested-by:
but feedbacks on testing are always appeciated.
Am 03.06.2013 21:48, schrieb Frederic Weisbecker:
> On Mon, Jun 03, 2013 at 11:47:17AM +0200, Stefan Seyfried wrote:
>> FWIW:
>> Tested-by: Stefan Seyfried <[email protected]>
>>
>> This patch fixes the 0% CPU issue on openSUSE Factory kernels for me.
>
> Thanks! The patch has been committed already so I can't add your Tested-by:
> but feedbacks on testing are always appeciated.
But it did not end up in Linus' tree yet. That would be more important
for me than the credits in the commit message :-)
Thanks,
Stefan
--
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B
B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
On Mon, Jun 03, 2013 at 09:51:37PM +0200, Stefan Seyfried wrote:
> Am 03.06.2013 21:48, schrieb Frederic Weisbecker:
> > On Mon, Jun 03, 2013 at 11:47:17AM +0200, Stefan Seyfried wrote:
> >> FWIW:
> >> Tested-by: Stefan Seyfried <[email protected]>
> >>
> >> This patch fixes the 0% CPU issue on openSUSE Factory kernels for me.
> >
> > Thanks! The patch has been committed already so I can't add your Tested-by:
> > but feedbacks on testing are always appeciated.
>
> But it did not end up in Linus' tree yet. That would be more important
> for me than the credits in the commit message :-)
It will soon :)
>
> Thanks,
>
> Stefan
> --
> Stefan Seyfried
> Linux Consultant & Developer -- GPG Key: 0x731B665B
>
> B1 Systems GmbH
> Osterfeldstra?e 7 / 85088 Vohburg / http://www.b1-systems.de
> GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
2013/5/21 Srivatsa S. Bhat <[email protected]>:
> On 05/20/2013 09:31 PM, Frederic Weisbecker wrote:
>> When the watchdog code is boot-disabled by the user, for example
>> through the 'nmi_watchdog=0' boot option, the setup() callback of
>> the watchdog kthread requests to park the task, and that until the
>> user later re-enables the watchdog through sysctl or procfs.
>>
>> However the parking request is not well handled when done from
>> the setup() callback. After ->setup() is called, the generic smpboot
>> kthread loop directly tries to call the thread function or wait
>> for some event if ->thread_should_run() is false.
>>
>> In the case of the watchdog kthread, ->thread_should_run() returns
>> false and the kthread goes to sleep and wait for the watchdog timer
>> to wake it up. But the timer is not enabled since the user requested
>> to disable the watchdog. We want the kthread to park instead of waiting
>> for events that can't happen.
>>
>> As a result, later unpark requests after sysctl write through
>> 'sysctl -w kernel.watchdog=1' won't wake up/unpark the task as
>> expected, since it's not parked anyway, leaving the value modified
>> without triggering any action.
>>
>> We could workaround some solution in the watchdog code like forcing
>> one pass to the thread function and immediately return to park.
>>
>> But supporting parking requests from ->setup() or ->unpark()
>
> unpark() can already do a proper park, because immediately after
> coming out of the parked state, the 'continue' statement helps
> re-evaluate the stop/park condition.
Right.
>
> So this fix is only for the ->setup() case.
>
>> callbacks look like proper way to implement cancellation in
>> general. So let's fix it that way.
>>
>
> Sounds good to me.
>
> Reviewed-by: Srivatsa S. Bhat <[email protected]>
>
> But I wonder what nmi_watchdog=0 should actually mean, semantically..
> The current code works as if the user asked us not to run the watchdog
> threads, but it could as well be interpreted as if the user does not
> want to run *any* watchdog-related *code*. In that case, ideally we
> should *unregister* the watchdog threads, instead of just parking them.
> And when the user enables them again via sysctl/procfs, we should
> register the watchdog threads with the smpboot infrastructure.
>
> I'm not saying that the current semantics is wrong, but if we really
> implement it the other way I proposed above, then we won't have to deal
> with weird issues like ->setup() code wanting to park, and watchdog
> threads unparking and parking themselves on every CPU hotplug operation,
> despite the fact that the user specified nmi_watchdog=0 on the kernel
> command line.
That's an interesting point. It seems that using smpboot
register/unregister operations for sysctl control would be more
appropriate and less complicated. I'm going to give it a try.
Thanks!