The timer interrupt handles several things like preemption,
timekeeping, rcu, etc...
However it appears that sometimes it is simply useless like
when a task runs alone and even more when it is in userspace
as RCU doesn't need it at all in such case.
It appears that HPC workload would get some win of such timer
deactivation, and perhaps also the Real Time world as this
minimizes the critical sections due to way less interrupts to
handle.
It works through the procfs interface:
echo 1 > /proc/self/nohz
With the following constraints:
- A cpu can have only one nohz task
- A nohz task must be affine to a single CPU. That affinity can't
change while the task is in this mode
- This must be written in /proc/self only, however further
plans to allow than to be set from another task should be
possible.
You need to migrate irqs manually from userspace, same
for tasks. If a non nohz task is running on the same cpu
than a nohz task, the tick can't be stopped.
I can provide you the tools I'm using to test it if you
want.
Note this depends on the rcu spurious softirq fixes in Paul's
queue for .38
I'm also using a hack to make init affine to the first CPU
on boot so that all userspace tasks end up to the first CPU
except kernel threads and tasks that change their affinity
explicitly (this is not sched isolation). This avoids any
task to set up timers to random CPUs on which we'll later
want to run a nohz task. But probably this can be fixed
with another way, like unbinding these timers or so. This
probably require a detailed audit.
Any comments are welcome.
You can fetch from:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
sched/nohz-task
Frederic Weisbecker (15):
nohz_task: New mask for cpus having nohz task
nohz_task: Avoid nohz task cpu as non-idle timer target
nohz_task: Make tick stop and restart callable outside idle
nohz_task: Stop the tick when the nohz task runs alone
nohz_task: Restart the tick when another task compete on the cpu
nohz_task: Keep the tick if rcu needs it
nohz_task: Restart tick when RCU forces nohz task cpu quiescent state
smp: Don't warn if irq are disabled but we don't wait for the ipi
rcu: Make rcu_enter,exit_nohz() callable from irq
nohz_task: Enter in extended quiescent state when in userspace
x86: Nohz task support
clocksource: Ignore nohz task cpu in clocksource watchdog
sched: Protect nohz task cpu affinity
nohz_task: Clear nohz task attribute on exit()
nohz_task: Procfs interface
arch/Kconfig | 7 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 10 ++-
arch/x86/kernel/ptrace.c | 10 +++
arch/x86/kernel/traps.c | 22 ++++--
arch/x86/mm/fault.c | 13 +++-
fs/proc/base.c | 80 +++++++++++++++++++++
include/linux/cpumask.h | 8 ++
include/linux/rcupdate.h | 1 +
include/linux/sched.h | 9 +++
include/linux/tick.h | 26 +++++++-
kernel/cpu.c | 15 ++++
kernel/exit.c | 3 +
kernel/rcutree.c | 127 +++++++++++++++------------------
kernel/rcutree.h | 12 ++--
kernel/sched.c | 135 ++++++++++++++++++++++++++++++++++-
kernel/smp.c | 2 +-
kernel/softirq.c | 4 +-
kernel/time/Kconfig | 7 ++
kernel/time/clocksource.c | 10 ++-
kernel/time/tick-sched.c | 138 +++++++++++++++++++++++++++++++++--
21 files changed, 535 insertions(+), 105 deletions(-)
--
1.7.3.2
Make the nohz tick restart and stop APIs callable outside idle
and from interrupts.
Don't reenable interrupts unconditionally and only enter/exit
rcu quiescent state if we are in idle.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/time/tick-sched.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..e706fa8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -405,12 +405,14 @@ void tick_nohz_stop_sched_tick(int inidle)
* the scheduler tick in nohz_restart_sched_tick.
*/
if (!ts->tick_stopped) {
- select_nohz_load_balancer(1);
+ if (!current->pid)
+ select_nohz_load_balancer(1);
ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
- rcu_enter_nohz();
+ if (!current->pid)
+ rcu_enter_nohz();
}
ts->idle_sleeps++;
@@ -500,12 +502,14 @@ void tick_nohz_restart_sched_tick(void)
{
int cpu = smp_processor_id();
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ unsigned long flags;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
unsigned long ticks;
#endif
ktime_t now;
- local_irq_disable();
+ local_irq_save(flags);
+
if (ts->idle_active || (ts->inidle && ts->tick_stopped))
now = ktime_get();
@@ -514,13 +518,14 @@ void tick_nohz_restart_sched_tick(void)
if (!ts->inidle || !ts->tick_stopped) {
ts->inidle = 0;
- local_irq_enable();
+ local_irq_restore(flags);
return;
}
ts->inidle = 0;
- rcu_exit_nohz();
+ if (!current->pid)
+ rcu_exit_nohz();
/* Update jiffies first */
select_nohz_load_balancer(0);
@@ -550,7 +555,7 @@ void tick_nohz_restart_sched_tick(void)
tick_nohz_restart(ts, now);
- local_irq_enable();
+ local_irq_restore(flags);
}
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
--
1.7.3.2
If a new task is enqueued on the same cpu than a nohz task, then
the tick must be restarted in case it was stopped for the nohz
task, so that preemption can work again.
Do this remotely using a specific IPI.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 46 insertions(+), 1 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 858a876..f80088a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2550,8 +2550,10 @@ extern void task_oncpu_function_call(struct task_struct *p,
void (*func) (void *info), void *info);
#ifdef CONFIG_NO_HZ_TASK
+extern void smp_send_update_nohz_task_cpu(int cpu);
extern int nohz_task_can_stop_tick(void);
#else
+static inline void smp_send_update_nohz_task_cpu(int cpu) { }
static inline int nohz_task_can_stop_tick(void) { return 0; }
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index e9cdd7a..6dbae46 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2446,6 +2446,27 @@ static void update_avg(u64 *avg, u64 sample)
#ifdef CONFIG_NO_HZ_TASK
DEFINE_PER_CPU(int, task_nohz_mode);
+/* Must be called with irq disabled! */
+static void nohz_task_cpu_update(void *unused)
+{
+ struct rq *rq;
+ int cpu;
+
+ if (!__get_cpu_var(task_nohz_mode))
+ return;
+
+ /*
+ * Look at nr_running lockless. At worst, the new task was deactivated
+ * and we just exit without doing anything.
+ */
+ rq = this_rq();
+ cpu = smp_processor_id();
+ if (rq->nr_running > 1) {
+ __get_cpu_var(task_nohz_mode) = 0;
+ tick_nohz_restart_sched_tick();
+ }
+}
+
int nohz_task_can_stop_tick(void)
{
struct rq *rq = this_rq();
@@ -2455,6 +2476,12 @@ int nohz_task_can_stop_tick(void)
return 1;
}
+
+void smp_send_update_nohz_task_cpu(int cpu)
+{
+ smp_call_function_single(cpu, nohz_task_cpu_update,
+ NULL, 0);
+}
#endif
static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
@@ -2477,6 +2504,8 @@ static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
int wake_flags, bool success)
{
+ int cpu = cpu_of(rq);
+
trace_sched_wakeup(p, success);
check_preempt_curr(rq, p, wake_flags);
@@ -2498,7 +2527,21 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
#endif
/* if a worker is waking up, notify workqueue */
if ((p->flags & PF_WQ_WORKER) && success)
- wq_worker_waking_up(p, cpu_of(rq));
+ wq_worker_waking_up(p, cpu);
+
+#ifdef CONFIG_NO_HZ_TASK
+ /*
+ * CHECKME:
+ * Ideally, we need to check if the target has a nohz task
+ * and only send the IPI if so. But there is nothing but
+ * a racy way to do that. Or can we assume at that point
+ * of the wake up that if cpu_has_nohz_task(cpu) is 0, then
+ * it's ok, even if it has a task about to switch to nohz
+ * task mode?
+ */
+ if (rq->nr_running == 2)
+ smp_send_update_nohz_task_cpu(cpu);
+#endif
}
/**
--
1.7.3.2
In order to be able to enter/exit into rcu extended quiescent
state from interrupt, we need to make rcu_enter_nohz() and
rcu_exit_nohz() callable from interrupts.
So, this proposes a new implementation of the rcu nohz fast path
related helpers, where rcu_enter_nohz() or rcu_exit_nohz() can
be called between rcu_enter_irq() and rcu_exit_irq() while keeping
the existing semantics.
We maintain three per cpu fields:
- nohz indicates we entered into extended quiescent state mode,
we may or not be in an interrupt even if that state is set though.
- irq_nest indicates we are in an irq. This number is incremented on
irq entry and decreased on irq exit. This includes NMIs
- qs_seq is increased everytime we see a true extended quiescent
state:
* When we call rcu_enter_nohz() and we are not in an irq.
* When we exit the outer most nesting irq and we are in
nohz mode (rcu_enter_nohz() was called without a pairing
rcu_exit_nohz() yet).
>From that three-part we can deduce the extended grace periods like
we did before on top of snapshots and comparisons.
If nohz == 1 and irq_nest == 0, we are in a quiescent state. qs_seq
is used to keep track of elapsed extended quiescent states, useful
to compare snapshots of rcu nohz state.
This is experimental and does not take care of barriers yet.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/rcutree.c | 103 ++++++++++++++++++++++-------------------------------
kernel/rcutree.h | 12 +++----
2 files changed, 48 insertions(+), 67 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ed6aba3..1ac1a61 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -129,10 +129,7 @@ void rcu_note_context_switch(int cpu)
}
#ifdef CONFIG_NO_HZ
-DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
- .dynticks_nesting = 1,
- .dynticks = 1,
-};
+DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks);
#endif /* #ifdef CONFIG_NO_HZ */
static int blimit = 10; /* Maximum callbacks per softirq. */
@@ -272,16 +269,15 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
*/
void rcu_enter_nohz(void)
{
- unsigned long flags;
struct rcu_dynticks *rdtp;
- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
- local_irq_save(flags);
+ preempt_disable();
rdtp = &__get_cpu_var(rcu_dynticks);
- rdtp->dynticks++;
- rdtp->dynticks_nesting--;
- WARN_ON_ONCE(rdtp->dynticks & 0x1);
- local_irq_restore(flags);
+ WARN_ON_ONCE(rdtp->nohz);
+ rdtp->nohz = 1;
+ if (!rdtp->irq_nest)
+ local_inc(&rdtp->qs_seq);
+ preempt_enable();
}
/*
@@ -292,16 +288,13 @@ void rcu_enter_nohz(void)
*/
void rcu_exit_nohz(void)
{
- unsigned long flags;
struct rcu_dynticks *rdtp;
- local_irq_save(flags);
+ preempt_disable();
rdtp = &__get_cpu_var(rcu_dynticks);
- rdtp->dynticks++;
- rdtp->dynticks_nesting++;
- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
- local_irq_restore(flags);
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+ WARN_ON_ONCE(!rdtp->nohz);
+ rdtp->nohz = 0;
+ preempt_enable();
}
/**
@@ -313,13 +306,7 @@ void rcu_exit_nohz(void)
*/
void rcu_nmi_enter(void)
{
- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
- if (rdtp->dynticks & 0x1)
- return;
- rdtp->dynticks_nmi++;
- WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+ rcu_irq_enter();
}
/**
@@ -331,13 +318,7 @@ void rcu_nmi_enter(void)
*/
void rcu_nmi_exit(void)
{
- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
- if (rdtp->dynticks & 0x1)
- return;
- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
- rdtp->dynticks_nmi++;
- WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
+ rcu_irq_exit();
}
/**
@@ -350,11 +331,7 @@ void rcu_irq_enter(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
- if (rdtp->dynticks_nesting++)
- return;
- rdtp->dynticks++;
- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+ rdtp->irq_nest++;
}
/**
@@ -368,11 +345,11 @@ void rcu_irq_exit(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
- if (--rdtp->dynticks_nesting)
+ if (--rdtp->irq_nest)
return;
- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
- rdtp->dynticks++;
- WARN_ON_ONCE(rdtp->dynticks & 0x1);
+
+ if (rdtp->nohz)
+ local_inc(&rdtp->qs_seq);
/* If the interrupt queued a callback, get out of dyntick mode. */
if (__get_cpu_var(rcu_sched_data).nxtlist ||
@@ -390,15 +367,19 @@ void rcu_irq_exit(void)
static int dyntick_save_progress_counter(struct rcu_data *rdp)
{
int ret;
- int snap;
- int snap_nmi;
+ int snap_nohz;
+ int snap_irq_nest;
+ long snap_qs_seq;
- snap = rdp->dynticks->dynticks;
- snap_nmi = rdp->dynticks->dynticks_nmi;
+ snap_nohz = rdp->dynticks->nohz;
+ snap_irq_nest = rdp->dynticks->irq_nest;
+ snap_qs_seq = local_read(&rdp->dynticks->qs_seq);
smp_mb(); /* Order sampling of snap with end of grace period. */
- rdp->dynticks_snap = snap;
- rdp->dynticks_nmi_snap = snap_nmi;
- ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
+ rdp->dynticks_snap.nohz = snap_nohz;
+ rdp->dynticks_snap.irq_nest = snap_irq_nest;
+ local_set(&rdp->dynticks_snap.qs_seq, snap_qs_seq);
+
+ ret = (snap_nohz && !snap_irq_nest);
if (ret)
rdp->dynticks_fqs++;
return ret;
@@ -412,15 +393,10 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
*/
static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
{
- long curr;
- long curr_nmi;
- long snap;
- long snap_nmi;
+ struct rcu_dynticks curr, snap;
- curr = rdp->dynticks->dynticks;
+ curr = *rdp->dynticks;
snap = rdp->dynticks_snap;
- curr_nmi = rdp->dynticks->dynticks_nmi;
- snap_nmi = rdp->dynticks_nmi_snap;
smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
/*
@@ -431,14 +407,21 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
* read-side critical section that started before the beginning
* of the current RCU grace period.
*/
- if ((curr != snap || (curr & 0x1) == 0) &&
- (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
- rdp->dynticks_fqs++;
- return 1;
- }
+ if (curr.nohz && !curr.irq_nest)
+ goto dynticks_qs;
+
+ if (snap.nohz && !snap.irq_nest)
+ goto dynticks_qs;
+
+ if (local_read(&curr.qs_seq) != local_read(&snap.qs_seq))
+ goto dynticks_qs;
/* Go check for the CPU being offline. */
return rcu_implicit_offline_qs(rdp);
+
+dynticks_qs:
+ rdp->dynticks_fqs++;
+ return 1;
}
#endif /* #ifdef CONFIG_SMP */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 91d4170..215e431 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -27,6 +27,7 @@
#include <linux/threads.h>
#include <linux/cpumask.h>
#include <linux/seqlock.h>
+#include <asm/local.h>
/*
* Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT.
@@ -79,11 +80,9 @@
* Dynticks per-CPU state.
*/
struct rcu_dynticks {
- int dynticks_nesting; /* Track nesting level, sort of. */
- int dynticks; /* Even value for dynticks-idle, else odd. */
- int dynticks_nmi; /* Even value for either dynticks-idle or */
- /* not in nmi handler, else odd. So this */
- /* remains even for nmi from irq handler. */
+ int nohz;
+ local_t qs_seq;
+ int irq_nest;
};
/*
@@ -212,8 +211,7 @@ struct rcu_data {
#ifdef CONFIG_NO_HZ
/* 3) dynticks interface. */
struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
- int dynticks_snap; /* Per-GP tracking for dynticks. */
- int dynticks_nmi_snap; /* Per-GP tracking for dynticks_nmi. */
+ struct rcu_dynticks dynticks_snap;
#endif /* #ifdef CONFIG_NO_HZ */
/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
--
1.7.3.2
This implements the /proc/pid/nohz file that enables the
nohz attribute of a task.
Synchronization is enforced so that:
- A CPU can have only one nohz task
- A nohz task can be only affine to a single CPU
For now this is only possible to write on /proc/self but probably
allowing it from another task would be a good idea and wouldn't
increase so much the complexity of the code.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
fs/proc/base.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 1 +
include/linux/tick.h | 1 +
kernel/sched.c | 43 ++++++++++++++++++++++++
kernel/time/Kconfig | 6 ++--
kernel/time/tick-sched.c | 12 +++++++
6 files changed, 140 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1828451..9a01978 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -83,6 +83,7 @@
#include <linux/pid_namespace.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>
+#include <linux/tick.h>
#include "internal.h"
/* NOTE:
@@ -1295,6 +1296,82 @@ static const struct file_operations proc_sessionid_operations = {
};
#endif
+#ifdef CONFIG_NO_HZ_TASK
+static ssize_t proc_nohz_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ char buffer[PROC_NUMBUF];
+ int val = 0;
+ size_t len;
+
+ if (!task)
+ return -ESRCH;
+
+ if (test_tsk_thread_flag(task, TIF_NOHZ))
+ val = 1;
+
+ put_task_struct(task);
+
+ len = snprintf(buffer, sizeof(buffer), "%d\n", val);
+
+ return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+
+static ssize_t proc_nohz_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct task_struct *task;
+ char buffer[PROC_NUMBUF];
+ long val;
+ int err = 0;
+
+ memset(buffer, 0, sizeof(buffer));
+
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+
+ if (copy_from_user(buffer, buf, count)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ err = strict_strtol(strstrip(buffer), 0, &val);
+
+ if (err || (val != 0 && val != 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ rcu_read_lock();
+ task = pid_task(proc_pid(inode), PIDTYPE_PID);
+ if (task != current) {
+ rcu_read_unlock();
+ err = -EPERM;
+ goto out;
+ }
+ rcu_read_unlock();
+
+ if (val == 1)
+ err = tick_nohz_task_set();
+ else
+ tick_nohz_task_clear();
+
+out:
+ return err < 0 ? err : count;
+}
+
+
+static const struct file_operations proc_nohz_operations = {
+ .read = proc_nohz_read,
+ .write = proc_nohz_write,
+ .llseek = generic_file_llseek,
+};
+#endif /* CONFIG_NO_HZ_TASK */
+
+
#ifdef CONFIG_FAULT_INJECTION
static ssize_t proc_fault_inject_read(struct file * file, char __user * buf,
size_t count, loff_t *ppos)
@@ -2784,6 +2861,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
#endif
+#ifdef CONFIG_NO_HZ_TASK
+ REG("nohz", S_IWUSR|S_IRUGO, proc_nohz_operations),
+#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f80088a..0e2e5c9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2552,6 +2552,7 @@ extern void task_oncpu_function_call(struct task_struct *p,
#ifdef CONFIG_NO_HZ_TASK
extern void smp_send_update_nohz_task_cpu(int cpu);
extern int nohz_task_can_stop_tick(void);
+extern int sched_task_set_nohz(void);
#else
static inline void smp_send_update_nohz_task_cpu(int cpu) { }
static inline int nohz_task_can_stop_tick(void) { return 0; }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 37af961..5364438 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -137,6 +137,7 @@ extern void tick_nohz_task_enter_kernel(void);
extern void tick_nohz_task_exit_kernel(void);
extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
+extern int tick_nohz_task_set(void);
extern void tick_nohz_task_clear(void);
extern int tick_nohz_task_mode(void);
diff --git a/kernel/sched.c b/kernel/sched.c
index bd0a41f..d553a47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2491,6 +2491,49 @@ void smp_send_update_nohz_task_cpu(int cpu)
smp_call_function_single(cpu, nohz_task_cpu_update,
NULL, 0);
}
+
+int sched_task_set_nohz(void)
+{
+ int cpu;
+ struct rq *rq;
+ int err = -EBUSY;
+ unsigned long flags;
+
+ get_online_cpus();
+
+ /* We need to serialize against set_cpus_allowed() */
+ rq = task_rq_lock(current, &flags);
+
+ /* A nohz task must be affine to a single cpu */
+ if (!cpumask_weight(¤t->cpus_allowed) == 1)
+ goto out;
+
+ cpu = smp_processor_id();
+
+ if (!cpu_online(cpu))
+ goto out;
+
+ /* A CPU must have a single nohz task */
+ if (cpu_has_nohz_task(cpu))
+ goto out;
+
+ /*
+ * We need to keep at least one CPU without nohz task
+ * for several background jobs.
+ */
+ if (cpumask_weight(cpu_online_mask) -
+ cpumask_weight(cpu_has_nohz_task_mask) == 1)
+ goto out;
+
+ set_cpu_has_nohz_task(cpu, 1);
+ set_thread_flag(TIF_NOHZ);
+ err = 0;
+out:
+ task_rq_unlock(rq, &flags);
+ put_online_cpus();
+
+ return err;
+}
#endif
static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index a460cee..dfb10db 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -31,6 +31,6 @@ config NO_HZ_TASK
bool "Tickless task"
depends on HAVE_NO_HZ_TASK && NO_HZ && SMP && HIGH_RES_TIMERS
help
- When a task runs alone on a CPU and switches into this mode,
- the timer interrupt will only trigger when it is strictly
- needed.
+ This implements the /proc/self/nohz interface. When a task
+ runs alone on a CPU and switches into this mode, the timer
+ interrupt will only trigger when it is strictly needed.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 06379eb..f408803 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -720,6 +720,18 @@ void tick_check_idle(int cpu)
}
#ifdef CONFIG_NO_HZ_TASK
+int tick_nohz_task_set(void)
+{
+ /*
+ * Only current can set this from procfs, so no possible
+ * race.
+ */
+ if (test_thread_flag(TIF_NOHZ))
+ return 0;
+
+ return sched_task_set_nohz();
+}
+
void tick_nohz_task_clear(void)
{
int cpu = raw_smp_processor_id();
--
1.7.3.2
The comment in smp_call_function_single() says it wants the irqs
to be enabled otherwise it may deadlock.
I can't find the reason for that though, except if we had to wait
for a self triggered IPI but we execute the local IPI by just
calling the function in place.
In doubt, only suppress the warning if we are not waiting for the
IPI to complete as it should really not raise any deadlock.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/smp.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 12ed8b0..886a406 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -289,7 +289,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
* send smp call function interrupt to this cpu and as such deadlocks
* can't happen.
*/
- WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
+ WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
&& !oops_in_progress);
if (cpu == this_cpu) {
--
1.7.3.2
Clear the nohz task attribute when a task exits, clear the cpu
mask and restart the tick if necessary.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
include/linux/tick.h | 2 ++
kernel/exit.c | 3 +++
kernel/time/tick-sched.c | 20 ++++++++++++++++++++
3 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index a704bb7..37af961 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -137,6 +137,7 @@ extern void tick_nohz_task_enter_kernel(void);
extern void tick_nohz_task_exit_kernel(void);
extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
+extern void tick_nohz_task_clear(void);
extern int tick_nohz_task_mode(void);
#else /* !NO_HZ_TASK */
@@ -144,6 +145,7 @@ static inline void tick_nohz_task_enter_kernel(void) { }
static inline void tick_nohz_task_exit_kernel(void) { }
static inline void tick_nohz_task_enter_exception(struct pt_regs *regs) { }
static inline void tick_nohz_task_exit_exception(struct pt_regs *regs) { }
+static inline void tick_nohz_task_clear(void) { }
static inline int tick_nohz_task_mode(void) { return 0; }
#endif /* !NO_HZ_TASK */
diff --git a/kernel/exit.c b/kernel/exit.c
index 676149a..250d832 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -51,6 +51,7 @@
#include <trace/events/sched.h>
#include <linux/hw_breakpoint.h>
#include <linux/oom.h>
+#include <linux/tick.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -1013,6 +1014,8 @@ NORET_TYPE void do_exit(long code)
*/
perf_event_exit_task(tsk);
+ tick_nohz_task_clear();
+
exit_notify(tsk, group_dead);
#ifdef CONFIG_NUMA
task_lock(tsk);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9a4aa39..06379eb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -720,6 +720,26 @@ void tick_check_idle(int cpu)
}
#ifdef CONFIG_NO_HZ_TASK
+void tick_nohz_task_clear(void)
+{
+ int cpu = raw_smp_processor_id();
+
+ if (!test_thread_flag(TIF_NOHZ))
+ return;
+
+ set_cpu_has_nohz_task(cpu, 0);
+ clear_tsk_thread_flag(current, TIF_NOHZ);
+
+ local_irq_disable();
+
+ if (__get_cpu_var(task_nohz_mode))
+ tick_nohz_restart_sched_tick();
+
+ __get_cpu_var(task_nohz_mode) = 0;
+
+ local_irq_enable();
+}
+
DEFINE_PER_CPU(int, nohz_task_ext_qs);
void tick_nohz_task_exit_kernel(void)
--
1.7.3.2
A nohz task can safely enter into extended quiescent state when
it goes into userspace, this avoids a remote cpu to force the
nohz task to be interrupted in order to notify quiescent states.
We enter into an extended quiescent state when:
- A nohz task resumes to userspace and is alone running on the
CPU (we check if the local cpu is in nohz mode, which means
no other task compete on that CPU). If the tick is still running
then entering into extended QS will be done later from the second
case:
- When the tick stops and verify the current task is a nohz one,
is alone running on the CPU and runs in userspace.
We exit the extended quiescent state when:
- A nohz task enters the kernel and is alone running on the CPU.
Again we check if the local cpu is in nohz mode for that. If
the tick is still running then it means we are not in an extended
QS and we don't do anything.
- The tick restarts because a new task is enqueued.
Whether the nohz task is in userspace or not is tracked by the
per cpu nohz_task_ext_qs variable.
Architectures need to provide some backend to notify userspace
exit/entry in order to support this mode.
It needs to implement the TIF_NOHZ flag that switches to slow
path syscall mode and to notify exceptions entry/exit.
We don't need to handle irqs or nmis as those are already handled
by RCU through rcu_enter_irq/nmi helpers.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
arch/Kconfig | 4 +++
include/linux/tick.h | 16 ++++++++++-
kernel/sched.c | 3 ++
kernel/time/tick-sched.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 81 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index e631791..d1ebea3 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -177,5 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
config HAVE_NO_HZ_TASK
bool
+ help
+ Features necessary hooks for a task wanting to enter nohz
+ while running alone on a CPU: thread flag for syscall hooks
+ and exceptions entry/exit hooks.
source "kernel/gcov/Kconfig"
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7465a47..a704bb7 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,7 @@
#include <linux/clockchips.h>
#include <linux/percpu-defs.h>
+#include <asm/ptrace.h>
#ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -130,10 +131,21 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#ifdef CONFIG_NO_HZ_TASK
DECLARE_PER_CPU(int, task_nohz_mode);
+DECLARE_PER_CPU(int, nohz_task_ext_qs);
+
+extern void tick_nohz_task_enter_kernel(void);
+extern void tick_nohz_task_exit_kernel(void);
+extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
+extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
extern int tick_nohz_task_mode(void);
-#else
+
+#else /* !NO_HZ_TASK */
+static inline void tick_nohz_task_enter_kernel(void) { }
+static inline void tick_nohz_task_exit_kernel(void) { }
+static inline void tick_nohz_task_enter_exception(struct pt_regs *regs) { }
+static inline void tick_nohz_task_exit_exception(struct pt_regs *regs) { }
static inline int tick_nohz_task_mode(void) { return 0; }
-#endif
+#endif /* !NO_HZ_TASK */
# else /* !NO_HZ */
static inline void tick_nohz_stop_sched_tick(int inidle) { }
diff --git a/kernel/sched.c b/kernel/sched.c
index b99f192..4412493 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2464,6 +2464,9 @@ static void nohz_task_cpu_update(void *unused)
if (rq->nr_running > 1 || rcu_pending(cpu) || rcu_needs_cpu(cpu)) {
__get_cpu_var(task_nohz_mode) = 0;
tick_nohz_restart_sched_tick();
+
+ if (__get_cpu_var(nohz_task_ext_qs))
+ rcu_exit_nohz();
}
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 88011b9..9a4aa39 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -720,6 +720,62 @@ void tick_check_idle(int cpu)
}
#ifdef CONFIG_NO_HZ_TASK
+DEFINE_PER_CPU(int, nohz_task_ext_qs);
+
+void tick_nohz_task_exit_kernel(void)
+{
+ unsigned long flags;
+
+ if (!test_thread_flag(TIF_NOHZ))
+ return;
+
+ local_irq_save(flags);
+
+ __get_cpu_var(nohz_task_ext_qs) = 1;
+ /*
+ * Only enter extended QS if the tick is not running.
+ * Otherwise the tick will handle that later when it
+ * will decide to stop.
+ */
+ if (__get_cpu_var(task_nohz_mode))
+ rcu_enter_nohz();
+
+ local_irq_restore(flags);
+}
+
+void tick_nohz_task_enter_kernel(void)
+{
+ unsigned long flags;
+
+ if (!test_thread_flag(TIF_NOHZ))
+ return;
+
+ local_irq_save(flags);
+
+ __get_cpu_var(nohz_task_ext_qs) = 0;
+ /*
+ * If the tick was running, then we weren't in
+ * rcu extended period. Only exit extended QS
+ * if we were in such state.
+ */
+ if (__get_cpu_var(task_nohz_mode))
+ rcu_exit_nohz();
+
+ local_irq_restore(flags);
+}
+
+void tick_nohz_task_enter_exception(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ tick_nohz_task_enter_kernel();
+}
+
+void tick_nohz_task_exit_exception(struct pt_regs *regs)
+{
+ if (user_mode(regs))
+ tick_nohz_task_exit_kernel();
+}
+
int tick_nohz_task_mode(void)
{
return __get_cpu_var(task_nohz_mode);
@@ -730,8 +786,11 @@ static void tick_nohz_task_stop_tick(void)
if (!test_thread_flag(TIF_NOHZ) || __get_cpu_var(task_nohz_mode))
return;
- if (nohz_task_can_stop_tick())
+ if (nohz_task_can_stop_tick()) {
__get_cpu_var(task_nohz_mode) = 1;
+ if (__get_cpu_var(nohz_task_ext_qs))
+ rcu_enter_nohz();
+ }
}
#else
static void tick_nohz_task_stop_tick(void) { }
--
1.7.3.2
Don't allow to change a nohz task cpu affinity as we want them
to be bound to a single CPU and we want this affinity not to
change.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/sched.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 4412493..bd0a41f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5712,6 +5712,13 @@ again:
goto out;
}
+ /* Nohz tasks must keep their affinity */
+ if (test_tsk_thread_flag(p, TIF_NOHZ) &&
+ !cpumask_equal(&p->cpus_allowed, new_mask)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, new_mask);
else {
--
1.7.3.2
Implement the thread flag, syscalls and exception hooks for
nohz task support.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 10 +++++++---
arch/x86/kernel/ptrace.c | 10 ++++++++++
arch/x86/kernel/traps.c | 22 +++++++++++++++-------
arch/x86/mm/fault.c | 13 +++++++++++--
5 files changed, 44 insertions(+), 12 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e330da2..37f59f2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -65,6 +65,7 @@ config X86
select HAVE_SPARSE_IRQ
select GENERIC_IRQ_PROBE
select GENERIC_PENDING_IRQ if SMP
+ select HAVE_NO_HZ_TASK
config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f0b6e5d..eb6784c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,6 +87,7 @@ struct thread_info {
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
+#define TIF_NOHZ 19 /* in nohz userspace mode */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_DEBUG 21 /* uses debug registers */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
@@ -110,6 +111,7 @@ struct thread_info {
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
+#define _TIF_NOHZ (1 << TIF_NOHZ)
#define _TIF_DEBUG (1 << TIF_DEBUG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FREEZE (1 << TIF_FREEZE)
@@ -121,12 +123,13 @@ struct thread_info {
/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
- _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
+ _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
+ _TIF_NOHZ)
/* work to do in syscall_trace_leave() */
#define _TIF_WORK_SYSCALL_EXIT \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP | \
- _TIF_SYSCALL_TRACEPOINT)
+ _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
@@ -136,7 +139,8 @@ struct thread_info {
/* work to do on any return to user space */
#define _TIF_ALLWORK_MASK \
- ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT)
+ ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \
+ _TIF_NOHZ)
/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 45892dc..b8b1c38 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
#include <linux/signal.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
+#include <linux/tick.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -1351,6 +1352,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
{
long ret = 0;
+ /* Notify nohz task syscall early so the rest can use rcu */
+ tick_nohz_task_enter_kernel();
+
/*
* If we stepped into a sysenter/syscall insn, it trapped in
* kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
@@ -1412,4 +1416,10 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
!test_thread_flag(TIF_SYSCALL_EMU);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, step);
+
+ /*
+ * Notify nohz task exit syscall at last so the rest can
+ * use rcu.
+ */
+ tick_nohz_task_exit_kernel();
}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cb838ca..b531c52 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -26,6 +26,7 @@
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/init.h>
+#include <linux/tick.h>
#include <linux/bug.h>
#include <linux/nmi.h>
#include <linux/mm.h>
@@ -459,24 +460,28 @@ void restart_nmi(void)
/* May run on IST stack. */
dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
{
+ tick_nohz_task_enter_exception(regs);
+
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
== NOTIFY_STOP)
- return;
+ goto exit;
#endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
#ifdef CONFIG_KPROBES
if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
== NOTIFY_STOP)
- return;
+ goto exit;
#else
if (notify_die(DIE_TRAP, "int3", regs, error_code, 3, SIGTRAP)
== NOTIFY_STOP)
- return;
+ goto exit;
#endif
preempt_conditional_sti(regs);
do_trap(3, SIGTRAP, "int3", regs, error_code, NULL);
preempt_conditional_cli(regs);
+exit:
+ tick_nohz_task_exit_exception(regs);
}
#ifdef CONFIG_X86_64
@@ -537,6 +542,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
unsigned long dr6;
int si_code;
+ tick_nohz_task_enter_exception(regs);
+
get_debugreg(dr6, 6);
/* Filter out all the reserved bits which are preset to 1 */
@@ -552,7 +559,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
/* Catch kmemcheck conditions first of all! */
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
- return;
+ goto exit;
/* DR6 may or may not be cleared by the CPU */
set_debugreg(0, 6);
@@ -567,7 +574,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
if (notify_die(DIE_DEBUG, "debug", regs, PTR_ERR(&dr6), error_code,
SIGTRAP) == NOTIFY_STOP)
- return;
+ goto exit;
/* It's safe to allow irq's after DR6 has been saved */
preempt_conditional_sti(regs);
@@ -576,7 +583,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
handle_vm86_trap((struct kernel_vm86_regs *) regs,
error_code, 1);
preempt_conditional_cli(regs);
- return;
+ goto exit;
}
/*
@@ -596,7 +603,8 @@ dotraplinkage void __kprobes do_debug(struct pt_regs *regs, long error_code)
send_sigtrap(tsk, regs, error_code, si_code);
preempt_conditional_cli(regs);
- return;
+exit:
+ tick_nohz_task_exit_exception(regs);
}
/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d90ceb..2b23b22 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -12,6 +12,7 @@
#include <linux/mmiotrace.h> /* kmmio_handler, ... */
#include <linux/perf_event.h> /* perf_sw_event */
#include <linux/hugetlb.h> /* hstate_index_to_shift */
+#include <linux/tick.h>
#include <asm/traps.h> /* dotraplinkage, ... */
#include <asm/pgalloc.h> /* pgd_*(), ... */
@@ -949,8 +950,8 @@ static int fault_in_kernel_space(unsigned long address)
* and the problem, and then passes it off to one of the appropriate
* routines.
*/
-dotraplinkage void __kprobes
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+static void __kprobes
+__do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -1158,3 +1159,11 @@ good_area:
up_read(&mm->mmap_sem);
}
+
+dotraplinkage void __kprobes
+do_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ tick_nohz_task_enter_exception(regs);
+ __do_page_fault(regs, error_code);
+ tick_nohz_task_exit_exception(regs);
+}
--
1.7.3.2
The watchdog should probably make an exception for nohz task cpus
that want to be interrupted the least possible.
However we probably need to warn the user about that.
Another solution would be to make this timer defferable.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/time/clocksource.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..9e62a97 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -302,9 +302,13 @@ static void clocksource_watchdog(unsigned long data)
* Cycle through CPUs to check if the CPUs stay synchronized
* to each other.
*/
- next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
- if (next_cpu >= nr_cpu_ids)
- next_cpu = cpumask_first(cpu_online_mask);
+ next_cpu = raw_smp_processor_id();
+ do {
+ next_cpu = cpumask_next(next_cpu, cpu_online_mask);
+ if (next_cpu >= nr_cpu_ids)
+ next_cpu = cpumask_first(cpu_online_mask);
+ } while (cpu_has_nohz_task(next_cpu));
+
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
out:
--
1.7.3.2
If a cpu is in nohz mode due to a nohz task running, then
it is not able to notify quiescent states requested by other
CPUs.
Then restart the tick to remotely force the quiescent states on the
nohz task cpus.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/rcutree.c | 21 +++++++++++++++------
kernel/sched.c | 2 +-
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 44dce3f..ed6aba3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -227,6 +227,8 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
*/
static int rcu_implicit_offline_qs(struct rcu_data *rdp)
{
+ int has_nohz_task;
+
/*
* If the CPU is offline, it is in a quiescent state. We can
* trust its state not to change because interrupts are disabled.
@@ -236,15 +238,22 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
return 1;
}
+ has_nohz_task = cpu_has_nohz_task(rdp->cpu);
+
/* If preemptable RCU, no point in sending reschedule IPI. */
- if (rdp->preemptable)
+ if (rdp->preemptable && !has_nohz_task)
return 0;
- /* The CPU is online, so send it a reschedule IPI. */
- if (rdp->cpu != smp_processor_id())
- smp_send_reschedule(rdp->cpu);
- else
- set_need_resched();
+ if (!has_nohz_task) {
+ /* The CPU is online, so send it a reschedule IPI. */
+ if (rdp->cpu != smp_processor_id())
+ smp_send_reschedule(rdp->cpu);
+ else
+ set_need_resched();
+ } else {
+ smp_send_update_nohz_task_cpu(rdp->cpu);
+ }
+
rdp->resched_ipi++;
return 0;
}
diff --git a/kernel/sched.c b/kernel/sched.c
index 45bd6e2..b99f192 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2461,7 +2461,7 @@ static void nohz_task_cpu_update(void *unused)
*/
rq = this_rq();
cpu = smp_processor_id();
- if (rq->nr_running > 1) {
+ if (rq->nr_running > 1 || rcu_pending(cpu) || rcu_needs_cpu(cpu)) {
__get_cpu_var(task_nohz_mode) = 0;
tick_nohz_restart_sched_tick();
}
--
1.7.3.2
When a nohz task is running, don't stop the tick if RCU
needs the CPU to notify a quiescent state or if it has
callbacks to handle.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
include/linux/rcupdate.h | 1 +
kernel/rcutree.c | 3 +--
kernel/sched.c | 6 ++++++
3 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 03cda7b..262d48b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -122,6 +122,7 @@ extern void rcu_init(void);
extern void rcu_sched_qs(int cpu);
extern void rcu_bh_qs(int cpu);
extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
struct notifier_block;
#ifdef CONFIG_NO_HZ
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ccdc04c..44dce3f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -149,7 +149,6 @@ module_param(rcu_cpu_stall_suppress, int, 0644);
#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
-static int rcu_pending(int cpu);
/*
* Return the number of RCU-sched batches processed thus far for debug & stats.
@@ -1634,7 +1633,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
* by the current CPU, returning 1 if so. This function is part of the
* RCU implementation; it is -not- an exported member of the RCU API.
*/
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
{
return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
__rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
diff --git a/kernel/sched.c b/kernel/sched.c
index 6dbae46..45bd6e2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2470,10 +2470,16 @@ static void nohz_task_cpu_update(void *unused)
int nohz_task_can_stop_tick(void)
{
struct rq *rq = this_rq();
+ int cpu;
if (rq->nr_running > 1)
return 0;
+ cpu = smp_processor_id();
+
+ if (rcu_pending(cpu) || rcu_needs_cpu(cpu))
+ return 0;
+
return 1;
}
--
1.7.3.2
A nohz task is a non-idle task that tries to shutdown the tick while
the task is running under some conditions.
This brings a new cpu_has_nohz_task_mask cpu mask that keeps track
of the cpus that have a nohz task. This is a 1:1 mapping: a nohz
task is affine to a single cpu and can't be moved elsewhere, and
a cpu can have only one nohz task.
This tracking will be useful later for rcu or when we need to
find an idle cpu target for a timer.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
arch/Kconfig | 3 +++
include/linux/cpumask.h | 8 ++++++++
kernel/cpu.c | 15 +++++++++++++++
kernel/time/Kconfig | 7 +++++++
4 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 8bf0fa65..e631791 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
config HAVE_ARCH_JUMP_LABEL
bool
+config HAVE_NO_HZ_TASK
+ bool
+
source "kernel/gcov/Kconfig"
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bae6fe2..6c4801c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -100,6 +100,13 @@ extern const struct cpumask *const cpu_active_mask;
#define cpu_active(cpu) ((cpu) == 0)
#endif
+#ifdef CONFIG_NO_HZ_TASK
+extern const struct cpumask *const cpu_has_nohz_task_mask;
+#define cpu_has_nohz_task(cpu) cpumask_test_cpu((cpu), cpu_has_nohz_task_mask)
+#else
+#define cpu_has_nohz_task(cpu) 0
+#endif
+
/* verify cpu argument to cpumask_* operators */
static inline unsigned int cpumask_check(unsigned int cpu)
{
@@ -671,6 +678,7 @@ void set_cpu_possible(unsigned int cpu, bool possible);
void set_cpu_present(unsigned int cpu, bool present);
void set_cpu_online(unsigned int cpu, bool online);
void set_cpu_active(unsigned int cpu, bool active);
+void set_cpu_has_nohz_task(unsigned int cpu, bool has_nohz_task);
void init_cpu_present(const struct cpumask *src);
void init_cpu_possible(const struct cpumask *src);
void init_cpu_online(const struct cpumask *src);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f6e726f..bc9a93e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -540,6 +540,11 @@ static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
EXPORT_SYMBOL(cpu_active_mask);
+#ifdef CONFIG_NO_HZ_TASK
+static DECLARE_BITMAP(cpu_has_nohz_task_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_has_nohz_task_mask = to_cpumask(cpu_has_nohz_task_bits);
+#endif
+
void set_cpu_possible(unsigned int cpu, bool possible)
{
if (possible)
@@ -572,6 +577,16 @@ void set_cpu_active(unsigned int cpu, bool active)
cpumask_clear_cpu(cpu, to_cpumask(cpu_active_bits));
}
+#ifdef CONFIG_NO_HZ_TASK
+void set_cpu_has_nohz_task(unsigned int cpu, bool has_nohz_task)
+{
+ if (has_nohz_task)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_has_nohz_task_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_has_nohz_task_bits));
+}
+#endif
+
void init_cpu_present(const struct cpumask *src)
{
cpumask_copy(to_cpumask(cpu_present_bits), src);
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index f06a8a3..a460cee 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -27,3 +27,10 @@ config GENERIC_CLOCKEVENTS_BUILD
default y
depends on GENERIC_CLOCKEVENTS || GENERIC_CLOCKEVENTS_MIGR
+config NO_HZ_TASK
+ bool "Tickless task"
+ depends on HAVE_NO_HZ_TASK && NO_HZ && SMP && HIGH_RES_TIMERS
+ help
+ When a task runs alone on a CPU and switches into this mode,
+ the timer interrupt will only trigger when it is strictly
+ needed.
--
1.7.3.2
Check from the timer interrupt that we are a nohz task running
alone in the CPU and stop the tick if this is the case.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
include/linux/sched.h | 6 ++++++
include/linux/tick.h | 11 ++++++++++-
kernel/sched.c | 14 ++++++++++++++
kernel/softirq.c | 4 ++--
kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++++--
5 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..858a876 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2549,6 +2549,12 @@ static inline void inc_syscw(struct task_struct *tsk)
extern void task_oncpu_function_call(struct task_struct *p,
void (*func) (void *info), void *info);
+#ifdef CONFIG_NO_HZ_TASK
+extern int nohz_task_can_stop_tick(void);
+#else
+static inline int nohz_task_can_stop_tick(void) { return 0; }
+#endif
+
#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b232ccc..7465a47 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -7,6 +7,7 @@
#define _LINUX_TICK_H
#include <linux/clockchips.h>
+#include <linux/percpu-defs.h>
#ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -126,7 +127,15 @@ extern void tick_nohz_restart_sched_tick(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+#ifdef CONFIG_NO_HZ_TASK
+DECLARE_PER_CPU(int, task_nohz_mode);
+extern int tick_nohz_task_mode(void);
+#else
+static inline int tick_nohz_task_mode(void) { return 0; }
+#endif
+
+# else /* !NO_HZ */
static inline void tick_nohz_stop_sched_tick(int inidle) { }
static inline void tick_nohz_restart_sched_tick(void) { }
static inline ktime_t tick_nohz_get_sleep_length(void)
diff --git a/kernel/sched.c b/kernel/sched.c
index 2cd6823..e9cdd7a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2443,6 +2443,20 @@ static void update_avg(u64 *avg, u64 sample)
}
#endif
+#ifdef CONFIG_NO_HZ_TASK
+DEFINE_PER_CPU(int, task_nohz_mode);
+
+int nohz_task_can_stop_tick(void)
+{
+ struct rq *rq = this_rq();
+
+ if (rq->nr_running > 1)
+ return 0;
+
+ return 1;
+}
+#endif
+
static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
bool is_sync, bool is_migrate, bool is_local,
unsigned long en_flags)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 18f4be0..e24c456a 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -297,7 +297,7 @@ void irq_enter(void)
int cpu = smp_processor_id();
rcu_irq_enter();
- if (idle_cpu(cpu) && !in_interrupt()) {
+ if ((idle_cpu(cpu) || tick_nohz_task_mode()) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
* here, as softirq will be serviced on return from interrupt.
@@ -330,7 +330,7 @@ void irq_exit(void)
rcu_irq_exit();
#ifdef CONFIG_NO_HZ
/* Make sure that timer wheel updates are propagated */
- if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+ if ((idle_cpu(smp_processor_id()) || tick_nohz_task_mode()) && !in_interrupt() && !need_resched())
tick_nohz_stop_sched_tick(0);
#endif
preempt_enable_no_resched();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e706fa8..88011b9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -274,7 +274,7 @@ void tick_nohz_stop_sched_tick(int inidle)
* updated. Thus, it must not be called in the event we are called from
* irq_exit() with the prior state different than idle.
*/
- if (!inidle && !ts->inidle)
+ if (!inidle && !ts->inidle && !tick_nohz_task_mode())
goto end;
/*
@@ -510,6 +510,11 @@ void tick_nohz_restart_sched_tick(void)
local_irq_save(flags);
+ if (tick_nohz_task_mode()) {
+ local_irq_restore(flags);
+ return;
+ }
+
if (ts->idle_active || (ts->inidle && ts->tick_stopped))
now = ktime_get();
@@ -714,10 +719,29 @@ void tick_check_idle(int cpu)
tick_check_nohz(cpu);
}
+#ifdef CONFIG_NO_HZ_TASK
+int tick_nohz_task_mode(void)
+{
+ return __get_cpu_var(task_nohz_mode);
+}
+
+static void tick_nohz_task_stop_tick(void)
+{
+ if (!test_thread_flag(TIF_NOHZ) || __get_cpu_var(task_nohz_mode))
+ return;
+
+ if (nohz_task_can_stop_tick())
+ __get_cpu_var(task_nohz_mode) = 1;
+}
+#else
+static void tick_nohz_task_stop_tick(void) { }
+#endif /* CONFIG_NO_HZ_TASK */
+
/*
* High resolution timer specific code
*/
#ifdef CONFIG_HIGH_RES_TIMERS
+
/*
* We rearm the timer until we get disabled by the idle code.
* Called with interrupts disabled and timer->base->cpu_base->lock held.
@@ -738,7 +762,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
* this duty, then the jiffies update is still serialized by
* xtime_lock.
*/
- if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
+ if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE) && !test_thread_flag(TIF_NOHZ))
tick_do_timer_cpu = cpu;
#endif
@@ -767,6 +791,8 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
profile_tick(CPU_PROFILING);
}
+ tick_nohz_task_stop_tick();
+
hrtimer_forward(timer, now, tick_period);
return HRTIMER_RESTART;
--
1.7.3.2
Unbound timers are preferably targeted for non idle cpu. If
possible though, prioritize idle cpus over nohz task cpus,
because the main point of nohz task is to avoid unnecessary
timer interrupts.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Tim Pepper <[email protected]>
---
kernel/sched.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..2cd6823 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1209,16 +1209,29 @@ static void resched_cpu(int cpu)
int get_nohz_timer_target(void)
{
int cpu = smp_processor_id();
+ int fallback = -1;
int i;
struct sched_domain *sd;
for_each_domain(cpu, sd) {
- for_each_cpu(i, sched_domain_span(sd))
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (cpu_has_nohz_task(i))
+ continue;
+
if (!idle_cpu(i))
return i;
+
+ if (fallback == -1 || i == cpu)
+ fallback = i;
+ }
}
- return cpu;
+
+ if (fallback == -1)
+ fallback = cpu;
+
+ return fallback;
}
+
/*
* When add_timer_on() enqueues a timer into the timer wheel of an
* idle CPU then this timer might expire before the next timer event
--
1.7.3.2
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> This implements the /proc/pid/nohz file that enables the
> nohz attribute of a task.
>
> Synchronization is enforced so that:
>
> - A CPU can have only one nohz task
Why?
> - A nohz task can be only affine to a single CPU
Why?
> For now this is only possible to write on /proc/self but probably
> allowing it from another task would be a good idea and wouldn't
> increase so much the complexity of the code.
ptrace rules might match that.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> The timer interrupt handles several things like preemption,
> timekeeping, rcu, etc...
>
> However it appears that sometimes it is simply useless like
> when a task runs alone and even more when it is in userspace
> as RCU doesn't need it at all in such case.
>
> It appears that HPC workload would get some win of such timer
> deactivation, and perhaps also the Real Time world as this
> minimizes the critical sections due to way less interrupts to
> handle.
>
> It works through the procfs interface:
>
> echo 1 > /proc/self/nohz
I wounder if we could just have this happen automatically.
>
> With the following constraints:
>
> - A cpu can have only one nohz task
> - A nohz task must be affine to a single CPU. That affinity can't
> change while the task is in this mode
If the above is the case, perhaps we could have this disable HZ on that
CPU.
> - This must be written in /proc/self only, however further
> plans to allow than to be set from another task should be
> possible.
>
> You need to migrate irqs manually from userspace, same
> for tasks. If a non nohz task is running on the same cpu
> than a nohz task, the tick can't be stopped.
So interrupts must not be set to this CPU?
>
> I can provide you the tools I'm using to test it if you
> want.
>
> Note this depends on the rcu spurious softirq fixes in Paul's
> queue for .38
>
> I'm also using a hack to make init affine to the first CPU
> on boot so that all userspace tasks end up to the first CPU
> except kernel threads and tasks that change their affinity
> explicitly (this is not sched isolation). This avoids any
> task to set up timers to random CPUs on which we'll later
> want to run a nohz task. But probably this can be fixed
> with another way, like unbinding these timers or so. This
> probably require a detailed audit.
Have you looked at "tuna"?
>
> Any comments are welcome.
Now as I was saying. If only a single running task is on a given CPU,
and it is affined there. If no timers are set for wakeups on that CPU.
Could we possible set this to be NOHZ automatically?
Just a thought.
-- Steve
>
> You can fetch from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> sched/nohz-task
>
> Frederic Weisbecker (15):
> nohz_task: New mask for cpus having nohz task
> nohz_task: Avoid nohz task cpu as non-idle timer target
> nohz_task: Make tick stop and restart callable outside idle
> nohz_task: Stop the tick when the nohz task runs alone
> nohz_task: Restart the tick when another task compete on the cpu
> nohz_task: Keep the tick if rcu needs it
> nohz_task: Restart tick when RCU forces nohz task cpu quiescent state
> smp: Don't warn if irq are disabled but we don't wait for the ipi
> rcu: Make rcu_enter,exit_nohz() callable from irq
> nohz_task: Enter in extended quiescent state when in userspace
> x86: Nohz task support
> clocksource: Ignore nohz task cpu in clocksource watchdog
> sched: Protect nohz task cpu affinity
> nohz_task: Clear nohz task attribute on exit()
> nohz_task: Procfs interface
>
> arch/Kconfig | 7 ++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/thread_info.h | 10 ++-
> arch/x86/kernel/ptrace.c | 10 +++
> arch/x86/kernel/traps.c | 22 ++++--
> arch/x86/mm/fault.c | 13 +++-
> fs/proc/base.c | 80 +++++++++++++++++++++
> include/linux/cpumask.h | 8 ++
> include/linux/rcupdate.h | 1 +
> include/linux/sched.h | 9 +++
> include/linux/tick.h | 26 +++++++-
> kernel/cpu.c | 15 ++++
> kernel/exit.c | 3 +
> kernel/rcutree.c | 127 +++++++++++++++------------------
> kernel/rcutree.h | 12 ++--
> kernel/sched.c | 135 ++++++++++++++++++++++++++++++++++-
> kernel/smp.c | 2 +-
> kernel/softirq.c | 4 +-
> kernel/time/Kconfig | 7 ++
> kernel/time/clocksource.c | 10 ++-
> kernel/time/tick-sched.c | 138 +++++++++++++++++++++++++++++++++--
> 21 files changed, 535 insertions(+), 105 deletions(-)
>
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> Unbound timers are preferably targeted for non idle cpu. If
> possible though, prioritize idle cpus over nohz task cpus,
> because the main point of nohz task is to avoid unnecessary
> timer interrupts.
Oh is it?
I'd very much expect the cpu that arms the timer to get the interrupt. I
mean, if the task doesn't want to get interrupted by timers,
_DON'T_USE_TIMERS_ to begin with.
So no, don't much like this at all.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> + if (!current->pid)
I'm not exactly sure that's the sanest way to test if you're the idle
thread...
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> Check from the timer interrupt that we are a nohz task running
> alone in the CPU and stop the tick if this is the case.
>
Does this verify that the tick has no other work to do?
I see no list of things the tick does and a checklist that everything it
does is indeed superfluous.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> +#ifdef CONFIG_NO_HZ_TASK
> + /*
> + * CHECKME:
> + * Ideally, we need to check if the target has a nohz task
> + * and only send the IPI if so. But there is nothing but
> + * a racy way to do that. Or can we assume at that point
> + * of the wake up that if cpu_has_nohz_task(cpu) is 0, then
> + * it's ok, even if it has a task about to switch to nohz
> + * task mode?
> + */
> + if (rq->nr_running == 2)
> + smp_send_update_nohz_task_cpu(cpu);
> +#endif
This is the wrong place, use ttwu_activate(), since activate_task() is
the thing that pokes at nr_running.
On Mon, Dec 20, 2010 at 04:42:24PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > This implements the /proc/pid/nohz file that enables the
> > nohz attribute of a task.
> >
> > Synchronization is enforced so that:
> >
> > - A CPU can have only one nohz task
>
> Why?
This is because of the hooks we have with entering/exiting userspace.
The "wants to enter extended quiescent" variable (nohz_task_ext_qs)
is per CPU and applies to any nohz task.
If A and B are nohz task bound to the same CPU,
A enters userspace, says it can enter extended quiescent state
(nohz_task_ext_qs = 1).
B preempts it and enters kernel, hence saying that it doesn't want
extended quiescent state (nohz_task_ext_qs = 0). B sleeps, we return
to A which said that it wants extended quiescent state but the per cpu
var has been screwed (nohz_task_ext_qs == 0).
But this can be solved using a per task variable. I just thought it
wouldn't be very useful to have two nohz task on a same CPU, but actually
why not.
> > - A nohz task can be only affine to a single CPU
>
> Why?
Same problem, we need to make some things per task. That's fixable,
This will may be add a bit of complexity and since I couldn't find
a usecase for migratable nohz tasks, I did not handled that case.
Should I?
> > For now this is only possible to write on /proc/self but probably
> > allowing it from another task would be a good idea and wouldn't
> > increase so much the complexity of the code.
>
> ptrace rules might match that.
You think I should use the ptrace interface? Hmm, dunno if it's
appropriate.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> @@ -1634,7 +1633,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> * by the current CPU, returning 1 if so. This function is part of the
> * RCU implementation; it is -not- an exported member of the RCU API.
> */
> -static int rcu_pending(int cpu)
> +int rcu_pending(int cpu)
/me wonders about that comment.
> {
> return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6dbae46..45bd6e2 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2470,10 +2470,16 @@ static void nohz_task_cpu_update(void *unused)
> int nohz_task_can_stop_tick(void)
> {
> struct rq *rq = this_rq();
> + int cpu;
>
> if (rq->nr_running > 1)
> return 0;
>
> + cpu = smp_processor_id();
> +
> + if (rcu_pending(cpu) || rcu_needs_cpu(cpu))
> + return 0;
Arguable, rcu_needs_cpu() should imply rcu_pending(), because if there's
work still to be done, it needs the cpu, hmm?
> return 1;
> }
>
This patch also implies you broke stuff with #4 because it would put the
machine to sleep while RCU still had bits to do, not very nice.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> If a cpu is in nohz mode due to a nohz task running, then
> it is not able to notify quiescent states requested by other
> CPUs.
>
> Then restart the tick to remotely force the quiescent states on the
> nohz task cpus.
-ENOPARSE.. if its in NOHZ state, it couldn't possibly need to
participate in the quiescent state machine because the cpu is in a
quiescent state and has 0 RCU activity.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> The comment in smp_call_function_single() says it wants the irqs
> to be enabled otherwise it may deadlock.
>
> I can't find the reason for that though, except if we had to wait
> for a self triggered IPI but we execute the local IPI by just
> calling the function in place.
>
> In doubt, only suppress the warning if we are not waiting for the
> IPI to complete as it should really not raise any deadlock.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: Tim Pepper <[email protected]>
> ---
> kernel/smp.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 12ed8b0..886a406 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -289,7 +289,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
> * send smp call function interrupt to this cpu and as such deadlocks
> * can't happen.
> */
> - WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> + WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
> && !oops_in_progress);
>
> if (cpu == this_cpu) {
You just deadlocked the machine.. note how you can still wait on the
previous csd in csd_lock().
On Mon, 2010-12-20 at 16:47 +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > Unbound timers are preferably targeted for non idle cpu. If
> > possible though, prioritize idle cpus over nohz task cpus,
> > because the main point of nohz task is to avoid unnecessary
> > timer interrupts.
>
> Oh is it?
>
> I'd very much expect the cpu that arms the timer to get the interrupt. I
> mean, if the task doesn't want to get interrupted by timers,
> _DON'T_USE_TIMERS_ to begin with.
>
> So no, don't much like this at all.
I think this comes from other tasks on other CPUs that are using timers.
Although, I'm not sure what causes an "unbound" timer to happen. I
thought timers usually go off on the CPU that asked for it to go off.
-- Steve
On Mon, 2010-12-20 at 11:06 -0500, Steven Rostedt wrote:
> On Mon, 2010-12-20 at 16:47 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > Unbound timers are preferably targeted for non idle cpu. If
> > > possible though, prioritize idle cpus over nohz task cpus,
> > > because the main point of nohz task is to avoid unnecessary
> > > timer interrupts.
> >
> > Oh is it?
> >
> > I'd very much expect the cpu that arms the timer to get the interrupt. I
> > mean, if the task doesn't want to get interrupted by timers,
> > _DON'T_USE_TIMERS_ to begin with.
> >
> > So no, don't much like this at all.
>
> I think this comes from other tasks on other CPUs that are using timers.
Tasks on other CPUs should not cause timers on this CPU, _if_ that does
happen, fix that.
> Although, I'm not sure what causes an "unbound" timer to happen. I
> thought timers usually go off on the CPU that asked for it to go off.
They do, except if you enable some weird power management feature that
migrates timers around so as to let CPUs sleep longer. But I doubt
that's the reason for this here, and if it is, just disable that.
On Mon, 2010-12-20 at 16:57 +0100, Frederic Weisbecker wrote:
> Should I?
Well yes, this interface of explicitly marking a task and cpu as
task_no_hz is kinda restrictive and useless.
When I run 4 cpu-bound tasks on a quad-core I shouldn't have to do
anything to benefit from this.
I don't see why having this cpumask is restricting you in any way,
user-space tasks don't migrate around, that all happens in kernel space.
Also, I'm not quite happy with the pure userspace restriction, but at
least I see why you did that event though you didn't mention that.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> (we check if the local cpu is in nohz mode, which means
> no other task compete on that CPU)
You keep repeating that definition, but its not true.. It means there is
not work for the tick to do, the tick does _tons_ more besides
preemption, so nr_running==1 is necessary but not sufficient.
On Mon, 2010-12-20 at 16:48 +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > + if (!current->pid)
>
> I'm not exactly sure that's the sanest way to test if you're the idle
> thread...
Would
if (idle_cpu(cpu))
be better?
-- Steve
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
>
> Implement the thread flag, syscalls and exception hooks for
> nohz task support.
>
I saw:
- syscall
- do_int3
- do_debug (int1)
- #PF
So where's all other interrupts?
On Mon, 2010-12-20 at 11:19 -0500, Steven Rostedt wrote:
> On Mon, 2010-12-20 at 16:48 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > + if (!current->pid)
> >
> > I'm not exactly sure that's the sanest way to test if you're the idle
> > thread...
>
> Would
>
> if (idle_cpu(cpu))
>
> be better?
Lots ;-)
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
>
> The watchdog should probably make an exception for nohz task cpus
> that want to be interrupted the least possible.
>
> However we probably need to warn the user about that.
>
> Another solution would be to make this timer defferable.
Nah, both skipping a cpu and making the timer deferrable render the
watchdog useless, just disable the thing.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> Don't allow to change a nohz task cpu affinity as we want them
> to be bound to a single CPU and we want this affinity not to
> change.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: Tim Pepper <[email protected]>
> ---
> kernel/sched.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 4412493..bd0a41f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5712,6 +5712,13 @@ again:
> goto out;
> }
>
> + /* Nohz tasks must keep their affinity */
> + if (test_tsk_thread_flag(p, TIF_NOHZ) &&
> + !cpumask_equal(&p->cpus_allowed, new_mask)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> if (p->sched_class->set_cpus_allowed)
> p->sched_class->set_cpus_allowed(p, new_mask);
> else {
NAK, this is really way too restrictive.
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
>
> Clear the nohz task attribute when a task exits, clear the cpu
> mask and restart the tick if necessary.
>
I'm not quite sure this all makes sense, I mean, we're going from 1 to 0
tasks, right?
On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> Any comments are welcome.
Not a single patch touches any of the accounting work done by the tick,
current NO_HZ more is _barely_ sufficient for the nr_running==0 case, it
definitely needs work for nr_running==1.
On Mon, 2010-12-20 at 17:28 +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > Don't allow to change a nohz task cpu affinity as we want them
> > to be bound to a single CPU and we want this affinity not to
> > change.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Anton Blanchard <[email protected]>
> > Cc: Tim Pepper <[email protected]>
> > ---
> > kernel/sched.c | 7 +++++++
> > 1 files changed, 7 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 4412493..bd0a41f 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5712,6 +5712,13 @@ again:
> > goto out;
> > }
> >
> > + /* Nohz tasks must keep their affinity */
> > + if (test_tsk_thread_flag(p, TIF_NOHZ) &&
> > + !cpumask_equal(&p->cpus_allowed, new_mask)) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
> > if (p->sched_class->set_cpus_allowed)
> > p->sched_class->set_cpus_allowed(p, new_mask);
> > else {
>
> NAK, this is really way too restrictive.
Agreed, the better solution is to disable the nohz from the task. If the
use just changed its affinity (or something else did), disable the nohz.
Maybe you can add a printk or warning, but I'm not sure about that
either.
-- Steve
On Mon, Dec 20, 2010 at 10:44:46AM -0500, Steven Rostedt wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > The timer interrupt handles several things like preemption,
> > timekeeping, rcu, etc...
> >
> > However it appears that sometimes it is simply useless like
> > when a task runs alone and even more when it is in userspace
> > as RCU doesn't need it at all in such case.
> >
> > It appears that HPC workload would get some win of such timer
> > deactivation, and perhaps also the Real Time world as this
> > minimizes the critical sections due to way less interrupts to
> > handle.
> >
> > It works through the procfs interface:
> >
> > echo 1 > /proc/self/nohz
>
> I wounder if we could just have this happen automatically.
But this would add some global overhead, especially in the syscall
path as we need to take the slow path to hook userspace resume/exit.
> > - This must be written in /proc/self only, however further
> > plans to allow than to be set from another task should be
> > possible.
> >
> > You need to migrate irqs manually from userspace, same
> > for tasks. If a non nohz task is running on the same cpu
> > than a nohz task, the tick can't be stopped.
>
> So interrupts must not be set to this CPU?
No it's just that the point is to minimize interrupts. If you want
that on a cpu you can use a nohz task, but you still have do
migrate irqs in another CPU if you want to truly minimize
the interrupts on a nohz task.
> >
> > I can provide you the tools I'm using to test it if you
> > want.
> >
> > Note this depends on the rcu spurious softirq fixes in Paul's
> > queue for .38
> >
> > I'm also using a hack to make init affine to the first CPU
> > on boot so that all userspace tasks end up to the first CPU
> > except kernel threads and tasks that change their affinity
> > explicitly (this is not sched isolation). This avoids any
> > task to set up timers to random CPUs on which we'll later
> > want to run a nohz task. But probably this can be fixed
> > with another way, like unbinding these timers or so. This
> > probably require a detailed audit.
>
> Have you looked at "tuna"?
No, I'm discovering this, I'll have a look. I'm not sure this
can fix the randomly bound timer issue though.
> > Any comments are welcome.
>
> Now as I was saying. If only a single running task is on a given CPU,
> and it is affined there. If no timers are set for wakeups on that CPU.
> Could we possible set this to be NOHZ automatically?
>
> Just a thought.
So, we still need the syscalls slow path hooks.
On Mon, Dec 20, 2010 at 04:51:39PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > Check from the timer interrupt that we are a nohz task running
> > alone in the CPU and stop the tick if this is the case.
> >
> Does this verify that the tick has no other work to do?
>
> I see no list of things the tick does and a checklist that everything it
> does is indeed superfluous.
In a subsequent patch we check if rcu also needs the tick.
For the rest, tick_nohz_stop_sched_tick() knows what to
do: keep the next tick or switch to nohz.
Hm?
On Mon, Dec 20, 2010 at 04:53:21PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > +#ifdef CONFIG_NO_HZ_TASK
> > + /*
> > + * CHECKME:
> > + * Ideally, we need to check if the target has a nohz task
> > + * and only send the IPI if so. But there is nothing but
> > + * a racy way to do that. Or can we assume at that point
> > + * of the wake up that if cpu_has_nohz_task(cpu) is 0, then
> > + * it's ok, even if it has a task about to switch to nohz
> > + * task mode?
> > + */
> > + if (rq->nr_running == 2)
> > + smp_send_update_nohz_task_cpu(cpu);
> > +#endif
>
> This is the wrong place, use ttwu_activate(), since activate_task() is
> the thing that pokes at nr_running.
Ok, will do.
Thanks.
On Mon, Dec 20, 2010 at 04:58:20PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
>
> > @@ -1634,7 +1633,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> > * by the current CPU, returning 1 if so. This function is part of the
> > * RCU implementation; it is -not- an exported member of the RCU API.
> > */
> > -static int rcu_pending(int cpu)
> > +int rcu_pending(int cpu)
>
> /me wonders about that comment.
Yeah I'll need to update that.
> > {
> > return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> > __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 6dbae46..45bd6e2 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2470,10 +2470,16 @@ static void nohz_task_cpu_update(void *unused)
> > int nohz_task_can_stop_tick(void)
> > {
> > struct rq *rq = this_rq();
> > + int cpu;
> >
> > if (rq->nr_running > 1)
> > return 0;
> >
> > + cpu = smp_processor_id();
> > +
> > + if (rcu_pending(cpu) || rcu_needs_cpu(cpu))
> > + return 0;
>
> Arguable, rcu_needs_cpu() should imply rcu_pending(), because if there's
> work still to be done, it needs the cpu, hmm?
We certainly need to change the naming there.
rcu_needs_cpu() checks if we need to do something with the local callbacks.
rcu_pending() checks if we the current CPU needs to notify quiescent states
because a new grace period has started.
So now that rcu_pending() is exported we probably need to refine the naming.
rcu_callbacks_pending() and rcu_grace_period_pending(), or something like
this.
> > return 1;
> > }
> >
>
> This patch also implies you broke stuff with #4 because it would put the
> machine to sleep while RCU still had bits to do, not very nice.
Nope, the new config can only be built after [RFC PATCH 11/15] x86: Nohz task support
I know I split up the patches in some unusual way but I did that on purpose:
I wanted to have a finegrained patchset so that it's more reviewable than a big
"core support" - "arch support" dual patch based style.
But I ensured the new config can not be enabled before it's entirely buildable
and has no known bugs.
On Mon, Dec 20, 2010 at 05:02:09PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > If a cpu is in nohz mode due to a nohz task running, then
> > it is not able to notify quiescent states requested by other
> > CPUs.
> >
> > Then restart the tick to remotely force the quiescent states on the
> > nohz task cpus.
>
> -ENOPARSE.. if its in NOHZ state, it couldn't possibly need to
> participate in the quiescent state machine because the cpu is in a
> quiescent state and has 0 RCU activity.
But it can be in nohz state in the kernel in which case it can have
any RCU activity.
On Mon, Dec 20, 2010 at 05:03:59PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > The comment in smp_call_function_single() says it wants the irqs
> > to be enabled otherwise it may deadlock.
> >
> > I can't find the reason for that though, except if we had to wait
> > for a self triggered IPI but we execute the local IPI by just
> > calling the function in place.
> >
> > In doubt, only suppress the warning if we are not waiting for the
> > IPI to complete as it should really not raise any deadlock.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Anton Blanchard <[email protected]>
> > Cc: Tim Pepper <[email protected]>
> > ---
> > kernel/smp.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 12ed8b0..886a406 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -289,7 +289,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
> > * send smp call function interrupt to this cpu and as such deadlocks
> > * can't happen.
> > */
> > - WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> > + WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
> > && !oops_in_progress);
> >
> > if (cpu == this_cpu) {
>
> You just deadlocked the machine.. note how you can still wait on the
> previous csd in csd_lock().
Ah right.
I should then use __smp_call_function_single().
On Tue, 21 Dec 2010 00:49:38 +0100
Frederic Weisbecker <[email protected]> wrote:
> Nope, the new config can only be built after [RFC PATCH 11/15] x86: Nohz task support
>
> I know I split up the patches in some unusual way but I did that on purpose:
> I wanted to have a finegrained patchset so that it's more reviewable than a big
> "core support" - "arch support" dual patch based style.
>
> But I ensured the new config can not be enabled before it's entirely buildable
> and has no known bugs.
Which is a nice thought (it helped me to understand the patches - article
forthcoming), but there is a downside: if anybody tries to bisect a
problem, they'll end up at #11. This stuff is sufficiently tricky that it
would be nice to be able to bisect a little closer to the patch which
actually introduced a bug.
jon
On Mon, Dec 20, 2010 at 04:47:58PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > Unbound timers are preferably targeted for non idle cpu. If
> > possible though, prioritize idle cpus over nohz task cpus,
> > because the main point of nohz task is to avoid unnecessary
> > timer interrupts.
>
> Oh is it?
>
> I'd very much expect the cpu that arms the timer to get the interrupt. I
> mean, if the task doesn't want to get interrupted by timers,
> _DON'T_USE_TIMERS_ to begin with.
>
> So no, don't much like this at all.
I suspect TIMER_NOT_PINNED has been introduced to save some power by
avoiding to wake up idle cpus.
This is used by mod_timer(), schedule_timeout(), mod_timer_pending()
So that's widely used and removing that could have a deep impact on
power.
On Mon, Dec 20, 2010 at 05:12:04PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 11:06 -0500, Steven Rostedt wrote:
> > On Mon, 2010-12-20 at 16:47 +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > Unbound timers are preferably targeted for non idle cpu. If
> > > > possible though, prioritize idle cpus over nohz task cpus,
> > > > because the main point of nohz task is to avoid unnecessary
> > > > timer interrupts.
> > >
> > > Oh is it?
> > >
> > > I'd very much expect the cpu that arms the timer to get the interrupt. I
> > > mean, if the task doesn't want to get interrupted by timers,
> > > _DON'T_USE_TIMERS_ to begin with.
> > >
> > > So no, don't much like this at all.
> >
> > I think this comes from other tasks on other CPUs that are using timers.
>
> Tasks on other CPUs should not cause timers on this CPU, _if_ that does
> happen, fix that.
>
> > Although, I'm not sure what causes an "unbound" timer to happen. I
> > thought timers usually go off on the CPU that asked for it to go off.
>
> They do, except if you enable some weird power management feature that
> migrates timers around so as to let CPUs sleep longer. But I doubt
> that's the reason for this here, and if it is, just disable that.
That seems to me the reason for that: avoid to wake up idle cpus.
I can certainly deactivate TIMER_NOT_PINNED and make it a no-op
if CONFIG_NO_HZ_TASK.
But I'm not sure why we would want to do this.
On Mon, Dec 20, 2010 at 05:16:39PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:57 +0100, Frederic Weisbecker wrote:
>
> > Should I?
>
> Well yes, this interface of explicitly marking a task and cpu as
> task_no_hz is kinda restrictive and useless.
Yeah indeed. I did the mistake to focus on the HPC specific worflow,
or rather what I imagine as the HPC specific workflow: a single task
per cpu doing intensive work.
But this should also work without even tweaks on the affinity or so.
> When I run 4 cpu-bound tasks on a quad-core I shouldn't have to do
> anything to benefit from this.
Yeah exactly. If the scheduler load balancer does the appropriate
share between CPUs, having only one task running on each should
happen often enough.
And let the user optimize that by playing with irq and task affinity.
We still need to do the echo 1 > /proc/pid/nohz though.
> I don't see why having this cpumask is restricting you in any way,
> user-space tasks don't migrate around, that all happens in kernel space.
The cpumask is useful to find unbound targets and for RCU to know if it
should send the specific IPI. Ah and also to keep at least one
CPU that has no nohz task to handle the timekeeping.
- For the unbound targets, we are discussing that elsewhere, that's one
reason for which we need to keep a CPU without nohz task, so that it
can handle those unbound timers. But if there is no such CPU, we can
just fallback as we did before.
- RCU can unconditonally send the specific IPI which can fallback into
executing the simple resched IPI callback if no nohz task is runnning
on the CPU.
- The last reason to keep at least one CPU without nohz task is then
the timekeeping. But again, if every CPU has a nohz task, we can
fallback to a random one
> Also, I'm not quite happy with the pure userspace restriction, but at
> least I see why you did that event though you didn't mention that.
What do you mean? The fact that kernel threads can not be nohz task?
On Mon, Dec 20, 2010 at 05:18:40PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > (we check if the local cpu is in nohz mode, which means
> > no other task compete on that CPU)
>
> You keep repeating that definition, but its not true.. It means there is
> not work for the tick to do, the tick does _tons_ more besides
> preemption, so nr_running==1 is necessary but not sufficient.
Sure but the point is that if the tick is not running, it means
that the nohz task is the only task running on that CPU.
Now indeed there are other reasons for the tick to restart like RCU
or the effective nohz mode to physically happen or to be delayed,
which is decided by tick_nohz_stop_sched_tick().
On Mon, Dec 20, 2010 at 05:23:19PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> >
> > Implement the thread flag, syscalls and exception hooks for
> > nohz task support.
> >
>
> I saw:
> - syscall
> - do_int3
> - do_debug (int1)
> - #PF
>
> So where's all other interrupts?
No need to handle them.
We have:
rcu_irq_enter() rcu_irq_exit() rcu_nmi_enter() rcu_nmi_exit()
and they already act as pauses into extended quiescent states, which
is enough for our needs.
On Mon, Dec 20, 2010 at 05:25:11PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 11:19 -0500, Steven Rostedt wrote:
> > On Mon, 2010-12-20 at 16:48 +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > + if (!current->pid)
> > >
> > > I'm not exactly sure that's the sanest way to test if you're the idle
> > > thread...
> >
> > Would
> >
> > if (idle_cpu(cpu))
> >
> > be better?
>
> Lots ;-)
I wondered about that too, will update.
Thanks.
On Tue, 2010-12-21 at 00:33 +0100, Frederic Weisbecker wrote:
>
> > I wounder if we could just have this happen automatically.
>
> But this would add some global overhead, especially in the syscall
> path as we need to take the slow path to hook userspace resume/exit.
>
I guess we could measure the overhead. See if the timer causes more
overhead than the added overhead of the syscall, or not.
-- Steve
On Mon, Dec 20, 2010 at 05:27:38PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> >
> > The watchdog should probably make an exception for nohz task cpus
> > that want to be interrupted the least possible.
> >
> > However we probably need to warn the user about that.
> >
> > Another solution would be to make this timer defferable.
>
> Nah, both skipping a cpu and making the timer deferrable render the
> watchdog useless, just disable the thing.
Thomas seemed to prefer we keep it but just ignore the nohz task cpus
I mean when we talked about that at that time, the notion of nohz was
purely per cpu.
Now we wish the notion of nohz task cpu could disappear, so
we should indeed deactivate the thing for that config.
Would be nice to have Thomas opinion too.
On Mon, Dec 20, 2010 at 05:30:28PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> >
> > Clear the nohz task attribute when a task exits, clear the cpu
> > mask and restart the tick if necessary.
> >
> I'm not quite sure this all makes sense, I mean, we're going from 1 to 0
> tasks, right?
Not necessarily, other tasks can be on the runqueue while that nohz task
exits, or we can be alone in which case the tick might be stopped and
we need to restart it because rq->nr_running > 1 won't make much sense
anymore without the nohz task and if a new task gets enqueued, the tick
won't restart until a second one gets in.
On Mon, Dec 20, 2010 at 05:35:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > Any comments are welcome.
>
> Not a single patch touches any of the accounting work done by the tick,
> current NO_HZ more is _barely_ sufficient for the nr_running==0 case, it
> definitely needs work for nr_running==1.
I'm not sure what you mean. Is it about timekeeping? Note we keep a cpu to
handle that. Hmm that makes me think, if it goes idle we may be screwed...
On Mon, Dec 20, 2010 at 12:05:30PM -0500, Steven Rostedt wrote:
> On Mon, 2010-12-20 at 17:28 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > Don't allow to change a nohz task cpu affinity as we want them
> > > to be bound to a single CPU and we want this affinity not to
> > > change.
> > >
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Paul E. McKenney <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Lai Jiangshan <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Anton Blanchard <[email protected]>
> > > Cc: Tim Pepper <[email protected]>
> > > ---
> > > kernel/sched.c | 7 +++++++
> > > 1 files changed, 7 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/sched.c b/kernel/sched.c
> > > index 4412493..bd0a41f 100644
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -5712,6 +5712,13 @@ again:
> > > goto out;
> > > }
> > >
> > > + /* Nohz tasks must keep their affinity */
> > > + if (test_tsk_thread_flag(p, TIF_NOHZ) &&
> > > + !cpumask_equal(&p->cpus_allowed, new_mask)) {
> > > + ret = -EBUSY;
> > > + goto out;
> > > + }
> > > +
> > > if (p->sched_class->set_cpus_allowed)
> > > p->sched_class->set_cpus_allowed(p, new_mask);
> > > else {
> >
> > NAK, this is really way too restrictive.
>
> Agreed, the better solution is to disable the nohz from the task. If the
> use just changed its affinity (or something else did), disable the nohz.
> Maybe you can add a printk or warning, but I'm not sure about that
> either.
Right. Or even better: don't force the nohz task to be affine to
a single cpu.
On Mon, Dec 20, 2010 at 05:12:59PM -0700, Jonathan Corbet wrote:
> On Tue, 21 Dec 2010 00:49:38 +0100
> Frederic Weisbecker <[email protected]> wrote:
>
> > Nope, the new config can only be built after [RFC PATCH 11/15] x86: Nohz task support
> >
> > I know I split up the patches in some unusual way but I did that on purpose:
> > I wanted to have a finegrained patchset so that it's more reviewable than a big
> > "core support" - "arch support" dual patch based style.
> >
> > But I ensured the new config can not be enabled before it's entirely buildable
> > and has no known bugs.
>
> Which is a nice thought (it helped me to understand the patches - article
> forthcoming)
Oh great :)
> but there is a downside: if anybody tries to bisect a
> problem, they'll end up at #11. This stuff is sufficiently tricky that it
> would be nice to be able to bisect a little closer to the patch which
> actually introduced a bug.
Yeah indeed. I think I could split that up another way while keeping the fine
grained progression (if any).
I could enable the interface and CONFIG a bit sooner, once we have the support
for tick stop and restart handling basic things like woken up tasks and rcu
pending works checks.
And then handle the rcu extended grace periods in userspace later.
Because it works without, the nohz task would just be regularly interrupted by
rcu IPIs to note the quiescent states.
There are lots of changes to be done anyway, so I'm sure there will be enough
new takes of this for me to train more patchset split-up flavours ;-)
On Mon, Dec 20, 2010 at 08:36:39PM -0500, Steven Rostedt wrote:
> On Tue, 2010-12-21 at 00:33 +0100, Frederic Weisbecker wrote:
> >
> > > I wounder if we could just have this happen automatically.
> >
> > But this would add some global overhead, especially in the syscall
> > path as we need to take the slow path to hook userspace resume/exit.
> >
>
> I guess we could measure the overhead. See if the timer causes more
> overhead than the added overhead of the syscall, or not.
Indeed, but may be let's stage that first as a conditional attribute
and if it proves better results most of the time, then let's make it
the default.
Hm?
On Tue, 2010-12-21 at 00:33 +0100, Frederic Weisbecker wrote:
> So, we still need the syscalls slow path hooks.
Make it a cpuset property or something.
On Tue, 2010-12-21 at 00:37 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 04:51:39PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > Check from the timer interrupt that we are a nohz task running
> > > alone in the CPU and stop the tick if this is the case.
> > >
> > Does this verify that the tick has no other work to do?
> >
> > I see no list of things the tick does and a checklist that everything it
> > does is indeed superfluous.
>
> In a subsequent patch we check if rcu also needs the tick.
> For the rest, tick_nohz_stop_sched_tick() knows what to
> do: keep the next tick or switch to nohz.
>
> Hm?
No, and that worries me, you don't even seem to know what the tick does
and yet you're working on stopping it.
On Tue, 2010-12-21 at 00:52 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 05:02:09PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > If a cpu is in nohz mode due to a nohz task running, then
> > > it is not able to notify quiescent states requested by other
> > > CPUs.
> > >
> > > Then restart the tick to remotely force the quiescent states on the
> > > nohz task cpus.
> >
> > -ENOPARSE.. if its in NOHZ state, it couldn't possibly need to
> > participate in the quiescent state machine because the cpu is in a
> > quiescent state and has 0 RCU activity.
>
>
> But it can be in nohz state in the kernel in which case it can have
> any RCU activity.
That still doesn't make sense.. if you're in nohz state there shouldn't
be any rcu activity, otherwise its not nohz is it?
On Tue, 2010-12-21 at 01:13 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 04:47:58PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > Unbound timers are preferably targeted for non idle cpu. If
> > > possible though, prioritize idle cpus over nohz task cpus,
> > > because the main point of nohz task is to avoid unnecessary
> > > timer interrupts.
> >
> > Oh is it?
> >
> > I'd very much expect the cpu that arms the timer to get the interrupt. I
> > mean, if the task doesn't want to get interrupted by timers,
> > _DON'T_USE_TIMERS_ to begin with.
> >
> > So no, don't much like this at all.
>
> I suspect TIMER_NOT_PINNED has been introduced to save some power by
> avoiding to wake up idle cpus.
>
> This is used by mod_timer(), schedule_timeout(), mod_timer_pending()
> So that's widely used and removing that could have a deep impact on
> power.
Yeah so? Who said task_nohz had to have the bestest power savings
around? Its not a laptop feature by any means. Simply disable the thing:
echo 0 > /proc/sys/kernel/timer_migration, or better yet, teach it about
your cpuset constraints and avoid it migrating timers into your set and
keep timers pinned within the set.
On Tue, 2010-12-21 at 01:20 +0100, Frederic Weisbecker wrote:
> I can certainly deactivate TIMER_NOT_PINNED and make it a no-op
> if CONFIG_NO_HZ_TASK.
Why would you want to make that a compile time switch? Dynamic is
trivial to do.
On Tue, 2010-12-21 at 02:27 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 05:18:40PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > (we check if the local cpu is in nohz mode, which means
> > > no other task compete on that CPU)
> >
> > You keep repeating that definition, but its not true.. It means there is
> > not work for the tick to do, the tick does _tons_ more besides
> > preemption, so nr_running==1 is necessary but not sufficient.
>
> Sure but the point is that if the tick is not running, it means
> that the nohz task is the only task running on that CPU.
No, that too isn't true, the cpu could be idle.
> Now indeed there are other reasons for the tick to restart like RCU
> or the effective nohz mode to physically happen or to be delayed,
> which is decided by tick_nohz_stop_sched_tick().
You really really really badly need to read through the whole tick path
and look at all the things it does, put them in a list, then look at the
current nohz code, mark those it deals with, then go through the nohz
code again and find the nr_running==0 assumptions, then make sure you've
covered everything.
I'm very confident you'll find a number of things to fix. At the very
least your current patch set totally forgets about task runtime
accounting (account_process_tick()), the existing NO_HZ doesn't need to
worry about that because the system is idle so all it needs to do is add
idle ticks when it wakes up (and possibly steal time for the virt muck).
You also miss the profile_tick(), and you need to go through the load
accounting muck (both of them) to see if there are any nr_running==0
assumptions there.
You also need to fix the perf counter list rotation stuff, and again,
check if no_hz load-balancing can deal with the nr_running==1 situation.
Now, there might be more, this is just a quick one-minute scan through
the tick code, but the fact that nowhere in this whole patch-set is even
a mention of these things makes me worry about your whole approach.
On Tue, 2010-12-21 at 02:30 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 05:23:19PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > >
> > > Implement the thread flag, syscalls and exception hooks for
> > > nohz task support.
> > >
> >
> > I saw:
> > - syscall
> > - do_int3
> > - do_debug (int1)
> > - #PF
> >
> > So where's all other interrupts?
>
> No need to handle them.
>
> We have:
>
> rcu_irq_enter() rcu_irq_exit() rcu_nmi_enter() rcu_nmi_exit()
> and they already act as pauses into extended quiescent states, which
> is enough for our needs.
Oh, and RCU is the only thing you need to worry about is it?
On Tue, 2010-12-21 at 02:48 +0100, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 05:30:28PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > >
> > > Clear the nohz task attribute when a task exits, clear the cpu
> > > mask and restart the tick if necessary.
> > >
> > I'm not quite sure this all makes sense, I mean, we're going from 1 to 0
> > tasks, right?
>
> Not necessarily, other tasks can be on the runqueue while that nohz task
> exits, or we can be alone in which case the tick might be stopped and
> we need to restart it because rq->nr_running > 1 won't make much sense
> anymore without the nohz task and if a new task gets enqueued, the tick
> won't restart until a second one gets in.
Urgh, so that mask is set even if you're not currently in that mode?
That's 'interesting'..
On Mon, Dec 20, 2010 at 04:58:20PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
>
> > @@ -1634,7 +1633,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> > * by the current CPU, returning 1 if so. This function is part of the
> > * RCU implementation; it is -not- an exported member of the RCU API.
> > */
> > -static int rcu_pending(int cpu)
> > +int rcu_pending(int cpu)
>
> /me wonders about that comment.
>
> > {
> > return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
> > __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 6dbae46..45bd6e2 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2470,10 +2470,16 @@ static void nohz_task_cpu_update(void *unused)
> > int nohz_task_can_stop_tick(void)
> > {
> > struct rq *rq = this_rq();
> > + int cpu;
> >
> > if (rq->nr_running > 1)
> > return 0;
> >
> > + cpu = smp_processor_id();
> > +
> > + if (rcu_pending(cpu) || rcu_needs_cpu(cpu))
> > + return 0;
>
> Arguable, rcu_needs_cpu() should imply rcu_pending(), because if there's
> work still to be done, it needs the cpu, hmm?
There are two cases:
1. This CPU has callbacks. In this case, rcu_pending() returns 1.
2. The RCU core needs something from this CPU. In this case,
rcu_pending() returns 1.
The trick is that in dyntick-idle mode, if we have #2 but not #1, other
CPUs can (and will) act on the dyntick-idle CPU's behalf. However, when
there is a task running, that task might do system calls, which can
queue callbacks and can contain RCU read-side critical sections, neither
of which can happen in dyntick-idle mode.
So the one-task-running-on-this-CPU case above does need special
handling.
> > return 1;
> > }
> >
>
> This patch also implies you broke stuff with #4 because it would put the
> machine to sleep while RCU still had bits to do, not very nice.
Hmmm... I need to look at this after getting some sleep.
Thanx, Paul
On Tue, 2010-12-21 at 02:24 +0100, Frederic Weisbecker wrote:
>
> > Also, I'm not quite happy with the pure userspace restriction, but at
> > least I see why you did that event though you didn't mention that.
>
> What do you mean? The fact that kernel threads can not be nohz task?
No, that you key off kernel/user boundary transitions. Arguably one
could allow simply system calls and page-faults to happen without
restarting the tick, then again, RCU is very pervasive these days so I'm
not quite sure you can actually make that happen.
On Tue, Dec 21, 2010 at 08:34:38AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 00:33 +0100, Frederic Weisbecker wrote:
> > So, we still need the syscalls slow path hooks.
>
> Make it a cpuset property or something.
Hm, but I don't understand how cpusets can help about that.
You mean I could use cpuset to define a range of cpus on which
I can run nohz tasks? And then set the TIF flags on those threads
and so?
On Tue, Dec 21, 2010 at 08:35:58AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 00:37 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 04:51:39PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > Check from the timer interrupt that we are a nohz task running
> > > > alone in the CPU and stop the tick if this is the case.
> > > >
> > > Does this verify that the tick has no other work to do?
> > >
> > > I see no list of things the tick does and a checklist that everything it
> > > does is indeed superfluous.
> >
> > In a subsequent patch we check if rcu also needs the tick.
> > For the rest, tick_nohz_stop_sched_tick() knows what to
> > do: keep the next tick or switch to nohz.
> >
> > Hm?
>
> No, and that worries me, you don't even seem to know what the tick does
> and yet you're working on stopping it.
Yeah, I've been focusing a lot on preemption and rcu because they seemed
to be the trickiest part of the thing, and I in the meantime I just forgot
the rest or tagged it too quickly as details to deal with later. And I
finally forgot them :)
Bah it's a first take, it can only improve ;)
On Tue, Dec 21, 2010 at 08:41:14AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 00:52 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:02:09PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > If a cpu is in nohz mode due to a nohz task running, then
> > > > it is not able to notify quiescent states requested by other
> > > > CPUs.
> > > >
> > > > Then restart the tick to remotely force the quiescent states on the
> > > > nohz task cpus.
> > >
> > > -ENOPARSE.. if its in NOHZ state, it couldn't possibly need to
> > > participate in the quiescent state machine because the cpu is in a
> > > quiescent state and has 0 RCU activity.
> >
> >
> > But it can be in nohz state in the kernel in which case it can have
> > any RCU activity.
>
> That still doesn't make sense.. if you're in nohz state there shouldn't
> be any rcu activity, otherwise its not nohz is it?
>
>
nohz only means that the tick is stopped, that we don't have
anymore a periodic but purely on need event.
Now rcu takes appropriate action to that new state accordingly.
If it's idle or in userspace, then it can enter into extended
quiescent state. If not then it can't.
So nohz and extended qs must be two different things now.
On Tue, Dec 21, 2010 at 08:50:47AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 01:13 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 04:47:58PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > Unbound timers are preferably targeted for non idle cpu. If
> > > > possible though, prioritize idle cpus over nohz task cpus,
> > > > because the main point of nohz task is to avoid unnecessary
> > > > timer interrupts.
> > >
> > > Oh is it?
> > >
> > > I'd very much expect the cpu that arms the timer to get the interrupt. I
> > > mean, if the task doesn't want to get interrupted by timers,
> > > _DON'T_USE_TIMERS_ to begin with.
> > >
> > > So no, don't much like this at all.
> >
> > I suspect TIMER_NOT_PINNED has been introduced to save some power by
> > avoiding to wake up idle cpus.
> >
> > This is used by mod_timer(), schedule_timeout(), mod_timer_pending()
> > So that's widely used and removing that could have a deep impact on
> > power.
>
> Yeah so? Who said task_nohz had to have the bestest power savings
> around? Its not a laptop feature by any means. Simply disable the thing:
> echo 0 > /proc/sys/kernel/timer_migration, or better yet, teach it about
> your cpuset constraints and avoid it migrating timers into your set and
> keep timers pinned within the set.
Right.
So I'll start by deactivating this through /proc/sys/kernel/timer_migration
and care about making it more automatically later.
On 12/21/2010 01:33 AM, Frederic Weisbecker wrote:
> On Mon, Dec 20, 2010 at 10:44:46AM -0500, Steven Rostedt wrote:
> > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > The timer interrupt handles several things like preemption,
> > > timekeeping, rcu, etc...
> > >
> > > However it appears that sometimes it is simply useless like
> > > when a task runs alone and even more when it is in userspace
> > > as RCU doesn't need it at all in such case.
> > >
> > > It appears that HPC workload would get some win of such timer
> > > deactivation, and perhaps also the Real Time world as this
> > > minimizes the critical sections due to way less interrupts to
> > > handle.
> > >
> > > It works through the procfs interface:
> > >
> > > echo 1> /proc/self/nohz
> >
> > I wounder if we could just have this happen automatically.
>
> But this would add some global overhead, especially in the syscall
> path as we need to take the slow path to hook userspace resume/exit.
This is zero as there is already a test on thread_info->flags on the
syscall exit path.
(or I misunderstood you - what's the purpose of the slow path?)
--
error compiling committee.c: too many arguments to function
On Tue, Dec 21, 2010 at 08:51:42AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 01:20 +0100, Frederic Weisbecker wrote:
> > I can certainly deactivate TIMER_NOT_PINNED and make it a no-op
> > if CONFIG_NO_HZ_TASK.
>
> Why would you want to make that a compile time switch? Dynamic is
> trivial to do.
Yep, I did not know the proc file at that time.
On 12/21/2010 10:14 AM, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:24 +0100, Frederic Weisbecker wrote:
> >
> > > Also, I'm not quite happy with the pure userspace restriction, but at
> > > least I see why you did that event though you didn't mention that.
> >
> > What do you mean? The fact that kernel threads can not be nohz task?
>
> No, that you key off kernel/user boundary transitions. Arguably one
> could allow simply system calls and page-faults to happen without
> restarting the tick, then again, RCU is very pervasive these days so I'm
> not quite sure you can actually make that happen.
>
For an example of a per-cpu flag that is checked on every exit with zero
additional overhead on the flag clear case, look at TIF_USER_RETURN_NOTIFY.
--
error compiling committee.c: too many arguments to function
On Tue, Dec 21, 2010 at 09:04:54AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:27 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:18:40PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > (we check if the local cpu is in nohz mode, which means
> > > > no other task compete on that CPU)
> > >
> > > You keep repeating that definition, but its not true.. It means there is
> > > not work for the tick to do, the tick does _tons_ more besides
> > > preemption, so nr_running==1 is necessary but not sufficient.
> >
> > Sure but the point is that if the tick is not running, it means
> > that the nohz task is the only task running on that CPU.
>
> No, that too isn't true, the cpu could be idle.
>
> > Now indeed there are other reasons for the tick to restart like RCU
> > or the effective nohz mode to physically happen or to be delayed,
> > which is decided by tick_nohz_stop_sched_tick().
>
> You really really really badly need to read through the whole tick path
> and look at all the things it does, put them in a list, then look at the
> current nohz code, mark those it deals with, then go through the nohz
> code again and find the nr_running==0 assumptions, then make sure you've
> covered everything.
>
> I'm very confident you'll find a number of things to fix. At the very
> least your current patch set totally forgets about task runtime
> accounting (account_process_tick()), the existing NO_HZ doesn't need to
> worry about that because the system is idle so all it needs to do is add
> idle ticks when it wakes up (and possibly steal time for the virt muck).
>
> You also miss the profile_tick(), and you need to go through the load
> accounting muck (both of them) to see if there are any nr_running==0
> assumptions there.
>
> You also need to fix the perf counter list rotation stuff, and again,
> check if no_hz load-balancing can deal with the nr_running==1 situation.
>
> Now, there might be more, this is just a quick one-minute scan through
> the tick code, but the fact that nowhere in this whole patch-set is even
> a mention of these things makes me worry about your whole approach.
Yeah it's right I missed a lot of important things. I'll try to handle
them for the next take.
On Tue, Dec 21, 2010 at 09:05:29AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:30 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:23:19PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > >
> > > > Implement the thread flag, syscalls and exception hooks for
> > > > nohz task support.
> > > >
> > >
> > > I saw:
> > > - syscall
> > > - do_int3
> > > - do_debug (int1)
> > > - #PF
> > >
> > > So where's all other interrupts?
> >
> > No need to handle them.
> >
> > We have:
> >
> > rcu_irq_enter() rcu_irq_exit() rcu_nmi_enter() rcu_nmi_exit()
> > and they already act as pauses into extended quiescent states, which
> > is enough for our needs.
>
> Oh, and RCU is the only thing you need to worry about is it?
wrt userspace/kernelspace switches yes, for now. But perhaps I'll discover
more reasons to hook into that boundary.
On Tue, Dec 21, 2010 at 09:07:45AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:48 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:30:28PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > >
> > > > Clear the nohz task attribute when a task exits, clear the cpu
> > > > mask and restart the tick if necessary.
> > > >
> > > I'm not quite sure this all makes sense, I mean, we're going from 1 to 0
> > > tasks, right?
> >
> > Not necessarily, other tasks can be on the runqueue while that nohz task
> > exits, or we can be alone in which case the tick might be stopped and
> > we need to restart it because rq->nr_running > 1 won't make much sense
> > anymore without the nohz task and if a new task gets enqueued, the tick
> > won't restart until a second one gets in.
>
> Urgh, so that mask is set even if you're not currently in that mode?
> That's 'interesting'..
I don't get what you mean.
On Tue, Dec 21, 2010 at 09:14:40AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:24 +0100, Frederic Weisbecker wrote:
> >
> > > Also, I'm not quite happy with the pure userspace restriction, but at
> > > least I see why you did that event though you didn't mention that.
> >
> > What do you mean? The fact that kernel threads can not be nohz task?
>
> No, that you key off kernel/user boundary transitions. Arguably one
> could allow simply system calls and page-faults to happen without
> restarting the tick
This is what I do.
> then again, RCU is very pervasive these days so I'm
> not quite sure you can actually make that happen.
No it looks ok. If we entered the kernel and the tick is stopped, we
just exit the extended quiescent state from rcu point of view.
But because we don't tick, we don't notify quiescent state and
so rcu might notice an extended grace period. Then it will
send us the IPI that will make us restart the tick to make us
notifying quiescent states.
On Tue, 2010-12-21 at 14:22 +0100, Frederic Weisbecker wrote:
> On Tue, Dec 21, 2010 at 08:35:58AM +0100, Peter Zijlstra wrote:
> > No, and that worries me, you don't even seem to know what the tick does
> > and yet you're working on stopping it.
>
> Yeah, I've been focusing a lot on preemption and rcu because they seemed
> to be the trickiest part of the thing, and I in the meantime I just forgot
> the rest or tagged it too quickly as details to deal with later. And I
> finally forgot them :)
>
> Bah it's a first take, it can only improve ;)
Exactly!
This is an RFC, a "release early, release often" approach. And you Cc'd
all the right people. This is not something I expect to make into 2.6.38
though ;-)
-- Steve
On Tue, Dec 21, 2010 at 09:05:29AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 02:30 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:23:19PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > >
> > > > Implement the thread flag, syscalls and exception hooks for
> > > > nohz task support.
> > > >
> > >
> > > I saw:
> > > - syscall
> > > - do_int3
> > > - do_debug (int1)
> > > - #PF
> > >
> > > So where's all other interrupts?
> >
> > No need to handle them.
> >
> > We have:
> >
> > rcu_irq_enter() rcu_irq_exit() rcu_nmi_enter() rcu_nmi_exit()
> > and they already act as pauses into extended quiescent states, which
> > is enough for our needs.
>
> Oh, and RCU is the only thing you need to worry about is it?
It seems that I need to hook there for the time accounting now
and take the time spent in user, syscalls, exceptions and irq as
system time. But instead of doing that from the tick, I need
to compute the deltas from the hooks, and also handle the fact
the tick can be restarted any time and so...
Plus the update_curr() thing and so on...
On Tue, Dec 21, 2010 at 09:34:39AM -0500, Steven Rostedt wrote:
> On Tue, 2010-12-21 at 14:22 +0100, Frederic Weisbecker wrote:
> > On Tue, Dec 21, 2010 at 08:35:58AM +0100, Peter Zijlstra wrote:
>
> > > No, and that worries me, you don't even seem to know what the tick does
> > > and yet you're working on stopping it.
> >
> > Yeah, I've been focusing a lot on preemption and rcu because they seemed
> > to be the trickiest part of the thing, and I in the meantime I just forgot
> > the rest or tagged it too quickly as details to deal with later. And I
> > finally forgot them :)
> >
> > Bah it's a first take, it can only improve ;)
>
> Exactly!
>
> This is an RFC, a "release early, release often" approach. And you Cc'd
> all the right people. This is not something I expect to make into 2.6.38
> though ;-)
No, even before posting I was way from expecting it for .38 ;)
On Tue, Dec 21, 2010 at 08:41:14AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-12-21 at 00:52 +0100, Frederic Weisbecker wrote:
> > On Mon, Dec 20, 2010 at 05:02:09PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> > > > If a cpu is in nohz mode due to a nohz task running, then
> > > > it is not able to notify quiescent states requested by other
> > > > CPUs.
> > > >
> > > > Then restart the tick to remotely force the quiescent states on the
> > > > nohz task cpus.
> > >
> > > -ENOPARSE.. if its in NOHZ state, it couldn't possibly need to
> > > participate in the quiescent state machine because the cpu is in a
> > > quiescent state and has 0 RCU activity.
> >
> > But it can be in nohz state in the kernel in which case it can have
> > any RCU activity.
>
> That still doesn't make sense.. if you're in nohz state there shouldn't
> be any rcu activity, otherwise its not nohz is it?
This was one of the outcomes of the LPC session that Frederic presented
at, along with subsequent discussions.
The initial thought was that the kernel would exit nohz mode on each
transition into the kernel, as it currently does for nohz-idle in response
to interrupts and NMIs. However, Frederic found that restarting the
tick on each system call was problematic, as he noted during his LPC
presentation. The alternative is to make CPUs stay out of nohz-task
mode if they have RCU work (as you note above).
However, this also fails if a nohz-task CPU stays in a long-running
system call. It has the tick turned off, so is ignoring RCU. It has
informed RCU that it is in the kernel, so RCU cannot ignore it.
The result would be stalled grace periods.
The solution to this impasse is to note that if there is a grace period
in progress, then there must also be an RCU callback queued on some
CPU somewhere in the system. This CPU cannot be in nohz mode, because
rcu_needs_cpu() won't let the CPU enter nohz mode. This CPU therefore
has the tick running, and can therefore IPI the CPU that is in the
long-running system call. This IPI can replace the tick, but only
when needed.
Of course, this IPI is only needed for some strange system call that
chews up several ticks of CPU-bound execution. Or that is being hammered
by interrupts. So these IPIs should be rare, but are required for
correctness in this corner case. In the normal case, there would be
no RCU callback, so there would be no IPIs -- as is desired for the
workloads that want nohz-task.
Make sense?
Thanx, Paul
On Tue, Dec 21, 2010 at 03:56:36PM +0200, Avi Kivity wrote:
> On 12/21/2010 01:33 AM, Frederic Weisbecker wrote:
> >On Mon, Dec 20, 2010 at 10:44:46AM -0500, Steven Rostedt wrote:
> >> On Mon, 2010-12-20 at 16:24 +0100, Frederic Weisbecker wrote:
> >> > The timer interrupt handles several things like preemption,
> >> > timekeeping, rcu, etc...
> >> >
> >> > However it appears that sometimes it is simply useless like
> >> > when a task runs alone and even more when it is in userspace
> >> > as RCU doesn't need it at all in such case.
> >> >
> >> > It appears that HPC workload would get some win of such timer
> >> > deactivation, and perhaps also the Real Time world as this
> >> > minimizes the critical sections due to way less interrupts to
> >> > handle.
> >> >
> >> > It works through the procfs interface:
> >> >
> >> > echo 1> /proc/self/nohz
> >>
> >> I wounder if we could just have this happen automatically.
> >
> >But this would add some global overhead, especially in the syscall
> >path as we need to take the slow path to hook userspace resume/exit.
>
> This is zero as there is already a test on thread_info->flags on the
> syscall exit path.
>
> (or I misunderstood you - what's the purpose of the slow path?)
Yes but on most cases threads don't have flags that route them to the
syscall slow path, which is to call syscall_trace_enter() and
syscall_trace_leave() as proxys. They call the syscall handler
directly.
But the nohz tasks must use a TIF_ flag that forces the syscall
slow path, which is where is the overhead. So if you wanted to
make that automatic for every task, you need to force the syscall
slow path on every tasks. And this is quite an overhead.
On Tue, Dec 21, 2010 at 04:00:09PM +0200, Avi Kivity wrote:
> On 12/21/2010 10:14 AM, Peter Zijlstra wrote:
> >On Tue, 2010-12-21 at 02:24 +0100, Frederic Weisbecker wrote:
> >>
> >> > Also, I'm not quite happy with the pure userspace restriction, but at
> >> > least I see why you did that event though you didn't mention that.
> >>
> >> What do you mean? The fact that kernel threads can not be nohz task?
> >
> >No, that you key off kernel/user boundary transitions. Arguably one
> >could allow simply system calls and page-faults to happen without
> >restarting the tick, then again, RCU is very pervasive these days so I'm
> >not quite sure you can actually make that happen.
> >
>
> For an example of a per-cpu flag that is checked on every exit with
> zero additional overhead on the flag clear case, look at
> TIF_USER_RETURN_NOTIFY.
Right, but the problem is actually that if we want to automate the nohz
attribute on every tasks, then you need you have this flag set for
all of these threads.
No problem with that, but if nobody wants the nohz attribute, we don't
need to force that slow path.
On 12/21/2010 07:05 PM, Frederic Weisbecker wrote:
> >
> > For an example of a per-cpu flag that is checked on every exit with
> > zero additional overhead on the flag clear case, look at
> > TIF_USER_RETURN_NOTIFY.
>
> Right, but the problem is actually that if we want to automate the nohz
> attribute on every tasks, then you need you have this flag set for
> all of these threads.
>
> No problem with that, but if nobody wants the nohz attribute, we don't
> need to force that slow path.
When the scheduler detects the task is all alone, it sets the flag; when
it blocks, or if another task joins, it drops the flag (at most one task
per cpu has the flag set).
Does that work?
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On Mon, Dec 20, 2010 at 04:24:16PM +0100, Frederic Weisbecker wrote:
> In order to be able to enter/exit into rcu extended quiescent
> state from interrupt, we need to make rcu_enter_nohz() and
> rcu_exit_nohz() callable from interrupts.
>
> So, this proposes a new implementation of the rcu nohz fast path
> related helpers, where rcu_enter_nohz() or rcu_exit_nohz() can
> be called between rcu_enter_irq() and rcu_exit_irq() while keeping
> the existing semantics.
>
> We maintain three per cpu fields:
>
> - nohz indicates we entered into extended quiescent state mode,
> we may or not be in an interrupt even if that state is set though.
>
> - irq_nest indicates we are in an irq. This number is incremented on
> irq entry and decreased on irq exit. This includes NMIs
>
> - qs_seq is increased everytime we see a true extended quiescent
> state:
> * When we call rcu_enter_nohz() and we are not in an irq.
> * When we exit the outer most nesting irq and we are in
> nohz mode (rcu_enter_nohz() was called without a pairing
> rcu_exit_nohz() yet).
>
> >From that three-part we can deduce the extended grace periods like
> we did before on top of snapshots and comparisons.
>
> If nohz == 1 and irq_nest == 0, we are in a quiescent state. qs_seq
> is used to keep track of elapsed extended quiescent states, useful
> to compare snapshots of rcu nohz state.
>
> This is experimental and does not take care of barriers yet.
Indeed!!!
I will likely be reworking the dyntick interface soon anyway, so will
try to make sure that your requirements are met.
Thanx, Paul
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: Tim Pepper <[email protected]>
> ---
> kernel/rcutree.c | 103 ++++++++++++++++++++++-------------------------------
> kernel/rcutree.h | 12 +++----
> 2 files changed, 48 insertions(+), 67 deletions(-)
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index ed6aba3..1ac1a61 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -129,10 +129,7 @@ void rcu_note_context_switch(int cpu)
> }
>
> #ifdef CONFIG_NO_HZ
> -DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> - .dynticks_nesting = 1,
> - .dynticks = 1,
> -};
> +DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks);
> #endif /* #ifdef CONFIG_NO_HZ */
>
> static int blimit = 10; /* Maximum callbacks per softirq. */
> @@ -272,16 +269,15 @@ static int rcu_implicit_offline_qs(struct rcu_data *rdp)
> */
> void rcu_enter_nohz(void)
> {
> - unsigned long flags;
> struct rcu_dynticks *rdtp;
>
> - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
> - local_irq_save(flags);
> + preempt_disable();
> rdtp = &__get_cpu_var(rcu_dynticks);
> - rdtp->dynticks++;
> - rdtp->dynticks_nesting--;
> - WARN_ON_ONCE(rdtp->dynticks & 0x1);
> - local_irq_restore(flags);
> + WARN_ON_ONCE(rdtp->nohz);
> + rdtp->nohz = 1;
> + if (!rdtp->irq_nest)
> + local_inc(&rdtp->qs_seq);
> + preempt_enable();
> }
>
> /*
> @@ -292,16 +288,13 @@ void rcu_enter_nohz(void)
> */
> void rcu_exit_nohz(void)
> {
> - unsigned long flags;
> struct rcu_dynticks *rdtp;
>
> - local_irq_save(flags);
> + preempt_disable();
> rdtp = &__get_cpu_var(rcu_dynticks);
> - rdtp->dynticks++;
> - rdtp->dynticks_nesting++;
> - WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
> - local_irq_restore(flags);
> - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
> + WARN_ON_ONCE(!rdtp->nohz);
> + rdtp->nohz = 0;
> + preempt_enable();
> }
>
> /**
> @@ -313,13 +306,7 @@ void rcu_exit_nohz(void)
> */
> void rcu_nmi_enter(void)
> {
> - struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> -
> - if (rdtp->dynticks & 0x1)
> - return;
> - rdtp->dynticks_nmi++;
> - WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
> - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
> + rcu_irq_enter();
> }
>
> /**
> @@ -331,13 +318,7 @@ void rcu_nmi_enter(void)
> */
> void rcu_nmi_exit(void)
> {
> - struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> -
> - if (rdtp->dynticks & 0x1)
> - return;
> - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
> - rdtp->dynticks_nmi++;
> - WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
> + rcu_irq_exit();
> }
>
> /**
> @@ -350,11 +331,7 @@ void rcu_irq_enter(void)
> {
> struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
>
> - if (rdtp->dynticks_nesting++)
> - return;
> - rdtp->dynticks++;
> - WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
> - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
> + rdtp->irq_nest++;
> }
>
> /**
> @@ -368,11 +345,11 @@ void rcu_irq_exit(void)
> {
> struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
>
> - if (--rdtp->dynticks_nesting)
> + if (--rdtp->irq_nest)
> return;
> - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
> - rdtp->dynticks++;
> - WARN_ON_ONCE(rdtp->dynticks & 0x1);
> +
> + if (rdtp->nohz)
> + local_inc(&rdtp->qs_seq);
>
> /* If the interrupt queued a callback, get out of dyntick mode. */
> if (__get_cpu_var(rcu_sched_data).nxtlist ||
> @@ -390,15 +367,19 @@ void rcu_irq_exit(void)
> static int dyntick_save_progress_counter(struct rcu_data *rdp)
> {
> int ret;
> - int snap;
> - int snap_nmi;
> + int snap_nohz;
> + int snap_irq_nest;
> + long snap_qs_seq;
>
> - snap = rdp->dynticks->dynticks;
> - snap_nmi = rdp->dynticks->dynticks_nmi;
> + snap_nohz = rdp->dynticks->nohz;
> + snap_irq_nest = rdp->dynticks->irq_nest;
> + snap_qs_seq = local_read(&rdp->dynticks->qs_seq);
> smp_mb(); /* Order sampling of snap with end of grace period. */
> - rdp->dynticks_snap = snap;
> - rdp->dynticks_nmi_snap = snap_nmi;
> - ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
> + rdp->dynticks_snap.nohz = snap_nohz;
> + rdp->dynticks_snap.irq_nest = snap_irq_nest;
> + local_set(&rdp->dynticks_snap.qs_seq, snap_qs_seq);
> +
> + ret = (snap_nohz && !snap_irq_nest);
> if (ret)
> rdp->dynticks_fqs++;
> return ret;
> @@ -412,15 +393,10 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
> */
> static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> {
> - long curr;
> - long curr_nmi;
> - long snap;
> - long snap_nmi;
> + struct rcu_dynticks curr, snap;
>
> - curr = rdp->dynticks->dynticks;
> + curr = *rdp->dynticks;
> snap = rdp->dynticks_snap;
> - curr_nmi = rdp->dynticks->dynticks_nmi;
> - snap_nmi = rdp->dynticks_nmi_snap;
> smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
>
> /*
> @@ -431,14 +407,21 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> * read-side critical section that started before the beginning
> * of the current RCU grace period.
> */
> - if ((curr != snap || (curr & 0x1) == 0) &&
> - (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
> - rdp->dynticks_fqs++;
> - return 1;
> - }
> + if (curr.nohz && !curr.irq_nest)
> + goto dynticks_qs;
> +
> + if (snap.nohz && !snap.irq_nest)
> + goto dynticks_qs;
> +
> + if (local_read(&curr.qs_seq) != local_read(&snap.qs_seq))
> + goto dynticks_qs;
>
> /* Go check for the CPU being offline. */
> return rcu_implicit_offline_qs(rdp);
> +
> +dynticks_qs:
> + rdp->dynticks_fqs++;
> + return 1;
> }
>
> #endif /* #ifdef CONFIG_SMP */
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 91d4170..215e431 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -27,6 +27,7 @@
> #include <linux/threads.h>
> #include <linux/cpumask.h>
> #include <linux/seqlock.h>
> +#include <asm/local.h>
>
> /*
> * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT.
> @@ -79,11 +80,9 @@
> * Dynticks per-CPU state.
> */
> struct rcu_dynticks {
> - int dynticks_nesting; /* Track nesting level, sort of. */
> - int dynticks; /* Even value for dynticks-idle, else odd. */
> - int dynticks_nmi; /* Even value for either dynticks-idle or */
> - /* not in nmi handler, else odd. So this */
> - /* remains even for nmi from irq handler. */
> + int nohz;
> + local_t qs_seq;
> + int irq_nest;
> };
>
> /*
> @@ -212,8 +211,7 @@ struct rcu_data {
> #ifdef CONFIG_NO_HZ
> /* 3) dynticks interface. */
> struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
> - int dynticks_snap; /* Per-GP tracking for dynticks. */
> - int dynticks_nmi_snap; /* Per-GP tracking for dynticks_nmi. */
> + struct rcu_dynticks dynticks_snap;
> #endif /* #ifdef CONFIG_NO_HZ */
>
> /* 4) reasons this CPU needed to be kicked by force_quiescent_state */
> --
> 1.7.3.2
>
On Tue, Dec 21, 2010 at 11:26:35AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 20, 2010 at 04:24:16PM +0100, Frederic Weisbecker wrote:
> > In order to be able to enter/exit into rcu extended quiescent
> > state from interrupt, we need to make rcu_enter_nohz() and
> > rcu_exit_nohz() callable from interrupts.
> >
> > So, this proposes a new implementation of the rcu nohz fast path
> > related helpers, where rcu_enter_nohz() or rcu_exit_nohz() can
> > be called between rcu_enter_irq() and rcu_exit_irq() while keeping
> > the existing semantics.
> >
> > We maintain three per cpu fields:
> >
> > - nohz indicates we entered into extended quiescent state mode,
> > we may or not be in an interrupt even if that state is set though.
> >
> > - irq_nest indicates we are in an irq. This number is incremented on
> > irq entry and decreased on irq exit. This includes NMIs
> >
> > - qs_seq is increased everytime we see a true extended quiescent
> > state:
> > * When we call rcu_enter_nohz() and we are not in an irq.
> > * When we exit the outer most nesting irq and we are in
> > nohz mode (rcu_enter_nohz() was called without a pairing
> > rcu_exit_nohz() yet).
> >
> > >From that three-part we can deduce the extended grace periods like
> > we did before on top of snapshots and comparisons.
> >
> > If nohz == 1 and irq_nest == 0, we are in a quiescent state. qs_seq
> > is used to keep track of elapsed extended quiescent states, useful
> > to compare snapshots of rcu nohz state.
> >
> > This is experimental and does not take care of barriers yet.
>
> Indeed!!!
>
> I will likely be reworking the dyntick interface soon anyway, so will
> try to make sure that your requirements are met.
Great, thanks!
On Mon, Dec 20, 2010 at 04:24:17PM +0100, Frederic Weisbecker wrote:
> A nohz task can safely enter into extended quiescent state when
> it goes into userspace, this avoids a remote cpu to force the
> nohz task to be interrupted in order to notify quiescent states.
>
> We enter into an extended quiescent state when:
>
> - A nohz task resumes to userspace and is alone running on the
> CPU (we check if the local cpu is in nohz mode, which means
> no other task compete on that CPU). If the tick is still running
> then entering into extended QS will be done later from the second
> case:
>
> - When the tick stops and verify the current task is a nohz one,
> is alone running on the CPU and runs in userspace.
>
> We exit the extended quiescent state when:
>
> - A nohz task enters the kernel and is alone running on the CPU.
> Again we check if the local cpu is in nohz mode for that. If
> the tick is still running then it means we are not in an extended
> QS and we don't do anything.
>
> - The tick restarts because a new task is enqueued.
>
> Whether the nohz task is in userspace or not is tracked by the
> per cpu nohz_task_ext_qs variable.
>
> Architectures need to provide some backend to notify userspace
> exit/entry in order to support this mode.
> It needs to implement the TIF_NOHZ flag that switches to slow
> path syscall mode and to notify exceptions entry/exit.
>
> We don't need to handle irqs or nmis as those are already handled
> by RCU through rcu_enter_irq/nmi helpers.
One question below...
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: Tim Pepper <[email protected]>
> ---
> arch/Kconfig | 4 +++
> include/linux/tick.h | 16 ++++++++++-
> kernel/sched.c | 3 ++
> kernel/time/tick-sched.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 81 insertions(+), 3 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index e631791..d1ebea3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -177,5 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
>
> config HAVE_NO_HZ_TASK
> bool
> + help
> + Features necessary hooks for a task wanting to enter nohz
> + while running alone on a CPU: thread flag for syscall hooks
> + and exceptions entry/exit hooks.
>
> source "kernel/gcov/Kconfig"
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 7465a47..a704bb7 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -8,6 +8,7 @@
>
> #include <linux/clockchips.h>
> #include <linux/percpu-defs.h>
> +#include <asm/ptrace.h>
>
> #ifdef CONFIG_GENERIC_CLOCKEVENTS
>
> @@ -130,10 +131,21 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>
> #ifdef CONFIG_NO_HZ_TASK
> DECLARE_PER_CPU(int, task_nohz_mode);
> +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> +
> +extern void tick_nohz_task_enter_kernel(void);
> +extern void tick_nohz_task_exit_kernel(void);
> +extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
> +extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
> extern int tick_nohz_task_mode(void);
> -#else
> +
> +#else /* !NO_HZ_TASK */
> +static inline void tick_nohz_task_enter_kernel(void) { }
> +static inline void tick_nohz_task_exit_kernel(void) { }
> +static inline void tick_nohz_task_enter_exception(struct pt_regs *regs) { }
> +static inline void tick_nohz_task_exit_exception(struct pt_regs *regs) { }
> static inline int tick_nohz_task_mode(void) { return 0; }
> -#endif
> +#endif /* !NO_HZ_TASK */
>
> # else /* !NO_HZ */
> static inline void tick_nohz_stop_sched_tick(int inidle) { }
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b99f192..4412493 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2464,6 +2464,9 @@ static void nohz_task_cpu_update(void *unused)
> if (rq->nr_running > 1 || rcu_pending(cpu) || rcu_needs_cpu(cpu)) {
If the task enters a system call in nohz mode, and then that system call
enqueues an RCU callback, this code path will pull that CPU out of nohz
mode, right?
Thanx, Paul
> __get_cpu_var(task_nohz_mode) = 0;
> tick_nohz_restart_sched_tick();
> +
> + if (__get_cpu_var(nohz_task_ext_qs))
> + rcu_exit_nohz();
> }
> }
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 88011b9..9a4aa39 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -720,6 +720,62 @@ void tick_check_idle(int cpu)
> }
>
> #ifdef CONFIG_NO_HZ_TASK
> +DEFINE_PER_CPU(int, nohz_task_ext_qs);
> +
> +void tick_nohz_task_exit_kernel(void)
> +{
> + unsigned long flags;
> +
> + if (!test_thread_flag(TIF_NOHZ))
> + return;
> +
> + local_irq_save(flags);
> +
> + __get_cpu_var(nohz_task_ext_qs) = 1;
> + /*
> + * Only enter extended QS if the tick is not running.
> + * Otherwise the tick will handle that later when it
> + * will decide to stop.
> + */
> + if (__get_cpu_var(task_nohz_mode))
> + rcu_enter_nohz();
> +
> + local_irq_restore(flags);
> +}
> +
> +void tick_nohz_task_enter_kernel(void)
> +{
> + unsigned long flags;
> +
> + if (!test_thread_flag(TIF_NOHZ))
> + return;
> +
> + local_irq_save(flags);
> +
> + __get_cpu_var(nohz_task_ext_qs) = 0;
> + /*
> + * If the tick was running, then we weren't in
> + * rcu extended period. Only exit extended QS
> + * if we were in such state.
> + */
> + if (__get_cpu_var(task_nohz_mode))
> + rcu_exit_nohz();
> +
> + local_irq_restore(flags);
> +}
> +
> +void tick_nohz_task_enter_exception(struct pt_regs *regs)
> +{
> + if (user_mode(regs))
> + tick_nohz_task_enter_kernel();
> +}
> +
> +void tick_nohz_task_exit_exception(struct pt_regs *regs)
> +{
> + if (user_mode(regs))
> + tick_nohz_task_exit_kernel();
> +}
> +
> int tick_nohz_task_mode(void)
> {
> return __get_cpu_var(task_nohz_mode);
> @@ -730,8 +786,11 @@ static void tick_nohz_task_stop_tick(void)
> if (!test_thread_flag(TIF_NOHZ) || __get_cpu_var(task_nohz_mode))
> return;
>
> - if (nohz_task_can_stop_tick())
> + if (nohz_task_can_stop_tick()) {
> __get_cpu_var(task_nohz_mode) = 1;
> + if (__get_cpu_var(nohz_task_ext_qs))
> + rcu_enter_nohz();
> + }
> }
> #else
> static void tick_nohz_task_stop_tick(void) { }
> --
> 1.7.3.2
>
On Tue, Dec 21, 2010 at 08:17:33PM +0200, Avi Kivity wrote:
> On 12/21/2010 07:05 PM, Frederic Weisbecker wrote:
> >>
> >> For an example of a per-cpu flag that is checked on every exit with
> >> zero additional overhead on the flag clear case, look at
> >> TIF_USER_RETURN_NOTIFY.
> >
> >Right, but the problem is actually that if we want to automate the nohz
> >attribute on every tasks, then you need you have this flag set for
> >all of these threads.
> >
> >No problem with that, but if nobody wants the nohz attribute, we don't
> >need to force that slow path.
>
> When the scheduler detects the task is all alone, it sets the flag;
> when it blocks, or if another task joins, it drops the flag (at most
> one task per cpu has the flag set).
>
> Does that work?
Makes sense. And that integrates well with Peter's idea of creating a
new cpuset attribute for the nohz tasks.
But instead of making this detection from the scheduler, I think this
should be done from the tick: if there is only one task running, set
it the TF flag.
But anyway, that's an optimisation. We can start with setting that flag
on every task in that cpuset.
On Tue, Dec 21, 2010 at 11:28:49AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 20, 2010 at 04:24:17PM +0100, Frederic Weisbecker wrote:
> > A nohz task can safely enter into extended quiescent state when
> > it goes into userspace, this avoids a remote cpu to force the
> > nohz task to be interrupted in order to notify quiescent states.
> >
> > We enter into an extended quiescent state when:
> >
> > - A nohz task resumes to userspace and is alone running on the
> > CPU (we check if the local cpu is in nohz mode, which means
> > no other task compete on that CPU). If the tick is still running
> > then entering into extended QS will be done later from the second
> > case:
> >
> > - When the tick stops and verify the current task is a nohz one,
> > is alone running on the CPU and runs in userspace.
> >
> > We exit the extended quiescent state when:
> >
> > - A nohz task enters the kernel and is alone running on the CPU.
> > Again we check if the local cpu is in nohz mode for that. If
> > the tick is still running then it means we are not in an extended
> > QS and we don't do anything.
> >
> > - The tick restarts because a new task is enqueued.
> >
> > Whether the nohz task is in userspace or not is tracked by the
> > per cpu nohz_task_ext_qs variable.
> >
> > Architectures need to provide some backend to notify userspace
> > exit/entry in order to support this mode.
> > It needs to implement the TIF_NOHZ flag that switches to slow
> > path syscall mode and to notify exceptions entry/exit.
> >
> > We don't need to handle irqs or nmis as those are already handled
> > by RCU through rcu_enter_irq/nmi helpers.
>
> One question below...
>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Anton Blanchard <[email protected]>
> > Cc: Tim Pepper <[email protected]>
> > ---
> > arch/Kconfig | 4 +++
> > include/linux/tick.h | 16 ++++++++++-
> > kernel/sched.c | 3 ++
> > kernel/time/tick-sched.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
> > 4 files changed, 81 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index e631791..d1ebea3 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -177,5 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
> >
> > config HAVE_NO_HZ_TASK
> > bool
> > + help
> > + Features necessary hooks for a task wanting to enter nohz
> > + while running alone on a CPU: thread flag for syscall hooks
> > + and exceptions entry/exit hooks.
> >
> > source "kernel/gcov/Kconfig"
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index 7465a47..a704bb7 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -8,6 +8,7 @@
> >
> > #include <linux/clockchips.h>
> > #include <linux/percpu-defs.h>
> > +#include <asm/ptrace.h>
> >
> > #ifdef CONFIG_GENERIC_CLOCKEVENTS
> >
> > @@ -130,10 +131,21 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> >
> > #ifdef CONFIG_NO_HZ_TASK
> > DECLARE_PER_CPU(int, task_nohz_mode);
> > +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> > +
> > +extern void tick_nohz_task_enter_kernel(void);
> > +extern void tick_nohz_task_exit_kernel(void);
> > +extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
> > +extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
> > extern int tick_nohz_task_mode(void);
> > -#else
> > +
> > +#else /* !NO_HZ_TASK */
> > +static inline void tick_nohz_task_enter_kernel(void) { }
> > +static inline void tick_nohz_task_exit_kernel(void) { }
> > +static inline void tick_nohz_task_enter_exception(struct pt_regs *regs) { }
> > +static inline void tick_nohz_task_exit_exception(struct pt_regs *regs) { }
> > static inline int tick_nohz_task_mode(void) { return 0; }
> > -#endif
> > +#endif /* !NO_HZ_TASK */
> >
> > # else /* !NO_HZ */
> > static inline void tick_nohz_stop_sched_tick(int inidle) { }
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index b99f192..4412493 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2464,6 +2464,9 @@ static void nohz_task_cpu_update(void *unused)
> > if (rq->nr_running > 1 || rcu_pending(cpu) || rcu_needs_cpu(cpu)) {
>
> If the task enters a system call in nohz mode, and then that system call
> enqueues an RCU callback, this code path will pull that CPU out of nohz
> mode, right?
>
> Thanx, Paul
Hmm, no because this code path is only called after rcu or the scheduler sends
an IPI. And rcu won't call it after it enqueues a callback.
I did not think about that. If every other CPUs are in extended quiescent
state, nobody will take care of the grace period comletion, unless we are
lucky in the whole GP completion scenario. And at least the current CPU
that enqueues the callbacks is supposed to take care of that grace period
completion, right?
So I guess I need to restat the tick from there too.
On Tue, Dec 21, 2010 at 10:49:43PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 21, 2010 at 11:28:49AM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 20, 2010 at 04:24:17PM +0100, Frederic Weisbecker wrote:
> > > A nohz task can safely enter into extended quiescent state when
> > > it goes into userspace, this avoids a remote cpu to force the
> > > nohz task to be interrupted in order to notify quiescent states.
> > >
> > > We enter into an extended quiescent state when:
> > >
> > > - A nohz task resumes to userspace and is alone running on the
> > > CPU (we check if the local cpu is in nohz mode, which means
> > > no other task compete on that CPU). If the tick is still running
> > > then entering into extended QS will be done later from the second
> > > case:
> > >
> > > - When the tick stops and verify the current task is a nohz one,
> > > is alone running on the CPU and runs in userspace.
> > >
> > > We exit the extended quiescent state when:
> > >
> > > - A nohz task enters the kernel and is alone running on the CPU.
> > > Again we check if the local cpu is in nohz mode for that. If
> > > the tick is still running then it means we are not in an extended
> > > QS and we don't do anything.
> > >
> > > - The tick restarts because a new task is enqueued.
> > >
> > > Whether the nohz task is in userspace or not is tracked by the
> > > per cpu nohz_task_ext_qs variable.
> > >
> > > Architectures need to provide some backend to notify userspace
> > > exit/entry in order to support this mode.
> > > It needs to implement the TIF_NOHZ flag that switches to slow
> > > path syscall mode and to notify exceptions entry/exit.
> > >
> > > We don't need to handle irqs or nmis as those are already handled
> > > by RCU through rcu_enter_irq/nmi helpers.
> >
> > One question below...
> >
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Paul E. McKenney <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Lai Jiangshan <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Anton Blanchard <[email protected]>
> > > Cc: Tim Pepper <[email protected]>
> > > ---
> > > arch/Kconfig | 4 +++
> > > include/linux/tick.h | 16 ++++++++++-
> > > kernel/sched.c | 3 ++
> > > kernel/time/tick-sched.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
> > > 4 files changed, 81 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index e631791..d1ebea3 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -177,5 +177,9 @@ config HAVE_ARCH_JUMP_LABEL
> > >
> > > config HAVE_NO_HZ_TASK
> > > bool
> > > + help
> > > + Features necessary hooks for a task wanting to enter nohz
> > > + while running alone on a CPU: thread flag for syscall hooks
> > > + and exceptions entry/exit hooks.
> > >
> > > source "kernel/gcov/Kconfig"
> > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > index 7465a47..a704bb7 100644
> > > --- a/include/linux/tick.h
> > > +++ b/include/linux/tick.h
> > > @@ -8,6 +8,7 @@
> > >
> > > #include <linux/clockchips.h>
> > > #include <linux/percpu-defs.h>
> > > +#include <asm/ptrace.h>
> > >
> > > #ifdef CONFIG_GENERIC_CLOCKEVENTS
> > >
> > > @@ -130,10 +131,21 @@ extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> > >
> > > #ifdef CONFIG_NO_HZ_TASK
> > > DECLARE_PER_CPU(int, task_nohz_mode);
> > > +DECLARE_PER_CPU(int, nohz_task_ext_qs);
> > > +
> > > +extern void tick_nohz_task_enter_kernel(void);
> > > +extern void tick_nohz_task_exit_kernel(void);
> > > +extern void tick_nohz_task_enter_exception(struct pt_regs *regs);
> > > +extern void tick_nohz_task_exit_exception(struct pt_regs *regs);
> > > extern int tick_nohz_task_mode(void);
> > > -#else
> > > +
> > > +#else /* !NO_HZ_TASK */
> > > +static inline void tick_nohz_task_enter_kernel(void) { }
> > > +static inline void tick_nohz_task_exit_kernel(void) { }
> > > +static inline void tick_nohz_task_enter_exception(struct pt_regs *regs) { }
> > > +static inline void tick_nohz_task_exit_exception(struct pt_regs *regs) { }
> > > static inline int tick_nohz_task_mode(void) { return 0; }
> > > -#endif
> > > +#endif /* !NO_HZ_TASK */
> > >
> > > # else /* !NO_HZ */
> > > static inline void tick_nohz_stop_sched_tick(int inidle) { }
> > > diff --git a/kernel/sched.c b/kernel/sched.c
> > > index b99f192..4412493 100644
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -2464,6 +2464,9 @@ static void nohz_task_cpu_update(void *unused)
> > > if (rq->nr_running > 1 || rcu_pending(cpu) || rcu_needs_cpu(cpu)) {
> >
> > If the task enters a system call in nohz mode, and then that system call
> > enqueues an RCU callback, this code path will pull that CPU out of nohz
> > mode, right?
> >
> > Thanx, Paul
>
> Hmm, no because this code path is only called after rcu or the scheduler sends
> an IPI. And rcu won't call it after it enqueues a callback.
>
> I did not think about that. If every other CPUs are in extended quiescent
> state, nobody will take care of the grace period comletion, unless we are
> lucky in the whole GP completion scenario. And at least the current CPU
> that enqueues the callbacks is supposed to take care of that grace period
> completion, right?
>
> So I guess I need to restat the tick from there too.
Please!!! ;-)
Thanx, Paul
On 12/21/2010 11:08 PM, Frederic Weisbecker wrote:
> On Tue, Dec 21, 2010 at 08:17:33PM +0200, Avi Kivity wrote:
> > On 12/21/2010 07:05 PM, Frederic Weisbecker wrote:
> > >>
> > >> For an example of a per-cpu flag that is checked on every exit with
> > >> zero additional overhead on the flag clear case, look at
> > >> TIF_USER_RETURN_NOTIFY.
> > >
> > >Right, but the problem is actually that if we want to automate the nohz
> > >attribute on every tasks, then you need you have this flag set for
> > >all of these threads.
> > >
> > >No problem with that, but if nobody wants the nohz attribute, we don't
> > >need to force that slow path.
> >
> > When the scheduler detects the task is all alone, it sets the flag;
> > when it blocks, or if another task joins, it drops the flag (at most
> > one task per cpu has the flag set).
> >
> > Does that work?
>
> Makes sense. And that integrates well with Peter's idea of creating a
> new cpuset attribute for the nohz tasks.
>
> But instead of making this detection from the scheduler, I think this
> should be done from the tick: if there is only one task running, set
> it the TF flag.
>
> But anyway, that's an optimisation. We can start with setting that flag
> on every task in that cpuset.
So long as we start without the new knob.
--
error compiling committee.c: too many arguments to function
On Wed, 2010-12-22 at 11:22 +0200, Avi Kivity wrote:
> > Makes sense. And that integrates well with Peter's idea of creating a
> > new cpuset attribute for the nohz tasks.
> >
> > But instead of making this detection from the scheduler, I think this
> > should be done from the tick: if there is only one task running, set
> > it the TF flag.
> >
> > But anyway, that's an optimisation. We can start with setting that flag
> > on every task in that cpuset.
>
> So long as we start without the new knob.
Right, so one of the things we can do is let the tick disable itself
when it finds there is no pending work left and set the TIF bit when
needed. We should then also rate-limit things so as not to
enable/disable the tick too often, but that would potentially allow us
to do away with all knobs.
On Wed, Dec 22, 2010 at 10:51:44AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:22 +0200, Avi Kivity wrote:
> > > Makes sense. And that integrates well with Peter's idea of creating a
> > > new cpuset attribute for the nohz tasks.
> > >
> > > But instead of making this detection from the scheduler, I think this
> > > should be done from the tick: if there is only one task running, set
> > > it the TF flag.
> > >
> > > But anyway, that's an optimisation. We can start with setting that flag
> > > on every task in that cpuset.
> >
> > So long as we start without the new knob.
>
> Right, so one of the things we can do is let the tick disable itself
> when it finds there is no pending work left and set the TIF bit when
> needed.
Right.
Now I think about potential races. If a tick happens between the end of
the syscall path and the resume to userspace, it can set the TIF flag
but too late. So the task resumes userspace without beeing in an
extended QS.
No big deal though, it's easy to fixup.
Another possible race: a task runs alone with the flag. A new task gets enqueued
so we send the IPI. When the CPU receives the IPI, is "current" still the task
that was previously in nohz mode or the freshly enqueued one? When it's the
second case it becomes hard to clear the flag.
Probably I'll need to hook into the enqueue_task() path to fixup that.
> We should then also rate-limit things so as not to
> enable/disable the tick too often, but that would potentially allow us
> to do away with all knobs.
Right. Before I posted that, I actually had a minimum duration threshold of the tick.
Like, even if we can stop the tick, just wait x more ns, x beeing an abritrary constant.
But that was actually complicating the thing and I wasn't sure there was a real gain.
On 12/20/2010 11:24 PM, Frederic Weisbecker wrote:
>
> +config NO_HZ_TASK
> + bool "Tickless task"
> + depends on HAVE_NO_HZ_TASK && NO_HZ && SMP && HIGH_RES_TIMERS
> + help
> + When a task runs alone on a CPU and switches into this mode,
> + the timer interrupt will only trigger when it is strictly
> + needed.
Why it depends on SMP?
On Fri, 2010-12-24 at 16:00 +0800, Lai Jiangshan wrote:
> On 12/20/2010 11:24 PM, Frederic Weisbecker wrote:
> >
> > +config NO_HZ_TASK
> > + bool "Tickless task"
> > + depends on HAVE_NO_HZ_TASK && NO_HZ && SMP && HIGH_RES_TIMERS
> > + help
> > + When a task runs alone on a CPU and switches into this mode,
> > + the timer interrupt will only trigger when it is strictly
> > + needed.
>
> Why it depends on SMP?
>
I guess that's because you need at least a ticking CPU for timekeeping
and stuff... Right?
Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)
http://retis.sssup.it/people/faggioli -- [email protected]
On Fri, Dec 24, 2010 at 09:19:56AM +0100, Dario Faggioli wrote:
> On Fri, 2010-12-24 at 16:00 +0800, Lai Jiangshan wrote:
> > On 12/20/2010 11:24 PM, Frederic Weisbecker wrote:
> > >
> > > +config NO_HZ_TASK
> > > + bool "Tickless task"
> > > + depends on HAVE_NO_HZ_TASK && NO_HZ && SMP && HIGH_RES_TIMERS
> > > + help
> > > + When a task runs alone on a CPU and switches into this mode,
> > > + the timer interrupt will only trigger when it is strictly
> > > + needed.
> >
> > Why it depends on SMP?
> >
> I guess that's because you need at least a ticking CPU for timekeeping
> and stuff... Right?
>
> Regards,
> Dario
Exactly.
In the next version I'll probably drop that requirement. On UP a single task
must still be able to be in nohz mode when it is in userspace, as timekeeping
doesn't matter there until we reenter the kernel.
Dunno, I'll see what I can do.