Ingo,
Please consider the following branch:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz-posix-timers
Posix cpu timers, perf events tick and sched clock tick need to be
handled in two parts:
1) When a posix cpu timer is enqueued, a list head in perf rotation list is added
or the sched clock is marked unstable, we need to kick the full dynticks CPUs by way
of IPIs so that they reconsider their tick state.
2) Before a full dynticks CPU decides to stop its tick, check if there are posix cpu timers
in the queue, that perf rotation list isn't empty and that sched clock is stable before
shutting down the tick.
This branch handles the first part (which is the trickiest) for posix cpu timers
and also settles the necessary infrastructure to do the same for perf and sched clock
in further pull requests.
As usual, the approach can be further optimized but it's not yet optimization time.
Thanks.
Frederic Weisbecker (3):
nohz: New APIs to re-evaluate the tick on full dynticks CPUs
posix_timers: Defer per process timer stop after timers processing
posix_timers: Kick full dynticks CPUs when a posix cpu timer is armed
include/linux/tick.h | 4 +++
kernel/posix-cpu-timers.c | 35 ++++++++++++++++++++++++++++--
kernel/time/Kconfig | 1 +
kernel/time/tick-sched.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 88 insertions(+), 3 deletions(-)
--
1.7.5.4
Provide two new helpers in order to notify the full dynticks CPUs about
some internal system changes against which they may reconsider the state
of their tick. Some practical examples include: posix cpu timers, perf tick
and sched clock tick.
For now the notifying handler, implemented through IPIs, is a stub
that will be implemented when we get the tick stop/restart infrastructure
in.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Gilad Ben Yossef <[email protected]>
Cc: Hakan Akkan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/tick.h | 4 +++
kernel/time/Kconfig | 1 +
kernel/time/tick-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 56 insertions(+), 0 deletions(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b4e3b0c..c2dcfb1 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -159,8 +159,12 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
#ifdef CONFIG_NO_HZ_FULL
extern int tick_nohz_full_cpu(int cpu);
+extern void tick_nohz_full_kick(void);
+extern void tick_nohz_full_kick_all(void);
#else
static inline int tick_nohz_full_cpu(int cpu) { return 0; }
+static inline void tick_nohz_full_kick(void) { }
+static inline void tick_nohz_full_kick_all(void) { }
#endif
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 358d601..fbb4c7e 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -111,6 +111,7 @@ config NO_HZ_FULL
select RCU_USER_QS
select RCU_NOCB_CPU
select CONTEXT_TRACKING_FORCE
+ select IRQ_WORK
help
Adaptively try to shutdown the tick whenever possible, even when
the CPU is running tasks. Typically this requires running a single
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 369b576..2bcad5b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -147,6 +147,57 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
static cpumask_var_t nohz_full_mask;
bool have_nohz_full_mask;
+/*
+ * Re-evaluate the need for the tick on the current CPU
+ * and restart it if necessary.
+ */
+static void tick_nohz_full_check(void)
+{
+ /*
+ * STUB for now, will be filled with the full tick stop/restart
+ * infrastructure patches
+ */
+}
+
+static void nohz_full_kick_work_func(struct irq_work *work)
+{
+ tick_nohz_full_check();
+}
+
+static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
+ .func = nohz_full_kick_work_func,
+};
+
+/*
+ * Kick the current CPU if it's full dynticks in order to force it to
+ * re-evaluate its dependency on the tick and restart it if necessary.
+ */
+void tick_nohz_full_kick(void)
+{
+ if (tick_nohz_full_cpu(smp_processor_id()))
+ irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
+}
+
+static void nohz_full_kick_ipi(void *info)
+{
+ tick_nohz_full_check();
+}
+
+/*
+ * Kick all full dynticks CPUs in order to force these to re-evaluate
+ * their dependency on the tick and restart it if necessary.
+ */
+void tick_nohz_full_kick_all(void)
+{
+ if (!have_nohz_full_mask)
+ return;
+
+ preempt_disable();
+ smp_call_function_many(nohz_full_mask,
+ nohz_full_kick_ipi, NULL, false);
+ preempt_enable();
+}
+
int tick_nohz_full_cpu(int cpu)
{
if (!have_nohz_full_mask)
--
1.7.5.4
Kick the full dynticks CPUs when a posix cpu timer is enqueued by
way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer().
This way they can re-evaluate the state of (and possibly restart) their
tick against the new expiry modification.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Gilad Ben Yossef <[email protected]>
Cc: Hakan Akkan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/posix-cpu-timers.c | 28 +++++++++++++++++++++++++++-
1 files changed, 27 insertions(+), 1 deletions(-)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 3b73ae8..a9053ad7 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -10,6 +10,8 @@
#include <linux/kernel_stat.h>
#include <trace/events/timer.h>
#include <linux/random.h>
+#include <linux/tick.h>
+#include <linux/workqueue.h>
/*
* Called after updating RLIMIT_CPU to run cpu timer and update
@@ -636,6 +638,26 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
return 0;
}
+#ifdef CONFIG_NO_HZ_FULL
+static void nohz_kick_work_fn(struct work_struct *work)
+{
+ tick_nohz_full_kick_all();
+}
+
+static DECLARE_WORK(nohz_kick_work, nohz_kick_work_fn);
+
+/*
+ * We need the IPIs to be sent from sane process context.
+ * The posix cpu timers are always set with irqs disabled.
+ */
+static void posix_cpu_timer_kick_nohz(void)
+{
+ schedule_work(&nohz_kick_work);
+}
+#else
+static inline void posix_cpu_timer_kick_nohz(void) { }
+#endif
+
/*
* Guts of sys_timer_settime for CPU timers.
* This is called with the timer locked and interrupts disabled.
@@ -794,6 +816,8 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
sample_to_timespec(timer->it_clock,
old_incr, &old->it_interval);
}
+ if (!ret)
+ posix_cpu_timer_kick_nohz();
return ret;
}
@@ -1369,7 +1393,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
}
if (!*newval)
- return;
+ goto out;
*newval += now.cpu;
}
@@ -1387,6 +1411,8 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
tsk->signal->cputime_expires.virt_exp = *newval;
break;
}
+out:
+ posix_cpu_timer_kick_nohz();
}
static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
--
1.7.5.4
When we process the posix cpu timers from the tick, we run through
two passes:
1) get the timers that want to fire and walk through the remaining
ones in order to set the next expiry time.
2) execute the firing timers and possibly reschedule some of them and
thus update the next expiry time again.
If there are no more timers in the queue after the firing ones in the
first pass, we may call stop_process_timers() there. If some firing
timers are rescheduled in the second pass, this leaves a tiny frame
where it is considered we have no process timers running, while we
are going to re-enqueue some.
This is a problem if there are full dynticks CPUs around. Having posix CPU
timers running is one of the reasons for full dynticks CPU to
prevent from stopping their tick. If they see this tiny frame clear of
timers by verifying that sig->cputimer.running == 0, they may stop their tick.
One solution would be to send these CPUs a notification if we re-enqueue
timers in the second pass so that they restart their tick accordingly. But
it looks like a hammer.
The other solution is to defer the call to stop_process_timers() after
the second pass once we are really sure that no more timers are in the queue
by the very end of the posix cpu timers processing.
This perhaps implies a little more contention against threadgroup lock and
sig->cputimer.lock if some other CPUs are concurrently checking the process
timers while we are executing those that fired. But this is probably not a big
issue in practice.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Gilad Ben Yossef <[email protected]>
Cc: Hakan Akkan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/posix-cpu-timers.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c..3b73ae8 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1130,8 +1130,6 @@ static void check_process_timers(struct task_struct *tsk,
sig->cputime_expires.prof_exp = prof_expires;
sig->cputime_expires.virt_exp = virt_expires;
sig->cputime_expires.sched_exp = sched_expires;
- if (task_cputime_zero(&sig->cputime_expires))
- stop_process_timers(sig);
}
/*
@@ -1279,6 +1277,7 @@ void run_posix_cpu_timers(struct task_struct *tsk)
LIST_HEAD(firing);
struct k_itimer *timer, *next;
unsigned long flags;
+ struct signal_struct *sig;
BUG_ON(!irqs_disabled());
@@ -1336,6 +1335,10 @@ void run_posix_cpu_timers(struct task_struct *tsk)
cpu_timer_fire(timer);
spin_unlock(&timer->it_lock);
}
+
+ sig = tsk->signal;
+ if (task_cputime_zero(&sig->cputime_expires))
+ stop_process_timers(sig);
}
/*
--
1.7.5.4
Hi Frederic,
> /*
> @@ -1279,6 +1277,7 @@ void run_posix_cpu_timers(struct task_struct *tsk)
> LIST_HEAD(firing);
> struct k_itimer *timer, *next;
> unsigned long flags;
> + struct signal_struct *sig;
>
> BUG_ON(!irqs_disabled());
>
> @@ -1336,6 +1335,10 @@ void run_posix_cpu_timers(struct task_struct *tsk)
> cpu_timer_fire(timer);
> spin_unlock(&timer->it_lock);
> }
> +
> + sig = tsk->signal;
> + if (task_cputime_zero(&sig->cputime_expires))
> + stop_process_timers(sig);
> }
>
> /*
I have proposed a different modification for the same problem yesterday:
https://lkml.org/lkml/2013/4/17/351
I might be mistaken but I believe that firing timers are not rescheduled
in the current interrupt context. They are going to be rescheduled later
from the task context handling the timer generated signal.
Also, if the cputimer is started by posix_cpu_timer_set() without arming
any timer, the stop_process_timers() function will never be reached.
My methodology to test the modifications that I have proposed is by
adding a printk() statement inside thread_group_cputimer() where the
cputimer is initialized and run
glibc-2.17/rt/tst-cputimer1 unittest.
Before the changes, the cputimer is initialized 4 times!
I was expecting with the proposed changes to see that number reduced to
2 but it ended up to be even better and be only 1 because the second
waves of posix cpu timers is set within 1 tick.
If you do the same test with your changes, I think that there is a
chance that it will not get the same positive result for the reason that
I have stated above.
To conclude, I would be very interested to get your input in posix cpu
timer modification proposal that I did last week. My proposal is either
too complex or I am very bad explaining it so since your knowledgable in
the area, your input would be very helpful.
https://lkml.org/lkml/2013/4/5/370
Thank you!!!
Olivier
2013/4/19 Olivier Langlois <[email protected]>:
> Hi Frederic,
>> /*
>> @@ -1279,6 +1277,7 @@ void run_posix_cpu_timers(struct task_struct *tsk)
>> LIST_HEAD(firing);
>> struct k_itimer *timer, *next;
>> unsigned long flags;
>> + struct signal_struct *sig;
>>
>> BUG_ON(!irqs_disabled());
>>
>> @@ -1336,6 +1335,10 @@ void run_posix_cpu_timers(struct task_struct *tsk)
>> cpu_timer_fire(timer);
>> spin_unlock(&timer->it_lock);
>> }
>> +
>> + sig = tsk->signal;
>> + if (task_cputime_zero(&sig->cputime_expires))
>> + stop_process_timers(sig);
>> }
>>
>> /*
> I have proposed a different modification for the same problem yesterday:
>
> https://lkml.org/lkml/2013/4/17/351
Hmm, that's not exactly the same problem.
>
> I might be mistaken but I believe that firing timers are not rescheduled
> in the current interrupt context. They are going to be rescheduled later
> from the task context handling the timer generated signal.
No, when the timer fires, it might generate a signal. But it won't
execute that signal right away in the same code path. Instead, after
signal generation, it may reschedule the timer if necessary then look
at the next firing timer in the list. This is all made from the same
timer interrupt context from the same call to run_posix_cpu_timers().
The signal itself is executed asynchronously. Either by interrupting a
syscall, or from the irq return path.
> Also, if the cputimer is started by posix_cpu_timer_set() without arming
> any timer, the stop_process_timers() function will never be reached.
It should. On the next timer tick that interrupt the task(s) that has
the timer, run_posix_cpu_timers() will run check_process_timers() that
may conclude with stop_process_timer() if none is running.
But then this makes me realize my patch is buggy. The fact I move
stop_process_timers() in the very end of the processing doesn't change
much the issue: another CPU may tick and cancel the
sig->cputimer.running. And if it is eventually rearmed by a
rescheduled timer, we still have a possible tiny race window against
full dynticks CPUs. Ok I need to drop that patch.
>
> My methodology to test the modifications that I have proposed is by
> adding a printk() statement inside thread_group_cputimer() where the
> cputimer is initialized and run
>
> glibc-2.17/rt/tst-cputimer1 unittest.
>
> Before the changes, the cputimer is initialized 4 times!
> I was expecting with the proposed changes to see that number reduced to
> 2 but it ended up to be even better and be only 1 because the second
> waves of posix cpu timers is set within 1 tick.
>
> If you do the same test with your changes, I think that there is a
> chance that it will not get the same positive result for the reason that
> I have stated above.
>
> To conclude, I would be very interested to get your input in posix cpu
> timer modification proposal that I did last week. My proposal is either
> too complex or I am very bad explaining it so since your knowledgable in
> the area, your input would be very helpful.
>
> https://lkml.org/lkml/2013/4/5/370
Ok I'm looking at it.
Thanks.
2013/4/18 Frederic Weisbecker <[email protected]>:
> Ingo,
>
> Please consider the following branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> timers/nohz-posix-timers
So please wait for a v2. I need to drop patch 2 and fix
posix_cpu_timers_schedule() through another way.
But the other pull request reviewed by Paul is still valid.
Thanks.
On Fri, 2013-04-19 at 14:47 +0200, Frederic Weisbecker wrote:
>
> >
> > I might be mistaken but I believe that firing timers are not rescheduled
> > in the current interrupt context. They are going to be rescheduled later
> > from the task context handling the timer generated signal.
>
> No, when the timer fires, it might generate a signal. But it won't
> execute that signal right away in the same code path. Instead, after
> signal generation, it may reschedule the timer if necessary then look
> at the next firing timer in the list. This is all made from the same
> timer interrupt context from the same call to run_posix_cpu_timers().
> The signal itself is executed asynchronously. Either by interrupting a
> syscall, or from the irq return path.
>
Frederic, be careful with the interpretation, there are 2 locations from
where posix_cpu_timer_schedule() can be called.
Call to posix_cpu_timer_schedule() from cpu_timer_fire() only happens if
the signal isn't sent because it is ignored by the recipient.
Maybe the condition around the posix_cpu_timer_schedule() block inside
cpu_timer_fire() could even be a good candidate for 'unlikely'
qualifier.
IMO, a more likely scenario, posix_cpu_timer_schedule() will be called
from dequeue_signal() which will be from from a different context than
the interrupt context.
At best, you have an interesting race!
dequeue_signal() is called when delivering a signal, not when it is
generated, right?
If you have a different understanding then please explain when call to
posix_cpu_timer_schedule() from dequeue_signal() will happen.
>
>
> IMO, a more likely scenario, posix_cpu_timer_schedule() will be called
> from dequeue_signal() which will be from from a different context than
> the interrupt context.
>
I'll be even more explicit. Inside dequeue_signal(),
do_schedule_next_timer() is called which then call
posix_cpu_timer_schedule().
On Fri, Apr 26, 2013 at 12:27:59AM -0400, Olivier Langlois wrote:
> On Fri, 2013-04-19 at 14:47 +0200, Frederic Weisbecker wrote:
>
> >
> > >
> > > I might be mistaken but I believe that firing timers are not rescheduled
> > > in the current interrupt context. They are going to be rescheduled later
> > > from the task context handling the timer generated signal.
> >
> > No, when the timer fires, it might generate a signal. But it won't
> > execute that signal right away in the same code path. Instead, after
> > signal generation, it may reschedule the timer if necessary then look
> > at the next firing timer in the list. This is all made from the same
> > timer interrupt context from the same call to run_posix_cpu_timers().
> > The signal itself is executed asynchronously. Either by interrupting a
> > syscall, or from the irq return path.
> >
> Frederic, be careful with the interpretation, there are 2 locations from
> where posix_cpu_timer_schedule() can be called.
>
> Call to posix_cpu_timer_schedule() from cpu_timer_fire() only happens if
> the signal isn't sent because it is ignored by the recipient.
>
> Maybe the condition around the posix_cpu_timer_schedule() block inside
> cpu_timer_fire() could even be a good candidate for 'unlikely'
> qualifier.
Well, cpu_timer_fire() is probably not a fast path. So helping branch
prediction there probably won't have much measurable effect in practice.
>
> IMO, a more likely scenario, posix_cpu_timer_schedule() will be called
> from dequeue_signal() which will be from from a different context than
> the interrupt context.
Oh you're right! I misunderstood that. So I need to take into consideration
for the nohz code.
>
> At best, you have an interesting race!
>
> dequeue_signal() is called when delivering a signal, not when it is
> generated, right?
Yeah you're right, sorry for the confusion. I'll reconsider your patches
with that in mind.
Thanks.
>
> > Maybe the condition around the posix_cpu_timer_schedule() block inside
> > cpu_timer_fire() could even be a good candidate for 'unlikely'
> > qualifier.
>
> Well, cpu_timer_fire() is probably not a fast path. So helping branch
> prediction there probably won't have much measurable effect in practice.
>
Frederic, I'm totally sure that you are right on the measurable effect.
When I did propose the 'unlikely' qualifier, please note, that I also
had a documentary purpose in mind.
Would you have searched the 'likely' path that does
posix_cpu_timer_schedule() when you did modify the code if the
'unlikely' tag would have been present?
Greetings,
Olivier
On Tue, Apr 30, 2013 at 01:51:58PM -0400, Olivier Langlois wrote:
>
> >
> > > Maybe the condition around the posix_cpu_timer_schedule() block inside
> > > cpu_timer_fire() could even be a good candidate for 'unlikely'
> > > qualifier.
> >
> > Well, cpu_timer_fire() is probably not a fast path. So helping branch
> > prediction there probably won't have much measurable effect in practice.
> >
> Frederic, I'm totally sure that you are right on the measurable effect.
> When I did propose the 'unlikely' qualifier, please note, that I also
> had a documentary purpose in mind.
>
> Would you have searched the 'likely' path that does
> posix_cpu_timer_schedule() when you did modify the code if the
> 'unlikely' tag would have been present?
It's indeed sometimes a good indicator.
But here it's in the end of a batch of conditional blocks, so it sort
of already suggests itself as an unlikely event.
But if you feel the comment can be improved, don't hesitate to send a patch.
>
> Greetings,
> Olivier
>
>
>