LinuxLists.cc - [GIT PULL] nohz: Move nohz kick out of scheduler IPI

2014-04-02 16:26:15

Subject: [GIT PULL] nohz: Move nohz kick out of scheduler IPI

Ingo, Thomas,

Please pull the timers/nohz-ipi-for-mingo branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz-ipi-for-tip

1st post got no objection: https://lkml.org/lkml/2014/3/19/453

I held it up until the block bits from Jens got pulled because this
set had a dependency on core IPI contained in the block branch.

I hope we can still get these two patches for the current merge window,
otherwise I'll have to split the core IPI change residing in the 1st patch
into a seperate branch for the next cycle (Linus will kill me if I
merge unrelated topics again: https://lkml.org/lkml/2014/4/1/888).

Now tell me if you think that these changes are too invasive at this stage.
Maybe we can just take the first patch that only does the IPI code
preparation work. That's the most important. The 2nd patch can wait.
Or we can just wait another cycle and split things in different branches.
Any way you prefer.

Thanks,
Frederic

-----

When a full dynticks CPU runs in single task mode then a new task gets
enqueued, we notify it through an IPI such that it restarts its tick.

The IPI used here is the scheduler IPI. There are a few reasons for that:
it can be called when interrupts are disabled, it can be called
concurrently... These convenient properties altogether aren't yet offered
by the IPI subsystem.

Meanwhile, bloating that way the scheduler IPI with scheduler unrelated
code is an abuse of this fast path. We certainly don't want to start a
big kernel IPI.

So this patchset adds a small helper to the IPI subsystem that allows
to queue an IPI from interrupt disabled code while handling concurrent
callers as well. Eventually the nohz kick gets converted to this new facility.

Partly inspired by a suggestion from Peter Zijlstra.

* Patch 1/2 brings the IPI infrastructure to support this
* Patch 2/2 does the nohz IPI conversion

---

Frederic Weisbecker (2):
smp: Non busy-waiting IPI queue
nohz: Move full nohz kick to its own IPI

include/linux/smp.h | 12 ++++++++++++
include/linux/tick.h | 2 ++
kernel/sched/core.c | 5 +----
kernel/sched/sched.h | 2 +-
kernel/smp.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
kernel/time/tick-sched.c | 20 ++++++++++++++++++++
6 files changed, 80 insertions(+), 5 deletions(-)

2014-04-02 16:26:21

by Frederic Weisbecker

[permalink] [raw]

Subject: [PATCH 2/2] nohz: Move full nohz kick to its own IPI

Now that we have smp_queue_function_single() which can be used to
safely queue IPIs when interrupts are disabled and without worrying
about concurrent callers, lets use it for the full dynticks kick to
notify a CPU that it's exiting single task mode.

This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.

Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/tick.h | 2 ++
kernel/sched/core.c | 5 +----
kernel/sched/sched.h | 2 +-
kernel/time/tick-sched.c | 20 ++++++++++++++++++++
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..9d3fcc2 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -182,6 +182,7 @@ static inline bool tick_nohz_full_cpu(int cpu)
extern void tick_nohz_init(void);
extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
+extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);
#else
@@ -190,6 +191,7 @@ static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
static inline void __tick_nohz_full_check(void) { }
static inline void tick_nohz_full_kick(void) { }
+static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void tick_nohz_full_kick_all(void) { }
static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9cae286..e4b344e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1499,9 +1499,7 @@ void scheduler_ipi(void)
*/
preempt_fold_need_resched();

- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick())
+ if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
return;

/*
@@ -1518,7 +1516,6 @@ void scheduler_ipi(void)
* somewhat pessimize the simple resched case.
*/
irq_enter();
- tick_nohz_full_check();
sched_ttwu_pending();

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c9007f2..4771063 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1225,7 +1225,7 @@ static inline void inc_nr_running(struct rq *rq)
if (tick_nohz_full_cpu(rq->cpu)) {
/* Order rq->nr_running write against the IPI */
smp_wmb();
- smp_send_reschedule(rq->cpu);
+ tick_nohz_full_kick_cpu(rq->cpu);
}
}
#endif
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..33a0043 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -230,6 +230,26 @@ void tick_nohz_full_kick(void)
irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
}

+static DEFINE_PER_CPU(struct queue_single_data, nohz_full_kick_qsd);
+
+static void nohz_full_kick_queue(struct queue_single_data *qsd)
+{
+ __tick_nohz_full_check();
+}
+
+void tick_nohz_full_kick_cpu(int cpu)
+{
+ if (!tick_nohz_full_cpu(cpu))
+ return;
+
+ if (cpu == smp_processor_id()) {
+ irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
+ } else {
+ smp_queue_function_single(cpu, nohz_full_kick_queue,
+ &per_cpu(nohz_full_kick_qsd, cpu));
+ }
+}
+
static void nohz_full_kick_ipi(void *info)
{
__tick_nohz_full_check();
--
1.8.3.1

2014-04-02 16:26:18

by Frederic Weisbecker

[permalink] [raw]

Subject: [PATCH 1/2] smp: Non busy-waiting IPI queue

Some IPI users, such as the nohz subsystem, need to be able to send
an async IPI (async = non waiting for any other IPI completion) on
contexts with disabled interrupts. And we want the IPI subsystem to handle
concurrent calls by itself.

Currently the nohz machinery uses the scheduler IPI for this purpose
because it can be triggered from any context and doesn't need any
serialization from the caller. But this is an abuse of a scheduler
fast path. We are bloating it with a job that should use its own IPI.

The current set of IPI functions can't be called when interrupts are
disabled otherwise we risk a deadlock when two CPUs wait for each
other's IPI completion.

OTOH smp_call_function_single_async() can be called when interrupts
are disabled. But then it's up to the caller to serialize the given
IPI. This can't be called concurrently without special care.

So we need a version of the async IPI that takes care of concurrent
calls.

The proposed solution is to synchronize the IPI with a specific flag
that prevents the IPI from being sent if it is already pending but not
yet executed. Ordering is maintained such that, if the IPI is not sent
because it's already pending, we guarantee it will see the new state of
the data we expect it to when it will execute.

This model is close to the irq_work design. It's also partly inspired by
suggestions from Peter Zijlstra.

Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/smp.h | 12 ++++++++++++
kernel/smp.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 633f5ed..155dc86 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -29,6 +29,18 @@ extern unsigned int total_cpus;
int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
int wait);

+struct queue_single_data;
+typedef void (*smp_queue_func_t)(struct queue_single_data *qsd);
+
+struct queue_single_data {
+ struct call_single_data data;
+ smp_queue_func_t func;
+ int pending;
+};
+
+int smp_queue_function_single(int cpuid, smp_queue_func_t func,
+ struct queue_single_data *qsd);
+
/*
* Call a function on all processors
*/
diff --git a/kernel/smp.c b/kernel/smp.c
index 06d574e..bfe7b36 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -265,6 +265,50 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd)
}
EXPORT_SYMBOL_GPL(smp_call_function_single_async);

+void generic_smp_queue_function_single_interrupt(void *info)
+{
+ struct queue_single_data *qsd = info;
+
+ WARN_ON_ONCE(xchg(&qsd->pending, 0) != 1);
+ qsd->func(qsd);
+}
+
+/**
+ * smp_queue_function_single - Queue an asynchronous function to run on a
+ * specific CPU unless it's already pending.
+ * @func: The function to run. This must be fast and non-blocking.
+ * @qsd: The data contained in the interested object if any
+ *
+ * Like smp_call_function_single_async() but the call to @func is serialized
+ * and won't be queued if it is already pending. In the latter case, ordering
+ * is still guaranteed such that the pending call will sees the new data we
+ * expect it to.
+ *
+ * This must not be called on offline CPUs.
+ *
+ * Returns 0 when @func is successfully queued or already pending, else a negative
+ * status code.
+ */
+int smp_queue_function_single(int cpu, smp_queue_func_t func, struct queue_single_data *qsd)
+{
+ int err;
+
+ if (cmpxchg(&qsd->pending, 0, 1))
+ return 0;
+
+ qsd->func = func;
+
+ preempt_disable();
+ err = generic_exec_single(cpu, &qsd->data, generic_smp_queue_function_single_interrupt, qsd, 0);
+ preempt_enable();
+
+ /* Reset is case of error. This must not be called on offline CPUs */
+ if (err)
+ qsd->pending = 0;
+
+ return err;
+}
+
/*
* smp_call_function_any - Run a function on any of the given cpus
* @mask: The mask of cpus it can run on.
--
1.8.3.1

2014-04-02 18:05:21

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 1/2] smp: Non busy-waiting IPI queue

On Wed, Apr 02, 2014 at 06:26:05PM +0200, Frederic Weisbecker wrote:
> Some IPI users, such as the nohz subsystem, need to be able to send
> an async IPI (async = non waiting for any other IPI completion) on
> contexts with disabled interrupts. And we want the IPI subsystem to handle
> concurrent calls by itself.
>
> Currently the nohz machinery uses the scheduler IPI for this purpose
> because it can be triggered from any context and doesn't need any
> serialization from the caller. But this is an abuse of a scheduler
> fast path. We are bloating it with a job that should use its own IPI.
>
> The current set of IPI functions can't be called when interrupts are
> disabled otherwise we risk a deadlock when two CPUs wait for each
> other's IPI completion.
>
> OTOH smp_call_function_single_async() can be called when interrupts
> are disabled. But then it's up to the caller to serialize the given
> IPI. This can't be called concurrently without special care.
>
> So we need a version of the async IPI that takes care of concurrent
> calls.
>
> The proposed solution is to synchronize the IPI with a specific flag
> that prevents the IPI from being sent if it is already pending but not
> yet executed. Ordering is maintained such that, if the IPI is not sent
> because it's already pending, we guarantee it will see the new state of
> the data we expect it to when it will execute.
>
> This model is close to the irq_work design. It's also partly inspired by
> suggestions from Peter Zijlstra.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Kevin Hilman <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Nice, but one question below.

Thanx, Paul

> ---
> include/linux/smp.h | 12 ++++++++++++
> kernel/smp.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 56 insertions(+)
>
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 633f5ed..155dc86 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -29,6 +29,18 @@ extern unsigned int total_cpus;
> int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
> int wait);
>
> +struct queue_single_data;
> +typedef void (*smp_queue_func_t)(struct queue_single_data *qsd);
> +
> +struct queue_single_data {
> + struct call_single_data data;
> + smp_queue_func_t func;
> + int pending;
> +};
> +
> +int smp_queue_function_single(int cpuid, smp_queue_func_t func,
> + struct queue_single_data *qsd);
> +
> /*
> * Call a function on all processors
> */
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 06d574e..bfe7b36 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -265,6 +265,50 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd)
> }
> EXPORT_SYMBOL_GPL(smp_call_function_single_async);
>
> +void generic_smp_queue_function_single_interrupt(void *info)
> +{
> + struct queue_single_data *qsd = info;
> +
> + WARN_ON_ONCE(xchg(&qsd->pending, 0) != 1);

I am probably missing something here, but shouldn't this function copy
*qsd to a local on-stack variable before doing the above xchg()? What
prevents the following from happening?

o CPU 0 does smp_queue_function_single(), which sets ->pending
and fills in ->func and ->data.

o CPU 1 takes IPI, invoking generic_smp_queue_function_single_interrupt().

o CPU 1 does xchg(), so that ->pending is now zero.

o An attempt to reuse the queue_single_data sees ->pending equal
to zero, so the ->func and ->data is overwritten.

o CPU 1 calls the new ->func with the new ->data (or any of the other
two possible unexpected outcomes), which might not be helpful to
the kernel's actuarial statistics.

So what am I missing?

> + qsd->func(qsd);
> +}
> +
> +/**
> + * smp_queue_function_single - Queue an asynchronous function to run on a
> + * specific CPU unless it's already pending.
> + * @func: The function to run. This must be fast and non-blocking.
> + * @qsd: The data contained in the interested object if any
> + *
> + * Like smp_call_function_single_async() but the call to @func is serialized
> + * and won't be queued if it is already pending. In the latter case, ordering
> + * is still guaranteed such that the pending call will sees the new data we
> + * expect it to.
> + *
> + * This must not be called on offline CPUs.
> + *
> + * Returns 0 when @func is successfully queued or already pending, else a negative
> + * status code.
> + */
> +int smp_queue_function_single(int cpu, smp_queue_func_t func, struct queue_single_data *qsd)
> +{
> + int err;
> +
> + if (cmpxchg(&qsd->pending, 0, 1))
> + return 0;
> +
> + qsd->func = func;
> +
> + preempt_disable();
> + err = generic_exec_single(cpu, &qsd->data, generic_smp_queue_function_single_interrupt, qsd, 0);
> + preempt_enable();
> +
> + /* Reset is case of error. This must not be called on offline CPUs */
> + if (err)
> + qsd->pending = 0;
> +
> + return err;
> +}
> +
> /*
> * smp_call_function_any - Run a function on any of the given cpus
> * @mask: The mask of cpus it can run on.
> --
> 1.8.3.1
>

2014-04-02 18:30:26

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 1/2] smp: Non busy-waiting IPI queue

2014-04-02 20:05 GMT+02:00 Paul E. McKenney <[email protected]>:
> On Wed, Apr 02, 2014 at 06:26:05PM +0200, Frederic Weisbecker wrote:
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 06d574e..bfe7b36 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -265,6 +265,50 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd)
>> }
>> EXPORT_SYMBOL_GPL(smp_call_function_single_async);
>>
>> +void generic_smp_queue_function_single_interrupt(void *info)
>> +{
>> + struct queue_single_data *qsd = info;
>> +
>> + WARN_ON_ONCE(xchg(&qsd->pending, 0) != 1);
>
> I am probably missing something here, but shouldn't this function copy
> *qsd to a local on-stack variable before doing the above xchg()? What
> prevents the following from happening?
>
> o CPU 0 does smp_queue_function_single(), which sets ->pending
> and fills in ->func and ->data.
>
> o CPU 1 takes IPI, invoking generic_smp_queue_function_single_interrupt().
>
> o CPU 1 does xchg(), so that ->pending is now zero.
>
> o An attempt to reuse the queue_single_data sees ->pending equal
> to zero, so the ->func and ->data is overwritten.
>
> o CPU 1 calls the new ->func with the new ->data (or any of the other
> two possible unexpected outcomes), which might not be helpful to
> the kernel's actuarial statistics.
>
> So what am I missing?

Ah, I forgot to precise that the function must remain the same for all
calls on a single qsd. And the data is always the qsd so this one can
only stay stable. So that shouldn't be a problem.

But you're right. The fact that we pass the function as an argument of
smp_queue_function_single() suggests that we can pass a different
function across various calls on a same qsd. So that's confusing.
Perhaps changing smp_queue_function_single() such that it only takes
the qsd as an argument would make that clearer? Then it's up to the
caller to initialize the qsd with the constant function. I could
define smp_queue_function_init() for that purpose. Or
DEFINE_QUEUE_FUNCTION_DATA() for static initializers.

How does that sound?

2014-04-02 19:01:17

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 1/2] smp: Non busy-waiting IPI queue

On Wed, Apr 02, 2014 at 08:30:23PM +0200, Frederic Weisbecker wrote:
> 2014-04-02 20:05 GMT+02:00 Paul E. McKenney <[email protected]>:
> > On Wed, Apr 02, 2014 at 06:26:05PM +0200, Frederic Weisbecker wrote:
> >> diff --git a/kernel/smp.c b/kernel/smp.c
> >> index 06d574e..bfe7b36 100644
> >> --- a/kernel/smp.c
> >> +++ b/kernel/smp.c
> >> @@ -265,6 +265,50 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd)
> >> }
> >> EXPORT_SYMBOL_GPL(smp_call_function_single_async);
> >>
> >> +void generic_smp_queue_function_single_interrupt(void *info)
> >> +{
> >> + struct queue_single_data *qsd = info;
> >> +
> >> + WARN_ON_ONCE(xchg(&qsd->pending, 0) != 1);
> >
> > I am probably missing something here, but shouldn't this function copy
> > *qsd to a local on-stack variable before doing the above xchg()? What
> > prevents the following from happening?
> >
> > o CPU 0 does smp_queue_function_single(), which sets ->pending
> > and fills in ->func and ->data.
> >
> > o CPU 1 takes IPI, invoking generic_smp_queue_function_single_interrupt().
> >
> > o CPU 1 does xchg(), so that ->pending is now zero.
> >
> > o An attempt to reuse the queue_single_data sees ->pending equal
> > to zero, so the ->func and ->data is overwritten.
> >
> > o CPU 1 calls the new ->func with the new ->data (or any of the other
> > two possible unexpected outcomes), which might not be helpful to
> > the kernel's actuarial statistics.
> >
> > So what am I missing?
>
> Ah, I forgot to precise that the function must remain the same for all
> calls on a single qsd. And the data is always the qsd so this one can
> only stay stable. So that shouldn't be a problem.

I did indeed miss that particular constraint. ;-)

> But you're right. The fact that we pass the function as an argument of
> smp_queue_function_single() suggests that we can pass a different
> function across various calls on a same qsd. So that's confusing.
> Perhaps changing smp_queue_function_single() such that it only takes
> the qsd as an argument would make that clearer? Then it's up to the
> caller to initialize the qsd with the constant function. I could
> define smp_queue_function_init() for that purpose. Or
> DEFINE_QUEUE_FUNCTION_DATA() for static initializers.
>
> How does that sound?

Sounds good!

Thanx, Paul