LinuxLists.cc - [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run

2015-04-15 09:55:38

Subject: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Cc: Ben Segall <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/time/hrtimer.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -791,6 +791,9 @@ u64 hrtimer_forward(struct hrtimer *time
if (delta.tv64 < 0)
return 0;

+ if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+ return 0;
+
if (interval.tv64 < hrtimer_resolution)
interval.tv64 = hrtimer_resolution;

@@ -1131,11 +1134,14 @@ static void __run_hrtimer(struct hrtimer
* Note: We clear the CALLBACK bit after enqueue_hrtimer and
* we do not reprogramm the event hardware. Happens either in
* hrtimer_start_range_ns() or in hrtimer_interrupt()
+ *
+ * Note: Because we dropped the cpu_base->lock above,
+ * hrtimer_start_range_ns() can have popped in and enqueued the timer
+ * for us already.
*/
- if (restart != HRTIMER_NORESTART) {
- BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+ if (restart != HRTIMER_NORESTART &&
+ !(timer->state & HRTIMER_STATE_ENQUEUED))
enqueue_hrtimer(timer, base);
- }

WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));

2015-04-15 10:26:48

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

On Wed, 15 Apr 2015, Peter Zijlstra wrote:

> hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

I don't think that subject line is correct.

Back in the early hrtimer days we made deliberately the design
decision that this kind of usage is forbidden. The reason for this is
that the hrtimer infrastructure cannot provide proper
serialization. So we thought it would be a sane restruction that
restarting a timer from the callback should not be mixed with
concurrent restarts from a different call site.

So I rather prefer a subject line like this

hrtimer: Allow concurrent hrtimer_start() for self restarting timers

or such.

> Because we drop cpu_base->lock around calling hrtimer::function, it is
> possible for hrtimer_start() to come in between and enqueue the timer.
>
> If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
> because HRTIMER_STATE_ENQUEUED will be set.
>
> Since the above is a perfectly valid scenario, remove the BUG_ON and
> make the enqueue_hrtimer() call conditional on the timer not being
> enqueued already.
>
> NOTE: in that concurrent scenario its entirely common for both sites
> to want to modify the hrtimer, since hrtimers don't provide
> serialization themselves be sure to provide some such that the
> hrtimer::function and the hrtimer_start() caller don't both try and
> fudge the expiration state at the same time.

Right.

> To that effect, add a WARN when someone tries to forward an already
> enqueued timer.

The warnon itself is nice, but what about sites which use
hrtimer_set_expires() and hrtimer_start_expires()?

Other than that I can see why you want that ...

Thanks,

tglx

2015-04-15 11:31:25

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

On Wed, Apr 15, 2015 at 12:26:58PM +0200, Thomas Gleixner wrote:
> On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> > hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
>
> I don't think that subject line is correct.
>
> Back in the early hrtimer days we made deliberately the design
> decision that this kind of usage is forbidden. The reason for this is
> that the hrtimer infrastructure cannot provide proper
> serialization. So we thought it would be a sane restruction that
> restarting a timer from the callback should not be mixed with
> concurrent restarts from a different call site.

Ah I was not aware. Until I changed the locking it was possible simply
because everything was serialized by the base lock. So the concurrent
start would either land before the callback or after it but not in the
middle like it can now.

> So I rather prefer a subject line like this
>
> hrtimer: Allow concurrent hrtimer_start() for self restarting timers
>

/me copy/paste, done! :-)

> > To that effect, add a WARN when someone tries to forward an already
> > enqueued timer.
>
> The warnon itself is nice, but what about sites which use
> hrtimer_set_expires() and hrtimer_start_expires()?

They are all inlines, furthermore forward is the most common way to
change the expiry of periodic / self restarting timers so would gain us
most.

How about this then?

---
Subject: hrtimer: Allow concurrent hrtimer_start() for self restarting timers
From: Peter Zijlstra <[email protected]>
Date: Tue May 20 15:49:48 CEST 2014

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Cc: Ben Segall <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/time/hrtimer.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -791,6 +791,9 @@ u64 hrtimer_forward(struct hrtimer *time
if (delta.tv64 < 0)
return 0;

+ if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+ return 0;
+
if (interval.tv64 < hrtimer_resolution)
interval.tv64 = hrtimer_resolution;

@@ -1131,11 +1134,14 @@ static void __run_hrtimer(struct hrtimer
* Note: We clear the CALLBACK bit after enqueue_hrtimer and
* we do not reprogramm the event hardware. Happens either in
* hrtimer_start_range_ns() or in hrtimer_interrupt()
+ *
+ * Note: Because we dropped the cpu_base->lock above,
+ * hrtimer_start_range_ns() can have popped in and enqueued the timer
+ * for us already.
*/
- if (restart != HRTIMER_NORESTART) {
- BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+ if (restart != HRTIMER_NORESTART &&
+ !(timer->state & HRTIMER_STATE_ENQUEUED))
enqueue_hrtimer(timer, base);
- }

WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));

2015-04-15 11:35:03

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 12:26:58PM +0200, Thomas Gleixner wrote:
> > On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> > > hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
> >
> > I don't think that subject line is correct.
> >
> > Back in the early hrtimer days we made deliberately the design
> > decision that this kind of usage is forbidden. The reason for this is
> > that the hrtimer infrastructure cannot provide proper
> > serialization. So we thought it would be a sane restruction that
> > restarting a timer from the callback should not be mixed with
> > concurrent restarts from a different call site.
>
> Ah I was not aware. Until I changed the locking it was possible simply
> because everything was serialized by the base lock. So the concurrent
> start would either land before the callback or after it but not in the
> middle like it can now.
>
> > So I rather prefer a subject line like this
> >
> > hrtimer: Allow concurrent hrtimer_start() for self restarting timers
> >
>
> /me copy/paste, done! :-)
>
> > > To that effect, add a WARN when someone tries to forward an already
> > > enqueued timer.
> >
> > The warnon itself is nice, but what about sites which use
> > hrtimer_set_expires() and hrtimer_start_expires()?
>
> They are all inlines, furthermore forward is the most common way to
> change the expiry of periodic / self restarting timers so would gain us
> most.

Right. I just wanted to mention it.

> How about this then?

Looks good. Should I add those 3 patches to the other pile of hrtimer
stuff?

Thanks,

tglx

2015-04-15 11:44:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

On Wed, Apr 15, 2015 at 01:35:15PM +0200, Thomas Gleixner wrote:
> > How about this then?
>
> Looks good. Should I add those 3 patches to the other pile of hrtimer
> stuff?

Lets wait a little while for pjt/ben to look over the cfs bandwidth
change, but basically yes please.

2015-04-22 19:15:47

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

Commit-ID: 5de2755c8c8b3a6b8414870e2c284914a2b42e4d
Gitweb: http://git.kernel.org/tip/5de2755c8c8b3a6b8414870e2c284914a2b42e4d
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 20 May 2014 15:49:48 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 22 Apr 2015 17:06:52 +0200

hrtimer: Allow concurrent hrtimer_start() for self restarting timers

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/hrtimer.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3bac942..4adf320 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -799,6 +799,9 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
if (delta.tv64 < 0)
return 0;

+ if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+ return 0;
+
if (interval.tv64 < hrtimer_resolution)
interval.tv64 = hrtimer_resolution;

@@ -1139,11 +1142,14 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
* Note: We clear the CALLBACK bit after enqueue_hrtimer and
* we do not reprogramm the event hardware. Happens either in
* hrtimer_start_range_ns() or in hrtimer_interrupt()
+ *
+ * Note: Because we dropped the cpu_base->lock above,
+ * hrtimer_start_range_ns() can have popped in and enqueued the timer
+ * for us already.
*/
- if (restart != HRTIMER_NORESTART) {
- BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+ if (restart != HRTIMER_NORESTART &&
+ !(timer->state & HRTIMER_STATE_ENQUEUED))
enqueue_hrtimer(timer, base);
- }

WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));

2015-05-12 13:52:52

by Sasha Levin

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

On 04/22/2015 03:15 PM, tip-bot for Peter Zijlstra wrote:
> Commit-ID: 5de2755c8c8b3a6b8414870e2c284914a2b42e4d
> Gitweb: http://git.kernel.org/tip/5de2755c8c8b3a6b8414870e2c284914a2b42e4d
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 20 May 2014 15:49:48 +0200
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Wed, 22 Apr 2015 17:06:52 +0200
>
> hrtimer: Allow concurrent hrtimer_start() for self restarting timers
>
> Because we drop cpu_base->lock around calling hrtimer::function, it is
> possible for hrtimer_start() to come in between and enqueue the timer.
>
> If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
> because HRTIMER_STATE_ENQUEUED will be set.
>
> Since the above is a perfectly valid scenario, remove the BUG_ON and
> make the enqueue_hrtimer() call conditional on the timer not being
> enqueued already.
>
> NOTE: in that concurrent scenario its entirely common for both sites
> to want to modify the hrtimer, since hrtimers don't provide
> serialization themselves be sure to provide some such that the
> hrtimer::function and the hrtimer_start() caller don't both try and
> fudge the expiration state at the same time.
>
> To that effect, add a WARN when someone tries to forward an already
> enqueued timer, the most common way to change the expiry of self
> restarting timers. Ideally we'd put the WARN in everything modifying
> the expiry but most of that is inlines and we don't need the bloat.
>
> Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Paul Turner <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> kernel/time/hrtimer.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3bac942..4adf320 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -799,6 +799,9 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
> if (delta.tv64 < 0)
> return 0;
>
> + if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
> + return 0;
> +
> if (interval.tv64 < hrtimer_resolution)
> interval.tv64 = hrtimer_resolution;

Hey Peter,

I seem to be hitting this warning with the latest -next (2015-05-11):

[3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
[3344291.057883] Modules linked in:
[3344291.058734] CPU: 0 PID: 9422 Comm: trinity-main Tainted: G W 4.1.0-rc3-next-20150511-sa
sha-00037-g87c65d4-dirty #2199
[3344291.061763] ffff880003a88000 00000000d4a58de9 ffff880021007c88 ffffffffabc833ac
[3344291.063726] 0000000000000000 0000000000000000 ffff880021007cd8 ffffffffa21f0a86
[3344291.065656] ffffffffae6936c8 ffffffffa2386319 ffff880021007cb8 ffff8800211f5f90
[3344291.067590] Call Trace:
[3344291.068085] <IRQ> [<ffffffffabc833ac>] dump_stack+0x4f/0x7b
[3344291.068949] [<ffffffffa21f0a86>] warn_slowpath_common+0xc6/0x120
[3344291.070382] [<ffffffffa21f0cca>] warn_slowpath_null+0x1a/0x20
[3344291.071078] [<ffffffffa2386319>] hrtimer_forward+0x1f9/0x330
[3344291.071834] [<ffffffffa24fc7a0>] perf_mux_hrtimer_handler+0x340/0x750
[3344291.072616] [<ffffffffa238830c>] __hrtimer_run_queues+0x30c/0x10d0
[3344291.075592] [<ffffffffa238a7ac>] hrtimer_interrupt+0x19c/0x480
[3344291.076340] [<ffffffffa20c1bef>] local_apic_timer_interrupt+0x6f/0xc0
[3344291.077156] [<ffffffffabcf3af3>] smp_trace_apic_timer_interrupt+0xe3/0x859
[3344291.077968] [<ffffffffabcf1df0>] trace_apic_timer_interrupt+0x70/0x80
[3344291.078729] <EOI> [<ffffffffa2620890>] ? arch_local_irq_restore+0x10/0x30
[3344291.079582] [<ffffffffa2626ff4>] __slab_alloc+0x704/0x790
[3344291.082482] [<ffffffffa2627324>] kmem_cache_alloc+0x2a4/0x3a0
[3344291.083944] [<ffffffffa27cdf6d>] proc_alloc_inode+0x1d/0x270
[3344291.084631] [<ffffffffa26d1a7b>] alloc_inode+0x5b/0x170
[3344291.085278] [<ffffffffa26d7501>] new_inode_pseudo+0x11/0xd0
[3344291.085950] [<ffffffffa27ce698>] proc_get_inode+0x18/0x580
[3344291.086615] [<ffffffffa27dd946>] proc_lookup_de+0xc6/0x160
[3344291.088717] [<ffffffffa27f168c>] proc_tgid_net_lookup+0x5c/0xa0
[3344291.089435] [<ffffffffa269f470>] lookup_real+0x90/0x120
[3344291.090071] [<ffffffffa26a0183>] __lookup_hash+0xe3/0x100
[3344291.092954] [<ffffffffa26a96d7>] walk_component+0x697/0xc10
[3344291.096353] [<ffffffffa26ae0cf>] path_lookupat+0x13f/0x380
[3344291.097727] [<ffffffffa26ae441>] filename_lookup+0x131/0x1f0
[3344291.098411] [<ffffffffa26b4d11>] user_path_at_empty+0xc1/0x160
[3344291.101545] [<ffffffffa26b4dc1>] user_path_at+0x11/0x20
[3344291.102274] [<ffffffffa268f131>] vfs_fstatat+0xb1/0x140
[3344291.105842] [<ffffffffa2690299>] SYSC_newfstatat+0x79/0xd0
[3344291.108737] [<ffffffffa269041e>] SyS_newfstatat+0xe/0x10
[3344291.109377] [<ffffffffabcf0ee2>] tracesys_phase2+0x88/0x8d

Thanks,
Sasha

2015-05-13 13:43:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
> Hey Peter,
>
> I seem to be hitting this warning with the latest -next (2015-05-11):
>
> [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()

Indeed.

So I can't seem to come up with a pretty solution :-(

The 'problem' is creating a periodic timer that can stop when there's no
work left and ensuring (re)start doesn't get lost or looses overruns.

The problem with the current code is that the re-start can come before
the callback does fwd, at which point the fwd will spew the above
warning (rightly so).

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that is
undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before.

So I suppose the below (compile tested) patch should fix things, but
seeing how I've been up since 4am I might just have missed something
obvious :-)

Almost-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 28 +++++++++++++++-------------
kernel/sched/core.c | 12 ------------
kernel/sched/fair.c | 17 +++++++++++++----
kernel/sched/rt.c | 8 +++++++-
kernel/sched/sched.h | 5 ++---
6 files changed, 41 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e86f85abeda7..1f5607b220e4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
struct perf_event_context *task_ctx;
int active_oncpu;
int exclusive;
+
+ raw_spinlock_t hrtimer_lock;
struct hrtimer hrtimer;
ktime_t hrtimer_interval;
+ unsigned int hrtimer_active;
+
struct pmu *unique_pmu;
struct perf_cgroup *cgrp;
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84231a146dd5..3dac1d0c6072 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
{
struct perf_cpu_context *cpuctx;
- enum hrtimer_restart ret = HRTIMER_NORESTART;
int rotations = 0;

WARN_ON(!irqs_disabled());

cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
rotations = perf_rotate_context(cpuctx);

- /*
- * arm timer if needed
- */
- if (rotations) {
+ raw_spin_lock(&cpuctx->hrtimer_lock);
+ if (rotations)
hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
- ret = HRTIMER_RESTART;
- }
+ else
+ cpuctx->hrtimer_active = 0;
+ raw_spin_unlock(&cpuctx->hrtimer_lock);

- return ret;
+ return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
}

static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,7 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)

cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);

- hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
timer->function = perf_mux_hrtimer_handler;
}

@@ -800,15 +797,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
{
struct hrtimer *timer = &cpuctx->hrtimer;
struct pmu *pmu = cpuctx->ctx.pmu;
+ unsigned long flags;

/* not for SW PMU */
if (pmu->task_ctx_nr == perf_sw_context)
return 0;

- if (hrtimer_is_queued(timer))
- return 0;
+ raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+ if (!cpuctx->hrtimer_active) {
+ cpuctx->hrtimer_active = 1;
+ hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+ hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+ }
+ raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);

- hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
return 0;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e381ca..bd9182aa74c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
- /*
- * Do not forward the expiration time of active timers;
- * we do not want to loose an overrun.
- */
- if (!hrtimer_active(period_timer))
- hrtimer_forward_now(period_timer, period);
-
- hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
DEFINE_MUTEX(sched_domains_mutex);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1510abeefa..2c08783ee95c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3883,8 +3883,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;

- start_bandwidth_timer(&cfs_b->slack_timer,
- ns_to_ktime(cfs_bandwidth_slack_period));
+ hrtimer_start(&cfs_b->slack_timer,
+ ns_to_ktime(cfs_bandwidth_slack_period),
+ HRTIMER_MODE_REL);
}

/* we know any runtime found here is valid as update_curr() precedes return */
@@ -4025,6 +4026,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)

idle = do_sched_cfs_period_timer(cfs_b, overrun);
}
+ if (idle)
+ cfs_b->period_active = 0;
raw_spin_unlock(&cfs_b->lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->period = ns_to_ktime(default_cfs_period());

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
- hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
- start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+ lockdep_assert_held(&cfs_b->lock);
+
+ if (!cfs_b->period_active) {
+ cfs_b->period_active = 1;
+ hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+ hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
}

static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38a636e..7d7093c51f8d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
idle = do_sched_rt_period_timer(rt_b, overrun);
raw_spin_lock(&rt_b->rt_runtime_lock);
}
+ if (idle)
+ rt_b->rt_period_active = 0;
raw_spin_unlock(&rt_b->rt_runtime_lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
return;

raw_spin_lock(&rt_b->rt_runtime_lock);
- start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+ if (!rt_b->rt_period_active) {
+ rt_b->rt_period_active = 1;
+ hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+ hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
raw_spin_unlock(&rt_b->rt_runtime_lock);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460f98648afd..f10a445910c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -137,6 +137,7 @@ struct rt_bandwidth {
ktime_t rt_period;
u64 rt_runtime;
struct hrtimer rt_period_timer;
+ unsigned int rt_period_active;
};

void __dl_clear_params(struct task_struct *p);
@@ -221,7 +222,7 @@ struct cfs_bandwidth {
s64 hierarchical_quota;
u64 runtime_expires;

- int idle;
+ int idle, period_active;
struct hrtimer period_timer, slack_timer;
struct list_head throttled_cfs_rq;

@@ -1410,8 +1411,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif

-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
/*
* __task_rq_lock - lock the rq @p resides on.
*/

2015-05-13 13:54:28

by Ingo Molnar

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

* Peter Zijlstra <[email protected]> wrote:

> On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
> > Hey Peter,
> >
> > I seem to be hitting this warning with the latest -next (2015-05-11):
> >
> > [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
>
> Indeed.
>
> So I can't seem to come up with a pretty solution :-(

I'm running into this rather frequently as well, in -tip integration
testing, on real hardware, so we need a solution :-/

Thanks,

Ingo

2015-05-13 17:25:15

by Benjamin Segall

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

Peter Zijlstra <[email protected]> writes:

> @@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> cfs_b->period = ns_to_ktime(default_cfs_period());
>
> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> - hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> cfs_b->period_timer.function = sched_cfs_period_timer;
> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> cfs_b->slack_timer.function = sched_cfs_slack_timer;
> @@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>
> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> {
> - start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> + lockdep_assert_held(&cfs_b->lock);
> +
> + if (!cfs_b->period_active) {
> + cfs_b->period_active = 1;
> + hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> + hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> + }
> }
>

I don't think the cfsb period timer actually needs to be pinned, even
though it has generally run as such due to start_bandwidth_timer.

The period_active code looks fine.

2015-05-13 23:10:31

by Sasha Levin

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

On 05/13/2015 09:43 AM, Peter Zijlstra wrote:
> On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
>> Hey Peter,
>>
>> I seem to be hitting this warning with the latest -next (2015-05-11):
>>
>> [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
>
> Indeed.
>
> So I can't seem to come up with a pretty solution :-(
>
> The 'problem' is creating a periodic timer that can stop when there's no
> work left and ensuring (re)start doesn't get lost or looses overruns.
>
> The problem with the current code is that the re-start can come before
> the callback does fwd, at which point the fwd will spew the above
> warning (rightly so).
>
> Now, conceptually its easy to detect if you're before or after the fwd
> by comparing the expiration time against the current time. Of course,
> that's expensive (and racy) because we don't have the current time.
>
> Alternatively one could cache this state inside the timer, but then
> everybody pays the overhead of maintaining this extra state, and that is
> undesired.
>
> The only other option that I could see is the external timer_active
> variable, which I tried to kill before.
>
> So I suppose the below (compile tested) patch should fix things, but
> seeing how I've been up since 4am I might just have missed something
> obvious :-)

I know exactly how you feel...

> Almost-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Original warning is gone, but now I get:

[ 281.213520] INFO: trying to register non-static key.
[ 281.214251] the code is fine but needs lockdep annotation.
[ 281.214251] turning off the locking correctness validator.
[ 281.214251] CPU: 0 PID: 10071 Comm: trinity-main Tainted: G W 4.1.0-rc3-next-20150512-sasha-00051-gd40f47b-dirty #2213
[ 281.214251] ffffffffae4a5b20 00000000cbbe31dc ffff88000267f698 ffffffffa7c8e8ab
[ 281.214251] 0000000000000000 0000000000000000 ffff88000267f848 ffffffff9e303567
[ 281.214251] ffff88001d015260 0000000000000004 ffff88001d014d20 0000000000000005
[ 281.214251] Call Trace:
[ 281.214251] dump_stack (lib/dump_stack.c:52)
[ 281.214251] __lock_acquire (kernel/locking/lockdep.c:786 kernel/locking/lockdep.c:3134)
[ 281.214251] ? x86_schedule_events (arch/x86/kernel/cpu/perf_event.c:849 (discriminator 2))
[ 281.214251] ? lockdep_reset_lock (kernel/locking/lockdep.c:3105)
[ 281.214251] ? x86_pmu_commit_txn (arch/x86/kernel/cpu/perf_event.c:1098)
[ 281.214251] ? x86_pmu_commit_txn (arch/x86/kernel/cpu/perf_event.c:1731)
[ 281.214251] ? x86_pmu_cancel_txn (arch/x86/kernel/cpu/perf_event.c:1720)
[ 281.214251] ? perf_event_update_userpage (include/linux/rcupdate.h:917 kernel/events/core.c:4256)
[ 281.214251] lock_acquire (kernel/locking/lockdep.c:3658)
[ 281.214251] ? perf_mux_hrtimer_restart (kernel/events/core.c:807)
[ 281.214251] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 281.214251] ? perf_mux_hrtimer_restart (kernel/events/core.c:807)
[ 281.214251] perf_mux_hrtimer_restart (kernel/events/core.c:807)
[ 281.214251] group_sched_in (kernel/events/core.c:1963)
[ 281.214251] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 281.214251] ctx_sched_in (kernel/events/core.c:2725 kernel/events/core.c:2756)
[ 281.214251] perf_event_sched_in (kernel/events/core.c:2025)
[ 281.214251] __perf_install_in_context (kernel/events/core.c:2082)
[ 281.214251] ? perf_mux_hrtimer_handler (kernel/events/core.c:2035)
[ 281.214251] remote_function (kernel/events/core.c:74)
[ 281.214251] ? pmu_dev_release (kernel/events/core.c:66)
[ 281.214251] generic_exec_single (kernel/smp.c:157)
[ 281.214251] ? pmu_dev_release (kernel/events/core.c:66)
[ 281.214251] smp_call_function_single (kernel/smp.c:300)
[ 281.214251] ? generic_exec_single (kernel/smp.c:273)
[ 281.214251] ? __lock_is_held (kernel/locking/lockdep.c:3572)
[ 281.214251] task_function_call (kernel/events/core.c:101)
[ 281.214251] ? remote_function (kernel/events/core.c:92)
[ 281.214251] ? perf_mux_hrtimer_handler (kernel/events/core.c:2035)
[ 281.214251] perf_install_in_context (kernel/events/core.c:2121)
[ 281.214251] ? add_event_to_ctx (kernel/events/core.c:2102)
[ 281.214251] ? alloc_perf_context (kernel/events/core.c:3306)
[ 281.214251] ? perf_event_alloc (kernel/events/core.c:7569)
[ 281.214251] SYSC_perf_event_open (kernel/events/core.c:8133)
[ 281.214251] ? perf_event_alloc (kernel/events/core.c:7856)
[ 281.214251] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1480)
[ 281.214251] SyS_perf_event_open (kernel/events/core.c:7853)
[ 281.214251] tracesys_phase2 (arch/x86/kernel/entry_64.S:346)

Thanks,
Sasha

2015-05-14 10:23:35

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers

On Wed, May 13, 2015 at 07:09:40PM -0400, Sasha Levin wrote:
> On 05/13/2015 09:43 AM, Peter Zijlstra wrote:

> > So I suppose the below (compile tested) patch should fix things, but
> > seeing how I've been up since 4am I might just have missed something
> > obvious :-)
>
> I know exactly how you feel...

How about I add a raw_spin_lock_init() for the new lock I added ;-)

This one seems to boot and do simple perf things with lockdep enabled.

---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 29 ++++++++++++++++-------------
kernel/sched/core.c | 12 ------------
kernel/sched/fair.c | 17 +++++++++++++----
kernel/sched/rt.c | 8 +++++++-
kernel/sched/sched.h | 5 ++---
6 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e86f85abeda7..1f5607b220e4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
struct perf_event_context *task_ctx;
int active_oncpu;
int exclusive;
+
+ raw_spinlock_t hrtimer_lock;
struct hrtimer hrtimer;
ktime_t hrtimer_interval;
+ unsigned int hrtimer_active;
+
struct pmu *unique_pmu;
struct perf_cgroup *cgrp;
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84231a146dd5..1c6c2826af1e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
{
struct perf_cpu_context *cpuctx;
- enum hrtimer_restart ret = HRTIMER_NORESTART;
int rotations = 0;

WARN_ON(!irqs_disabled());

cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
rotations = perf_rotate_context(cpuctx);

- /*
- * arm timer if needed
- */
- if (rotations) {
+ raw_spin_lock(&cpuctx->hrtimer_lock);
+ if (rotations)
hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
- ret = HRTIMER_RESTART;
- }
+ else
+ cpuctx->hrtimer_active = 0;
+ raw_spin_unlock(&cpuctx->hrtimer_lock);

- return ret;
+ return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
}

static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)

cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);

- hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ raw_spin_lock_init(&cpuctx->hrtimer_lock);
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
timer->function = perf_mux_hrtimer_handler;
}

@@ -800,15 +798,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
{
struct hrtimer *timer = &cpuctx->hrtimer;
struct pmu *pmu = cpuctx->ctx.pmu;
+ unsigned long flags;

/* not for SW PMU */
if (pmu->task_ctx_nr == perf_sw_context)
return 0;

- if (hrtimer_is_queued(timer))
- return 0;
+ raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+ if (!cpuctx->hrtimer_active) {
+ cpuctx->hrtimer_active = 1;
+ hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+ hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+ }
+ raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);

- hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
return 0;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e381ca..bd9182aa74c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
- /*
- * Do not forward the expiration time of active timers;
- * we do not want to loose an overrun.
- */
- if (!hrtimer_active(period_timer))
- hrtimer_forward_now(period_timer, period);
-
- hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
DEFINE_MUTEX(sched_domains_mutex);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1510abeefa..2c08783ee95c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3883,8 +3883,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;

- start_bandwidth_timer(&cfs_b->slack_timer,
- ns_to_ktime(cfs_bandwidth_slack_period));
+ hrtimer_start(&cfs_b->slack_timer,
+ ns_to_ktime(cfs_bandwidth_slack_period),
+ HRTIMER_MODE_REL);
}

/* we know any runtime found here is valid as update_curr() precedes return */
@@ -4025,6 +4026,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)

idle = do_sched_cfs_period_timer(cfs_b, overrun);
}
+ if (idle)
+ cfs_b->period_active = 0;
raw_spin_unlock(&cfs_b->lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->period = ns_to_ktime(default_cfs_period());

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
- hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
- start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+ lockdep_assert_held(&cfs_b->lock);
+
+ if (!cfs_b->period_active) {
+ cfs_b->period_active = 1;
+ hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+ hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
}

static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38a636e..7d7093c51f8d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
idle = do_sched_rt_period_timer(rt_b, overrun);
raw_spin_lock(&rt_b->rt_runtime_lock);
}
+ if (idle)
+ rt_b->rt_period_active = 0;
raw_spin_unlock(&rt_b->rt_runtime_lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
return;

raw_spin_lock(&rt_b->rt_runtime_lock);
- start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+ if (!rt_b->rt_period_active) {
+ rt_b->rt_period_active = 1;
+ hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+ hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
raw_spin_unlock(&rt_b->rt_runtime_lock);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460f98648afd..f10a445910c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -137,6 +137,7 @@ struct rt_bandwidth {
ktime_t rt_period;
u64 rt_runtime;
struct hrtimer rt_period_timer;
+ unsigned int rt_period_active;
};

void __dl_clear_params(struct task_struct *p);
@@ -221,7 +222,7 @@ struct cfs_bandwidth {
s64 hierarchical_quota;
u64 runtime_expires;

- int idle;
+ int idle, period_active;
struct hrtimer period_timer, slack_timer;
struct list_head throttled_cfs_rq;

@@ -1410,8 +1411,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif

-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
/*
* __task_rq_lock - lock the rq @p resides on.
*/

2015-05-18 15:22:49

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:timers/core] sched,perf: Fix periodic timers

Commit-ID: 4cfafd3082afc707653aeb82e9f8e7b596fbbfd6
Gitweb: http://git.kernel.org/tip/4cfafd3082afc707653aeb82e9f8e7b596fbbfd6
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 14 May 2015 12:23:11 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Mon, 18 May 2015 17:17:42 +0200

sched,perf: Fix periodic timers

In the below two commits (see Fixes) we have periodic timers that can
stop themselves when they're no longer required, but need to be
(re)-started when their idle condition changes.

Further complications is that we want the timer handler to always do
the forward such that it will always correctly deal with the overruns,
and we do not want to race such that the handler has already decided
to stop, but the (external) restart sees the timer still active and we
end up with a 'lost' timer.

The problem with the current code is that the re-start can come before
the callback does the forward, at which point the forward from the
callback will WARN about forwarding an enqueued timer.

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that
is undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before. I would love a nicer interface
for this seemingly simple 'problem' but alas.

Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage")
Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers")
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Sasha Levin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 29 ++++++++++++++++-------------
kernel/sched/core.c | 12 ------------
kernel/sched/fair.c | 17 +++++++++++++----
kernel/sched/rt.c | 8 +++++++-
kernel/sched/sched.h | 5 ++---
6 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..cf3342a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
struct perf_event_context *task_ctx;
int active_oncpu;
int exclusive;
+
+ raw_spinlock_t hrtimer_lock;
struct hrtimer hrtimer;
ktime_t hrtimer_interval;
+ unsigned int hrtimer_active;
+
struct pmu *unique_pmu;
struct perf_cgroup *cgrp;
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f528829..d9c93f3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
{
struct perf_cpu_context *cpuctx;
- enum hrtimer_restart ret = HRTIMER_NORESTART;
int rotations = 0;

WARN_ON(!irqs_disabled());

cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
rotations = perf_rotate_context(cpuctx);

- /*
- * arm timer if needed
- */
- if (rotations) {
+ raw_spin_lock(&cpuctx->hrtimer_lock);
+ if (rotations)
hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
- ret = HRTIMER_RESTART;
- }
+ else
+ cpuctx->hrtimer_active = 0;
+ raw_spin_unlock(&cpuctx->hrtimer_lock);

- return ret;
+ return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
}

static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)

cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);

- hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ raw_spin_lock_init(&cpuctx->hrtimer_lock);
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
timer->function = perf_mux_hrtimer_handler;
}

@@ -800,15 +798,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
{
struct hrtimer *timer = &cpuctx->hrtimer;
struct pmu *pmu = cpuctx->ctx.pmu;
+ unsigned long flags;

/* not for SW PMU */
if (pmu->task_ctx_nr == perf_sw_context)
return 0;

- if (hrtimer_is_queued(timer))
- return 0;
+ raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+ if (!cpuctx->hrtimer_active) {
+ cpuctx->hrtimer_active = 1;
+ hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+ hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+ }
+ raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);

- hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
return 0;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8a6196..e84aeb2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
- /*
- * Do not forward the expiration time of active timers;
- * we do not want to loose an overrun.
- */
- if (!hrtimer_active(period_timer))
- hrtimer_forward_now(period_timer, period);
-
- hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
DEFINE_MUTEX(sched_domains_mutex);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3b32eb..69be282 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3870,8 +3870,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;

- start_bandwidth_timer(&cfs_b->slack_timer,
- ns_to_ktime(cfs_bandwidth_slack_period));
+ hrtimer_start(&cfs_b->slack_timer,
+ ns_to_ktime(cfs_bandwidth_slack_period),
+ HRTIMER_MODE_REL);
}

/* we know any runtime found here is valid as update_curr() precedes return */
@@ -4012,6 +4013,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)

idle = do_sched_cfs_period_timer(cfs_b, overrun);
}
+ if (idle)
+ cfs_b->period_active = 0;
raw_spin_unlock(&cfs_b->lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4025,7 +4028,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->period = ns_to_ktime(default_cfs_period());

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
- hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4039,7 +4042,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
- start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+ lockdep_assert_held(&cfs_b->lock);
+
+ if (!cfs_b->period_active) {
+ cfs_b->period_active = 1;
+ hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+ hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
}

static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b0febf2..e43da53 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
idle = do_sched_rt_period_timer(rt_b, overrun);
raw_spin_lock(&rt_b->rt_runtime_lock);
}
+ if (idle)
+ rt_b->rt_period_active = 0;
raw_spin_unlock(&rt_b->rt_runtime_lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
return;

raw_spin_lock(&rt_b->rt_runtime_lock);
- start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+ if (!rt_b->rt_period_active) {
+ rt_b->rt_period_active = 1;
+ hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+ hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+ }
raw_spin_unlock(&rt_b->rt_runtime_lock);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08606a1..f9a58ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -131,6 +131,7 @@ struct rt_bandwidth {
ktime_t rt_period;
u64 rt_runtime;
struct hrtimer rt_period_timer;
+ unsigned int rt_period_active;
};

void __dl_clear_params(struct task_struct *p);
@@ -215,7 +216,7 @@ struct cfs_bandwidth {
s64 hierarchical_quota;
u64 runtime_expires;

- int idle;
+ int idle, period_active;
struct hrtimer period_timer, slack_timer;
struct list_head throttled_cfs_rq;

@@ -1406,8 +1407,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif

-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
/*
* __task_rq_lock - lock the rq @p resides on.
*/