2014-02-24 08:06:58

by Mike Galbraith

[permalink] [raw]
Subject: [patch] sched: don't use nutty scale_rt_power() output

Hi Peter,

I wonder if the below makes sense for mainline.

Background: I received some rather surprising news recently, a user of
old 2.6.32 kernels regularly receive log spam stemming from old 208 day
era warnings/protections inserted to prevent explosions from what was at
the time unknown bad juju happening (but don't report logs that look
like graffiti artist with an unlimited supply of spray paint gone mad).

The kernel that emitted the below does NOT contain..
9993bc63 sched/x86: Fix overflow in cyc2ns_offset
..though these folks use kexec fwtw. They're one of those "You update
your kernel IFF world stops spinning" users, so will likely not be
terribly interested in me making their boxen say BUG(), and may even be
doing something naughty that induces it for all I know.

In any case, NOT using nutty output from the intentionally racy function
seems like a good plan no matter who or what makes weird unreproducible
(elsewhere) sh*t happen. Wedging a bent 64 bit peg into 32 bit hole
could make boom, on top of doing funny things to balancing.

sched: don't use nutty scale_rt_power() output

Boxen instructed to gripe if they see nutty cpu_power catch us
trashing it while seriously dazed and confused for an unknown reason.

Dec 18 05:50:56 kernel: [40091179.401405] update_group_power: cpu_power = 3148183471
Dec 18 05:51:01 /usr/sbin/cron[2279]: (root) CMD (/opt/blah/fix_cdr_bin.job >> /opt/blah/fix_cdr_bin.out 2>&1)
Dec 18 05:51:06 kernel: [40091189.455713] update_cpu_power: cpu_power = 19495027282; scale_rt = 19495027282
Dec 18 05:51:16 kernel: [22076800.665578] update_cpu_power: cpu_power = 2671067611; scale_rt = 18428729677871137243
Dec 18 05:51:16 kernel: [40091199.188773] update_cpu_power: cpu_power = 2675064501; scale_rt = 18428729677875134133

Don't do that, make a scary warning instead.

Signed-off-by: Mike Galbraith <[email protected]>
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5305,7 +5305,7 @@ static unsigned long scale_rt_power(int
static void update_cpu_power(struct sched_domain *sd, int cpu)
{
unsigned long weight = sd->span_weight;
- unsigned long power = SCHED_POWER_SCALE;
+ unsigned long power = SCHED_POWER_SCALE, rt_power;
struct sched_group *sdg = sd->groups;

if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
@@ -5326,8 +5326,14 @@ static void update_cpu_power(struct sche

power >>= SCHED_POWER_SHIFT;

- power *= scale_rt_power(cpu);
- power >>= SCHED_POWER_SHIFT;
+ rt_power = scale_rt_power(cpu);
+ if (unlikely(rt_power > SCHED_POWER_SCALE)) {
+ printk_once(KERN_WARNING "CPU%d Timewarp? rt_power: %lx clock: %Lx rt_avg: %Lx\n",
+ cpu, rt_power, cpu_rq(cpu)->clock, cpu_rq(cpu)->rt_avg);
+ } else {
+ power *= rt_power;
+ power >>= SCHED_POWER_SHIFT;
+ }

if (!power)
power = 1;


2014-02-27 09:40:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: don't use nutty scale_rt_power() output

On Mon, Feb 24, 2014 at 09:06:51AM +0100, Mike Galbraith wrote:
> Hi Peter,
>
> I wonder if the below makes sense for mainline.
>
> Background: I received some rather surprising news recently, a user of
> old 2.6.32 kernels regularly receive log spam stemming from old 208 day
> era warnings/protections inserted to prevent explosions from what was at
> the time unknown bad juju happening (but don't report logs that look
> like graffiti artist with an unlimited supply of spray paint gone mad).
>
> The kernel that emitted the below does NOT contain..
> 9993bc63 sched/x86: Fix overflow in cyc2ns_offset
> ..though these folks use kexec fwtw. They're one of those "You update
> your kernel IFF world stops spinning" users, so will likely not be
> terribly interested in me making their boxen say BUG(), and may even be
> doing something naughty that induces it for all I know.
>
> In any case, NOT using nutty output from the intentionally racy function
> seems like a good plan no matter who or what makes weird unreproducible
> (elsewhere) sh*t happen. Wedging a bent 64 bit peg into 32 bit hole
> could make boom, on top of doing funny things to balancing.
>
> sched: don't use nutty scale_rt_power() output
>
> Boxen instructed to gripe if they see nutty cpu_power catch us
> trashing it while seriously dazed and confused for an unknown reason.
>
> Dec 18 05:50:56 kernel: [40091179.401405] update_group_power: cpu_power = 3148183471
> Dec 18 05:51:01 /usr/sbin/cron[2279]: (root) CMD (/opt/blah/fix_cdr_bin.job >> /opt/blah/fix_cdr_bin.out 2>&1)
> Dec 18 05:51:06 kernel: [40091189.455713] update_cpu_power: cpu_power = 19495027282; scale_rt = 19495027282
> Dec 18 05:51:16 kernel: [22076800.665578] update_cpu_power: cpu_power = 2671067611; scale_rt = 18428729677871137243
> Dec 18 05:51:16 kernel: [40091199.188773] update_cpu_power: cpu_power = 2675064501; scale_rt = 18428729677875134133
>
> Don't do that, make a scary warning instead.
>

Yeah, I'm in two minds about that. Crappy clocks can make a whole lot of
missery. Then again, we usually guard against them going backwards.

How about something like so? Most other sites don't complain about
clocks going backwards either, they just deal with it.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5564,6 +5564,7 @@ static unsigned long scale_rt_power(int
{
struct rq *rq = cpu_rq(cpu);
u64 total, available, age_stamp, avg;
+ s64 delta;

/*
* Since we're reading these variables without serialization make sure
@@ -5572,7 +5573,11 @@ static unsigned long scale_rt_power(int
age_stamp = ACCESS_ONCE(rq->age_stamp);
avg = ACCESS_ONCE(rq->rt_avg);

- total = sched_avg_period() + (rq_clock(rq) - age_stamp);
+ delta = rq_clock(rq) - age_stamp;
+ if (unlikely(delta < 0))
+ delta = 0;
+
+ total = sched_avg_period() + delta;

if (unlikely(total < avg)) {
/* Ensures that power won't end up being negative */

2014-02-27 10:39:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: don't use nutty scale_rt_power() output

On Thu, 2014-02-27 at 10:40 +0100, Peter Zijlstra wrote:
> On Mon, Feb 24, 2014 at 09:06:51AM +0100, Mike Galbraith wrote:
> > Hi Peter,
> >
> > I wonder if the below makes sense for mainline.
> >
> > Background: I received some rather surprising news recently, a user of
> > old 2.6.32 kernels regularly receive log spam stemming from old 208 day
> > era warnings/protections inserted to prevent explosions from what was at
> > the time unknown bad juju happening (but don't report logs that look
> > like graffiti artist with an unlimited supply of spray paint gone mad).
> >
> > The kernel that emitted the below does NOT contain..
> > 9993bc63 sched/x86: Fix overflow in cyc2ns_offset
> > ..though these folks use kexec fwtw. They're one of those "You update
> > your kernel IFF world stops spinning" users, so will likely not be
> > terribly interested in me making their boxen say BUG(), and may even be
> > doing something naughty that induces it for all I know.
> >
> > In any case, NOT using nutty output from the intentionally racy function
> > seems like a good plan no matter who or what makes weird unreproducible
> > (elsewhere) sh*t happen. Wedging a bent 64 bit peg into 32 bit hole
> > could make boom, on top of doing funny things to balancing.
> >
> > sched: don't use nutty scale_rt_power() output
> >
> > Boxen instructed to gripe if they see nutty cpu_power catch us
> > trashing it while seriously dazed and confused for an unknown reason.
> >
> > Dec 18 05:50:56 kernel: [40091179.401405] update_group_power: cpu_power = 3148183471
> > Dec 18 05:51:01 /usr/sbin/cron[2279]: (root) CMD (/opt/blah/fix_cdr_bin.job >> /opt/blah/fix_cdr_bin.out 2>&1)
> > Dec 18 05:51:06 kernel: [40091189.455713] update_cpu_power: cpu_power = 19495027282; scale_rt = 19495027282
> > Dec 18 05:51:16 kernel: [22076800.665578] update_cpu_power: cpu_power = 2671067611; scale_rt = 18428729677871137243
> > Dec 18 05:51:16 kernel: [40091199.188773] update_cpu_power: cpu_power = 2675064501; scale_rt = 18428729677875134133
> >
> > Don't do that, make a scary warning instead.
> >
>
> Yeah, I'm in two minds about that. Crappy clocks can make a whole lot of
> missery. Then again, we usually guard against them going backwards.
>
> How about something like so? Most other sites don't complain about
> clocks going backwards either, they just deal with it.

Yeah, better to warp protect scale_rt_power() directly.

This small set of identical weird ass boxen should be reliable tsc.
They jump back and forth in time by _exactly 208 days_, and do that
straight from boot, and randomly thereafter. Wish I could get my hands
on one of the things, but that ain't gonna happen.

Those boxen have long uptimes, which proves you can survive with a sched
clock that's going completely bonkers, which is kinda surprising to me.
On a busy box, I'd expect some poor victim to eat the mother of all
latency hits.

> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5564,6 +5564,7 @@ static unsigned long scale_rt_power(int
> {
> struct rq *rq = cpu_rq(cpu);
> u64 total, available, age_stamp, avg;
> + s64 delta;
>
> /*
> * Since we're reading these variables without serialization make sure
> @@ -5572,7 +5573,11 @@ static unsigned long scale_rt_power(int
> age_stamp = ACCESS_ONCE(rq->age_stamp);
> avg = ACCESS_ONCE(rq->rt_avg);
>
> - total = sched_avg_period() + (rq_clock(rq) - age_stamp);
> + delta = rq_clock(rq) - age_stamp;
> + if (unlikely(delta < 0))
> + delta = 0;
> +
> + total = sched_avg_period() + delta;
>
> if (unlikely(total < avg)) {
> /* Ensures that power won't end up being negative */

Subject: [tip:sched/core] sched: Make scale_rt_power() deal with backward clocks

Commit-ID: cadefd3d6cc914d95163ba1eda766bfe7ce1e5b7
Gitweb: http://git.kernel.org/tip/cadefd3d6cc914d95163ba1eda766bfe7ce1e5b7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 27 Feb 2014 10:40:35 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 18 Apr 2014 12:07:21 +0200

sched: Make scale_rt_power() deal with backward clocks

Mike reported that, while unlikely, its entirely possible for
scale_rt_power() to see the time go backwards. This yields rather
'interesting' results.

So like all other sites that deal with clocks; make this one ignore
backward clock movement too.

Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..5e157f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5564,6 +5564,7 @@ static unsigned long scale_rt_power(int cpu)
{
struct rq *rq = cpu_rq(cpu);
u64 total, available, age_stamp, avg;
+ s64 delta;

/*
* Since we're reading these variables without serialization make sure
@@ -5572,7 +5573,11 @@ static unsigned long scale_rt_power(int cpu)
age_stamp = ACCESS_ONCE(rq->age_stamp);
avg = ACCESS_ONCE(rq->rt_avg);

- total = sched_avg_period() + (rq_clock(rq) - age_stamp);
+ delta = rq_clock(rq) - age_stamp;
+ if (unlikely(delta < 0))
+ delta = 0;
+
+ total = sched_avg_period() + delta;

if (unlikely(total < avg)) {
/* Ensures that power won't end up being negative */