2014-12-02 22:53:48

by Sasha Levin

[permalink] [raw]
Subject: sched: odd values for effective load calculations

Hi all,

I was fuzzing with trinity inside a KVM tools guest, running the latest -next
kernel along with the undefined behaviour sanitizer patch, and hit the following:

[ 787.894288] ================================================================================
[ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
[ 787.898981] signed integer overflow:
[ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
[ 787.900066] CPU: 18 PID: 12958 Comm: trinity-c103 Not tainted 3.18.0-rc6-next-20141201-sasha-00070-g028060a-dirty #1528
[ 787.900066] 0000000000000000 0000000000000000 ffffffff93b0f890 ffff8806e3eff918
[ 787.900066] ffffffff91f1cf26 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff938
[ 787.900066] ffffffff91f1fb90 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff9f8
[ 787.900066] Call Trace:
[ 787.900066] dump_stack (lib/dump_stack.c:52)
[ 787.900066] ubsan_epilogue (lib/ubsan.c:159)
[ 787.900066] handle_overflow (lib/ubsan.c:191)
[ 787.900066] ? __do_page_fault (arch/x86/mm/fault.c:1220)
[ 787.900066] ? local_clock (kernel/sched/clock.c:392)
[ 787.900066] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
[ 787.900066] select_task_rq_fair (kernel/sched/fair.c:4541 kernel/sched/fair.c:4755)
[ 787.900066] try_to_wake_up (kernel/sched/core.c:1415 kernel/sched/core.c:1724)
[ 787.900066] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:33)
[ 787.900066] default_wake_function (kernel/sched/core.c:2979)
[ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[ 787.900066] autoremove_wake_function (kernel/sched/wait.c:295)
[ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[ 787.900066] __wake_up_common (kernel/sched/wait.c:73)
[ 787.900066] __wake_up_sync_key (include/linux/spinlock.h:364 kernel/sched/wait.c:146)
[ 787.900066] pipe_write (fs/pipe.c:466)
[ 787.900066] ? kasan_poison_shadow (mm/kasan/kasan.c:48)
[ 787.900066] ? new_sync_read (fs/read_write.c:480)
[ 787.900066] do_iter_readv_writev (fs/read_write.c:681)
[ 787.900066] compat_do_readv_writev (fs/read_write.c:1029)
[ 787.900066] ? wait_for_partner (fs/pipe.c:340)
[ 787.900066] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 787.900066] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 787.900066] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1486)
[ 787.900066] compat_writev (fs/read_write.c:1145)
[ 787.900066] compat_SyS_writev (fs/read_write.c:1163 fs/read_write.c:1151)
[ 787.900066] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[ 787.900066] ================================================================================

The values for effective load seem a bit off (and are overflowing!).

One change I did recently was to disable lock debugging, could it be related?


Thanks,
Sasha


2014-12-13 08:30:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations


* Sasha Levin <[email protected]> wrote:

> Hi all,
>
> I was fuzzing with trinity inside a KVM tools guest, running the latest -next
> kernel along with the undefined behaviour sanitizer patch, and hit the following:
>
> [ 787.894288] ================================================================================
> [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
> [ 787.898981] signed integer overflow:
> [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
> [ 787.900066] CPU: 18 PID: 12958 Comm: trinity-c103 Not tainted 3.18.0-rc6-next-20141201-sasha-00070-g028060a-dirty #1528
> [ 787.900066] 0000000000000000 0000000000000000 ffffffff93b0f890 ffff8806e3eff918
> [ 787.900066] ffffffff91f1cf26 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff938
> [ 787.900066] ffffffff91f1fb90 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff9f8
> [ 787.900066] Call Trace:
> [ 787.900066] dump_stack (lib/dump_stack.c:52)
> [ 787.900066] ubsan_epilogue (lib/ubsan.c:159)
> [ 787.900066] handle_overflow (lib/ubsan.c:191)
> [ 787.900066] ? __do_page_fault (arch/x86/mm/fault.c:1220)
> [ 787.900066] ? local_clock (kernel/sched/clock.c:392)
> [ 787.900066] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
> [ 787.900066] select_task_rq_fair (kernel/sched/fair.c:4541 kernel/sched/fair.c:4755)
> [ 787.900066] try_to_wake_up (kernel/sched/core.c:1415 kernel/sched/core.c:1724)
> [ 787.900066] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:33)
> [ 787.900066] default_wake_function (kernel/sched/core.c:2979)
> [ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
> [ 787.900066] autoremove_wake_function (kernel/sched/wait.c:295)
> [ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
> [ 787.900066] __wake_up_common (kernel/sched/wait.c:73)
> [ 787.900066] __wake_up_sync_key (include/linux/spinlock.h:364 kernel/sched/wait.c:146)
> [ 787.900066] pipe_write (fs/pipe.c:466)
> [ 787.900066] ? kasan_poison_shadow (mm/kasan/kasan.c:48)
> [ 787.900066] ? new_sync_read (fs/read_write.c:480)
> [ 787.900066] do_iter_readv_writev (fs/read_write.c:681)
> [ 787.900066] compat_do_readv_writev (fs/read_write.c:1029)
> [ 787.900066] ? wait_for_partner (fs/pipe.c:340)
> [ 787.900066] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 787.900066] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 787.900066] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1486)
> [ 787.900066] compat_writev (fs/read_write.c:1145)
> [ 787.900066] compat_SyS_writev (fs/read_write.c:1163 fs/read_write.c:1151)
> [ 787.900066] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
> [ 787.900066] ================================================================================
>
> The values for effective load seem a bit off (and are overflowing!).

It definitely looks like a bug in SMP load balancing!

> One change I did recently was to disable lock debugging, could it be related?

Should be unrelated to lock debugging, other than certain races
will occur in different patterns with lock debugging on or off.

Thanks,

Ingo

2014-12-15 12:12:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations


Sorry for the long delay, I was out for a few weeks due to having become
a dad for the second time.

On Sat, Dec 13, 2014 at 09:30:12AM +0100, Ingo Molnar wrote:
> * Sasha Levin <[email protected]> wrote:
>
> > Hi all,
> >
> > I was fuzzing with trinity inside a KVM tools guest, running the latest -next
> > kernel along with the undefined behaviour sanitizer patch, and hit the following:
> >
> > [ 787.894288] ================================================================================
> > [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
> > [ 787.898981] signed integer overflow:
> > [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'

So that's:

this_eff_load *= this_load +
effective_load(tg, this_cpu, weight, weight);

Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
makes that. Which makes the rhs 'large'. Do you have
CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
are you using?

In any case, bit sad this doesn't have a register dump included :/

Is this easy to reproduce or something that happened once?

> > The values for effective load seem a bit off (and are overflowing!).
>
> It definitely looks like a bug in SMP load balancing!

Yeah, although theoretically (and somewhat practical) this can be
triggered in more places if you manage to run up the 'weight' with
enough tasks.

That said, it should at worst result in 'funny' balancing behaviour, not
anything else.

2014-12-15 13:14:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Mon, Dec 15, 2014 at 01:12:27PM +0100, Peter Zijlstra wrote:
>
> Sorry for the long delay, I was out for a few weeks due to having become
> a dad for the second time.
>
> On Sat, Dec 13, 2014 at 09:30:12AM +0100, Ingo Molnar wrote:
> > * Sasha Levin <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I was fuzzing with trinity inside a KVM tools guest, running the latest -next
> > > kernel along with the undefined behaviour sanitizer patch, and hit the following:
> > >
> > > [ 787.894288] ================================================================================
> > > [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
> > > [ 787.898981] signed integer overflow:
> > > [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
>
> So that's:
>
> this_eff_load *= this_load +
> effective_load(tg, this_cpu, weight, weight);
>
> Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
> makes that. Which makes the rhs 'large'. Do you have
> CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
> are you using?
>
> In any case, bit sad this doesn't have a register dump included :/

Hmm, I was hoping to be able to see if it was this_load or the
effective_load() result being silly large, but going by the ASM output
of my compiler this isn't going to be available in registers, its all
stack spills.

Pinning my hopes on that reproducability thing :/

2014-12-16 04:52:26

by Sasha Levin

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On 12/15/2014 07:12 AM, Peter Zijlstra wrote:
>
> Sorry for the long delay, I was out for a few weeks due to having become
> a dad for the second time.

Congrats! May you be able to sleep at night sooner rather than later.

> On Sat, Dec 13, 2014 at 09:30:12AM +0100, Ingo Molnar wrote:
>> * Sasha Levin <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I was fuzzing with trinity inside a KVM tools guest, running the latest -next
>>> kernel along with the undefined behaviour sanitizer patch, and hit the following:
>>>
>>> [ 787.894288] ================================================================================
>>> [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
>>> [ 787.898981] signed integer overflow:
>>> [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
>
> So that's:
>
> this_eff_load *= this_load +
> effective_load(tg, this_cpu, weight, weight);
>
> Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
> makes that. Which makes the rhs 'large'. Do you have
> CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
> are you using?

CONFIG_FAIR_GROUP_SCHED is enabled. There's no cgroup set-up initially,
but I figure that trinity is able to do crazy things here.

> In any case, bit sad this doesn't have a register dump included :/
>
> Is this easy to reproduce or something that happened once?

It's fairy reproducible, I've seen it happen quite a few times. What other
information might be useful?

>>> The values for effective load seem a bit off (and are overflowing!).
>>
>> It definitely looks like a bug in SMP load balancing!
>
> Yeah, although theoretically (and somewhat practical) this can be
> triggered in more places if you manage to run up the 'weight' with
> enough tasks.
>
> That said, it should at worst result in 'funny' balancing behaviour, not
> anything else.

I'm not sure if you've caught up on the RCU stall issue we've been trying
to track down (https://lkml.org/lkml/2014/11/14/656), but could this "funny"
balancing behaviour be "funny" enough to cause a stall?


Thanks,
Sasha

2014-12-16 05:30:05

by Sasha Levin

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On 12/15/2014 08:14 AM, Peter Zijlstra wrote:
>>>> [ 787.894288] ================================================================================
>>>> > > > [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
>>>> > > > [ 787.898981] signed integer overflow:
>>>> > > > [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
>> >
>> > So that's:
>> >
>> > this_eff_load *= this_load +
>> > effective_load(tg, this_cpu, weight, weight);
>> >
>> > Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
>> > makes that. Which makes the rhs 'large'. Do you have
>> > CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
>> > are you using?
>> >
>> > In any case, bit sad this doesn't have a register dump included :/
> Hmm, I was hoping to be able to see if it was this_load or the
> effective_load() result being silly large, but going by the ASM output
> of my compiler this isn't going to be available in registers, its all
> stack spills.
>
> Pinning my hopes on that reproducability thing :/

Okay, yeah, it's very reproducible. I've added:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df2cdf7..e1fbe1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4486,7 +4486,7 @@ static int wake_wide(struct task_struct *p)

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
- s64 this_load, load;
+ s64 this_load, load, tmps;
s64 this_eff_load, prev_eff_load;
int idx, this_cpu, prev_cpu;
struct task_group *tg;
@@ -4538,6 +4538,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
prev_eff_load *= capacity_of(this_cpu);

if (this_load > 0) {
+ if (__builtin_mul_overflow(this_eff_load, this_load +
+ effective_load(tg, this_cpu, weight, weight), &tmps))
+ printk(KERN_CRIT "%lld %lld %lld", this_eff_load, this_load, effective_load(tg, this_cpu, weight, weight));
this_eff_load *= this_load +
effective_load(tg, this_cpu, weight, weight);

And got:

[ 437.511964] 91600 1765238667340524 81
[ 437.512781] ================================================================================
[ 437.516069] UBSan: Undefined behaviour in kernel/sched/fair.c:4544:17
[ 437.517888] signed integer overflow:
[ 437.518721] 1765238667340605 * 91600 cannot be represented in type 'long long int'
[ 437.520051] CPU: 16 PID: 23069 Comm: trinity-c510 Not tainted 3.18.0-next-20141211-sasha-00069-g11f17a7-dirty #1607
[ 437.520153] 0000000000000000 0000000000000000 ffffffffb1f0fc70 ffff880321e8f908
[ 437.520153] ffffffffb0250d8f 1ffffffff78a0603 ffffffffb1f0fc88 ffff880321e8f928
[ 437.520153] ffffffffb0252f96 1ffffffff78a0603 ffffffffb1f0fc88 ffff880321e8f9e8
[ 437.520153] Call Trace:
[ 437.520153] dump_stack (lib/dump_stack.c:52)
[ 437.520153] ubsan_epilogue (lib/ubsan.c:159)
[ 437.520153] handle_overflow (lib/ubsan.c:191)
[ 437.520153] ? printk (./arch/x86/include/asm/preempt.h:95 kernel/printk/printk.c:1863)
[ 437.520153] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
[ 437.520153] select_task_rq_fair (kernel/sched/fair.c:4544 kernel/sched/fair.c:4758)
[ 437.520153] ? get_parent_ip (kernel/sched/core.c:2564)
[ 437.520153] ? get_parent_ip (kernel/sched/core.c:2564)
[ 437.520153] try_to_wake_up (kernel/sched/core.c:1415 kernel/sched/core.c:1729)
[ 437.520153] ? deactivate_slab (include/linux/spinlock.h:349 mm/slub.c:1940)
[ 437.520153] default_wake_function (kernel/sched/core.c:2988)
[ 437.520153] ? get_parent_ip (kernel/sched/core.c:2564)
[ 437.520153] autoremove_wake_function (kernel/sched/wait.c:295)
[ 437.520153] __wake_up_common (kernel/sched/wait.c:73)
[ 437.520153] ? copy_page_to_iter (./arch/x86/include/asm/preempt.h:95 include/linux/uaccess.h:36 include/linux/highmem.h:75 mm/iov_iter.c:180 mm/iov_iter.c:432)
[ 437.520153] __wake_up_sync_key (include/linux/spinlock.h:364 kernel/sched/wait.c:146)
[ 437.520153] pipe_read (fs/pipe.c:317)
[ 437.520153] ? do_sync_readv_writev (fs/read_write.c:396)
[ 437.520153] do_iter_readv_writev (fs/read_write.c:681)
[ 437.520153] do_readv_writev (fs/read_write.c:849)
[ 437.520153] ? pipe_write (fs/pipe.c:231)
[ 437.520153] ? acct_account_cputime (kernel/tsacct.c:168)
[ 437.520153] ? preempt_count_sub (kernel/sched/core.c:2620)
[ 437.520153] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 437.520153] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 437.520153] vfs_readv (fs/read_write.c:881)
[ 437.520153] SyS_readv (fs/read_write.c:907 fs/read_write.c:898)
[ 437.520153] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:33)
[ 437.520153] system_call_fastpath (arch/x86/kernel/entry_64.S:423)
[ 437.520153] ================================================================================

So it's actually 'this_load' going bananas.


Thanks,
Sasha

2014-12-16 10:08:10

by Yuyang Du

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Mon, Dec 15, 2014 at 11:51:46PM -0500, Sasha Levin wrote:
> On 12/15/2014 07:12 AM, Peter Zijlstra wrote:
> >
> > Sorry for the long delay, I was out for a few weeks due to having become
> > a dad for the second time.
>
> Congrats! May you be able to sleep at night sooner rather than later.

Congrats +1.

>
> It's fairy reproducible, I've seen it happen quite a few times. What other
> information might be useful?
>

Sasha, it might be helpful to see this_load is from:

this_load1: this_load = target_load(this_cpu, idx);

or

this_load2: this_load += effective_load(tg, this_cpu, -weight, -weight);

It really does not seem to be this_load1, while the calc of effective_load is a bit
complicated to see what the problem is.

2014-12-16 15:33:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Mon, Dec 15, 2014 at 11:51:46PM -0500, Sasha Levin wrote:
> I'm not sure if you've caught up on the RCU stall issue we've been trying
> to track down (https://lkml.org/lkml/2014/11/14/656), but could this "funny"
> balancing behaviour be "funny" enough to cause a stall?

Typically not, the worst degenerate modes are either running everything
on one cpu or constantly migrating tasks. Both suck performance wise,
but are 'valid' modes of operation.

RCU stalls require not actually getting to userspace or otherwise
delaying grace periods for egregious amounts of time. The only way I can
see that happening is a load-balance pass not finishing at all, and then
you'd consistently get stack traces from inside the balancer -- and I'm
not seeing that in the email referenced.

2014-12-16 15:38:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Tue, Dec 16, 2014 at 12:29:28AM -0500, Sasha Levin wrote:
> On 12/15/2014 08:14 AM, Peter Zijlstra wrote:

> > Pinning my hopes on that reproducability thing :/
>
> Okay, yeah, it's very reproducible. I've added:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df2cdf7..e1fbe1a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4486,7 +4486,7 @@ static int wake_wide(struct task_struct *p)
>
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> - s64 this_load, load;
> + s64 this_load, load, tmps;
> s64 this_eff_load, prev_eff_load;
> int idx, this_cpu, prev_cpu;
> struct task_group *tg;
> @@ -4538,6 +4538,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> prev_eff_load *= capacity_of(this_cpu);
>
> if (this_load > 0) {
> + if (__builtin_mul_overflow(this_eff_load, this_load +
> + effective_load(tg, this_cpu, weight, weight), &tmps))
> + printk(KERN_CRIT "%lld %lld %lld", this_eff_load, this_load, effective_load(tg, this_cpu, weight, weight));
> this_eff_load *= this_load +
> effective_load(tg, this_cpu, weight, weight);

Minor nit: in general it would be recommend to evaluate effective_load()
once, not thrice, state might have changed in between the calls and
results might differ. Still..

> And got:
>
> [ 437.511964] 91600 1765238667340524 81

> So it's actually 'this_load' going bananas.

That is indeed a fairly strong indication its not effective_load(),
which is good, since that's one hairy piece of cra^Wcode.

Lemme go ponder about this_load.

2014-12-18 10:09:21

by Yuyang Du

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Tue, Dec 16, 2014 at 12:29:28AM -0500, Sasha Levin wrote:
> On 12/15/2014 08:14 AM, Peter Zijlstra wrote:
> >>>> [ 787.894288] ================================================================================
> >>>> > > > [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
> >>>> > > > [ 787.898981] signed integer overflow:
> >>>> > > > [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
> >> >
> >> > So that's:
> >> >
> >> > this_eff_load *= this_load +
> >> > effective_load(tg, this_cpu, weight, weight);
> >> >
> >> > Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
> >> > makes that. Which makes the rhs 'large'. Do you have
> >> > CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
> >> > are you using?
> >> >
> >> > In any case, bit sad this doesn't have a register dump included :/
> > Hmm, I was hoping to be able to see if it was this_load or the
> > effective_load() result being silly large, but going by the ASM output
> > of my compiler this isn't going to be available in registers, its all
> > stack spills.
> >
> > Pinning my hopes on that reproducability thing :/
>
> Okay, yeah, it's very reproducible. I've added:
>

Hi Sasha,

I tried to reproduce this overflow, but got not luck (the trinity has been
running in a KVM VM for an hour)...

The trinity used is v1.4, and simply launched as ./trinity.

Could you detail your setup and procedure?

Thanks,
Yuyang

2014-12-19 08:28:16

by Yuyang Du

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Tue, Dec 16, 2014 at 10:09:48AM +0800, Yuyang Du wrote:
>
> Sasha, it might be helpful to see this_load is from:
>
> this_load1: this_load = target_load(this_cpu, idx);
>
> or
>
> this_load2: this_load += effective_load(tg, this_cpu, -weight, -weight);
>
> It really does not seem to be this_load1, while the calc of effective_load is a bit
> complicated to see what the problem is.

Hi all,

I finally managed to reproduce this, but not by trinity, just by keeping rebooting.

Indeed, the problem is from:

this_load2: this_load += effective_load(tg, this_cpu, -weight, -weight);

After digging into effective_load(), the root cause is:

wl = (w * tg->shares) / W;

if we have negative w, then it will be cast to unsigned long, and then may or may not
overflow, and end up an insane number.

I tried this in userspace, interestingly if we have:

wl = w * tg->shares;
wl /= W;

the result is ok, but not ok with the lines combined as the original one.

Anyway, the following patch can fix this.

---
Subject: [PATCH] sched: Fix long and unsigned long multiplication error in
effective_load

In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.

Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df2cdf7..6b99659 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4424,7 +4424,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
* wl = S * s'_i; see (2)
*/
if (W > 0 && w < W)
- wl = (w * tg->shares) / W;
+ wl = (w * (long)tg->shares) / W;
else
wl = tg->shares;

--

2014-12-19 11:20:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: odd values for effective load calculations

On Fri, Dec 19, 2014 at 08:29:56AM +0800, Yuyang Du wrote:
> Subject: [PATCH] sched: Fix long and unsigned long multiplication error in effective_load
>
> In effective_load, we have (long w * unsigned long tg->shares) / long W,
> when w is negative, it is cast to unsigned long and hence the product is
> insanely large. Fix this by casting tg->shares to long.
>
> Reported-by: Sasha Levin <[email protected]>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df2cdf7..6b99659 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4424,7 +4424,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> * wl = S * s'_i; see (2)
> */
> if (W > 0 && w < W)
> - wl = (w * tg->shares) / W;
> + wl = (w * (long)tg->shares) / W;
> else
> wl = tg->shares;


Oh, nice! thanks!