2014-07-17 07:58:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Thu, Jul 17, 2014 at 04:31:04PM +0900, Michel D?nzer wrote:
>
> I've been running into the panic captured in the attached picture (hope
> it's legible) randomly while running 3.16-rc4 and -rc5. I haven't
> noticed any pattern as to when it happens; at least once it happened
> while the box was basically sitting idle.
>
> dmesg, .config and /proc/cpuinfo attached as well; let me know if you
> need anything else.
>

Does lkml.kernel.org/r/[email protected] cure things?


Attachments:
(No filename) (491.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-18 09:30:08

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 17.07.2014 16:58, Peter Zijlstra wrote:
> On Thu, Jul 17, 2014 at 04:31:04PM +0900, Michel D?nzer wrote:
>>
>> I've been running into the panic captured in the attached picture (hope
>> it's legible) randomly while running 3.16-rc4 and -rc5. I haven't
>> noticed any pattern as to when it happens; at least once it happened
>> while the box was basically sitting idle.
>>
>> dmesg, .config and /proc/cpuinfo attached as well; let me know if you
>> need anything else.
>
> Does lkml.kernel.org/r/[email protected] cure things?

Yes, adding back

cpumask_clear(sched_group_cpus(sg));

seems to do the trick, thanks.

There's a long weekend coming up for me, but after that I'll be happy to
test any better fix you guys come up with.


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer


Attachments:
signature.asc (234.00 B)
OpenPGP digital signature

2014-07-22 06:13:23

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 18.07.2014 18:29, Michel D?nzer wrote:
> On 17.07.2014 16:58, Peter Zijlstra wrote:
>> On Thu, Jul 17, 2014 at 04:31:04PM +0900, Michel D?nzer wrote:
>>>
>>> I've been running into the panic captured in the attached picture (hope
>>> it's legible) randomly while running 3.16-rc4 and -rc5. I haven't
>>> noticed any pattern as to when it happens; at least once it happened
>>> while the box was basically sitting idle.
>>>
>>> dmesg, .config and /proc/cpuinfo attached as well; let me know if you
>>> need anything else.
>>
>> Does lkml.kernel.org/r/[email protected] cure things?
>
> Yes, adding back
>
> cpumask_clear(sched_group_cpus(sg));
>
> seems to do the trick, thanks.

I'm afraid it happened again with 3.16-rc5 plus the above change. It
seemed to last longer than before, but maybe that was just luck.

Going to try 3.16-rc6 now.


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer


Attachments:
signature.asc (234.00 B)
OpenPGP digital signature

2014-07-23 03:53:31

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 22.07.2014 15:13, Michel D?nzer wrote:
> On 18.07.2014 18:29, Michel D?nzer wrote:
>> On 17.07.2014 16:58, Peter Zijlstra wrote:
>>> On Thu, Jul 17, 2014 at 04:31:04PM +0900, Michel D?nzer wrote:
>>>>
>>>> I've been running into the panic captured in the attached picture (hope
>>>> it's legible) randomly while running 3.16-rc4 and -rc5. I haven't
>>>> noticed any pattern as to when it happens; at least once it happened
>>>> while the box was basically sitting idle.
>>>>
>>>> dmesg, .config and /proc/cpuinfo attached as well; let me know if you
>>>> need anything else.
>>>
>>> Does lkml.kernel.org/r/[email protected] cure things?
>>
>> Yes, adding back
>>
>> cpumask_clear(sched_group_cpus(sg));
>>
>> seems to do the trick, thanks.
>
> I'm afraid it happened again with 3.16-rc5 plus the above change. It
> seemed to last longer than before, but maybe that was just luck.
>
> Going to try 3.16-rc6 now.

Just happened again with the same change on top of 3.16-rc6.

Are there any other potential fixes yet?

I hope this problem is on the radar as a showstopper for 3.16.


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer


Attachments:
signature.asc (234.00 B)
OpenPGP digital signature

2014-07-23 04:21:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Tue, Jul 22, 2014 at 8:53 PM, Michel Dänzer <[email protected]> wrote:
>
> Just happened again with the same change on top of 3.16-rc6.

The (maybe) related bugzilla entry is just odd. Bruno Wolff reports
that the BUG_ON() in his added patch triggers:

+ cpumask_clear(sched_group_cpus(sg));
+ sg->sgc->capacity = 0;
+ BUG_ON(!cpumask_empty(sched_group_cpus(sg)));

where it *just* did a cpumask_clear(), and now the BUG_ON() triggers
that it's no longer empty?

That would imply an allocation error, but all the sched groups seem to
be properly allocated with the proper addition of cpumask_size().

And his config file even has NR_CPUS being 32, so it should be a
single word of bitmap, which triggers all the simple code.

Completely insane, in other words.

Linus

2014-07-23 06:49:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Tue, Jul 22, 2014 at 09:21:40PM -0700, Linus Torvalds wrote:
> On Tue, Jul 22, 2014 at 8:53 PM, Michel D?nzer <[email protected]> wrote:
> >
> > Just happened again with the same change on top of 3.16-rc6.
>
> The (maybe) related bugzilla entry is just odd. Bruno Wolff reports
> that the BUG_ON() in his added patch triggers:
>
> + cpumask_clear(sched_group_cpus(sg));
> + sg->sgc->capacity = 0;
> + BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
>
> where it *just* did a cpumask_clear(), and now the BUG_ON() triggers
> that it's no longer empty?
>
> That would imply an allocation error, but all the sched groups seem to
> be properly allocated with the proper addition of cpumask_size().
>
> And his config file even has NR_CPUS being 32, so it should be a
> single word of bitmap, which triggers all the simple code.
>
> Completely insane, in other words.

So we've had this other thread where the same happened:

lkml.kernel.org/r/[email protected]

(pointed Michel to that earlier)

And that seems to be sorted now (just found positive feedback in my
Inbox this morning), it was a question of the arch code supplying
completely 'broken' topology information, and the scheduler trusting it
too much.

The real fix in that thread is:

lkml.kernel.org/r/[email protected]

And I'll also add this to make the scheduler less trusting:

lkml.kernel.org/r/[email protected]

Michael, that's not going to tell us what's wrong with your machine, as
you've not got the ancient dual P4 Xeon Bruno's got. Seeing how your
cpuinfo says:

model name : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G

but we can start the same debugging session I suppose.

Could you run with this patch on top:

lkml.kernel.org/r/[email protected]

And provide us with the dmesg after boot?

2014-07-23 08:05:48

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 23.07.2014 15:49, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 09:21:40PM -0700, Linus Torvalds wrote:
>> On Tue, Jul 22, 2014 at 8:53 PM, Michel D?nzer <[email protected]> wrote:
>>>
>>> Just happened again with the same change on top of 3.16-rc6.
>>
>> The (maybe) related bugzilla entry is just odd. Bruno Wolff reports
>> that the BUG_ON() in his added patch triggers:
>>
>> + cpumask_clear(sched_group_cpus(sg));
>> + sg->sgc->capacity = 0;
>> + BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
>>
>> where it *just* did a cpumask_clear(), and now the BUG_ON() triggers
>> that it's no longer empty?
>>
>> That would imply an allocation error, but all the sched groups seem to
>> be properly allocated with the proper addition of cpumask_size().
>>
>> And his config file even has NR_CPUS being 32, so it should be a
>> single word of bitmap, which triggers all the simple code.
>>
>> Completely insane, in other words.
>
> So we've had this other thread where the same happened:
>
> lkml.kernel.org/r/[email protected]
>
> (pointed Michel to that earlier)
>
> And that seems to be sorted now (just found positive feedback in my
> Inbox this morning), it was a question of the arch code supplying
> completely 'broken' topology information, and the scheduler trusting it
> too much.
>
> The real fix in that thread is:
>
> lkml.kernel.org/r/[email protected]
>
> And I'll also add this to make the scheduler less trusting:
>
> lkml.kernel.org/r/[email protected]
>
> Michael, that's not going to tell us what's wrong with your machine, as
> you've not got the ancient dual P4 Xeon Bruno's got. Seeing how your
> cpuinfo says:
>
> model name : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
>
> but we can start the same debugging session I suppose.
>
> Could you run with this patch on top:
>
> lkml.kernel.org/r/[email protected]
>
> And provide us with the dmesg after boot?

Attached. No FAIL messages yet.


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer


Attachments:
dmesg.txt (81.58 kB)

2014-07-23 08:28:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 05:05:24PM +0900, Michel D?nzer wrote:
> On 23.07.2014 15:49, Peter Zijlstra wrote:
> Attached. No FAIL messages yet.

> [ 0.467570] __sdt_alloc: allocated ffff8802155ea4c0 with cpus:
> [ 0.467574] __sdt_alloc: allocated ffff8802155ea3c0 with cpus:
> [ 0.467576] __sdt_alloc: allocated ffff8802155ea2c0 with cpus:
> [ 0.467577] __sdt_alloc: allocated ffff8802155ea1c0 with cpus:
> [ 0.467582] __sdt_alloc: allocated ffff8802155ea0c0 with cpus:
> [ 0.467589] __sdt_alloc: allocated ffff880215798f40 with cpus:
> [ 0.467591] __sdt_alloc: allocated ffff880215798e40 with cpus:
> [ 0.467593] __sdt_alloc: allocated ffff880215798d40 with cpus:
> [ 0.467599] __sdt_alloc: allocated ffff880215798c40 with cpus:
> [ 0.467600] __sdt_alloc: allocated ffff880215798b40 with cpus:
> [ 0.467602] __sdt_alloc: allocated ffff880215798a40 with cpus:
> [ 0.467604] __sdt_alloc: allocated ffff880215798940 with cpus:
> [ 0.467627] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0-1
> [ 0.467629] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0-3
> [ 0.467631] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 0-1
> [ 0.467632] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 0-3
> [ 0.467634] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 2-3
> [ 0.467635] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 0-3
> [ 0.467637] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 2-3
> [ 0.467638] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 0-3
> [ 0.467640] build_sched_groups: got group ffff8802155ea4c0 with cpus:
> [ 0.467642] build_sched_groups: got group ffff8802155ea3c0 with cpus:
> [ 0.467643] build_sched_groups: got group ffff8802155ea0c0 with cpus:
> [ 0.467644] build_sched_groups: got group ffff880215798e40 with cpus:
> [ 0.467646] build_sched_groups: got group ffff8802155ea2c0 with cpus:
> [ 0.467647] build_sched_groups: got group ffff8802155ea1c0 with cpus:

Hmm, indeed. And given that I don't see how the cpumask_clear() can make
any difference for you. And your topology information is 'correct'.

Of course, the other thing that patch did is clear sgp->power (now
sgc->capacity). So does adding that back cure things for you?

If it does, we've got to go figure out what's wrong with the sgc
assignments or so.

---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..0c83265cf7c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
continue;

group = get_group(i, sdd, &sg);
+ sg->sgc->capacity = 0;
cpumask_setall(sched_group_mask(sg));

for_each_cpu(j, span) {

2014-07-23 09:25:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:

> Of course, the other thing that patch did is clear sgp->power (now
> sgc->capacity).

Hmm, re-reading the thread there isn't a clear confirmation its this
patch at all. Could you perhaps bisect this to either verify it is
indeed that patch we're talking about:

caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")

or find which patch is causing this.

2014-07-23 09:31:36

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 23.07.2014 18:25, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:
>
>> Of course, the other thing that patch did is clear sgp->power (now
>> sgc->capacity).
>
> Hmm, re-reading the thread there isn't a clear confirmation its this
> patch at all. Could you perhaps bisect this to either verify it is
> indeed that patch we're talking about:
>
> caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")
>
> or find which patch is causing this.

It can take a long time for the problem to occur, so I need to run at
least for one or two days to be at least somewhat sure a given kernel is
not affected.

I'll try reproducing the problem with your previous suggestions first,
but if I manage to do that, I guess there's no alternative to bisecting...


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer

2014-07-23 09:45:48

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 23/07/14 10:31, Michel D?nzer wrote:
> On 23.07.2014 18:25, Peter Zijlstra wrote:
>> On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:
>>
>>> Of course, the other thing that patch did is clear sgp->power (now
>>> sgc->capacity).
>>
>> Hmm, re-reading the thread there isn't a clear confirmation its this
>> patch at all. Could you perhaps bisect this to either verify it is
>> indeed that patch we're talking about:
>>
>> caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")
>>
>> or find which patch is causing this.
>
> It can take a long time for the problem to occur, so I need to run at
> least for one or two days to be at least somewhat sure a given kernel is
> not affected.

Doesn't the picture showing the captured panic reveal more information.
Haven't seen it myself, I just saw Peter's reply to your email

https://lkml.org/lkml/2014/7/17/100

>
> I'll try reproducing the problem with your previous suggestions first,
> but if I manage to do that, I guess there's no alternative to bisecting...
>
>

2014-07-23 10:52:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 06:31:26PM +0900, Michel D?nzer wrote:
> On 23.07.2014 18:25, Peter Zijlstra wrote:
> > On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:
> >
> >> Of course, the other thing that patch did is clear sgp->power (now
> >> sgc->capacity).
> >
> > Hmm, re-reading the thread there isn't a clear confirmation its this
> > patch at all. Could you perhaps bisect this to either verify it is
> > indeed that patch we're talking about:
> >
> > caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")
> >
> > or find which patch is causing this.
>
> It can take a long time for the problem to occur, so I need to run at
> least for one or two days to be at least somewhat sure a given kernel is
> not affected.

Ah, ok, that's unfortunate :/

2014-07-23 11:11:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 10:45:46AM +0100, Dietmar Eggemann wrote:
> Doesn't the picture showing the captured panic reveal more information.
> Haven't seen it myself, I just saw Peter's reply to your email

Its a general protection fault from somewhere in load_balance(), I send
you the picture.

It would help to get addr2line of the RIP I suppose.

Michel provided a config, so lemme go try and build that, maybe my gcc
will generate similar code to his and the function offset is enough
clue.


2014-07-23 11:30:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 01:11:10PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 10:45:46AM +0100, Dietmar Eggemann wrote:
> > Doesn't the picture showing the captured panic reveal more information.
> > Haven't seen it myself, I just saw Peter's reply to your email
>
> Its a general protection fault from somewhere in load_balance(), I send
> you the picture.
>
> It would help to get addr2line of the RIP I suppose.
>
> Michel provided a config, so lemme go try and build that, maybe my gcc
> will generate similar code to his and the function offset is enough
> clue.

So the code section says the faulting instruction is:

f3 a5

followed by:

48 89 c7 85 50 ff ff

or so.

My compiled code is 'different', the function is shorter, but there's a
f3 a5 somewhere not too far short of +d7 at +a8. I have (objdump -SD):

35a8: f3 a5 rep movsl %ds:(%rsi),%es:(%rdi)

for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
unsigned long capacity, capacity_factor, wl;
enum fbq_type rt;

rq = cpu_rq(i);
35aa: 48 c7 c1 00 00 00 00 mov $0x0,%rcx

And that's the only part that could possibly match.

That looks like the start of find_busiest_queue(). I'm not entirely sure
what the rep movsl is operating on, lemme try and figure that out.

2014-07-23 14:25:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 01:30:21PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 01:11:10PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 23, 2014 at 10:45:46AM +0100, Dietmar Eggemann wrote:
> > > Doesn't the picture showing the captured panic reveal more information.
> > > Haven't seen it myself, I just saw Peter's reply to your email
> >
> > Its a general protection fault from somewhere in load_balance(), I send
> > you the picture.
> >
> > It would help to get addr2line of the RIP I suppose.
> >
> > Michel provided a config, so lemme go try and build that, maybe my gcc
> > will generate similar code to his and the function offset is enough
> > clue.
>
> So the code section says the faulting instruction is:
>
> f3 a5
>
> followed by:
>
> 48 89 c7 85 50 ff ff
>
> or so.
>
> My compiled code is 'different', the function is shorter, but there's a
> f3 a5 somewhere not too far short of +d7 at +a8. I have (objdump -SD):
>
> 35a8: f3 a5 rep movsl %ds:(%rsi),%es:(%rdi)
>
> for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
> unsigned long capacity, capacity_factor, wl;
> enum fbq_type rt;
>
> rq = cpu_rq(i);
> 35aa: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
>
> And that's the only part that could possibly match.
>
> That looks like the start of find_busiest_queue(). I'm not entirely sure
> what the rep movsl is operating on, lemme try and figure that out.

Ah, this appears to be load_balance()'s:

cpumask_copy(cpus, cpu_active_mask);

Which totally doesn't make sense, both src and dst are static storage.
Dst is the most interesting since its per-cpu storage, but still.

No way either of those should generate a #GP. Puzzled.

2014-07-23 14:38:14

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 23.07.2014 23:24, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 01:30:21PM +0200, Peter Zijlstra wrote:
>> On Wed, Jul 23, 2014 at 01:11:10PM +0200, Peter Zijlstra wrote:
>>> On Wed, Jul 23, 2014 at 10:45:46AM +0100, Dietmar Eggemann wrote:
>>>> Doesn't the picture showing the captured panic reveal more information.
>>>> Haven't seen it myself, I just saw Peter's reply to your email
>>>
>>> Its a general protection fault from somewhere in load_balance(), I send
>>> you the picture.
>>>
>>> It would help to get addr2line of the RIP I suppose.
>>>
>>> Michel provided a config, so lemme go try and build that, maybe my gcc
>>> will generate similar code to his and the function offset is enough
>>> clue.
>>
>> So the code section says the faulting instruction is:
>>
>> f3 a5
>>
>> followed by:
>>
>> 48 89 c7 85 50 ff ff
>>
>> or so.
>>
>> My compiled code is 'different', the function is shorter, but there's a
>> f3 a5 somewhere not too far short of +d7 at +a8. I have (objdump -SD):
>>
>> 35a8: f3 a5 rep movsl %ds:(%rsi),%es:(%rdi)
>>
>> for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>> unsigned long capacity, capacity_factor, wl;
>> enum fbq_type rt;
>>
>> rq = cpu_rq(i);
>> 35aa: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
>>
>> And that's the only part that could possibly match.
>>
>> That looks like the start of find_busiest_queue(). I'm not entirely sure
>> what the rep movsl is operating on, lemme try and figure that out.
>
> Ah, this appears to be load_balance()'s:
>
> cpumask_copy(cpus, cpu_active_mask);

Right, according to addr2line it's the memcpy in bitmap_copy().


> Which totally doesn't make sense, both src and dst are static storage.
> Dst is the most interesting since its per-cpu storage, but still.
>
> No way either of those should generate a #GP. Puzzled.

Could it be the memcpy length being off or something like that?


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer

2014-07-23 15:51:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Wed, Jul 23, 2014 at 7:24 AM, Peter Zijlstra <[email protected]> wrote:
>
> No way either of those should generate a #GP. Puzzled.

I haven't seen the full oops, can you forward the screenshot? The
exact register state might give some clues.

Linus

2014-07-24 07:18:59

by Michel Dänzer

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On 23.07.2014 18:31, Michel D?nzer wrote:
> On 23.07.2014 18:25, Peter Zijlstra wrote:
>> On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:
>>
>>> Of course, the other thing that patch did is clear sgp->power (now
>>> sgc->capacity).
>>
>> Hmm, re-reading the thread there isn't a clear confirmation its this
>> patch at all. Could you perhaps bisect this to either verify it is
>> indeed that patch we're talking about:
>>
>> caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")
>>
>> or find which patch is causing this.
>
> It can take a long time for the problem to occur, so I need to run at
> least for one or two days to be at least somewhat sure a given kernel is
> not affected.
>
> I'll try reproducing the problem with your previous suggestions first,

Just happened again, with your robustness patch and setting
sg->sgc->capacity = 0.

> but if I manage to do that, I guess there's no alternative to bisecting...

I hope the assembly output I sent earlier helps, I'm afraid bisecting
this could be painful.


--
Earthling Michel D?nzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer

2014-07-24 07:52:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Thu, Jul 24, 2014 at 04:18:48PM +0900, Michel D?nzer wrote:
> On 23.07.2014 18:31, Michel D?nzer wrote:
> > On 23.07.2014 18:25, Peter Zijlstra wrote:
> >> On Wed, Jul 23, 2014 at 10:28:19AM +0200, Peter Zijlstra wrote:
> >>
> >>> Of course, the other thing that patch did is clear sgp->power (now
> >>> sgc->capacity).
> >>
> >> Hmm, re-reading the thread there isn't a clear confirmation its this
> >> patch at all. Could you perhaps bisect this to either verify it is
> >> indeed that patch we're talking about:
> >>
> >> caffcdd8d27b ("sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()")
> >>
> >> or find which patch is causing this.
> >
> > It can take a long time for the problem to occur, so I need to run at
> > least for one or two days to be at least somewhat sure a given kernel is
> > not affected.
> >
> > I'll try reproducing the problem with your previous suggestions first,
>
> Just happened again, with your robustness patch and setting
> sg->sgc->capacity = 0.

Yeah, that pretty much confirms its not that patch :/

> > but if I manage to do that, I guess there's no alternative to bisecting...
>
> I hope the assembly output I sent earlier helps, I'm afraid bisecting
> this could be painful.

Yeah, lemme go have a look...

2014-07-24 09:55:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Random panic in load_balance() with 3.16-rc

On Thu, Jul 24, 2014 at 09:51:57AM +0200, Peter Zijlstra wrote:
> > I hope the assembly output I sent earlier helps, I'm afraid bisecting
> > this could be painful.
>
> Yeah, lemme go have a look...

So I'm not seeing it, the cpus value is kept at -136(%rbp), so
-128(%rbp) comes after and that's struct lb_env env. And -140(%rbp)
comes before and that ends up being @idle.

The compiler likes to spill for sure, but aside from stupid I don't
see it doing wrong in the relatively short code from function start to
the rep movsl.

It does a rep stosl on -128(%rbp) and then fills it out, but none of
that looks to stomp on our -136(%rbp) value. And the -140(%rbp) thing is
only written to once, and while that is done after the 136 thing its a
single movl and that's not going to clobber anything.

And the fault happens before we pass @env around, so there no chance
someone writes before it either.

So I'm still entirely clueless..