2008-06-21 19:24:35

by Vegard Nossum

[permalink] [raw]
Subject: v2.6.26-rc7: BUG task_struct: Poison overwritten

Hi,

I don't really know what to do about this. The stacktrace is utterly
uninteresting, too. Maybe you can extract some information from it?
I was running lmbench and hotplug/-unplugging CPUs, some network-
exercising code, and a grep on /sys.

I have the config and vmlinux with debugging symbols if this is needed.

Thanks.


Vegard


=============================================================================
BUG task_struct: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xeb2f3018-0xeb2f302b. First byte 0x35 instead of 0x6b
INFO: Allocated in copy_process+0x70/0x1090 age=326 cpu=1 pid=2109
INFO: Freed in free_task+0x2c/0x30 age=283 cpu=0 pid=19516
INFO: Slab 0xc1974480 objects=8 used=7 fp=0xeb2f2fd0 flags=0x400020c3
INFO: Object 0xeb2f2fd0 @offset=12240 fp=0x00000000

Bytes b4 0xeb2f2fc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object 0xeb2f2fd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xeb2f2fe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xeb2f2ff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xeb2f3000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xeb2f3010: 6b 6b 6b 6b 6b 6b 6b 6b 35 cc 4a 6a 1f 08 00 00 kkkkkkkk5<CC>Jj....
Object 0xeb2f3020: 35 cc 4a 6a 6c 6b 6b 6b 6a 6b 6b eb 6b 6b 6b 6b 5<CC>Jjlkkkjkk<EB>kkkk
Object 0xeb2f3030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xeb2f3040: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Redzone 0xeb2f3f88: bb bb bb bb <BB><BB><BB><BB>
Padding 0xeb2f3fb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Pid: 1326, comm: udevd Not tainted 2.6.26-rc7 #25
[<c018f3fa>] print_trailer+0xaa/0xe0
[<c018f4cb>] check_bytes_and_report+0x9b/0xc0
[<c018f8b3>] check_object+0x193/0x1f0
[<c019047b>] __slab_alloc+0x44b/0x5d0
[<c01909e3>] kmem_cache_alloc+0xb3/0xe0
[<c01345c0>] ? copy_process+0x70/0x1090
[<c01345c0>] ? copy_process+0x70/0x1090
[<c01345c0>] copy_process+0x70/0x1090
[<c017ff49>] ? remove_vma+0x49/0x60
[<c01356d7>] do_fork+0x57/0x280
[<c0180a8a>] ? do_munmap+0x18a/0x1f0
[<c0180b36>] ? sys_munmap+0x46/0x60
[<c010652f>] sys_clone+0x2f/0x40
[<c010831b>] sysenter_past_esp+0x78/0xd1
=======================
FIX task_struct: Restoring 0xeb2f3018-0xeb2f302b=0x6b

FIX task_struct: Marking all objects used


2008-06-21 19:29:00

by Vegard Nossum

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

Oops, seems there was another one a bit earlier (about 5 minutes) that I
didn't notice. I don't think it helps that much, but here it is:


=============================================================================
BUG task_struct: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf53ab018-0xf53ab02b. First byte 0x71 instead of 0x6b
INFO: Allocated in copy_process+0x70/0x1090 age=110 cpu=1 pid=28664
INFO: Freed in free_task+0x2c/0x30 age=68 cpu=0 pid=28667
INFO: Slab 0xc1ba6cc0 objects=8 used=5 fp=0xf53aafd0 flags=0x400020c3
INFO: Object 0xf53aafd0 @offset=12240 fp=0xf53acfb0

Bytes b4 0xf53aafc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object 0xf53aafd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf53aafe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf53aaff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf53ab000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf53ab010: 6b 6b 6b 6b 6b 6b 6b 6b 71 19 6f be dd 07 00 00 kkkkkkkkq.o<BE><DD>...
Object 0xf53ab020: 71 19 6f be 6b 6b 6b 6b 6a 6b 6b eb 6b 6b 6b 6b q.o<BE>kkkkjkk<EB>kkkk
Object 0xf53ab030: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf53ab040: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Redzone 0xf53abf88: bb bb bb bb <BB><BB><BB><BB>

Padding 0xf53abfb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Pid: 1326, comm: udevd Not tainted 2.6.26-rc7 #25
[<c018f3fa>] print_trailer+0xaa/0xe0
[<c018f4cb>] check_bytes_and_report+0x9b/0xc0
[<c018f8b3>] check_object+0x193/0x1f0
[<c019047b>] __slab_alloc+0x44b/0x5d0
[<c01909e3>] kmem_cache_alloc+0xb3/0xe0
[<c01345c0>] ? copy_process+0x70/0x1090
[<c01345c0>] ? copy_process+0x70/0x1090
[<c01345c0>] copy_process+0x70/0x1090
[<c017ff49>] ? remove_vma+0x49/0x60
[<c01356d7>] do_fork+0x57/0x280
[<c0180a8a>] ? do_munmap+0x18a/0x1f0
[<c0180b36>] ? sys_munmap+0x46/0x60
[<c010652f>] sys_clone+0x2f/0x40
[<c010831b>] sysenter_past_esp+0x78/0xd1
=======================
FIX task_struct: Restoring 0xf53ab018-0xf53ab02b=0x6b

FIX task_struct: Marking all objects used

2008-06-21 20:41:20

by Vegard Nossum

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

On Sat, Jun 21, 2008 at 9:28 PM, Vegard Nossum <[email protected]> wrote:
> Oops, seems there was another one a bit earlier (about 5 minutes) that I
> didn't notice. I don't think it helps that much, but here it is:

I actually got a third one too, but it's similar to the first two.

>
>
> =============================================================================
> BUG task_struct: Poison overwritten
> -----------------------------------------------------------------------------
>
> INFO: 0xf53ab018-0xf53ab02b. First byte 0x71 instead of 0x6b
> INFO: Allocated in copy_process+0x70/0x1090 age=110 cpu=1 pid=28664
> INFO: Freed in free_task+0x2c/0x30 age=68 cpu=0 pid=28667
> INFO: Slab 0xc1ba6cc0 objects=8 used=5 fp=0xf53aafd0 flags=0x400020c3
> INFO: Object 0xf53aafd0 @offset=12240 fp=0xf53acfb0
>
> Bytes b4 0xf53aafc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
> Object 0xf53aafd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf53aafe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf53aaff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf53ab000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> Object 0xf53ab010: 6b 6b 6b 6b 6b 6b 6b 6b 71 19 6f be dd 07 00 00 kkkkkkkkq.o<BE><DD>...
> Object 0xf53ab020: 71 19 6f be 6b 6b 6b 6b 6a 6b 6b eb 6b 6b 6b 6b q.o<BE>kkkkjkk<EB>kkkk

So what to notice is that this is offset hex(0xf53ab018-0xf53aafc0) =
'0x58L' from the beginning of the object (would be nice to have SLUB
print that too, btw), which corresponds to (struct
task_struct).se.vruntime (the "se" is a struct sched_entity). I'm
putting Ingo and Peter on the Cc.

What I find odd is that only some of the bytes in there are wrong,
take the stray "eb" in the last line (above), for example. And these
variables around offset 0x58 from the struct task struct are all u64s.
Is it possible that the corruption comes from somewhere else?

(Does the number look like a valid vruntime, for example?)

For the record,

$ grep SCHED .config
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-21 21:22:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

On Sat, 2008-06-21 at 22:41 +0200, Vegard Nossum wrote:
> On Sat, Jun 21, 2008 at 9:28 PM, Vegard Nossum <[email protected]> wrote:
> > Oops, seems there was another one a bit earlier (about 5 minutes) that I
> > didn't notice. I don't think it helps that much, but here it is:
>
> I actually got a third one too, but it's similar to the first two.
>
> >
> >
> > =============================================================================
> > BUG task_struct: Poison overwritten
> > -----------------------------------------------------------------------------
> >
> > INFO: 0xf53ab018-0xf53ab02b. First byte 0x71 instead of 0x6b
> > INFO: Allocated in copy_process+0x70/0x1090 age=110 cpu=1 pid=28664
> > INFO: Freed in free_task+0x2c/0x30 age=68 cpu=0 pid=28667
> > INFO: Slab 0xc1ba6cc0 objects=8 used=5 fp=0xf53aafd0 flags=0x400020c3
> > INFO: Object 0xf53aafd0 @offset=12240 fp=0xf53acfb0
> >
> > Bytes b4 0xf53aafc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
> > Object 0xf53aafd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> > Object 0xf53aafe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> > Object 0xf53aaff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> > Object 0xf53ab000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
> > Object 0xf53ab010: 6b 6b 6b 6b 6b 6b 6b 6b 71 19 6f be dd 07 00 00 kkkkkkkkq.o<BE><DD>...
> > Object 0xf53ab020: 71 19 6f be 6b 6b 6b 6b 6a 6b 6b eb 6b 6b 6b 6b q.o<BE>kkkkjkk<EB>kkkk
>
> So what to notice is that this is offset hex(0xf53ab018-0xf53aafc0) =
> '0x58L' from the beginning of the object (would be nice to have SLUB
> print that too, btw), which corresponds to (struct
> task_struct).se.vruntime (the "se" is a struct sched_entity). I'm
> putting Ingo and Peter on the Cc.
>
> What I find odd is that only some of the bytes in there are wrong,
> take the stray "eb" in the last line (above), for example. And these
> variables around offset 0x58 from the struct task struct are all u64s.
> Is it possible that the corruption comes from somewhere else?
>
> (Does the number look like a valid vruntime, for example?)

Hard to tell, its a virtual timeline so pretty much any value is a valid
value - you'd need a contemporary vruntime from the same cfs_rq to
compare against.

But it looks like there might be some cpu hotplug race with group
scheduling - heiko (s390) and avi (x86_64) reported some cpu hotplug
crashes. We're still looking into those.

> For the record,
>
> $ grep SCHED .config
> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
> CONFIG_GROUP_SCHED=y
> CONFIG_FAIR_GROUP_SCHED=y
> CONFIG_RT_GROUP_SCHED=y
> CONFIG_USER_SCHED=y
> # CONFIG_CGROUP_SCHED is not set
> CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
> CONFIG_SCHED_SMT=y
> CONFIG_SCHED_MC=y
> CONFIG_SCHED_HRTICK=y
> CONFIG_SCHED_DEBUG=y
> CONFIG_SCHEDSTATS=y



2008-06-21 22:57:17

by Vegard Nossum

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

On Sat, Jun 21, 2008 at 11:21 PM, Peter Zijlstra <[email protected]> wrote:
> But it looks like there might be some cpu hotplug race with group
> scheduling - heiko (s390) and avi (x86_64) reported some cpu hotplug
> crashes. We're still looking into those.

Thanks.

I was poking around in kernel/sched.c and noticed something odd: In
migrate_dead(), we have this code:

/*
* Drop lock around migration; if someone else moves it,
* that's OK. No task can be added to this CPU, so iteration is
* fine.
*/
spin_unlock_irq(&rq->lock);
move_task_off_dead_cpu(dead_cpu, p);
spin_lock_irq(&rq->lock);

which is fine in itself, I guess. But spin_unlock_irq() will enable
interrupts. And move_task_off_dead_cpu() has this comment:

/*
* Figure out where task on dead CPU should go, use force if necessary.
* NOTE: interrupts should be disabled by the caller
*/
static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{

...but here, interrupts will not be disabled. On the other hand
__migrate_task_irq() (called by move_task_off_dead_cpu()) calls
local_irq_disable() itself... What do you think of this? Is the
comment wrong? Or is there a difference between "interrupts" and
"local_irq"?


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-22 07:55:57

by Vegard Nossum

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

On Sat, Jun 21, 2008 at 10:41 PM, Vegard Nossum <[email protected]> wrote:
>> INFO: 0xf53ab018-0xf53ab02b. First byte 0x71 instead of 0x6b
>> INFO: Allocated in copy_process+0x70/0x1090 age=110 cpu=1 pid=28664
>> INFO: Freed in free_task+0x2c/0x30 age=68 cpu=0 pid=28667
>> INFO: Slab 0xc1ba6cc0 objects=8 used=5 fp=0xf53aafd0 flags=0x400020c3
>> INFO: Object 0xf53aafd0 @offset=12240 fp=0xf53acfb0
>>
>> Bytes b4 0xf53aafc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
>> Object 0xf53aafd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> Object 0xf53aafe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> Object 0xf53aaff0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> Object 0xf53ab000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
>> Object 0xf53ab010: 6b 6b 6b 6b 6b 6b 6b 6b 71 19 6f be dd 07 00 00 kkkkkkkkq.o<BE><DD>...
>> Object 0xf53ab020: 71 19 6f be 6b 6b 6b 6b 6a 6b 6b eb 6b 6b 6b 6b q.o<BE>kkkkjkk<EB>kkkk
>
> So what to notice is that this is offset hex(0xf53ab018-0xf53aafc0) =
> '0x58L' from the beginning of the object (would be nice to have SLUB
> print that too, btw), which corresponds to (struct
> task_struct).se.vruntime (the "se" is a struct sched_entity). I'm
> putting Ingo and Peter on the Cc.

Oops, this calculation is bogus. The start of the object is
0xf53aafd0. This corresponds to 0x48L or .se.exec_start. So it seems
that exec_start, sum_exec_runtime, and vruntime have all been written
after the object was freed.

I guess the high number of 6bs in there is simply because the poison
was read, used for some calculation, then written back with not all
bits modified by the calculation.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-23 11:44:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: v2.6.26-rc7: BUG task_struct: Poison overwritten

On 06/22, Vegard Nossum wrote:
>
> I was poking around in kernel/sched.c and noticed something odd: In
> migrate_dead(), we have this code:
>
> /*
> * Drop lock around migration; if someone else moves it,
> * that's OK. No task can be added to this CPU, so iteration is
> * fine.
> */
> spin_unlock_irq(&rq->lock);
> move_task_off_dead_cpu(dead_cpu, p);
> spin_lock_irq(&rq->lock);
>
> which is fine in itself, I guess. But spin_unlock_irq() will enable
> interrupts. And move_task_off_dead_cpu() has this comment:
>
> /*
> * Figure out where task on dead CPU should go, use force if necessary.
> * NOTE: interrupts should be disabled by the caller
> */
> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> {
>
> ...but here, interrupts will not be disabled. On the other hand
> __migrate_task_irq() (called by move_task_off_dead_cpu()) calls
> local_irq_disable() itself... What do you think of this? Is the
> comment wrong? Or is there a difference between "interrupts" and
> "local_irq"?

Yes, the comment is wrong, thanks.

It wasn't updated by

"do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist)"
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f7b4cddcc5aca03e80e357360c9424dfba5056c2

Previously move_task_off_dead_cpu() was called under write_lock_irq(tasklist),
and we can't take tasklist for writing without disabling irqs.

If you don't see other problems, could you make a patch to fix the comment?

Oleg.