2012-06-14 08:03:29

by Andrei Vagin

[permalink] [raw]
Subject: general protection fault on finalizing task

Hello,

I'm developing CRIU (criu.org) and got this GP. I have seen it a few
time with the same stack trace.
It's not reproduced on 3.4.0-rc4+.

general protection fault: 0000 [#1] SMP
CPU 0
Modules linked in: udp_diag bridge stp llc ipv6 ext4 jbd2 dm_mirror
dm_region_hash dm_log dm_mod pcspkr virtio_balloon 8139too 8139cp mii
i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring
virtio pata_acpi ata_generic ata_piix floppy [last unloaded:
scsi_wait_scan]

Pid: 1647, comm: crtools Not tainted 3.5.0-rc2+ #203 Red Hat KVM
RIP: 0010:[<ffffffff811b453a>] [<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70
RSP: 0018:ffff88001651bd28 EFLAGS: 00010246
RAX: 0000000000003531 RBX: ffff88001651bd68 RCX: 0000000000000010
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000003531
RBP: ffff88001651bd38 R08: 000000000000fffa R09: 0000000000000002
R10: 0000000000000000 R11: 000000000000fffd R12: 6b6b6b6b6b6b6b6b
R13: ffff88001a3b3db0 R14: ffff88001651bd68 R15: 000000000000000f
FS: 00007ff80c4a2700(0000) GS:ffff88001f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff80c4ac000 CR3: 0000000001a0b000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process crtools (pid: 1647, threadinfo ffff88001651a000, task ffff880017154c40)
Stack:
ffff88001651bd78 0000000000000001 ffff88001651bdc8 ffffffff812050c0
ffff8800185b44b0 ffff88001721e4a0 ffff88001721e4a0 0000000f81057b6c
0000000200003531 ffff88001651bd78 ffff880032003531 0000000000000246
Call Trace:
[<ffffffff812050c0>] proc_flush_task+0xa0/0x1e0
[<ffffffff81057c0e>] release_task+0xce/0x690
[<ffffffff81057b6c>] ? release_task+0x2c/0x690
[<ffffffff810622c2>] exit_ptrace+0x102/0x140
[<ffffffff81059c64>] do_exit+0x214/0xa70
[<ffffffff81553cbb>] ? _raw_read_unlock+0x2b/0x50
[<ffffffff8105a51b>] do_group_exit+0x5b/0xd0
[<ffffffff8105a5a7>] sys_exit_group+0x17/0x20
[<ffffffff8155cee9>] system_call_fastpath+0x16/0x1b
Code: 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 66 66 66
66 90 48 89 f3 49 89 fc 8b 76 04 48 8b 7b 08 e8 58 0c ff ff 89 03 <41>
f6 04 24 01 75 1f 48 89 de 4c 89 e7 e8 64 ff ff ff 48 8b 1c
RIP [<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70
RSP <ffff88001651bd28>
---[ end trace 250bb1fa95f4b805 ]---
Fixing recursive fault but reboot is needed!

Steps to reproduce:
* # git clone git://github.com/avagin/crtools.git -b gp-3.5
* # cd crtools
* # make && make -C test
* # while :; do bash test/zdtm.sh pidns/static/session00 || break; done
* Wait a few seconds

session00 is a test case for checking, that session ids restored correctly.
it create about 10 processes in a separate pidns, some of them wait
children, other ones
wait on read from pipe. crtools freezes and dumps state of this
processes and kill processes.

The bug is reproduced, when crtools try to kill tasks (in this moment
crtools attached to this tasks by ptrace).
The meta code looks like:
for_each_task(pid) {
kill(pid, SIGKILL);
ptrace(PTRACE_DETACH, pid, NULL, NULL);
}


2012-06-14 16:03:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: general protection fault on finalizing task

Hi Andrey,

On 06/14, Andrey Vagin wrote:
>
> Hello,
>
> I'm developing CRIU (criu.org) and got this GP. I have seen it a few
> time with the same stack trace.
> It's not reproduced on 3.4.0-rc4+.
>
> general protection fault: 0000 [#1] SMP
> CPU 0
> Modules linked in: udp_diag bridge stp llc ipv6 ext4 jbd2 dm_mirror
> dm_region_hash dm_log dm_mod pcspkr virtio_balloon 8139too 8139cp mii
> i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring
> virtio pata_acpi ata_generic ata_piix floppy [last unloaded:
> scsi_wait_scan]
>
> Pid: 1647, comm: crtools Not tainted 3.5.0-rc2+ #203 Red Hat KVM
> RIP: 0010:[<ffffffff811b453a>] [<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70

Could you please re-test with these

http://marc.info/?l=linux-mm-commits&m=133962463616232
http://marc.info/?l=linux-mm-commits&m=133962463616231

patches applied?


> RSP: 0018:ffff88001651bd28 EFLAGS: 00010246
> RAX: 0000000000003531 RBX: ffff88001651bd68 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000003531
> RBP: ffff88001651bd38 R08: 000000000000fffa R09: 0000000000000002
> R10: 0000000000000000 R11: 000000000000fffd R12: 6b6b6b6b6b6b6b6b
> R13: ffff88001a3b3db0 R14: ffff88001651bd68 R15: 000000000000000f
> FS: 00007ff80c4a2700(0000) GS:ffff88001f800000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007ff80c4ac000 CR3: 0000000001a0b000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process crtools (pid: 1647, threadinfo ffff88001651a000, task ffff880017154c40)
> Stack:
> ffff88001651bd78 0000000000000001 ffff88001651bdc8 ffffffff812050c0
> ffff8800185b44b0 ffff88001721e4a0 ffff88001721e4a0 0000000f81057b6c
> 0000000200003531 ffff88001651bd78 ffff880032003531 0000000000000246
> Call Trace:
> [<ffffffff812050c0>] proc_flush_task+0xa0/0x1e0
> [<ffffffff81057c0e>] release_task+0xce/0x690
> [<ffffffff81057b6c>] ? release_task+0x2c/0x690
> [<ffffffff810622c2>] exit_ptrace+0x102/0x140
> [<ffffffff81059c64>] do_exit+0x214/0xa70
> [<ffffffff81553cbb>] ? _raw_read_unlock+0x2b/0x50
> [<ffffffff8105a51b>] do_group_exit+0x5b/0xd0
> [<ffffffff8105a5a7>] sys_exit_group+0x17/0x20
> [<ffffffff8155cee9>] system_call_fastpath+0x16/0x1b
> Code: 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 66 66 66
> 66 90 48 89 f3 49 89 fc 8b 76 04 48 8b 7b 08 e8 58 0c ff ff 89 03 <41>
> f6 04 24 01 75 1f 48 89 de 4c 89 e7 e8 64 ff ff ff 48 8b 1c
> RIP [<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70
> RSP <ffff88001651bd28>
> ---[ end trace 250bb1fa95f4b805 ]---
> Fixing recursive fault but reboot is needed!
>
> Steps to reproduce:
> * # git clone git://github.com/avagin/crtools.git -b gp-3.5
> * # cd crtools
> * # make && make -C test
> * # while :; do bash test/zdtm.sh pidns/static/session00 || break; done
> * Wait a few seconds
>
> session00 is a test case for checking, that session ids restored correctly.
> it create about 10 processes in a separate pidns, some of them wait
> children, other ones
> wait on read from pipe. crtools freezes and dumps state of this
> processes and kill processes.
>
> The bug is reproduced, when crtools try to kill tasks (in this moment
> crtools attached to this tasks by ptrace).
> The meta code looks like:
> for_each_task(pid) {
> kill(pid, SIGKILL);
> ptrace(PTRACE_DETACH, pid, NULL, NULL);
> }

2012-06-14 20:37:57

by Andrei Vagin

[permalink] [raw]
Subject: Re: general protection fault on finalizing task

Oleg, thank you for response. I'm going to test yours patches.

FYI: I bisected this problem.

# git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[3208450488ae724196f1efffc457e4265957c04e] pidns: use
task_active_pid_ns in do_notify_parent

commit 3208450488ae724196f1efffc457e4265957c04e
Author: Eric W. Biederman <[email protected]>
Date: Thu May 31 16:26:39 2012 -0700

pidns: use task_active_pid_ns in do_notify_parent

Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces. This change allows us to have parent
processes that are zombies. Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.

Signed-off-by: Eric W. Biederman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Louis Rilling <[email protected]>
Cc: Mike Galbraith <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>



2012/6/14 Oleg Nesterov <[email protected]>:
> Hi Andrey,
>
> On 06/14, Andrey Vagin wrote:
>>
>> Hello,
>>
>> I'm developing CRIU (criu.org) and got this GP. I have seen it a few
>> time with the same stack trace.
>> It's not reproduced on 3.4.0-rc4+.
>>
>> general protection fault: 0000 [#1] SMP
>> CPU 0
>> Modules linked in: udp_diag bridge stp llc ipv6 ext4 jbd2 dm_mirror
>> dm_region_hash dm_log dm_mod pcspkr virtio_balloon 8139too 8139cp mii
>> i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring
>> virtio pata_acpi ata_generic ata_piix floppy [last unloaded:
>> scsi_wait_scan]
>>
>> Pid: 1647, comm: crtools Not tainted 3.5.0-rc2+ #203 Red Hat KVM
>> RIP: 0010:[<ffffffff811b453a>] ?[<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70
>
> Could you please re-test with these
>
> ? ? ? ?http://marc.info/?l=linux-mm-commits&m=133962463616232
> ? ? ? ?http://marc.info/?l=linux-mm-commits&m=133962463616231
>
> patches applied?
>
>
>> RSP: 0018:ffff88001651bd28 ?EFLAGS: 00010246
>> RAX: 0000000000003531 RBX: ffff88001651bd68 RCX: 0000000000000010
>> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000003531
>> RBP: ffff88001651bd38 R08: 000000000000fffa R09: 0000000000000002
>> R10: 0000000000000000 R11: 000000000000fffd R12: 6b6b6b6b6b6b6b6b
>> R13: ffff88001a3b3db0 R14: ffff88001651bd68 R15: 000000000000000f
>> FS: ?00007ff80c4a2700(0000) GS:ffff88001f800000(0000) knlGS:0000000000000000
>> CS: ?0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 00007ff80c4ac000 CR3: 0000000001a0b000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process crtools (pid: 1647, threadinfo ffff88001651a000, task ffff880017154c40)
>> Stack:
>> ?ffff88001651bd78 0000000000000001 ffff88001651bdc8 ffffffff812050c0
>> ?ffff8800185b44b0 ffff88001721e4a0 ffff88001721e4a0 0000000f81057b6c
>> ?0000000200003531 ffff88001651bd78 ffff880032003531 0000000000000246
>> Call Trace:
>> ?[<ffffffff812050c0>] proc_flush_task+0xa0/0x1e0
>> ?[<ffffffff81057c0e>] release_task+0xce/0x690
>> ?[<ffffffff81057b6c>] ? release_task+0x2c/0x690
>> ?[<ffffffff810622c2>] exit_ptrace+0x102/0x140
>> ?[<ffffffff81059c64>] do_exit+0x214/0xa70
>> ?[<ffffffff81553cbb>] ? _raw_read_unlock+0x2b/0x50
>> ?[<ffffffff8105a51b>] do_group_exit+0x5b/0xd0
>> ?[<ffffffff8105a5a7>] sys_exit_group+0x17/0x20
>> ?[<ffffffff8155cee9>] system_call_fastpath+0x16/0x1b
>> Code: 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 66 66 66
>> 66 90 48 89 f3 49 89 fc 8b 76 04 48 8b 7b 08 e8 58 0c ff ff 89 03 <41>
>> f6 04 24 01 75 1f 48 89 de 4c 89 e7 e8 64 ff ff ff 48 8b 1c
>> RIP ?[<ffffffff811b453a>] d_hash_and_lookup+0x2a/0x70
>> ?RSP <ffff88001651bd28>
>> ---[ end trace 250bb1fa95f4b805 ]---
>> Fixing recursive fault but reboot is needed!
>>
>> Steps to reproduce:
>> * # git clone git://github.com/avagin/crtools.git -b gp-3.5
>> * # cd crtools
>> * # make && make -C test
>> * # while :; do bash test/zdtm.sh pidns/static/session00 || break; done
>> * Wait a few seconds
>>
>> session00 is a test case for checking, that session ids restored correctly.
>> it create about 10 processes in a separate pidns, some of them wait
>> children, other ones
>> wait on read from pipe. crtools freezes and dumps state of this
>> processes and kill processes.
>>
>> The bug is reproduced, when crtools try to kill tasks (in this moment
>> crtools attached to this tasks by ptrace).
>> The meta code looks like:
>> for_each_task(pid) {
>> ? kill(pid, SIGKILL);
>> ? ptrace(PTRACE_DETACH, pid, NULL, NULL);
>> }
>

2012-06-14 21:05:54

by Andrei Vagin

[permalink] [raw]
Subject: Re: general protection fault on finalizing task

>
> Could you please re-test with these
>
> ? ? ? ?http://marc.info/?l=linux-mm-commits&m=133962463616232
> ? ? ? ?http://marc.info/?l=linux-mm-commits&m=133962463616231
>
> patches applied?

Yes. They fixed the bug. Thanks.

2012-06-14 22:28:28

by Andrew Morton

[permalink] [raw]
Subject: Re: general protection fault on finalizing task

On Fri, 15 Jun 2012 01:05:51 +0400 Andrew Wagin <[email protected]> wrote:

> >
> > Could you please re-test with these
> >
> > http://marc.info/?l=linux-mm-commits&m=133962463616232
> > http://marc.info/?l=linux-mm-commits&m=133962463616231
> >
> > patches applied?
>
> Yes. They fixed the bug. Thanks.

OK, thanks. I didn't actually have those queued for 3.5. Do now.
I'll get them into Linus next week.

2012-06-15 09:44:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: general protection fault on finalizing task

Hi Andrew,

Thanks lot for testing, I guess we need these fixes in 3.5

But I am puzzled...

On 06/15, Andrew Wagin wrote:
>
> FYI: I bisected this problem.
>
> # git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [3208450488ae724196f1efffc457e4265957c04e] pidns: use
> task_active_pid_ns in do_notify_parent
>
> commit 3208450488ae724196f1efffc457e4265957c04e
> Author: Eric W. Biederman <[email protected]>
> Date: Thu May 31 16:26:39 2012 -0700
>
> pidns: use task_active_pid_ns in do_notify_parent

Impossible ;) I think. I'd say it should be the next change

00c10bc13cdb58447d6bb2a003afad7bd60f5a5f
"pidns: make killed children autoreap"

which is fine by itself, but makes the problem (hopefully fixed
by -mm patches) more visible.

Oleg.