2013-07-06 00:30:57

by Simon Kirby

[permalink] [raw]
Subject: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

We saw two Oopses overnight on two separate boxes that seem possibly
related, but both are weird. These boxes typically run btrfs for rsync
snapshot backups (and usually Oops in btrfs ;), but not this time!
backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
was running 3.10 release plus btrfs-next from yesterday. Full kern.log
and .config at http://0x.ca/sim/ref/3.10/

backup02's first Oops:

BUG: unable to handle kernel paging request at 0000000100000000
IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
PGD 1f54f7067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
RIP: 0010:[<ffffffff81124beb>] [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
RSP: 0018:ffff8801b4877e88 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
Call Trace:
[<ffffffff8112a1bc>] ? __fput+0x12c/0x240
[<ffffffff8107a503>] prepare_creds+0x23/0x150
[<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
[<ffffffff811274a3>] SyS_access+0x13/0x20
[<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
RSP <ffff8801b4877e88>
CR2: 0000000100000000
---[ end trace 744477356cd98306 ]---

backup03's first Oops:

BUG: unable to handle kernel paging request at ffff880502efc240
IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
PGD 1d3a067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
RIP: 0010:[<ffffffff81124c4b>] [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
RSP: 0018:ffff8803c23afd90 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
FS: 00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
Call Trace:
[<ffffffff81133811>] ? final_putname+0x21/0x50
[<ffffffff8107a4e3>] prepare_creds+0x23/0x150
[<ffffffff8107ab11>] copy_creds+0x31/0x160
[<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
[<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
[<ffffffff81143312>] ? __alloc_fd+0x42/0x100
[<ffffffff81050134>] do_fork+0xa4/0x320
[<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
[<ffffffff81143426>] ? __fd_install+0x26/0x60
[<ffffffff81050431>] SyS_clone+0x11/0x20
[<ffffffff817ad849>] stub_clone+0x69/0x90
[<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
RSP <ffff8803c23afd90>
CR2: ffff880502efc240
---[ end trace 956d153150ecc57f ]---

Simon-


2013-07-06 08:27:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Sat, Jul 6, 2013 at 3:09 AM, Simon Kirby <[email protected]> wrote:
> We saw two Oopses overnight on two separate boxes that seem possibly
> related, but both are weird. These boxes typically run btrfs for rsync
> snapshot backups (and usually Oops in btrfs ;), but not this time!
> backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
> was running 3.10 release plus btrfs-next from yesterday. Full kern.log
> and .config at http://0x.ca/sim/ref/3.10/
>
> backup02's first Oops:
>
> BUG: unable to handle kernel paging request at 0000000100000000
> IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> PGD 1f54f7067 PUD 0
> Oops: 0000 [#1] SMP
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
> CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
> Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
> task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
> RIP: 0010:[<ffffffff81124beb>] [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> RSP: 0018:ffff8801b4877e88 EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
> RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
> RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
> R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
> R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
> FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
> ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
> 0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
> ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
> Call Trace:
> [<ffffffff8112a1bc>] ? __fput+0x12c/0x240
> [<ffffffff8107a503>] prepare_creds+0x23/0x150
> [<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
> [<ffffffff811274a3>] SyS_access+0x13/0x20
> [<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
> Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> RIP [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> RSP <ffff8801b4877e88>
> CR2: 0000000100000000
> ---[ end trace 744477356cd98306 ]---
>
> backup03's first Oops:
>
> BUG: unable to handle kernel paging request at ffff880502efc240
> IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> PGD 1d3a067 PUD 0
> Oops: 0000 [#1] SMP
> Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
> CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
> Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
> task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
> RIP: 0010:[<ffffffff81124c4b>] [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> RSP: 0018:ffff8803c23afd90 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
> RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
> RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
> R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
> R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
> FS: 00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
> ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
> ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
> ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
> Call Trace:
> [<ffffffff81133811>] ? final_putname+0x21/0x50
> [<ffffffff8107a4e3>] prepare_creds+0x23/0x150
> [<ffffffff8107ab11>] copy_creds+0x31/0x160
> [<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
> [<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
> [<ffffffff81143312>] ? __alloc_fd+0x42/0x100
> [<ffffffff81050134>] do_fork+0xa4/0x320
> [<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
> [<ffffffff81143426>] ? __fd_install+0x26/0x60
> [<ffffffff81050431>] SyS_clone+0x11/0x20
> [<ffffffff817ad849>] stub_clone+0x69/0x90
> [<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
> Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> RIP [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> RSP <ffff8803c23afd90>
> CR2: ffff880502efc240
> ---[ end trace 956d153150ecc57f ]---

Looks like slab corruption to me.

Please try reproducing with "slub_debug" passed as a kernel parameter.
It should give us some more debug output for catching the caller
that's messing up slab.

Btw, there are some btrfs related lockup warnings in the logs so I'm
also CC'ing Chris.

Pekka

2013-08-19 20:36:01

by Simon Kirby

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Sat, Jul 06, 2013 at 11:27:38AM +0300, Pekka Enberg wrote:

> On Sat, Jul 6, 2013 at 3:09 AM, Simon Kirby <[email protected]> wrote:
> > We saw two Oopses overnight on two separate boxes that seem possibly
> > related, but both are weird. These boxes typically run btrfs for rsync
> > snapshot backups (and usually Oops in btrfs ;), but not this time!
> > backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
> > was running 3.10 release plus btrfs-next from yesterday. Full kern.log
> > and .config at http://0x.ca/sim/ref/3.10/
> >
> > backup02's first Oops:
> >
> > BUG: unable to handle kernel paging request at 0000000100000000
> > IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> > PGD 1f54f7067 PUD 0
> > Oops: 0000 [#1] SMP
> > Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
> > CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
> > Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
> > task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
> > RIP: 0010:[<ffffffff81124beb>] [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:ffff8801b4877e88 EFLAGS: 00010206
> > RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
> > RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
> > RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
> > R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
> > R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
> > FS: 0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Stack:
> > ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
> > 0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
> > ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
> > Call Trace:
> > [<ffffffff8112a1bc>] ? __fput+0x12c/0x240
> > [<ffffffff8107a503>] prepare_creds+0x23/0x150
> > [<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
> > [<ffffffff811274a3>] SyS_access+0x13/0x20
> > [<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
> > Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> > RIP [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> > RSP <ffff8801b4877e88>
> > CR2: 0000000100000000
> > ---[ end trace 744477356cd98306 ]---
> >
> > backup03's first Oops:
> >
> > BUG: unable to handle kernel paging request at ffff880502efc240
> > IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> > PGD 1d3a067 PUD 0
> > Oops: 0000 [#1] SMP
> > Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
> > CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
> > Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
> > task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
> > RIP: 0010:[<ffffffff81124c4b>] [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:ffff8803c23afd90 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
> > RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
> > RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
> > R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
> > R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
> > FS: 00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Stack:
> > ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
> > ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
> > ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
> > Call Trace:
> > [<ffffffff81133811>] ? final_putname+0x21/0x50
> > [<ffffffff8107a4e3>] prepare_creds+0x23/0x150
> > [<ffffffff8107ab11>] copy_creds+0x31/0x160
> > [<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
> > [<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
> > [<ffffffff81143312>] ? __alloc_fd+0x42/0x100
> > [<ffffffff81050134>] do_fork+0xa4/0x320
> > [<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
> > [<ffffffff81143426>] ? __fd_install+0x26/0x60
> > [<ffffffff81050431>] SyS_clone+0x11/0x20
> > [<ffffffff817ad849>] stub_clone+0x69/0x90
> > [<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
> > Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> > RIP [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> > RSP <ffff8803c23afd90>
> > CR2: ffff880502efc240
> > ---[ end trace 956d153150ecc57f ]---
>
> Looks like slab corruption to me.
>
> Please try reproducing with "slub_debug" passed as a kernel parameter.
> It should give us some more debug output for catching the caller
> that's messing up slab.
>
> Btw, there are some btrfs related lockup warnings in the logs so I'm
> also CC'ing Chris.

So, with slub_debug, we are seeing "Poison overwritten" on two separate
boxes (we have four running roughly the same NFS-to-btrfs snapshot backup
tasks). One does it about weekly, the other has only done it once. The
alloc/free traces are always the same -- always alloc_pipe_info and
free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:

=============================================================================
BUG kmalloc-192 (Not tainted): Poison overwritten
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: 0xffff880090f19e7c-0xffff880090f19e7c. First byte 0x6c instead of 0x6b
INFO: Allocated in alloc_pipe_info+0x1f/0xb0 age=15 cpu=6 pid=21914
__slab_alloc.constprop.66+0x35b/0x3a0
kmem_cache_alloc_trace+0xa0/0x100
alloc_pipe_info+0x1f/0xb0
create_pipe_files+0x41/0x1f0
__do_pipe_flags+0x3c/0xb0
SyS_pipe2+0x1b/0xa0
SyS_pipe+0xb/0x10
system_call_fastpath+0x16/0x1b
INFO: Freed in free_pipe_info+0x6a/0x70 age=14 cpu=6 pid=21914
__slab_free+0x2d/0x2df
kfree+0xfd/0x130
free_pipe_info+0x6a/0x70
pipe_release+0x94/0xf0
__fput+0xa7/0x230
____fput+0x9/0x10
task_work_run+0x97/0xd0
do_notify_resume+0x66/0x70
int_signal+0x12/0x17
INFO: Slab 0xffffea000243c600 objects=31 used=31 fp=0x (null) flags=0x4000000000004080
INFO: Object 0xffff880090f19e78 @offset=7800 fp=0xffff880090f1b6d8

Bytes b4 ffff880090f19e68: 11 a2 b0 07 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
Object ffff880090f19e88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19e98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19ea8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19eb8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19ec8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19ed8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19ee8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19ef8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19f08: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19f18: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff880090f19f28: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
Redzone ffff880090f19f38: bb bb bb bb bb bb bb bb ........
Padding ffff880090f1a078: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 6 PID: 21914 Comm: perl Tainted: G B 3.11.0-rc4-hw+ #48
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
ffff880090f19e78 ffff8800a0f03c98 ffffffff817af54c 0000000000000007
ffff88041d404900 ffff8800a0f03cc8 ffffffff81131c89 ffff880090f19e7d
ffff88041d404900 000000000000006b ffff880090f19e78 ffff8800a0f03d18
Call Trace:
[<ffffffff817af54c>] dump_stack+0x46/0x58
[<ffffffff81131c89>] print_trailer+0xf9/0x160
[<ffffffff81131e22>] check_bytes_and_report+0xe2/0x120
[<ffffffff81132027>] check_object+0x1c7/0x240
[<ffffffff8113fd9f>] ? alloc_pipe_info+0x1f/0xb0
[<ffffffff817abaae>] alloc_debug_processing+0x153/0x168
[<ffffffff817abe1e>] __slab_alloc.constprop.66+0x35b/0x3a0
[<ffffffff8113fd9f>] ? alloc_pipe_info+0x1f/0xb0
[<ffffffff811333a0>] kmem_cache_alloc_trace+0xa0/0x100
[<ffffffff8114f26d>] ? inode_init_always+0xed/0x1b0
[<ffffffff8113fd9f>] alloc_pipe_info+0x1f/0xb0
[<ffffffff811402c1>] create_pipe_files+0x41/0x1f0
[<ffffffff811404ac>] __do_pipe_flags+0x3c/0xb0
[<ffffffff81152206>] ? __fd_install+0x26/0x60
[<ffffffff8114057b>] SyS_pipe2+0x1b/0xa0
[<ffffffff8114060b>] SyS_pipe+0xb/0x10
[<ffffffff817bce69>] system_call_fastpath+0x16/0x1b
FIX kmalloc-192: Restoring 0xffff880090f19e7c-0xffff880090f19e7c=0x6b

FIX kmalloc-192: Marking all objects used

This and more traces posted here: http://0x.ca/sim/ref/3.11-rc4/

Is there anything more we should turn on to get more information?
CONFIG_EFENCE? :)

Simon-

Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Mon, 19 Aug 2013, Simon Kirby wrote:

> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk

This looks like an increment after free in the second 32 bit value of the
structure. First 32 bit value's poison is unchanged.

> CONFIG_EFENCE? :)

I think the developers should have enough information now to fix the
issue. Find the struct definition. Look for the second 32 bit value and
find out where that value is incremented.

2013-08-19 21:16:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
> On Mon, 19 Aug 2013, Simon Kirby wrote:
>
>> [... ] The
>> alloc/free traces are always the same -- always alloc_pipe_info and
>> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
>>
>> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
>
> This looks like an increment after free in the second 32 bit value of the
> structure. First 32 bit value's poison is unchanged.

Ugh. If that is "struct pipe_inode_info" and I read it right, that's
the "wait_lock" spinlock that is part of the mutex.

Doing a "spin_lock()" could indeed cause an increment operation. But
it still sounds like a very odd case. And even for some wild pointer
I'd then expect the spin_unlock to also happen, and to then increment
the next byte (or word) too. More importantly, for a mutex, I'd expect
the *other* fields to be corrupted too (the "waiter" field etc). That
is, unless we're still spinning waiting for the mutex, but with that
value we shouldn't, as far as I can see.

But it kind of does match at least one of your oopses that you had
before using slab debugging: one of them had a pointer that should
have been NULL that was 0000000100000000. Which again is "increment
the second 32-bit word", and could be explained by the slab entry
being re-used for another allocation that just happened to have a
pointer in the first 8 bytes instead.

And I think the timing is interesting, and there is data to back up
the fact that it is that mutex field: the field was introduced by
commit 72b0d9aacb89 ("pipe: don't use ->i_mutex"), which was merged
into 3.10-rc1. So it matches the timing Simon sees. So while I think
the pipe mutex spinlock field is a bit odd,

Al Viro added to the participants list. Because that
pipe->mutex->mutex_lock corruption doesn't really make sense to me,
but there are certainly interesting coincidences wrt timing.

Simon - it *might* be interesting to do this with DEBUG_PAGEALLOC, and
make the pipe_inode_info allocations use a full page instead of a
kmalloc() in order to trigger that way. So now it uses

pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);

and

kfree(pipe);

in alloc_pipe_info/free_pipe_info respectively, could you make it use

pipe = (void *)get_zeroed_page(GFP_KERNEL);

and

free_page((unsigned long)pipe);

instead respectively, and then enable DEBUG_PAGEALLOC? That *should*
trigger an exception on the actual bad access, if it really is this
pipe_inode_info that is having problems..

Linus

2013-08-19 21:24:47

by Chris Mason

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

Quoting Linus Torvalds (2013-08-19 17:16:36)
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >> [... ] The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
>
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
>
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.
>

Simon, is this box doing btrfs send/receive? If so, it's probably where
this pipe is coming from.

Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
way to find it, I can give you a patch if it'll help.

It would be nice if you could trigger this on plain 3.11-rcX instead of
btrfs-next.

-chris

2013-08-19 23:31:41

by Simon Kirby

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:

> Quoting Linus Torvalds (2013-08-19 17:16:36)
> > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
> > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > >
> > >> [... ] The
> > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > >>
> > >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
> > >
> > > This looks like an increment after free in the second 32 bit value of the
> > > structure. First 32 bit value's poison is unchanged.
> >
> > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > the "wait_lock" spinlock that is part of the mutex.
> >
> > Doing a "spin_lock()" could indeed cause an increment operation. But
> > it still sounds like a very odd case. And even for some wild pointer
> > I'd then expect the spin_unlock to also happen, and to then increment
> > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > the *other* fields to be corrupted too (the "waiter" field etc). That
> > is, unless we're still spinning waiting for the mutex, but with that
> > value we shouldn't, as far as I can see.
> >
>
> Simon, is this box doing btrfs send/receive? If so, it's probably where
> this pipe is coming from.

No, not for some time (a few kernel versions ago).

> Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> way to find it, I can give you a patch if it'll help.

I presume it's just:

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e1..30d5b8d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
{
struct pipe_inode_info *pipe;

- pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+ pipe = (void *)get_zeroed_page(GFP_KERNEL);
if (pipe) {
pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * PIPE_DEF_BUFFERS, GFP_KERNEL);
if (pipe->bufs) {
@@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
mutex_init(&pipe->mutex);
return pipe;
}
- kfree(pipe);
+ free_page((unsigned long)pipe);
}

return NULL;
@@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
if (pipe->tmp_page)
__free_page(pipe->tmp_page);
kfree(pipe->bufs);
- kfree(pipe);
+ free_page((unsigned long)pipe);
}

static struct vfsmount *pipe_mnt __read_mostly;

...and CONFIG_DEBUG_PAGEALLOC enabled.

> It would be nice if you could trigger this on plain 3.11-rcX instead of
> btrfs-next.

On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
were from 3.11-rc4 with just some of our local patches:

> git diff --stat v3.11-rc4..master
firmware/Makefile | 4 +-
firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex | 5804 ++++++++++++++++++++++
firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +++++++++++++++++++++++++
kernel/acct.c | 21 +-
net/sunrpc/auth.c | 2 +-
net/sunrpc/clnt.c | 10 +
net/sunrpc/xprt.c | 8 +-
7 files changed, 12335 insertions(+), 10 deletions(-)

None of them look relevant, but I'm building vanilla -rc4 with
CONFIG_DEBUG_PAGEALLOC and the patch above.

Simon-

2013-08-20 04:06:54

by Al Viro

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Mon, Aug 19, 2013 at 02:16:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >> [... ] The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
>
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
>
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.

Point... I would probably start with CONFIG_DEBUG_MUTEXES and see if
it screams about mutex magic, etc. having been buggered. FWIW, pipe is
neither a FIFO nor an internal per-task one - it's a usual pipe(2) one,
which should've excluded most of the weird codepaths...

2013-08-20 07:17:55

by Ian Applegate

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

We are also seeing this or a similar issue. On a fairly widespread
deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
order of 36 days (combined MTBF.)

[28974.739774] ------------[ cut here ]------------
[28974.744980] kernel BUG at mm/slub.c:3352!
[28974.749502] invalid opcode: 0000 [#1] SMP
[28974.754143] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler
dm_mod md_mod nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6table_raw ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
iptable_filter xt_tcpudp xt_CT nf_conntrack xt_multiport iptable_raw
ip_tables x_tables rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 fuse
nfsv3 nfs_acl nfs fscache lockd sunrpc bonding tcp_cubic sg sfc(O) mtd
mdio igb dca i2c_algo_bit i2c_core ptp pps_core coretemp kvm_intel kvm
crc32c_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper
aes_x86_64 acpi_cpufreq evdev sd_mod crc_t10dif snd_pcm tpm_tis
microcode tpm snd_timer tpm_bios snd soundcore snd_page_alloc pcspkr
ahci libahci uhci_hcd ehci_pci ehci_hcd lpc_ich libata mfd_core
usbcore usb_common hpsa scsi_mod mperf button processor thermal_sys
[28974.835407] CPU: 17 PID: 21400 Comm: nginx-fl Tainted: G
O 3.10.1-cloudflare-trace+ #2
[28974.845307] Hardware name: HP ProLiant DL180 G6 , BIOS O20 01/24/2011
[28974.852653] task: ffff8805bea7b390 ti: ffff8809a5f0e000 task.ti:
ffff8809a5f0e000
[28974.861095] RIP: 0010:[<ffffffff810ebcb2>] [<ffffffff810ebcb2>]
kfree+0x59/0xe7
[28974.869453] RSP: 0018:ffff8809a5f0fe30 EFLAGS: 00010246
[28974.875434] RAX: 0000000000000000 RBX: ffff880b21e79d40 RCX: 0000000000000028
[28974.883458] RDX: 006ffc0000080068 RSI: 000000000001183c RDI: ffffea002c879e40
[28974.891483] RBP: ffffea002c879e40 R08: 0000000000017100 R09: ffffea0009ea0480
[28974.899507] R10: ffff880ae9746250 R11: 0000000000000000 R12: ffffffff81119456
[28974.907533] R13: 000000000000000a R14: ffff8809a5f0ff48 R15: 0000000000013b80
[28974.915558] FS: 00007fb36c115710(0000) GS:ffff880627d60000(0000)
knlGS:0000000000000000
[28974.924681] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28974.931149] CR2: ffffffffff600400 CR3: 0000000761b93000 CR4: 00000000000007e0
[28974.939173] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[28974.947190] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[28974.955205] Stack:
[28974.957486] ffff880a3e72a700 000000000001183c 0000000000000000
ffffffff81119456
[28974.965873] ffff8809a5f0fe48 0000000000000000 ffff880b21e79d40
000000000000c350
[28974.974253] ffff8809a5f0fec0 ffff8802b4781bc0 0000000000000200
ffffffff811d96eb
[28974.982646] Call Trace:
[28974.985419] [<ffffffff81119456>] ? do_readv_writev+0xfc/0x135
[28974.991987] [<ffffffff811d96eb>] ? ep_poll+0x215/0x286
[28974.997875] [<ffffffff8112ccb4>] ? fget_light+0x2e/0x7c
[28975.003861] [<ffffffff811182fd>] ? fdget+0x16/0x1c
[28975.009359] [<ffffffff8111956c>] ? SyS_readv+0x5a/0xb0
[28975.015245] [<ffffffff811da1ae>] ? SyS_epoll_wait+0x86/0xc1
[28975.021622] [<ffffffff814e4d92>] ? system_call_fastpath+0x16/0x1b
[28975.028578] Code: 48 83 fb 10 0f 86 aa 00 00 00 48 89 df e8 8c de
ff ff 48 89 c7 48 89 c5 e8 b5 d7 ff ff 85 c0 75 22 48 f7 45 00 00 c0
00 00 75 02 <0f> 0b 48 89 ef e8 a8 d7 ff ff 5b 48 89 ef 89 c6 5d 41 5c
e9 9a
[28975.050651] RIP [<ffffffff810ebcb2>] kfree+0x59/0xe7
[28975.056354] RSP <ffff8809a5f0fe30>
[28975.062479] ---[ end trace 29f90372a2c3b0ac ]---

The machine hobbles along until all processes crash at this point.

[14064.855658] =============================================================================
[14064.864884] BUG kmalloc-192 (Tainted: G W ): Poison overwritten
[14064.872424] -----------------------------------------------------------------------------
[14064.872424]
[14064.883322] Disabling lock debugging due to kernel taint
[14064.889304] INFO: 0xffff880930a54824-0xffff880930a54824. First byte
0x6c instead of 0x6b
[14064.898433] INFO: Allocated in alloc_pipe_info+0x17/0x94 age=80
cpu=8 pid=28897
[14064.906684] set_track+0x5c/0xd7
[14064.910332] alloc_debug_processing+0x76/0xfd
[14064.915250] __slab_alloc+0x3e0/0x417
[14064.919386] alloc_inode+0x26/0x6c
[14064.923232] alloc_pipe_info+0x17/0x94
[14064.927467] kmem_cache_alloc_trace+0xbe/0x14e
[14064.932486] alloc_pipe_info+0x17/0x94
[14064.936721] alloc_pipe_info+0x17/0x94
[14064.940956] create_pipe_files+0x3c/0x1e5
[14064.945481] __do_pipe_flags+0x23/0xa7
[14064.949718] SyS_pipe2+0x18/0x86
[14064.953374] system_call_fastpath+0x16/0x1b
[14064.958098] INFO: Freed in pipe_release+0xc4/0xcd age=76 cpu=23 pid=28897
[14064.965740] set_track+0x5c/0xd7
[14064.969389] free_debug_processing+0x11d/0x1a9
[14064.974391] __slab_free+0x32/0x30a
[14064.978334] free_pages_prepare+0x104/0x128
[14064.983053] pipe_release+0xc4/0xcd
[14064.986991] __fput+0xe1/0x1e4
[14064.990443] task_work_run+0x7b/0x8f
[14064.994482] do_notify_resume+0x59/0x68
[14064.998816] int_signal+0x12/0x17
[14065.002561] INFO: Slab 0xffffea0024c29500 objects=31 used=28
fp=0xffff880930a570c0 flags=0x6ffc0000004080
[14065.013341] INFO: Object 0xffff880930a54820 @offset=2080
fp=0xffff880930a55040
[14065.023177] Bytes b4 ffff880930a54810: 4d fa 14 00 01 00 00 00 5a
5a 5a 5a 5a 5a 5a 5a M.......ZZZZZZZZ
[14065.033864] Object ffff880930a54820: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
[14065.044354] Object ffff880930a54830: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.054845] Object ffff880930a54840: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.065335] Object ffff880930a54850: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.075827] Object ffff880930a54860: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.086321] Object ffff880930a54870: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.096806] Object ffff880930a54880: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.107298] Object ffff880930a54890: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.117788] Object ffff880930a548a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.128278] Object ffff880930a548b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.138770] Object ffff880930a548c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[14065.149262] Object ffff880930a548d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
[14065.159751] Redzone ffff880930a548e0: bb bb bb bb bb bb bb bb
........
[14065.169559] Padding ffff880930a54a20: 5a 5a 5a 5a 5a 5a 5a 5a
ZZZZZZZZ
[14065.179372] CPU: 10 PID: 7969 Comm: nginx-cache Tainted: G B W
3.10.6-cloudflare-debug+ #5
[14065.189469] Hardware name: HP ProLiant DL180 G6 , BIOS O20 01/24/2011
[14065.196807] 0000000000000000 ffffffff810ee8d2 ffff880930a54820
ffffffff81827fe7
[14065.205185] ffffea0024c29500 ffff880627804900 ffff880930a54820
ffffea0024c29500
[14065.213576] ffffffff81223ebb ffff880930a54820 ffffea0024c29500
ffffffff810ee9fe
[14065.221965] Call Trace:
[14065.224743] [<ffffffff810ee8d2>] ? check_bytes_and_report+0xa7/0xf4
[14065.231902] [<ffffffff81223ebb>] ? kmem_alloc+0x43/0x6d
[14065.237890] [<ffffffff810ee9fe>] ? check_object+0xdf/0x19e
[14065.244170] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.250154] [<ffffffff814e834d>] ? alloc_debug_processing+0x5c/0xfd
[14065.257310] [<ffffffff814e87ce>] ? __slab_alloc+0x3e0/0x417
[14065.263687] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.269671] [<ffffffff810efe41>] ? __kmalloc+0x15b/0x191
[14065.275755] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.281733] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.287709] [<ffffffff8124ce2f>] ? xfs_idata_realloc+0xd5/0xf7
[14065.294378] [<ffffffff810efde0>] ? __kmalloc+0xfa/0x191
[14065.300366] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.306354] [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.312340] [<ffffffff8125c4ab>] ? xfs_log_commit_cil+0xe8/0x3ce
[14065.319202] [<ffffffff812589a1>] ? xfs_trans_commit+0x5c/0x1c5
[14065.325873] [<ffffffff81222ed4>] ? xfs_create+0x404/0x527
[14065.332055] [<ffffffff8121d07d>] ? xfs_vn_mknod+0xb4/0x15e
[14065.338334] [<ffffffff8112d010>] ? vfs_create+0x5f/0x84
[14065.344318] [<ffffffff8112e0f6>] ? do_last+0x54b/0x9b7
[14065.350206] [<ffffffff8112e624>] ? path_openat+0xc2/0x327
[14065.356386] [<ffffffff81123b89>] ? do_readv_writev+0x119/0x135
[14065.363053] [<ffffffff8112eb2c>] ? do_filp_open+0x2a/0x6e
[14065.369232] [<ffffffff8112bda0>] ? getname_flags.part.32+0x22/0x10a
[14065.376386] [<ffffffff8113791b>] ? __alloc_fd+0xcc/0xdb
[14065.382373] [<ffffffff81122803>] ? do_sys_open+0x5c/0xe0
[14065.388457] [<ffffffff814f43d2>] ? system_call_fastpath+0x16/0x1b
[14065.395413] FIX kmalloc-192: Restoring
0xffff880930a54824-0xffff880930a54824=0x6b
[14065.395413]
[14065.405533] FIX kmalloc-192: Marking all objects used

On Mon, Aug 19, 2013 at 9:06 PM, Al Viro <[email protected]> wrote:
> On Mon, Aug 19, 2013 at 02:16:36PM -0700, Linus Torvalds wrote:
>> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
>> > On Mon, 19 Aug 2013, Simon Kirby wrote:
>> >
>> >> [... ] The
>> >> alloc/free traces are always the same -- always alloc_pipe_info and
>> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
>> >>
>> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
>> >
>> > This looks like an increment after free in the second 32 bit value of the
>> > structure. First 32 bit value's poison is unchanged.
>>
>> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
>> the "wait_lock" spinlock that is part of the mutex.
>>
>> Doing a "spin_lock()" could indeed cause an increment operation. But
>> it still sounds like a very odd case. And even for some wild pointer
>> I'd then expect the spin_unlock to also happen, and to then increment
>> the next byte (or word) too. More importantly, for a mutex, I'd expect
>> the *other* fields to be corrupted too (the "waiter" field etc). That
>> is, unless we're still spinning waiting for the mutex, but with that
>> value we shouldn't, as far as I can see.
>
> Point... I would probably start with CONFIG_DEBUG_MUTEXES and see if
> it screams about mutex magic, etc. having been buggered. FWIW, pipe is
> neither a FIFO nor an internal per-task one - it's a usual pipe(2) one,
> which should've excluded most of the weird codepaths...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-08-20 07:21:50

by Al Viro

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
> We are also seeing this or a similar issue. On a fairly widespread
> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
> order of 36 days (combined MTBF.)

Do you have any boxen with CONFIG_DEBUG_MUTEXES among those? What
kind of setup do you have on those, BTW?

2013-08-20 07:51:13

by Ian Applegate

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

Unfortunately no boxen with CONFIG_DEBUG_MUTEXES among them. I can
enable on a few and should have some results within the day. These
mainly serve (quite a bit of) HTTP/S cache traffic.

On Tue, Aug 20, 2013 at 12:21 AM, Al Viro <[email protected]> wrote:
> On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
>> We are also seeing this or a similar issue. On a fairly widespread
>> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
>> order of 36 days (combined MTBF.)
>
> Do you have any boxen with CONFIG_DEBUG_MUTEXES among those? What
> kind of setup do you have on those, BTW?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-03 20:43:11

by Simon Kirby

[permalink] [raw]
Subject: Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

On Mon, Aug 19, 2013 at 04:31:38PM -0700, Simon Kirby wrote:

> On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:
>
> > Quoting Linus Torvalds (2013-08-19 17:16:36)
> > > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <[email protected]> wrote:
> > > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > > >
> > > >> [... ] The
> > > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > > >>
> > > >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkklkkkkkkkkkkk
> > > >
> > > > This looks like an increment after free in the second 32 bit value of the
> > > > structure. First 32 bit value's poison is unchanged.
> > >
> > > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > > the "wait_lock" spinlock that is part of the mutex.
> > >
> > > Doing a "spin_lock()" could indeed cause an increment operation. But
> > > it still sounds like a very odd case. And even for some wild pointer
> > > I'd then expect the spin_unlock to also happen, and to then increment
> > > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > > the *other* fields to be corrupted too (the "waiter" field etc). That
> > > is, unless we're still spinning waiting for the mutex, but with that
> > > value we shouldn't, as far as I can see.
> > >
> >
> > Simon, is this box doing btrfs send/receive? If so, it's probably where
> > this pipe is coming from.
>
> No, not for some time (a few kernel versions ago).
>
> > Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> > way to find it, I can give you a patch if it'll help.
>
> I presume it's just:
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index d2c45e1..30d5b8d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
> {
> struct pipe_inode_info *pipe;
>
> - pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
> + pipe = (void *)get_zeroed_page(GFP_KERNEL);
> if (pipe) {
> pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * PIPE_DEF_BUFFERS, GFP_KERNEL);
> if (pipe->bufs) {
> @@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
> mutex_init(&pipe->mutex);
> return pipe;
> }
> - kfree(pipe);
> + free_page((unsigned long)pipe);
> }
>
> return NULL;
> @@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
> if (pipe->tmp_page)
> __free_page(pipe->tmp_page);
> kfree(pipe->bufs);
> - kfree(pipe);
> + free_page((unsigned long)pipe);
> }
>
> static struct vfsmount *pipe_mnt __read_mostly;
>
> ...and CONFIG_DEBUG_PAGEALLOC enabled.
>
> > It would be nice if you could trigger this on plain 3.11-rcX instead of
> > btrfs-next.
>
> On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
> were from 3.11-rc4 with just some of our local patches:
>
> > git diff --stat v3.11-rc4..master
> firmware/Makefile | 4 +-
> firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex | 5804 ++++++++++++++++++++++
> firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +++++++++++++++++++++++++
> kernel/acct.c | 21 +-
> net/sunrpc/auth.c | 2 +-
> net/sunrpc/clnt.c | 10 +
> net/sunrpc/xprt.c | 8 +-
> 7 files changed, 12335 insertions(+), 10 deletions(-)
>
> None of them look relevant, but I'm building vanilla -rc4 with
> CONFIG_DEBUG_PAGEALLOC and the patch above.

Stock 3.11-rc4 plus the above get_zeroed_page() for pipe allocations has
been running since August 19th on a few btrfs boxes. It has been quiet
until a few days ago, where we hit this:

BUG: soft lockup - CPU#5 stuck for 22s! [btrfs-cleaner:5871]
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe serio_raw bnx2 evdev
CPU: 5 PID: 5871 Comm: btrfs-cleaner Not tainted 3.11.0-rc4-hw+ #48
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: ffff8804261117d0 ti: ffff8804120d8000 task.ti: ffff8804120d8000
RIP: 0010:[<ffffffff817b4c8c>] [<ffffffff817b4c8c>] _raw_spin_unlock_irqrestore+0xc/0x20
RSP: 0018:ffff8804120d98b8 EFLAGS: 00000296
RAX: ffff88042d800a80 RBX: ffffffff81130ffc RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000296 RDI: 0000000000000296
RBP: ffff8804120d98b8 R08: ffff8804120d9840 R09: 0000000000000000
R10: ffff8803c6f47400 R11: ffff88042d807180 R12: ffff8803c6f474a8
R13: 00000000000000cc R14: ffffffff8101f6da R15: ffff8804120d9828
FS: 0000000000000000(0000) GS:ffff88043fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffff600400 CR3: 00000001a9fc4000 CR4: 00000000000007e0
Stack:
ffff8804120d9958 ffffffff817abea8 ffff8804120d98f8 ffffffff813733d0
00000034ffffffff 0000000000000296 ffff88004edd99d0 0000000000000010
ffff8804120d9918 ffffffff81319211 0000000000000004 ffff88004edd99d0
Call Trace:
[<ffffffff817abea8>] __slab_free+0x45/0x2df
[<ffffffff813733d0>] ? btrfs_set_lock_blocking_rw+0xa0/0xe0
[<ffffffff81319211>] ? btrfs_set_path_blocking+0x31/0x70
[<ffffffff8131f15b>] ? btrfs_search_old_slot+0x75b/0x980
[<ffffffff8139899e>] ? ulist_free+0x1e/0x30
[<ffffffff811336ed>] kfree+0xfd/0x130
[<ffffffff8139899e>] ulist_free+0x1e/0x30
[<ffffffff81396b96>] __resolve_indirect_refs+0x436/0x5b0
[<ffffffff8139757c>] find_parent_nodes+0x47c/0x6c0
[<ffffffff81397851>] btrfs_find_all_roots+0x91/0x100
[<ffffffff8139b7e8>] btrfs_qgroup_account_ref+0x148/0x500
[<ffffffff81328016>] btrfs_delayed_refs_qgroup_accounting+0xa6/0x100
[<ffffffff8132f6f9>] btrfs_run_delayed_refs+0x49/0x5b0
[<ffffffff81328460>] ? walk_down_proc+0x120/0x2d0
[<ffffffff8135e359>] ? free_extent_buffer+0x59/0xb0
[<ffffffff8132c741>] ? walk_up_tree+0xe1/0x1e0
[<ffffffff8133f03a>] btrfs_should_end_transaction+0x4a/0x70
[<ffffffff8132df2b>] btrfs_drop_snapshot+0x3db/0x6a0
[<ffffffff8138e726>] ? btrfs_kill_all_delayed_nodes+0x56/0xf0
[<ffffffff817b4bee>] ? _raw_spin_lock+0xe/0x30
[<ffffffff81340b06>] btrfs_clean_one_deleted_snapshot+0xc6/0x130
[<ffffffff81337ab0>] cleaner_kthread+0x120/0x180
[<ffffffff81337990>] ? transaction_kthread+0x210/0x210
[<ffffffff8107af5b>] kthread+0xbb/0xc0
[<ffffffff81010000>] ? xen_smp_intr_init+0x100/0x220
[<ffffffff8107aea0>] ? kthread_freezable_should_stop+0x60/0x60
[<ffffffff817bcdbc>] ret_from_fork+0x7c/0xb0
[<ffffffff8107aea0>] ? kthread_freezable_should_stop+0x60/0x60
Code: 66 0f c1 07 0f b6 d4 38 c2 74 0d 0f 1f 40 00 f3 90 0f b6 07 38 d0 75 f7 5d c3 0f 1f 44 00 00 55 48 89 e5 80 07 01 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55

Perhaps something is still be wrong with the btrfs ulist stuff, even with
35f0399db6658f465b00893bdd13b992a0acfef0?

I was thinking of going back to the regular slab pipe allocations and
DEBUG_MUTEXES, but it may not be pipe-related at this point...

Simon-