2006-05-26 21:39:21

by Dave Jones

[permalink] [raw]
Subject: .17rc5 cfq slab corruption.

Was playing with googles new picasa toy, which hammered the disks
hunting out every image file it could find, when this popped out:

Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
Prev obj: start=ffff810012b99808, len=168
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=ffff810012b99988, len=168
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

Dave

--
http://www.codemonkey.org.uk


2006-05-26 23:57:35

by Andrew Morton

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

Dave Jones <[email protected]> wrote:
>
> Was playing with googles new picasa toy, which hammered the disks
> hunting out every image file it could find, when this popped out:
>
> Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> Prev obj: start=ffff810012b99808, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Next obj: start=ffff810012b99988, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>

Just to compare notes...

I hit the below a few times last week. That was with the subsystem-trees
part of the mm lineup applied, but it doesn't have much in the way of CFQ
alterations.


audit(1147884480.106:2): selinux=0 auid=4294967295
general protection fault: 0000 [1] ipv6 ppdev autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sony_acpi button battery asus_acpi ac lp parport_pc parport nvram snd_hda_intel snd_hda_codec i2c_i801 sr_mod snd_seq_dummy i2c_core sg snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ehci_hcd snd_pcm_oss ohci1394 ieee1394 snd_mixer_oss uhci_hcd snd_pcm snd_timer snd soundcore snd_page_alloc hw_random ext3 jbd ata_piix libata sd_mod scsi_mod
Pid: 15, comm: kblockd/1 Not tainted 2.6.17-rc4-mm1 #3
RIP: 0010:[<ffffffff802f0cc3>] <ffffffff802f0cc3>{cfq_dispatch_requests+816}
RSP: 0018:ffff81012f4dfd48 EFLAGS: 00010087
RAX: ffff81012c89ab58 RBX: ffff81012edd8060 RCX: 00000001009985e6
RDX: 0000000000000001 RSI: ffff81012edd8068 RDI: 0000000000000000
RBP: ffff81012edd8000 R08: ffff81012edd8088 R09: 0000000000000000
R10: ffff81012f91d900 R11: ffff81012edd8000 R12: 0002000500030002
R13: 0000000000000004 R14: 0000000000000000 R15: ffff81012f91d900
FS: 0000000000000000(0000) GS:ffff81012fcb5cc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003052576460 CR3: 0000000119422000 CR4: 00000000000006e0
Process kblockd/1 (pid: 15, threadinfo ffff81012f4de000, task ffff81012f4d9080)
Stack: ffffffff880027a8 0000000000000000 ffff81012f91d800 ffff81012eecb550
ffff81012eecb550 ffff81012eecb748 ffff81012f91d900 ffffffff802e5e51
ffff81012f91da30 ffff8101157030a8
Call Trace: <ffffffff880027a8>{:scsi_mod:scsi_done+0}
<ffffffff802e5e51>{elv_next_request+321} <ffffffff88008174>{:scsi_mod:scsi_request_fn+115}
<ffffffff802f1208>{cfq_kick_queue+0} <ffffffff802f1287>{cfq_kick_queue+127}
<ffffffff8023f269>{run_workqueue+159} <ffffffff8023fc51>{worker_thread+0}
<ffffffff802425b0>{keventd_create_kthread+0} <ffffffff8023fd5a>{worker_thread+265}
<ffffffff8022904d>{default_wake_function+0} <ffffffff802425b0>{keventd_create_kthread+0}
<ffffffff802425b0>{keventd_create_kthread+0} <ffffffff802428df>{kthread+254}
<ffffffff8020a14a>{child_rip+8} <ffffffff802425b0>{keventd_create_kthread+0}
<ffffffff8043fb47>{thread_return+0} <ffffffff802427e1>{kthread+0}
<ffffffff8020a142>{child_rip+0}

Code: 49 8b 54 24 20 83 7b 20 01 48 19 c0 48 83 e0 fc 8b 84 30 e8

2006-05-27 02:56:25

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Fri, May 26, 2006 at 05:39:15PM -0400, Dave Jones wrote:
> Was playing with googles new picasa toy, which hammered the disks
> hunting out every image file it could find, when this popped out:
>
> Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> Prev obj: start=ffff810012b99808, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Next obj: start=ffff810012b99988, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

After a reboot, I just hit this again. This time whilst the box was
mostly idle (just picking up some email via fetchmail)

Slab corruption: (Not tainted) start=ffff81003dcde8c8, len=168
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff8032bb5f>](cfq_free_io_context+0x2f/0x74)
090: d0 1a 65 3b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
Prev obj: start=ffff81003dcde808, len=168
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f)
000: 00 00 00 00 00 00 00 00 01 00 00 00 5a 5a 5a 5a
010: 00 00 00 00 00 00 00 00 08 78 44 3c 00 81 ff ff
Next obj: start=ffff81003dcde988, len=168
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f)
000: 88 e6 cd 3d 00 81 ff ff 00 00 00 00 5a 5a 5a 5a
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


What's interesting is the 00 81 ff ff part of the corruption
is there in both cases. Anyone have any clues what this could be?

Dave
--
http://www.codemonkey.org.uk

2006-05-27 03:03:04

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Fri, May 26, 2006 at 10:56:23PM -0400, Dave Jones wrote:
> On Fri, May 26, 2006 at 05:39:15PM -0400, Dave Jones wrote:
> > Was playing with googles new picasa toy, which hammered the disks
> > hunting out every image file it could find, when this popped out:
> >
> > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > Prev obj: start=ffff810012b99808, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > Next obj: start=ffff810012b99988, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>
> After a reboot, I just hit this again. This time whilst the box was
> mostly idle (just picking up some email via fetchmail)
>
> Slab corruption: (Not tainted) start=ffff81003dcde8c8, len=168
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8032bb5f>](cfq_free_io_context+0x2f/0x74)
> 090: d0 1a 65 3b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> Prev obj: start=ffff81003dcde808, len=168
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f)
> 000: 00 00 00 00 00 00 00 00 01 00 00 00 5a 5a 5a 5a
> 010: 00 00 00 00 00 00 00 00 08 78 44 3c 00 81 ff ff
> Next obj: start=ffff81003dcde988, len=168
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f)
> 000: 88 e6 cd 3d 00 81 ff ff 00 00 00 00 5a 5a 5a 5a
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
>
> What's interesting is the 00 81 ff ff part of the corruption
> is there in both cases. Anyone have any clues what this could be?

<sudden realisation> oh, it's high 32 bits of the address, duh.
Back to head-scratching.

Dave

--
http://www.codemonkey.org.uk

2006-05-27 07:07:37

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Fri, May 26 2006, Andrew Morton wrote:
> Dave Jones <[email protected]> wrote:
> >
> > Was playing with googles new picasa toy, which hammered the disks
> > hunting out every image file it could find, when this popped out:
> >
> > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > Prev obj: start=ffff810012b99808, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > Next obj: start=ffff810012b99988, len=168
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> >
>
> Just to compare notes...
>
> I hit the below a few times last week. That was with the subsystem-trees
> part of the mm lineup applied, but it doesn't have much in the way of CFQ
> alterations.
>
>
> audit(1147884480.106:2): selinux=0 auid=4294967295
> general protection fault: 0000 [1] ipv6 ppdev autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sony_acpi button battery asus_acpi ac lp parport_pc parport nvram snd_hda_intel snd_hda_codec i2c_i801 sr_mod snd_seq_dummy i2c_core sg snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ehci_hcd snd_pcm_oss ohci1394 ieee1394 snd_mixer_oss uhci_hcd snd_pcm snd_timer snd soundcore snd_page_alloc hw_random ext3 jbd ata_piix libata sd_mod scsi_mod
> Pid: 15, comm: kblockd/1 Not tainted 2.6.17-rc4-mm1 #3
> RIP: 0010:[<ffffffff802f0cc3>] <ffffffff802f0cc3>{cfq_dispatch_requests+816}
> RSP: 0018:ffff81012f4dfd48 EFLAGS: 00010087
> RAX: ffff81012c89ab58 RBX: ffff81012edd8060 RCX: 00000001009985e6
> RDX: 0000000000000001 RSI: ffff81012edd8068 RDI: 0000000000000000
> RBP: ffff81012edd8000 R08: ffff81012edd8088 R09: 0000000000000000
> R10: ffff81012f91d900 R11: ffff81012edd8000 R12: 0002000500030002
> R13: 0000000000000004 R14: 0000000000000000 R15: ffff81012f91d900
> FS: 0000000000000000(0000) GS:ffff81012fcb5cc0(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000003052576460 CR3: 0000000119422000 CR4: 00000000000006e0
> Process kblockd/1 (pid: 15, threadinfo ffff81012f4de000, task ffff81012f4d9080)
> Stack: ffffffff880027a8 0000000000000000 ffff81012f91d800 ffff81012eecb550
> ffff81012eecb550 ffff81012eecb748 ffff81012f91d900 ffffffff802e5e51
> ffff81012f91da30 ffff8101157030a8
> Call Trace: <ffffffff880027a8>{:scsi_mod:scsi_done+0}
> <ffffffff802e5e51>{elv_next_request+321} <ffffffff88008174>{:scsi_mod:scsi_request_fn+115}
> <ffffffff802f1208>{cfq_kick_queue+0} <ffffffff802f1287>{cfq_kick_queue+127}
> <ffffffff8023f269>{run_workqueue+159} <ffffffff8023fc51>{worker_thread+0}
> <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff8023fd5a>{worker_thread+265}
> <ffffffff8022904d>{default_wake_function+0} <ffffffff802425b0>{keventd_create_kthread+0}
> <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff802428df>{kthread+254}
> <ffffffff8020a14a>{child_rip+8} <ffffffff802425b0>{keventd_create_kthread+0}
> <ffffffff8043fb47>{thread_return+0} <ffffffff802427e1>{kthread+0}
> <ffffffff8020a142>{child_rip+0}
>
> Code: 49 8b 54 24 20 83 7b 20 01 48 19 c0 48 83 e0 fc 8b 84 30 e8

Pretty baffling... cfq has been hammered pretty thoroughly over the
last months and _nothing_ has shown up except some performance anomalies
that are now fixed. Since daves case (at least) seems to be
use-after-free, I'll see if I can reproduce with some contrived case.
I'm asuming that picasa forks and exits a lot with submitted io in
between than may not have finished at exit.

--
Jens Axboe

2006-05-27 13:31:34

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> On Fri, May 26 2006, Andrew Morton wrote:
> > Dave Jones <[email protected]> wrote:
> > >
> > > Was playing with googles new picasa toy, which hammered the disks
> > > hunting out every image file it could find, when this popped out:
> > >
> > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > Prev obj: start=ffff810012b99808, len=168
> > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > Next obj: start=ffff810012b99988, len=168
> > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>
> Pretty baffling... cfq has been hammered pretty thoroughly over the
> last months and _nothing_ has shown up except some performance anomalies
> that are now fixed. Since daves case (at least) seems to be
> use-after-free, I'll see if I can reproduce with some contrived case.
> I'm asuming that picasa forks and exits a lot with submitted io in
> between than may not have finished at exit.

The second time I hit it, was actually during boot up.

Dave

--
http://www.codemonkey.org.uk

2006-05-27 14:52:44

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Sat, May 27 2006, Dave Jones wrote:
> On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > On Fri, May 26 2006, Andrew Morton wrote:
> > > Dave Jones <[email protected]> wrote:
> > > >
> > > > Was playing with googles new picasa toy, which hammered the disks
> > > > hunting out every image file it could find, when this popped out:
> > > >
> > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > Prev obj: start=ffff810012b99808, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > Next obj: start=ffff810012b99988, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> >
> > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > last months and _nothing_ has shown up except some performance anomalies
> > that are now fixed. Since daves case (at least) seems to be
> > use-after-free, I'll see if I can reproduce with some contrived case.
> > I'm asuming that picasa forks and exits a lot with submitted io in
> > between than may not have finished at exit.
>
> The second time I hit it, was actually during boot up.

Boot up sees a lot of that same pattern.

--
Jens Axboe

2006-05-30 13:15:33

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Sat, May 27 2006, Dave Jones wrote:
> On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > On Fri, May 26 2006, Andrew Morton wrote:
> > > Dave Jones <[email protected]> wrote:
> > > >
> > > > Was playing with googles new picasa toy, which hammered the disks
> > > > hunting out every image file it could find, when this popped out:
> > > >
> > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > Prev obj: start=ffff810012b99808, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > Next obj: start=ffff810012b99988, len=168
> > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> >
> > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > last months and _nothing_ has shown up except some performance anomalies
> > that are now fixed. Since daves case (at least) seems to be
> > use-after-free, I'll see if I can reproduce with some contrived case.
> > I'm asuming that picasa forks and exits a lot with submitted io in
> > between than may not have finished at exit.
>
> The second time I hit it, was actually during boot up.

Dave, do you have any io scheduler switching going on?

--
Jens Axboe

2006-05-30 13:45:00

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote:
> On Sat, May 27 2006, Dave Jones wrote:
> > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > > On Fri, May 26 2006, Andrew Morton wrote:
> > > > Dave Jones <[email protected]> wrote:
> > > > >
> > > > > Was playing with googles new picasa toy, which hammered the disks
> > > > > hunting out every image file it could find, when this popped out:
> > > > >
> > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > Prev obj: start=ffff810012b99808, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > Next obj: start=ffff810012b99988, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > >
> > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > last months and _nothing_ has shown up except some performance anomalies
> > > that are now fixed. Since daves case (at least) seems to be
> > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > between than may not have finished at exit.
> >
> > The second time I hit it, was actually during boot up.
>
> Dave, do you have any io scheduler switching going on?

Nope, everything set to use CFQ as default, and left that way.

Dave

--
http://www.codemonkey.org.uk

2006-05-30 13:49:13

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Dave Jones wrote:
> On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote:
> > On Sat, May 27 2006, Dave Jones wrote:
> > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > > > On Fri, May 26 2006, Andrew Morton wrote:
> > > > > Dave Jones <[email protected]> wrote:
> > > > > >
> > > > > > Was playing with googles new picasa toy, which hammered the disks
> > > > > > hunting out every image file it could find, when this popped out:
> > > > > >
> > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > Prev obj: start=ffff810012b99808, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > Next obj: start=ffff810012b99988, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > >
> > > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > > last months and _nothing_ has shown up except some performance anomalies
> > > > that are now fixed. Since daves case (at least) seems to be
> > > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > > between than may not have finished at exit.
> > >
> > > The second time I hit it, was actually during boot up.
> >
> > Dave, do you have any io scheduler switching going on?
>
> Nope, everything set to use CFQ as default, and left that way.

Hrmpf ok, I had hoped perhaps something in your init scripts modified
the scheduler value.

--
Jens Axboe

2006-05-30 13:52:30

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 03:50:59PM +0200, Jens Axboe wrote:

> > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > > > last months and _nothing_ has shown up except some performance anomalies
> > > > > that are now fixed. Since daves case (at least) seems to be
> > > > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > > > between than may not have finished at exit.
> > > >
> > > > The second time I hit it, was actually during boot up.
> > >
> > > Dave, do you have any io scheduler switching going on?
> >
> > Nope, everything set to use CFQ as default, and left that way.
>
> Hrmpf ok, I had hoped perhaps something in your init scripts modified
> the scheduler value.

grep doesn't show anything in init scripts, and ttbomk, hald isn't messing
with this. (Actually I'm seeing it trigger before that gets started
anyway, so that can't be it).

Dave

--
http://www.codemonkey.org.uk

2006-05-30 14:13:40

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 09:52:24AM -0400, Dave Jones wrote:
> On Tue, May 30, 2006 at 03:50:59PM +0200, Jens Axboe wrote:
>
> > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > > > > last months and _nothing_ has shown up except some performance anomalies
> > > > > > that are now fixed. Since daves case (at least) seems to be
> > > > > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > > > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > > > > between than may not have finished at exit.
> > > > >
> > > > > The second time I hit it, was actually during boot up.
> > > >
> > > > Dave, do you have any io scheduler switching going on?
> > >
> > > Nope, everything set to use CFQ as default, and left that way.
> >
> > Hrmpf ok, I had hoped perhaps something in your init scripts modified
> > the scheduler value.
>
> grep doesn't show anything in init scripts, and ttbomk, hald isn't messing
> with this. (Actually I'm seeing it trigger before that gets started
> anyway, so that can't be it).

In case it makes a difference to help with reproducing -- the hardware of this
system is:

two sata drives sda1/sdb1 in single-volume device-mapper configuration as VolGroup00-LogVol00
pata cd drive on /dev/hdc
usb memory stick on sdc1

Nothing too out of the ordinary there..

Dave

--
http://www.codemonkey.org.uk

2006-05-30 16:12:54

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote:
> On Sat, May 27 2006, Dave Jones wrote:
> > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > > On Fri, May 26 2006, Andrew Morton wrote:
> > > > Dave Jones <[email protected]> wrote:
> > > > >
> > > > > Was playing with googles new picasa toy, which hammered the disks
> > > > > hunting out every image file it could find, when this popped out:
> > > > >
> > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > Prev obj: start=ffff810012b99808, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > Next obj: start=ffff810012b99988, len=168
> > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > >
> > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > last months and _nothing_ has shown up except some performance anomalies
> > > that are now fixed. Since daves case (at least) seems to be
> > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > between than may not have finished at exit.
> >
> > The second time I hit it, was actually during boot up.
>
> Dave, do you have any io scheduler switching going on?

Here's something interesting (possibly unrelated).
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193534

I added this patch to our devel kernel (based on 17rc5-git5 right now)

It's similar to the list_head debugging patch from -mm

--- linux-2.6.12/include/linux/list.h~ 2005-08-08 15:34:50.000000000 -0400
+++ linux-2.6.12/include/linux/list.h 2005-08-08 15:35:22.000000000 -0400
@@ -5,7 +5,9 @@

#include <linux/stddef.h>
#include <linux/prefetch.h>
+#include <linux/kernel.h>
#include <asm/system.h>
+#include <asm/bug.h>

/*
* These are non-NULL pointers that will result in page faults
@@ -52,6 +52,16 @@ static inline void __list_add(struct lis
struct list_head *prev,
struct list_head *next)
{
+ if (next->prev != prev) {
+ printk("List corruption. next->prev should be %p, but was %p\n",
+ prev, next->prev);
+ BUG();
+ }
+ if (prev->next != next) {
+ printk("List corruption. prev->next should be %p, but was %p\n",
+ next, prev->next);
+ BUG();
+ }
next->prev = new;
new->next = next;
new->prev = prev;
@@ -162,6 +162,16 @@ static inline void __list_del(struct lis
*/
static inline void list_del(struct list_head *entry)
{
+ if (entry->prev->next != entry) {
+ printk("List corruption. prev->next should be %p, but was %p\n",
+ entry, entry->prev->next);
+ BUG();
+ }
+ if (entry->next->prev != entry) {
+ printk("List corruption. next->prev should be %p, but was %p\n",
+ entry, entry->next->prev);
+ BUG();
+ }
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;


And then it turned up this:

List corruption. next->prev should be f74a5e2c, but was ea7ed31c
Pointing at cfq_set_request.

Now, *anything* could have corrupted that list, not necessarily cfq,
but it's something of a coincidence.

Dave

--
http://www.codemonkey.org.uk

2006-05-30 16:47:23

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Dave Jones wrote:
> On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote:
> > On Sat, May 27 2006, Dave Jones wrote:
> > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote:
> > > > On Fri, May 26 2006, Andrew Morton wrote:
> > > > > Dave Jones <[email protected]> wrote:
> > > > > >
> > > > > > Was playing with googles new picasa toy, which hammered the disks
> > > > > > hunting out every image file it could find, when this popped out:
> > > > > >
> > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > Prev obj: start=ffff810012b99808, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > Next obj: start=ffff810012b99988, len=168
> > > > > > Redzone: 0x5a2cf071/0x5a2cf071.
> > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74)
> > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > > >
> > > > Pretty baffling... cfq has been hammered pretty thoroughly over the
> > > > last months and _nothing_ has shown up except some performance anomalies
> > > > that are now fixed. Since daves case (at least) seems to be
> > > > use-after-free, I'll see if I can reproduce with some contrived case.
> > > > I'm asuming that picasa forks and exits a lot with submitted io in
> > > > between than may not have finished at exit.
> > >
> > > The second time I hit it, was actually during boot up.
> >
> > Dave, do you have any io scheduler switching going on?
>
> Here's something interesting (possibly unrelated).
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193534
>
> I added this patch to our devel kernel (based on 17rc5-git5 right now)
>
> It's similar to the list_head debugging patch from -mm
>
> --- linux-2.6.12/include/linux/list.h~ 2005-08-08 15:34:50.000000000 -0400
> +++ linux-2.6.12/include/linux/list.h 2005-08-08 15:35:22.000000000 -0400
> @@ -5,7 +5,9 @@
>
> #include <linux/stddef.h>
> #include <linux/prefetch.h>
> +#include <linux/kernel.h>
> #include <asm/system.h>
> +#include <asm/bug.h>
>
> /*
> * These are non-NULL pointers that will result in page faults
> @@ -52,6 +52,16 @@ static inline void __list_add(struct lis
> struct list_head *prev,
> struct list_head *next)
> {
> + if (next->prev != prev) {
> + printk("List corruption. next->prev should be %p, but was %p\n",
> + prev, next->prev);
> + BUG();
> + }
> + if (prev->next != next) {
> + printk("List corruption. prev->next should be %p, but was %p\n",
> + next, prev->next);
> + BUG();
> + }
> next->prev = new;
> new->next = next;
> new->prev = prev;
> @@ -162,6 +162,16 @@ static inline void __list_del(struct lis
> */
> static inline void list_del(struct list_head *entry)
> {
> + if (entry->prev->next != entry) {
> + printk("List corruption. prev->next should be %p, but was %p\n",
> + entry, entry->prev->next);
> + BUG();
> + }
> + if (entry->next->prev != entry) {
> + printk("List corruption. next->prev should be %p, but was %p\n",
> + entry, entry->next->prev);
> + BUG();
> + }
> __list_del(entry->prev, entry->next);
> entry->next = LIST_POISON1;
> entry->prev = LIST_POISON2;
>
>
> And then it turned up this:
>
> List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> Pointing at cfq_set_request.

I think I'm missing a piece of this - what list was corrupted, in what
function did it trigger?


--
Jens Axboe

2006-05-30 16:57:09

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:

> > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > Pointing at cfq_set_request.
>
> I think I'm missing a piece of this - what list was corrupted, in what
> function did it trigger?

If you look at the attachment in the bugzilla url in my previous msg,
you'll see this:

ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
May 30 05:31:33 mandril kernel: SMP
May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
May 30 05:31:33 mandril kernel: CPU: 0
May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
May 30 05:31:33 mandril kernel: eax: 00000044 ebx: f605af58 ecx: c06c77d0 edx: 00200282
May 30 05:31:33 mandril kernel: esi: f4c31d58 edi: f74a5e2c ebp: f4c31da8 esp: c57c1abc
May 30 05:31:33 mandril kernel: ds: 007b es: 007b ss: 0068
May 30 05:31:33 mandril kernel: Process httpd (pid: 30323, threadinfo=c57c1000 task=f1bf4550)
May 30 05:31:33 mandril kernel: Stack: c0624ac7 f74a5e2c ea7ed31c d0c71c34 f74cb25c f74a5d04 f1bf4550 00000000
May 30 05:31:33 mandril kernel: 00007673 f3822398 00000000 c04e310e f74cb25c d0c71c34 f74cb26c c04d7342
May 30 05:31:33 mandril kernel: 00000010 f74cb25c 00000000 c04d9da8 00000010 d954b84c c046ca2b 0142781d
May 30 05:31:33 mandril kernel: Call Trace:
May 30 05:31:33 mandril kernel: <c04e310e> cfq_set_request+0x0/0x3ff <c04d7342> elv_set_request+0x1e/0x2d
May 30 05:31:33 mandril kernel: <c04d9da8> get_request+0x164/0x31d <c046ca2b> __find_get_block+0x161/0x16b
May 30 05:31:33 mandril kernel: <c04da58e> get_request_wait+0x1b/0x15e <c046ca6e> __getblk+0x39/0x257
May 30 05:31:33 mandril kernel: <c04dbeeb> __make_request+0x322/0x3cb <c0465b7a> cache_alloc_debugcheck_after+0x134/0x13e
May 30 05:31:33 mandril kernel: <f8951626> ext3_get_blocks_handle+0x921/0x954 [ext3] <c04d96f6> generic_make_request+0x2b7/0x2c7
May 30 05:31:33 mandril kernel: <c046c6f9> __find_get_block_slow+0x11c/0x128 <c0465b7a> cache_alloc_debugcheck_after+0x134/0x13e
May 30 05:31:33 mandril kernel: <c046ca2b> __find_get_block+0x161/0x16b <c0465b13> cache_alloc_debugcheck_after+0xcd/0x13e
May 30 05:31:33 mandril kernel: <c044ec05> mempool_alloc+0x37/0xd3 <c04db30c> submit_bio+0xb7/0xbe
May 30 05:31:33 mandril kernel: <c044ec05> mempool_alloc+0x37/0xd3 <f8951774> ext3_getblk+0x11b/0x2a6 [ext3]
May 30 05:31:33 mandril kernel: <c046f5f2> bio_alloc_bioset+0x9b/0xf3 <c046c3da> submit_bh+0xe4/0x102
May 30 05:31:33 mandril kernel: <c046d3d5> ll_rw_block+0x8d/0xa9 <f89527a3> ext3_bread+0x39/0x83 [ext3]
May 30 05:31:33 mandril kernel: <f895571c> ext3_find_entry+0x174/0x58d [ext3] <c0406470> do_IRQ+0x79/0x84
May 30 05:31:33 mandril kernel: <c040485e> common_interrupt+0x1a/0x20 <c04390ef> debug_mutex_add_waiter+0x97/0xa9
May 30 05:31:33 mandril kernel: <c047fd1e> d_alloc+0x4b/0x206 <f8956f38> ext3_lookup+0x21/0x77 [ext3]
May 30 05:31:33 mandril kernel: <c0476e95> do_lookup+0xb2/0x14d <c0478c03> __link_path_walk+0x894/0xd31
May 30 05:31:33 mandril kernel: <c04790e9> link_path_walk+0x49/0xbd <c0435994> autoremove_wake_function+0x0/0x35
May 30 05:31:33 mandril kernel: <c0479484> do_path_lookup+0x1e3/0x251 <c047823b> getname+0x91/0x98
May 30 05:31:33 mandril kernel: <c0479c7d> __user_walk_fd+0x2f/0x40 <c04736ab> vfs_stat_fd+0x19/0x40
May 30 05:31:33 mandril kernel: <c0435994> autoremove_wake_function+0x0/0x35 <c047375f> sys_stat64+0xf/0x23
May 30 05:31:33 mandril kernel: <c0406470> do_IRQ+0x79/0x84 <c0403e1f> syscall_call+0x7/0xb
May 30 05:31:33 mandril kernel: Code: 8b 7c 24 14 8b 98 28 01 00 00 81 c7 28 01 00 00 8b 43 04 39 f8 74 46 89 7c 24 04 89 44 24 08 c7 04 24 c7 4a 62 c0 e8 44 1e f4 ff <0f>
0b 3a 00 b2 4a 62 c0 8b 54 24 14 8b 82 28 01 00 00 39 d8 74

Not sure which list is being pointed at. I'll see if I can decipher
more after lunch.

Dave

--
http://www.codemonkey.org.uk

2006-05-30 17:02:46

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Dave Jones wrote:
> On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
>
> > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > Pointing at cfq_set_request.
> >
> > I think I'm missing a piece of this - what list was corrupted, in what
> > function did it trigger?
>
> If you look at the attachment in the bugzilla url in my previous msg,
> you'll see this:
>
> ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> May 30 05:31:33 mandril kernel: SMP
> May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> May 30 05:31:33 mandril kernel: CPU: 0
> May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff

Just do a l *cfq_set_request+0x202 from gdb if you have
CONFIG_DEBUG_INFO enabled in your vmlinux.

--
Jens Axboe

2006-05-30 18:47:30

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Jens Axboe wrote:
> On Tue, May 30 2006, Dave Jones wrote:
> > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
> >
> > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > > Pointing at cfq_set_request.
> > >
> > > I think I'm missing a piece of this - what list was corrupted, in what
> > > function did it trigger?
> >
> > If you look at the attachment in the bugzilla url in my previous msg,
> > you'll see this:
> >
> > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> > May 30 05:31:33 mandril kernel: SMP
> > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> > May 30 05:31:33 mandril kernel: CPU: 0
> > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
>
> Just do a l *cfq_set_request+0x202 from gdb if you have
> CONFIG_DEBUG_INFO enabled in your vmlinux.

Doh, found it. Dave, please try and reproduce with this applied:

---

[PATCH] cfq-iosched: kill list entry on dead cic

We must remember to drop the cic circular list when we prune
a dead entry.

Signed-off-by: Jens Axboe <[email protected]>

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2540dfa..309e0f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1477,6 +1477,7 @@ cfq_drop_dead_cic(struct io_context *ioc
{
read_lock(&cfq_exit_lock);
rb_erase(&cic->rb_node, &ioc->cic_root);
+ list_del_init(&cic->queue_list);
read_unlock(&cfq_exit_lock);
kmem_cache_free(cfq_ioc_pool, cic);
atomic_dec(&ioc_count);

--
Jens Axboe

2006-05-30 18:50:22

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Jens Axboe wrote:
> On Tue, May 30 2006, Jens Axboe wrote:
> > On Tue, May 30 2006, Dave Jones wrote:
> > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
> > >
> > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > > > Pointing at cfq_set_request.
> > > >
> > > > I think I'm missing a piece of this - what list was corrupted, in what
> > > > function did it trigger?
> > >
> > > If you look at the attachment in the bugzilla url in my previous msg,
> > > you'll see this:
> > >
> > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> > > May 30 05:31:33 mandril kernel: SMP
> > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> > > May 30 05:31:33 mandril kernel: CPU: 0
> > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
> >
> > Just do a l *cfq_set_request+0x202 from gdb if you have
> > CONFIG_DEBUG_INFO enabled in your vmlinux.
>
> Doh, found it. Dave, please try and reproduce with this applied:

Nevermind, that's not it either. Damn. Stay tuned.

--
Jens Axboe

2006-05-30 19:09:29

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Jens Axboe wrote:
> On Tue, May 30 2006, Jens Axboe wrote:
> > On Tue, May 30 2006, Jens Axboe wrote:
> > > On Tue, May 30 2006, Dave Jones wrote:
> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
> > > >
> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > > > > Pointing at cfq_set_request.
> > > > >
> > > > > I think I'm missing a piece of this - what list was corrupted, in what
> > > > > function did it trigger?
> > > >
> > > > If you look at the attachment in the bugzilla url in my previous msg,
> > > > you'll see this:
> > > >
> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> > > > May 30 05:31:33 mandril kernel: SMP
> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> > > > May 30 05:31:33 mandril kernel: CPU: 0
> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
> > >
> > > Just do a l *cfq_set_request+0x202 from gdb if you have
> > > CONFIG_DEBUG_INFO enabled in your vmlinux.
> >
> > Doh, found it. Dave, please try and reproduce with this applied:
>
> Nevermind, that's not it either. Damn. Stay tuned.

Try this instead, please.

[PATCH] cfq-iosched: fixup locking and ->queue_list list management

- Drop cic from the list when seen as dead.
- Fixup the locking, just use a simple spinlock.

Signed-off-by: Jens Axboe <[email protected]>

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2540dfa..11ce6aa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -33,7 +33,7 @@ #define CFQ_SLICE_SCALE (5)

#define CFQ_KEY_ASYNC (0)

-static DEFINE_RWLOCK(cfq_exit_lock);
+static DEFINE_SPINLOCK(cfq_exit_lock);

/*
* for the hash of cfqq inside the cfqd
@@ -1284,7 +1284,7 @@ static void cfq_exit_io_context(struct i
/*
* put the reference this task is holding to the various queues
*/
- read_lock_irqsave(&cfq_exit_lock, flags);
+ spin_lock_irqsave(&cfq_exit_lock, flags);

n = rb_first(&ioc->cic_root);
while (n != NULL) {
@@ -1294,7 +1294,7 @@ static void cfq_exit_io_context(struct i
n = rb_next(n);
}

- read_unlock_irqrestore(&cfq_exit_lock, flags);
+ spin_unlock_irqrestore(&cfq_exit_lock, flags);
}

static struct cfq_io_context *
@@ -1400,17 +1400,17 @@ static int cfq_ioc_set_ioprio(struct io_
struct cfq_io_context *cic;
struct rb_node *n;

- write_lock(&cfq_exit_lock);
+ spin_lock(&cfq_exit_lock);

n = rb_first(&ioc->cic_root);
while (n != NULL) {
cic = rb_entry(n, struct cfq_io_context, rb_node);
-
+
changed_ioprio(cic);
n = rb_next(n);
}

- write_unlock(&cfq_exit_lock);
+ spin_unlock(&cfq_exit_lock);

return 0;
}
@@ -1475,9 +1475,10 @@ out:
static void
cfq_drop_dead_cic(struct io_context *ioc, struct cfq_io_context *cic)
{
- read_lock(&cfq_exit_lock);
+ spin_lock(&cfq_exit_lock);
rb_erase(&cic->rb_node, &ioc->cic_root);
- read_unlock(&cfq_exit_lock);
+ list_del_init(&cic->queue_list);
+ spin_unlock(&cfq_exit_lock);
kmem_cache_free(cfq_ioc_pool, cic);
atomic_dec(&ioc_count);
}
@@ -1545,11 +1546,11 @@ restart:
BUG();
}

- read_lock(&cfq_exit_lock);
+ spin_lock(&cfq_exit_lock);
rb_link_node(&cic->rb_node, parent, p);
rb_insert_color(&cic->rb_node, &ioc->cic_root);
list_add(&cic->queue_list, &cfqd->cic_list);
- read_unlock(&cfq_exit_lock);
+ spin_unlock(&cfq_exit_lock);
}

/*
@@ -2187,7 +2188,7 @@ static void cfq_exit_queue(elevator_t *e

cfq_shutdown_timer_wq(cfqd);

- write_lock(&cfq_exit_lock);
+ spin_lock(&cfq_exit_lock);
spin_lock_irq(q->queue_lock);

if (cfqd->active_queue)
@@ -2210,7 +2211,7 @@ static void cfq_exit_queue(elevator_t *e
}

spin_unlock_irq(q->queue_lock);
- write_unlock(&cfq_exit_lock);
+ spin_unlock(&cfq_exit_lock);

cfq_shutdown_timer_wq(cfqd);


--
Jens Axboe

2006-05-30 19:24:14

by Dave Jones

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30, 2006 at 09:11:28PM +0200, Jens Axboe wrote:

> > > > Just do a l *cfq_set_request+0x202 from gdb if you have
> > > > CONFIG_DEBUG_INFO enabled in your vmlinux.
> > >
> > > Doh, found it. Dave, please try and reproduce with this applied:
> >
> > Nevermind, that's not it either. Damn. Stay tuned.
>
> Try this instead, please.

Heh, I was waiting forever to get that debuginfo package downloaded & unpacked.
I'll throw this into my next build, and see what happens.
Is this likely to fix the slab corruption bug I first reported,
or the list corruption as seen in the bugzilla, or (optimistically) both ?

Thanks,

Dave

--
http://www.codemonkey.org.uk

2006-05-30 19:25:18

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Dave Jones wrote:
> On Tue, May 30, 2006 at 09:11:28PM +0200, Jens Axboe wrote:
>
> > > > > Just do a l *cfq_set_request+0x202 from gdb if you have
> > > > > CONFIG_DEBUG_INFO enabled in your vmlinux.
> > > >
> > > > Doh, found it. Dave, please try and reproduce with this applied:
> > >
> > > Nevermind, that's not it either. Damn. Stay tuned.
> >
> > Try this instead, please.
>
> Heh, I was waiting forever to get that debuginfo package downloaded & unpacked.
> I'll throw this into my next build, and see what happens.
> Is this likely to fix the slab corruption bug I first reported,
> or the list corruption as seen in the bugzilla, or (optimistically) both ?

Both, it's the same bug.

--
Jens Axboe

2006-05-30 19:28:38

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

Jens Axboe <[email protected]> writes:

> On Tue, May 30 2006, Jens Axboe wrote:
>> On Tue, May 30 2006, Jens Axboe wrote:
>> > On Tue, May 30 2006, Jens Axboe wrote:
>> > > On Tue, May 30 2006, Dave Jones wrote:
>> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
>> > > >
>> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
>> > > > > > Pointing at cfq_set_request.
>> > > > >
>> > > > > I think I'm missing a piece of this - what list was corrupted, in what
>> > > > > function did it trigger?
>> > > >
>> > > > If you look at the attachment in the bugzilla url in my previous msg,
>> > > > you'll see this:
>> > > >
>> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
>> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
>> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
>> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
>> > > > May 30 05:31:33 mandril kernel: SMP
>> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
>> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
>> > > > May 30 05:31:33 mandril kernel: CPU: 0
>> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
>> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
>> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
>> > >
>> > > Just do a l *cfq_set_request+0x202 from gdb if you have
>> > > CONFIG_DEBUG_INFO enabled in your vmlinux.
>> >
>> > Doh, found it. Dave, please try and reproduce with this applied:
>>
>> Nevermind, that's not it either. Damn. Stay tuned.
>
> Try this instead, please.

Umm.. don't we need this line?

static void cfq_free_io_context(struct io_context *ioc)
{
struct cfq_io_context *__cic;
struct rb_node *n;
int freed = 0;

while ((n = rb_first(&ioc->cic_root)) != NULL) {
__cic = rb_entry(n, struct cfq_io_context, rb_node);
rb_erase(&__cic->rb_node, &ioc->cic_root);
list_del(&__cic->queue_list);
^^^^^^^^ <---- this line
kmem_cache_free(cfq_ioc_pool, __cic);
freed++;
}

if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone)
complete(ioc_gone);
}
--
OGAWA Hirofumi <[email protected]>

2006-05-30 19:40:19

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Wed, May 31 2006, OGAWA Hirofumi wrote:
> Jens Axboe <[email protected]> writes:
>
> > On Tue, May 30 2006, Jens Axboe wrote:
> >> On Tue, May 30 2006, Jens Axboe wrote:
> >> > On Tue, May 30 2006, Jens Axboe wrote:
> >> > > On Tue, May 30 2006, Dave Jones wrote:
> >> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
> >> > > >
> >> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> >> > > > > > Pointing at cfq_set_request.
> >> > > > >
> >> > > > > I think I'm missing a piece of this - what list was corrupted, in what
> >> > > > > function did it trigger?
> >> > > >
> >> > > > If you look at the attachment in the bugzilla url in my previous msg,
> >> > > > you'll see this:
> >> > > >
> >> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> >> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> >> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> >> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> >> > > > May 30 05:31:33 mandril kernel: SMP
> >> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> >> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> >> > > > May 30 05:31:33 mandril kernel: CPU: 0
> >> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> >> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> >> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
> >> > >
> >> > > Just do a l *cfq_set_request+0x202 from gdb if you have
> >> > > CONFIG_DEBUG_INFO enabled in your vmlinux.
> >> >
> >> > Doh, found it. Dave, please try and reproduce with this applied:
> >>
> >> Nevermind, that's not it either. Damn. Stay tuned.
> >
> > Try this instead, please.
>
> Umm.. don't we need this line?
>
> static void cfq_free_io_context(struct io_context *ioc)
> {
> struct cfq_io_context *__cic;
> struct rb_node *n;
> int freed = 0;
>
> while ((n = rb_first(&ioc->cic_root)) != NULL) {
> __cic = rb_entry(n, struct cfq_io_context, rb_node);
> rb_erase(&__cic->rb_node, &ioc->cic_root);
> list_del(&__cic->queue_list);
> ^^^^^^^^ <---- this line
> kmem_cache_free(cfq_ioc_pool, __cic);
> freed++;
> }
>
> if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone)
> complete(ioc_gone);

Yep, looks like that is missing as well. Care to send a proper patch and
I'll shove it in, too.

--
Jens Axboe

2006-05-30 19:46:58

by Jens Axboe

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

On Tue, May 30 2006, Jens Axboe wrote:
> On Wed, May 31 2006, OGAWA Hirofumi wrote:
> > Jens Axboe <[email protected]> writes:
> >
> > > On Tue, May 30 2006, Jens Axboe wrote:
> > >> On Tue, May 30 2006, Jens Axboe wrote:
> > >> > On Tue, May 30 2006, Jens Axboe wrote:
> > >> > > On Tue, May 30 2006, Dave Jones wrote:
> > >> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote:
> > >> > > >
> > >> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > >> > > > > > Pointing at cfq_set_request.
> > >> > > > >
> > >> > > > > I think I'm missing a piece of this - what list was corrupted, in what
> > >> > > > > function did it trigger?
> > >> > > >
> > >> > > > If you look at the attachment in the bugzilla url in my previous msg,
> > >> > > > you'll see this:
> > >> > > >
> > >> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c
> > >> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------
> > >> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58!
> > >> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1]
> > >> > > > May 30 05:31:33 mandril kernel: SMP
> > >> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3
> > >> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod
> > >> > > > May 30 05:31:33 mandril kernel: CPU: 0
> > >> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI
> > >> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1)
> > >> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff
> > >> > >
> > >> > > Just do a l *cfq_set_request+0x202 from gdb if you have
> > >> > > CONFIG_DEBUG_INFO enabled in your vmlinux.
> > >> >
> > >> > Doh, found it. Dave, please try and reproduce with this applied:
> > >>
> > >> Nevermind, that's not it either. Damn. Stay tuned.
> > >
> > > Try this instead, please.
> >
> > Umm.. don't we need this line?
> >
> > static void cfq_free_io_context(struct io_context *ioc)
> > {
> > struct cfq_io_context *__cic;
> > struct rb_node *n;
> > int freed = 0;
> >
> > while ((n = rb_first(&ioc->cic_root)) != NULL) {
> > __cic = rb_entry(n, struct cfq_io_context, rb_node);
> > rb_erase(&__cic->rb_node, &ioc->cic_root);
> > list_del(&__cic->queue_list);
> > ^^^^^^^^ <---- this line
> > kmem_cache_free(cfq_ioc_pool, __cic);
> > freed++;
> > }
> >
> > if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone)
> > complete(ioc_gone);
>
> Yep, looks like that is missing as well. Care to send a proper patch and
> I'll shove it in, too.

Spoke too quickly, it's not needed. The cic will already have gone
through the exit function, so it must already have been pruned from the
list. It was a little head scratching though, if we really did need that
it should have shown very quickly I think.

--
Jens Axboe

2006-05-30 19:49:54

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: .17rc5 cfq slab corruption.

Jens Axboe <[email protected]> writes:

> Yep, looks like that is missing as well. Care to send a proper patch and
> I'll shove it in, too.

Sorry.
--
OGAWA Hirofumi <[email protected]>

[PATCH] Fix missing list_del(&__cic->queue_list)

We should kill cic from cfqd->cic_list before freeing it.

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

block/cfq-iosched.c | 1 +
1 file changed, 1 insertion(+)

diff -puN block/cfq-iosched.c~cfq-missing-list_del block/cfq-iosched.c
--- linux-2.6/block/cfq-iosched.c~cfq-missing-list_del 2006-05-31 04:42:36.000000000 +0900
+++ linux-2.6-hirofumi/block/cfq-iosched.c 2006-05-31 04:42:55.000000000 +0900
@@ -1225,6 +1225,7 @@ static void cfq_free_io_context(struct i
while ((n = rb_first(&ioc->cic_root)) != NULL) {
__cic = rb_entry(n, struct cfq_io_context, rb_node);
rb_erase(&__cic->rb_node, &ioc->cic_root);
+ list_del(&__cic->queue_list);
kmem_cache_free(cfq_ioc_pool, __cic);
freed++;
}
_