Not sure what exactly happened here. Was running fsx on ext3 over 2 disk raid0,
and running a yum update. Box locked up solid after a few minutes..
http://www.codemonkey.org.uk/junk/DSC00747.JPG
The unwinder getting stuck meant I lost the top of the trace though.
(I have backporting the .19 fixes to .18 on my todo unless someone
beats me to it and they end up in -stable).
Will try to reproduce with a serial console hooked up.
Dave
--
http://www.codemonkey.org.uk
<Cc'd Eric as he's been looking into this>
On Mon, Oct 02, 2006 at 03:47:11PM -0400, Dave Jones wrote:
> Not sure what exactly happened here. Was running fsx on ext3 over 2 disk raid0,
> and running a yum update. Box locked up solid after a few minutes..
> http://www.codemonkey.org.uk/junk/DSC00747.JPG
>
> The unwinder getting stuck meant I lost the top of the trace though.
> (I have backporting the .19 fixes to .18 on my todo unless someone
> beats me to it and they end up in -stable).
>
> Will try to reproduce with a serial console hooked up.
So I managed to reproduce it with an 'fsx foo' and a
'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
a vanilla 2.6.18 with none of the Fedora patches..
I'll give 2.6.18-git a try next.
Dave
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/buffer.c:2791
invalid opcode: 0000 [1] SMP
CPU 1
Modules linked in: hidp l2cap bluetooth nfs lockd nfs_acl sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sr_mod snd_seq_device cdrom intel_rng snd_pcm_oss sg snd_mixer_oss snd_pcm shpchp floppy serio_raw pcspkr i2c_i801 ohci1394 ieee1394 snd_timer snd e1000 i2c_core soundcore snd_page_alloc sata_sil ahci libata sd_mod scsi_mod raid0 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 408, comm: kjournald Not tainted 2.6.18 #1
RIP: 0010:[<ffffffff8021b425>] [<ffffffff8021b425>] submit_bh+0x29/0x124
RSP: 0018:ffff81007ebcbd40 EFLAGS: 00010246
RAX: 0000000000000005 RBX: ffff810063dd0ec8 RCX: 8000000000000000
RDX: ffff81007f1f5430 RSI: ffff810063dd0ec8 RDI: 0000000000000001
RBP: ffff81007ebcbd60 R08: 0000000000800000 R09: 0000000000000003
R10: ffff810068621510 R11: 0000000000000400 R12: ffff81007f7f48c8
R13: 0000000000000001 R14: 0000000000000033 R15: 0000000000000080
FS: 0000000000000000(0000) GS:ffff810037ff1cc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b76d5365000 CR3: 00000000658e3000 CR4: 00000000000006e0
Process kjournald (pid: 408, threadinfo ffff81007ebca000, task ffff810037f52040)
Stack: 0000000000000003 ffff810063dd0ec8 ffff81007f7f48c8 0000000000000003
ffff81007ebcbda0 ffffffff80217ca1 ffff81007aa572a0 ffff810063f8d400
ffff810064145478 ffff81007f1ea208 ffff81007aa572a0 0000000000000080
Call Trace:
[<ffffffff80217ca1>] ll_rw_block+0xa6/0xcd
[<ffffffff88035991>] :jbd:journal_commit_transaction+0x40b/0x10ce
[<ffffffff8803a033>] :jbd:kjournald+0xc7/0x222
[<ffffffff80236089>] kthread+0x100/0x136
[<ffffffff802624a0>] child_rip+0xa/0x12
DWARF2 unwinder stuck at child_rip+0xa/0x12
Leftover inexact backtrace:
[<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
[<ffffffff80261adc>] restore_args+0x0/0x30
[<ffffffff80250ec3>] run_workqueue+0x19/0xfa
[<ffffffff80250ec3>] run_workqueue+0x19/0xfa
[<ffffffff80235f89>] kthread+0x0/0x136
[<ffffffff80262496>] child_rip+0x0/0x12
Code: 0f 0b 68 c8 86 49 80 c2 e7 0a 48 83 7b 38 00 75 0a 0f 0b 68
RIP [<ffffffff8021b425>] submit_bh+0x29/0x124
RSP <ffff81007ebcbd40>
<0>general protection fault: 0000 [2] SMP
CPU 1
Modules linked in: hidp l2cap bluetooth nfs lockd nfs_acl sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sr_mod snd_seq_device cdrom intel_rng snd_pcm_oss sg snd_mixer_oss snd_pcm shpchp floppy serio_raw pcspkr i2c_i801 ohci1394 ieee1394 snd_timer snd e1000 i2c_core soundcore snd_page_alloc sata_sil ahci libata sd_mod scsi_mod raid0 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18 #1
RIP: 0010:[<ffffffff8028e1a1>] [<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
RSP: 0018:ffff810037e17df0 EFLAGS: 00010006
RAX: 6b6b6b6b6b6b6b6b RBX: ffffffff80aae480 RCX: ffff81007f1ea5a8
RDX: ffff81007ee71080 RSI: ffff810037e17e78 RDI: ffff810037f52040
RBP: ffff810037e17e10 R08: ffff810037e17eb0 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff8029995d R12: ffffffff80aae480
R13: ffff810037e17e78 R14: ffff810037f52040 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffff810037ff1cc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b76d535d000 CR3: 000000007e290000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810037e10000, task ffff81007ee71080)
Stack: 000000000000000f ffff810037e08000 ffff810037f52040 ffffffff880398d9
ffff810037e17eb0 ffffffff80249e98 ffff810037e08000 000000007ee71080
ffffffff80268c22 0000000200000001 0000000000000000 0000000100000000
Call Trace:
[<ffffffff80249e98>] try_to_wake_up+0x27/0x421
[<ffffffff8028e3ce>] wake_up_process+0x10/0x12
[<ffffffff880398e2>] :jbd:commit_timeout+0x9/0xb
[<ffffffff80299a67>] run_timer_softirq+0x14c/0x1d5
[<ffffffff80212724>] __do_softirq+0x68/0xf5
[<ffffffff802627f8>] call_softirq+0x1c/0x28
DWARF2 unwinder stuck at call_softirq+0x1c/0x28
Leftover inexact backtrace:
<IRQ> [<ffffffff80270c65>] do_softirq+0x39/0x9f
[<ffffffff802962a3>] irq_exit+0x57/0x59
[<ffffffff8027b063>] smp_apic_timer_interrupt+0x5d/0x62
[<ffffffff8025b784>] mwait_idle+0x0/0x54
[<ffffffff8026216f>] apic_timer_interrupt+0x6b/0x70
<EOI> [<ffffffff80266026>] __sched_text_start+0xaa6/0xadd
[<ffffffff8025b7c3>] mwait_idle+0x3f/0x54
[<ffffffff8025b78d>] mwait_idle+0x9/0x54
[<ffffffff8024c916>] cpu_idle+0xa2/0xc5
[<ffffffff8027a674>] start_secondary+0x468/0x477
Code: 8b 40 18 48 8b 04 c5 c0 59 a6 80 4c 03 60 08 4c 89 e7 e8 5e
RIP [<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
RSP <ffff810037e17df0>
<3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():1, irqs_disabled():1
Call Trace:
[<ffffffff8026f956>] show_trace+0xae/0x336
[<ffffffff8026fbf3>] dump_stack+0x15/0x17
[<ffffffff8020bb01>] __might_sleep+0xb2/0xb4
[<ffffffff802a5160>] down_read+0x1d/0x4a
[<ffffffff8029cf62>] blocking_notifier_call_chain+0x1b/0x41
[<ffffffff80293e10>] profile_task_exit+0x15/0x17
[<ffffffff80215a74>] do_exit+0x25/0x96a
[<ffffffff8026fe21>] kernel_math_error+0x0/0x96
[<ffff810037e17d48>]
DWARF2 unwinder stuck at 0xffff810037e17d48
Leftover inexact backtrace:
<IRQ> [<ffffffff8026993f>] do_general_protection+0x10a/0x115
[<ffffffff8808bab2>] :scsi_mod:scsi_run_queue+0x1ab/0x1ba
[<ffffffff802622d1>] error_exit+0x0/0x96
[<ffffffff8029995d>] run_timer_softirq+0x42/0x1d5
[<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
[<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
[<ffffffff80249e98>] try_to_wake_up+0x27/0x421
[<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
[<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
[<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
[<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
[<ffffffff8028e3ce>] wake_up_process+0x10/0x12
[<ffffffff880398e2>] :jbd:commit_timeout+0x9/0xb
[<ffffffff80299a67>] run_timer_softirq+0x14c/0x1d5
[<ffffffff80212724>] __do_softirq+0x68/0xf5
[<ffffffff802627f8>] call_softirq+0x1c/0x28
[<ffffffff80270c65>] do_softirq+0x39/0x9f
[<ffffffff802962a3>] irq_exit+0x57/0x59
[<ffffffff8027b063>] smp_apic_timer_interrupt+0x5d/0x62
[<ffffffff8025b784>] mwait_idle+0x0/0x54
[<ffffffff8026216f>] apic_timer_interrupt+0x6b/0x70
<EOI> [<ffffffff80266026>] __sched_text_start+0xaa6/0xadd
[<ffffffff8025b7c3>] mwait_idle+0x3f/0x54
[<ffffffff8025b78d>] mwait_idle+0x9/0x54
[<ffffffff8024c916>] cpu_idle+0xa2/0xc5
[<ffffffff8027a674>] start_secondary+0x468/0x477
Kernel panic - not syncing: Aiee, killing interrupt handler!
--
http://www.codemonkey.org.uk
Dave Jones wrote:
> So I managed to reproduce it with an 'fsx foo' and a
> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> a vanilla 2.6.18 with none of the Fedora patches..
>
> I'll give 2.6.18-git a try next.
>
> Dave
>
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/buffer.c:2791
I had thought/hoped that this was fixed by Jan's patch at
http://lkml.org/lkml/2006/9/7/236 from the thread started at
http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
first by going through that new codepath....
-Eric
On Tue, 03 Oct 2006 00:43:01 -0500
Eric Sandeen <[email protected]> wrote:
> Dave Jones wrote:
>
> > So I managed to reproduce it with an 'fsx foo' and a
> > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > a vanilla 2.6.18 with none of the Fedora patches..
> >
> > I'll give 2.6.18-git a try next.
> >
> > Dave
> >
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at fs/buffer.c:2791
>
> I had thought/hoped that this was fixed by Jan's patch at
> http://lkml.org/lkml/2006/9/7/236 from the thread started at
> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> first by going through that new codepath....
Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
Badari was hitting that BUG and was able to confirm that Jan's patch
(3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
it.
On Mon, Oct 02, 2006 at 11:19:45PM -0700, Andrew Morton wrote:
> On Tue, 03 Oct 2006 00:43:01 -0500
> Eric Sandeen <[email protected]> wrote:
>
> > Dave Jones wrote:
> >
> > > So I managed to reproduce it with an 'fsx foo' and a
> > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > a vanilla 2.6.18 with none of the Fedora patches..
> > >
> > > I'll give 2.6.18-git a try next.
> > >
> > > Dave
> > >
> > > ----------- [cut here ] --------- [please bite here ] ---------
> > > Kernel BUG at fs/buffer.c:2791
> >
> > I had thought/hoped that this was fixed by Jan's patch at
> > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > first by going through that new codepath....
>
> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> Badari was hitting that BUG and was able to confirm that Jan's patch
> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> it.
Ok, this afternoon I was definitly running a kernel with that patch in it,
and managed to get a trace (It was the one from the top of this thread
that unfortunatly got truncated).
Now, I can't reproduce it on a plain 2.6.18+that patch.
I'll leave the stress test running overnight, and see if anything
falls out in the morning.
Dave
--
http://www.codemonkey.org.uk
On Tue, Oct 03, 2006 at 02:40:30AM -0400, Dave Jones wrote:
> > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > Kernel BUG at fs/buffer.c:2791
> > >
> > > I had thought/hoped that this was fixed by Jan's patch at
> > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Ok, this afternoon I was definitly running a kernel with that patch in it,
> and managed to get a trace (It was the one from the top of this thread
> that unfortunatly got truncated).
>
> Now, I can't reproduce it on a plain 2.6.18+that patch.
> I'll leave the stress test running overnight, and see if anything
> falls out in the morning.
Been chugging away for 10 hrs now without repeating that incident. Hmm.
That patch looks like good -stable material. I'll keep digging to
see if I can somehow reproduce the problem I saw with the patch applied,
but in absense of something better, I think we should go with it.
One thing that did happen in the 10hrs was fsx-over-NFS spewed some
nasty looking trace. I'll post that separately next.
Dave
--
http://www.codemonkey.org.uk
Andrew Morton wrote:
> On Tue, 03 Oct 2006 00:43:01 -0500
> Eric Sandeen <[email protected]> wrote:
>
>> Dave Jones wrote:
>>
>>> So I managed to reproduce it with an 'fsx foo' and a
>>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
>>> a vanilla 2.6.18 with none of the Fedora patches..
>>>
>>> I'll give 2.6.18-git a try next.
>>>
>>> Dave
>>>
>>> ----------- [cut here ] --------- [please bite here ] ---------
>>> Kernel BUG at fs/buffer.c:2791
>> I had thought/hoped that this was fixed by Jan's patch at
>> http://lkml.org/lkml/2006/9/7/236 from the thread started at
>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
>> first by going through that new codepath....
>
> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> Badari was hitting that BUG and was able to confirm that Jan's patch
> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> it.
Looking at some BH traces*, it appears that what Dave hit is a truncate
racing with a sync...
truncate ...
ext3_invalidate_page
journal_invalidatepage
journal_unmap buffer
going off at the same time as
sync ...
journal_dirty_data
sync_dirty_buffer
submit_bh <-- finds unmapped buffer, boom.
I'm not sure what should be coordinating this, and I'm not sure why
we've not yet seen it on a stock kernel, but only FC6... I haven't found
anything in FC6 that looks like it may affect this.
-Eric
*http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt
Eric Sandeen wrote:
>>> I had thought/hoped that this was fixed by Jan's patch at
>>> http://lkml.org/lkml/2006/9/7/236 from the thread started at
>>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
>>> first by going through that new codepath....
>> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
>> Badari was hitting that BUG and was able to confirm that Jan's patch
>> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
>> it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
(oh btw this is -with the above patch from Jan in place...)
-Eric
> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.
>
> I'm not sure what should be coordinating this, and I'm not sure why
> we've not yet seen it on a stock kernel, but only FC6... I haven't found
> anything in FC6 that looks like it may affect this.
>
> -Eric
>
> *http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt
On Mon, 2006-10-09 at 14:46 -0500, Eric Sandeen wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <[email protected]> wrote:
> >
> >> Dave Jones wrote:
> >>
> >>> So I managed to reproduce it with an 'fsx foo' and a
> >>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> >>> a vanilla 2.6.18 with none of the Fedora patches..
> >>>
> >>> I'll give 2.6.18-git a try next.
> >>>
> >>> Dave
> >>>
> >>> ----------- [cut here ] --------- [please bite here ] ---------
> >>> Kernel BUG at fs/buffer.c:2791
> >> I had thought/hoped that this was fixed by Jan's patch at
> >> http://lkml.org/lkml/2006/9/7/236 from the thread started at
> >> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> >> first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
>
> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.
>
I don't understand how this can happen ..
journal_unmap_buffer() zapping the buffer since its not attached to any
transaction.
journal_unmap_buffer():[fs/jbd/transaction.c:1789] not on any
transaction: zap
b_state:0x10402f b_jlist:BJ_None cpu:0 b_count:3 b_blocknr:52735707
b_jbd:1 b_frozen_data:0000000000000000
b_committed_data:0000000000000000
b_transaction:0 b_next_transaction:0 b_cp_transaction:0
b_trans_is_running:0
b_trans_is_comitting:0 b_jcount:2 pg_dirty:1
journal_dirty_data() would do submit_bh() ONLY if its part of the older
transaction.
I need to take a closer look to understand the race.
BTW, is this 1k or 2k filesystem ? How easy is to reproduce the
problem ?
Thanks,
Badari
On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <[email protected]> wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <[email protected]> wrote:
> > > Dave Jones wrote:
> > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > >
> > > > I'll give 2.6.18-git a try next.
> > > >
> > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > Kernel BUG at fs/buffer.c:2791
> > > I had thought/hoped that this was fixed by Jan's patch at
> > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > first by going through that new codepath....
> >
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
>
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
>
> truncate ...
> ext3_invalidate_page
> journal_invalidatepage
> journal_unmap buffer
>
> going off at the same time as
>
> sync ...
> journal_dirty_data
> sync_dirty_buffer
> submit_bh <-- finds unmapped buffer, boom.
Is this possibly related to the issues that are discussed in another
thread? We're seeing problems while unlinking large files (usually get
it within some hours with 200MB files, but couldn't yet reproduce it
with 20MB.)
MfG, JBG
--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: Alles wird gut! ...und heute wirds schon ein bißchen besser.
the second :
On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> journal_dirty_data() would do submit_bh() ONLY if its part of the older
> transaction.
>
> I need to take a closer look to understand the race.
>
> BTW, is this 1k or 2k filesystem ?
(18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
Block size: 1024
Fragment size: 1024
Inode size: 128
(18:41:16:davej@gelk:~)$
> How easy is to reproduce the problem ?
I can reproduce it within a few hours of stressing, but only
on that one box. I've not figured out exactly what's so
special about it yet (though the 1k block thing may be it).
I had been thinking it was a raid0 only thing, as none of
my other boxes have that.
I'm not entirely sure how it got set up that way either.
The Fedora installer being too smart for its own good perhaps.
Dave
--
http://www.codemonkey.org.uk
> On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <[email protected]> wrote:
> > Andrew Morton wrote:
> > > On Tue, 03 Oct 2006 00:43:01 -0500
> > > Eric Sandeen <[email protected]> wrote:
> > > > Dave Jones wrote:
> > > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > > >
> > > > > I'll give 2.6.18-git a try next.
> > > > >
> > > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > > Kernel BUG at fs/buffer.c:2791
> > > > I had thought/hoped that this was fixed by Jan's patch at
> > > > http://lkml.org/lkml/2006/9/7/236 from the thread started at
> > > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug
> > > > first by going through that new codepath....
> > >
> > > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc,
> > > Badari was hitting that BUG and was able to confirm that Jan's patch
> > > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > > it.
> >
> > Looking at some BH traces*, it appears that what Dave hit is a truncate
> > racing with a sync...
> >
> > truncate ...
> > ext3_invalidate_page
> > journal_invalidatepage
> > journal_unmap buffer
> >
> > going off at the same time as
> >
> > sync ...
> > journal_dirty_data
> > sync_dirty_buffer
> > submit_bh <-- finds unmapped buffer, boom.
>
> Is this possibly related to the issues that are discussed in another
> thread? We're seeing problems while unlinking large files (usually get
> it within some hours with 200MB files, but couldn't yet reproduce it
> with 20MB.)
I don't think this is related (BTW: I've run your test for 5 hours
without any luck ;( Maybe I'll try again for some longer time...).
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
> On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
>
> > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> > transaction.
> >
> > I need to take a closer look to understand the race.
> >
> > BTW, is this 1k or 2k filesystem ?
>
> (18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
> Block size: 1024
> Fragment size: 1024
> Inode size: 128
> (18:41:16:davej@gelk:~)$
>
> > How easy is to reproduce the problem ?
>
> I can reproduce it within a few hours of stressing, but only
> on that one box. I've not figured out exactly what's so
> special about it yet (though the 1k block thing may be it).
> I had been thinking it was a raid0 only thing, as none of
> my other boxes have that.
>
> I'm not entirely sure how it got set up that way either.
> The Fedora installer being too smart for its own good perhaps.
I think it's really the 1KB block size that makes it happen.
I've looked at journal_dirty_data() code and I think the following can
happen:
sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
There is finds dirty buffer attached to the comitting transaction. So it drops
all locks and calls sync_dirty_buffer(bh).
Now in other process, file is truncated so that 'bh' gets just after EOF.
As we have 1kb buffers, it can happen that bh is in the partially
truncated page. Buffer is marked unmapped and clean. But in a moment the page
is marked dirty and msync() is called. That eventually calls
set_page_dirty() and all buffers in the page are marked dirty.
The first process now wakes up, locks the buffer, clears the dirty bit
and does submit_bh() - Oops.
This is essentially the same problem Badari found but in a different
place. There are two places that are arguably wrong...
1) We mark buffer dirty after EOF. But actually that may be needed -
or what is the expected behaviour when we write into mmapped file after
EOF, then extend the file and do msync()?
2) We submit a buffer after EOF for IO. This should be clearly avoided
but getting the needed info from bh is really ugly...
What we could do is: Instead od calling sync_dirty_buffer() we do
something like:
lock_buffer(bh);
jbd_lock_bh_state(bh);
if (!buffer_jbd(bh) || jh != bh2jh(bh) || jh->b_transaction !=
journal->j_committing_transaction) {
jbd_unlock_bh_state(bh);
unlock_buffer(bh);
}
jbd_unlock_bh_state(bh);
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
submit_bh(WRITE, bh);
wait_on_buffer(bh);
}
else
unlock_buffer(bh);
That should deal with the problem... Much more adventurous change would
be to remove the syncing code altogether - the new commit code makes
sure to write out each buffer just once so the livelock should not
happen now. But then we'd have to put running transaction in
j_next_transaction and refile data buffers instead of unfiling them.
That should actually give quite some performance improvement when
intensively rewriting files. But I guess that is JBD2 matter ;).
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Tue, 2006-10-10 15:16:03 +0200, Jan Kara <[email protected]> wrote:
> > Is this possibly related to the issues that are discussed in another
> > thread? We're seeing problems while unlinking large files (usually get
> > it within some hours with 200MB files, but couldn't yet reproduce it
> > with 20MB.)
> I don't think this is related (BTW: I've run your test for 5 hours
> without any luck ;( Maybe I'll try again for some longer time...).
Looking at it, I also think it's a different thing. Just to add, I
couldn't make it bug with a 10 MB file in a day, but it failed again
using a 100MB file.
So file size seems to matter somehow.
MfG, JBG
--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: Wenn ich wach bin, träume ich.
the second :
On Tue, 10 Oct 2006 16:11:45 +0200
Jan Kara <[email protected]> wrote:
> > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> >
> > > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> > > transaction.
> > >
> > > I need to take a closer look to understand the race.
> > >
> > > BTW, is this 1k or 2k filesystem ?
> >
> > (18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
> > Block size: 1024
> > Fragment size: 1024
> > Inode size: 128
> > (18:41:16:davej@gelk:~)$
> >
> > > How easy is to reproduce the problem ?
> >
> > I can reproduce it within a few hours of stressing, but only
> > on that one box. I've not figured out exactly what's so
> > special about it yet (though the 1k block thing may be it).
> > I had been thinking it was a raid0 only thing, as none of
> > my other boxes have that.
> >
> > I'm not entirely sure how it got set up that way either.
> > The Fedora installer being too smart for its own good perhaps.
> I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
> Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
> The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.
>
> This is essentially the same problem Badari found but in a different
> place. There are two places that are arguably wrong...
> 1) We mark buffer dirty after EOF. But actually that may be needed -
> or what is the expected behaviour when we write into mmapped file after
> EOF, then extend the file and do msync()?
yup.
> 2) We submit a buffer after EOF for IO. This should be clearly avoided
> but getting the needed info from bh is really ugly...
Things like __block_write_full_page() avoid this by checking the block's
offset against i_size. (Not racy against truncate-down because the page is
locked, not racy against truncate-up because the bh is zero and
up-to-date).
But for jbd writeout we don't hold the page lock, so checking against
bh->b_page->host->i_size is a bit racy.
hm. But we do lock the buffers in journal_invalidatepage(), so checking
i_size after locking the buffer in the writeout path might get us there.
Jan Kara wrote:
> I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
> Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
> The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.
Hm, just FWIW I have a couple traces* of the buffer getting unmapped
-before- journal_submit_data_buffers ever even finds it...
journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
adding to array pid 1836
b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
b_jbd:1 b_frozen_data:0000000000000000
b_committed_data:0000000000000000
b_transaction:1 b_next_transaction:0 b_cp_transaction:0
b_trans_is_running:0
b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
so it's already unmapped at this point. Could
journal_submit_data_buffers benefit from some buffer_mapped checks? Or
is that just a bandaid too late...
-Eric
*http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
> Jan Kara wrote:
>
> > I think it's really the 1KB block size that makes it happen.
> > I've looked at journal_dirty_data() code and I think the following can
> > happen:
> > sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> > There is finds dirty buffer attached to the comitting transaction. So it drops
> > all locks and calls sync_dirty_buffer(bh).
> > Now in other process, file is truncated so that 'bh' gets just after EOF.
> > As we have 1kb buffers, it can happen that bh is in the partially
> > truncated page. Buffer is marked unmapped and clean. But in a moment the page
> > is marked dirty and msync() is called. That eventually calls
> > set_page_dirty() and all buffers in the page are marked dirty.
> > The first process now wakes up, locks the buffer, clears the dirty bit
> > and does submit_bh() - Oops.
>
> Hm, just FWIW I have a couple traces* of the buffer getting unmapped
> -before- journal_submit_data_buffers ever even finds it...
>
> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
> adding to array pid 1836
> b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
> b_jbd:1 b_frozen_data:0000000000000000
> b_committed_data:0000000000000000
> b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> b_trans_is_running:0
> b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
>
> so it's already unmapped at this point. Could
> journal_submit_data_buffers benefit from some buffer_mapped checks? Or
> is that just a bandaid too late...
Hmm..
b_state: 0x114025
^
means BH_Mapped. Isn't it ?
Thanks,
Badari
Badari Pulavarty wrote:
> On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
>> Jan Kara wrote:
>>
>>> I think it's really the 1KB block size that makes it happen.
>>> I've looked at journal_dirty_data() code and I think the following can
>>> happen:
>>> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
>>> There is finds dirty buffer attached to the comitting transaction. So it drops
>>> all locks and calls sync_dirty_buffer(bh).
>>> Now in other process, file is truncated so that 'bh' gets just after EOF.
>>> As we have 1kb buffers, it can happen that bh is in the partially
>>> truncated page. Buffer is marked unmapped and clean. But in a moment the page
>>> is marked dirty and msync() is called. That eventually calls
>>> set_page_dirty() and all buffers in the page are marked dirty.
>>> The first process now wakes up, locks the buffer, clears the dirty bit
>>> and does submit_bh() - Oops.
>> Hm, just FWIW I have a couple traces* of the buffer getting unmapped
>> -before- journal_submit_data_buffers ever even finds it...
>>
>> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
>> adding to array pid 1836
>> b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
>> b_jbd:1 b_frozen_data:0000000000000000
>> b_committed_data:0000000000000000
>> b_transaction:1 b_next_transaction:0 b_cp_transaction:0
>> b_trans_is_running:0
>> b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
>>
>> so it's already unmapped at this point. Could
>> journal_submit_data_buffers benefit from some buffer_mapped checks? Or
>> is that just a bandaid too late...
>
> Hmm..
>
> b_state: 0x114025
> ^
> means BH_Mapped. Isn't it ?
Whoops, I pasted in the wrong one, I guess, from earlier in the trace. Here are
the ones I was looking at:
journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding to
array pid 1690
b_state:0x104005 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:30045
b_jbd:1 b_frozen_data:0000000000000000 b_committed_data:0000000000000000
b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0
b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
and
journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding to
array pid 1836
b_state:0x114005 b_jlist:BJ_SyncData cpu:1 b_count:2 b_blocknr:27130
b_jbd:1 b_frozen_data:0000000000000000 b_committed_data:0000000000000000
b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0
b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
-Eric
> Badari Pulavarty wrote:
> >On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
> >>Jan Kara wrote:
> >>
> >>> I think it's really the 1KB block size that makes it happen.
> >>>I've looked at journal_dirty_data() code and I think the following can
> >>>happen:
> >>> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> >>>There is finds dirty buffer attached to the comitting transaction. So it
> >>>drops
> >>>all locks and calls sync_dirty_buffer(bh).
> >>> Now in other process, file is truncated so that 'bh' gets just after
> >>> EOF.
> >>>As we have 1kb buffers, it can happen that bh is in the partially
> >>>truncated page. Buffer is marked unmapped and clean. But in a moment the
> >>>page
> >>>is marked dirty and msync() is called. That eventually calls
> >>>set_page_dirty() and all buffers in the page are marked dirty.
> >>> The first process now wakes up, locks the buffer, clears the dirty bit
> >>>and does submit_bh() - Oops.
> >>Hm, just FWIW I have a couple traces* of the buffer getting unmapped
> >>-before- journal_submit_data_buffers ever even finds it...
> >>
> >> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
> >>adding to array pid 1836
> >> b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
> >> b_jbd:1 b_frozen_data:0000000000000000
> >>b_committed_data:0000000000000000
> >> b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> >>b_trans_is_running:0
> >> b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
> >>
> >>so it's already unmapped at this point. Could
> >>journal_submit_data_buffers benefit from some buffer_mapped checks? Or
> >>is that just a bandaid too late...
> >
> >Hmm..
> >
> >b_state: 0x114025
> > ^
> >means BH_Mapped. Isn't it ?
>
> Whoops, I pasted in the wrong one, I guess, from earlier in the trace.
> Here are the ones I was looking at:
>
> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding
> to array pid 1690
> b_state:0x104005 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:30045
> b_jbd:1 b_frozen_data:0000000000000000
> b_committed_data:0000000000000000
> b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> b_trans_is_running:0
> b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
>
> and
>
> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding
> to array pid 1836
> b_state:0x114005 b_jlist:BJ_SyncData cpu:1 b_count:2 b_blocknr:27130
> b_jbd:1 b_frozen_data:0000000000000000
> b_committed_data:0000000000000000
> b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> b_trans_is_running:0
> b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
Umm, but these two traces confuse me:
1) They are different traces that those you wrote about initially,
aren't they? Because here we would not call sync_dirty_buffer() from
journal_dirty_data().
BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
as the buffer is not dirty...
Am I right that now there are no Oopses because of buffers submitted
from the commit code? Only those from journal_dirty_data()?
2) Those buffers have strange states - they are !mapped, !dirty (so this
is fine) but they are also JBD_Dirty and ion BJ_SyncData. This is really
strange! Either it is ordered data buffer and should not be JBD_Dirty or
it is not ordered data buffer and then it should not be in BJ_SyncData!
The second buffer even has JBD_JWrite set so it really was metadabuffer
under commit and later something took it from the committing transaction
(without clearing the JWrite bit) and filed it in the SyncData list...
Umm, this reminds me something... <looks into commit code> Oh no, who could
write that BJ_Forget list handling.. me? ;)
I think the problem is in my change to BJ_Forget list handling - if we
find out buffer has b_next_transaction set, we just refile it
(previously we BUGged). That it just fine, except when we are in the
ordered mode and the buffer is reallocated as data. Then we refile the
buffer to BJ_Metadata or BJ_Reserved list, instead of BJ_SyncData. What
then happens is uncertain but probably something gets (rightfully so)
confused and JBD_Dirty buffer gets to BJ_SyncData list. This bug does
not seem to cause those problems with unmapped buffers but still we
should fix it as it is asking for trouble.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara wrote:
> Umm, but these two traces confuse me:
> 1) They are different traces that those you wrote about initially,
> aren't they? Because here we would not call sync_dirty_buffer() from
> journal_dirty_data().
> BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> as the buffer is not dirty...
They do wind up at the same oops, from the same "testcase" (i.e. beat the tar
out of the filesystem with multiple fsx's and fsstress...)
The buffer is not dirty at that tracepoint because it has just done
if (locked && test_clear_buffer_dirty(bh)) {
prior to the tracepoint...
> Am I right that now there are no Oopses because of buffers submitted
> from the commit code? Only those from journal_dirty_data()?
Well, it's hard to sort out which thread is doing what; I added a pid to each
tracepoint to try to make that a little easier. Both of the traces I posted
seem to be journal_submit_data_buffers leading to an unmapped buffer submitted
in submit_bh.
> 2) Those buffers have strange states - they are !mapped, !dirty (so this
> is fine)
Well, they were just undirtied in journal_submit_data_buffers.
See the whole traces at
http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
I will try to reproduce with fewer threads, to see if we can find a simpler
sequence of events...
As an aside, when we do journal_unmap_buffer... should it stay on t_sync_datalist?
Thanks,
-Eric
> but they are also JBD_Dirty and ion BJ_SyncData. This is really
> strange! Either it is ordered data buffer and should not be JBD_Dirty or
> it is not ordered data buffer and then it should not be in BJ_SyncData!
> The second buffer even has JBD_JWrite set so it really was metadabuffer
> under commit and later something took it from the committing transaction
> (without clearing the JWrite bit) and filed it in the SyncData list...
> Umm, this reminds me something... <looks into commit code> Oh no, who could
> write that BJ_Forget list handling.. me? ;)
>
> I think the problem is in my change to BJ_Forget list handling - if we
> find out buffer has b_next_transaction set, we just refile it
> (previously we BUGged). That it just fine, except when we are in the
> ordered mode and the buffer is reallocated as data. Then we refile the
> buffer to BJ_Metadata or BJ_Reserved list, instead of BJ_SyncData. What
> then happens is uncertain but probably something gets (rightfully so)
> confused and JBD_Dirty buffer gets to BJ_SyncData list. This bug does
> not seem to cause those problems with unmapped buffers but still we
> should fix it as it is asking for trouble.
>
> Honza
> Jan Kara wrote:
>
> > Umm, but these two traces confuse me:
> >1) They are different traces that those you wrote about initially,
> >aren't they? Because here we would not call sync_dirty_buffer() from
> >journal_dirty_data().
> > BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> >as the buffer is not dirty...
>
> They do wind up at the same oops, from the same "testcase" (i.e. beat the
> tar out of the filesystem with multiple fsx's and fsstress...)
>
> The buffer is not dirty at that tracepoint because it has just done
> if (locked && test_clear_buffer_dirty(bh)) {
> prior to the tracepoint...
Oh, I see. OK.
>
> See the whole traces at
>
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
Hmm, those traces look really useful. I just have to digest them ;).
> As an aside, when we do journal_unmap_buffer... should it stay on
> t_sync_datalist?
Yes, it should and it seems it really was removed from it at some
point. Only later journal_dirty_data() came and filed it back to the
BJ_SyncData list. And the buffer remained unmapped till the commit time
and then *bang*... It may even be a race in ext3 itself that it called
journal_dirty_data() on an unmapped buffer but I have to read some more
code.
Bye
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Wed, 2006-10-11 at 16:22 +0200, Jan Kara wrote:
> > Jan Kara wrote:
> >
> > > Umm, but these two traces confuse me:
> > >1) They are different traces that those you wrote about initially,
> > >aren't they? Because here we would not call sync_dirty_buffer() from
> > >journal_dirty_data().
> > > BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> > >as the buffer is not dirty...
> >
> > They do wind up at the same oops, from the same "testcase" (i.e. beat the
> > tar out of the filesystem with multiple fsx's and fsstress...)
> >
> > The buffer is not dirty at that tracepoint because it has just done
> > if (locked && test_clear_buffer_dirty(bh)) {
> > prior to the tracepoint...
> Oh, I see. OK.
>
> >
> > See the whole traces at
> >
> > http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> > http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
> Hmm, those traces look really useful. I just have to digest them ;).
>
> > As an aside, when we do journal_unmap_buffer... should it stay on
> > t_sync_datalist?
> Yes, it should and it seems it really was removed from it at some
> point. Only later journal_dirty_data() came and filed it back to the
> BJ_SyncData list. And the buffer remained unmapped till the commit time
> and then *bang*... It may even be a race in ext3 itself that it called
> journal_dirty_data() on an unmapped buffer but I have to read some more
> code.
>
Yes. calling journal_dirty_data() on unmapped buffer can definitely
happen. (only thing i am not sure is - why doesn't happen with a
simple testcase like dirtying only a part of a page in 1k filesystem.
I am not sure why we need journal_unmap_buffer() in the sequence).
Here is what I think is happening..
journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
its a part of the same page. So it remained on the page->buffers
list. (at this time its not part of any transaction).
Then, ordererd_commit_write() called journal_dirty_data() and we added
all these buffers to BJ_SyncData list. (at this time buffer is clean -
not dirty).
Now msync() called __set_page_dirty_buffers() and dirtied *all* the
buffers attached to this page.
journal_submit_data_buffers() got around to this buffer and tried to
submit the buffer...
Andrew is right - only option for us to check the filesize in the
write out path and skip the buffers beyond EOF.
Thanks,
Badari
Badari Pulavarty wrote:
> Here is what I think is happening..
>
> journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
> its a part of the same page. So it remained on the page->buffers
> list. (at this time its not part of any transaction).
>
> Then, ordererd_commit_write() called journal_dirty_data() and we added
> all these buffers to BJ_SyncData list. (at this time buffer is clean -
> not dirty).
>
> Now msync() called __set_page_dirty_buffers() and dirtied *all* the
> buffers attached to this page.
>
> journal_submit_data_buffers() got around to this buffer and tried to
> submit the buffer...
This seems about right, but one thing bothers me in the traces; it seems like
there is some locking that is missing. In
http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
for example, it looks like journal_dirty_data gets started, but then the
buffer_head is acted on by journal_unmap_buffer, which decides this buffer is
part of the running transaction, past EOF, and clears mapped, dirty, etc. Then
journal_dirty_data picks up again, decides that the buffer is not on the right
list (now BJ_None) and puts it back on BJ_SyncData. Then it gets picked up by
journal_submit_data_buffers and submitted, and oops.
Talking with Stephen, it seemed like the page lock should synchronize these
threads, but I've found that we can get to journal_dirty_data acting on the
buffer heads w/o having the page locked...
I'm still digging, and, er, grasping at straws here... Am I off base?
-Eric
> Andrew is right - only option for us to check the filesize in the
> write out path and skip the buffers beyond EOF.
>
> Thanks,
> Badari
>
Eric Sandeen wrote:
> Badari Pulavarty wrote:
>
>> Here is what I think is happening..
>>
>> journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
>> its a part of the same page. So it remained on the page->buffers
>> list. (at this time its not part of any transaction).
>>
>> Then, ordererd_commit_write() called journal_dirty_data() and we added
>> all these buffers to BJ_SyncData list. (at this time buffer is clean -
>> not dirty).
>>
>> Now msync() called __set_page_dirty_buffers() and dirtied *all* the
>> buffers attached to this page.
>>
>> journal_submit_data_buffers() got around to this buffer and tried to
>> submit the buffer...
>
> This seems about right, but one thing bothers me in the traces; it
> seems like there is some locking that is missing. In
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> for example, it looks like journal_dirty_data gets started, but then
> the buffer_head is acted on by journal_unmap_buffer, which decides
> this buffer is part of the running transaction, past EOF, and clears
> mapped, dirty, etc. Then journal_dirty_data picks up again, decides
> that the buffer is not on the right list (now BJ_None) and puts it
> back on BJ_SyncData. Then it gets picked up by
> journal_submit_data_buffers and submitted, and oops.
>
> Talking with Stephen, it seemed like the page lock should synchronize
> these threads, but I've found that we can get to journal_dirty_data
> acting on the buffer heads w/o having the page locked...
>
> I'm still digging, and, er, grasping at straws here... Am I off base?
>
> -Eric
>
>
>> Andrew is right - only option for us to check the filesize in the
>> write out path and skip the buffers beyond EOF.
>>
>> Thanks,
>> Badari
>>
Here's another data point for your consideration. I've been seeing this
error since I started running 2.6.18, I assumed it was hardware, so I've
tried 3 different disks, a PATA and 2 SATA drives, with VIA and Promise
controllers, the error has occurred on all of them. I see the error
infrequently, always when downloading lots of small files from Usenet
and building, copying and deleting large (200 - 300 MB). I haven't ever
had an oops/panic, just this error. When I run fsck, I always see a
single message that "deleted inode nnn has zero dtime". I hope this will
be useful.
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5):
ext3_free_blocks_sb: bit already cleared for block 4740550
Oct 11 20:37:32 Godzilla kernel: Aborting journal on device hda5.
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_free_blocks_sb: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_free_blocks_sb: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_truncate: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_orphan_del: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in
ext3_delete_inode: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing
b_committed_data
Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing
b_committed_data
Oct 11 20:37:32 Godzilla kernel: ext3_abort called.
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5):
ext3_journal_start_sb: Detected aborted journal
Oct 11 20:37:32 Godzilla kernel: Remounting filesystem read-only
On Wed, 2006-10-11 21:34:13 -0700, John Wendel <[email protected]> wrote:
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): ext3_free_blocks_sb: bit already cleared for block 4740550
> Oct 11 20:37:32 Godzilla kernel: Aborting journal on device hda5.
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_free_blocks_sb: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_free_blocks_sb: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_truncate: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_orphan_del: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_delete_inode: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing b_committed_data
> Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing b_committed_data
> Oct 11 20:37:32 Godzilla kernel: ext3_abort called.
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): ext3_journal_start_sb: Detected aborted journal
> Oct 11 20:37:32 Godzilla kernel: Remounting filesystem read-only
This looks very much like the issue I see.
MfG, JBG
--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: http://catb.org/~esr/faqs/smart-questions.html
the second :
> Badari Pulavarty wrote:
>
> >Here is what I think is happening..
> >
> >journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
> >its a part of the same page. So it remained on the page->buffers
> >list. (at this time its not part of any transaction).
> >
> >Then, ordererd_commit_write() called journal_dirty_data() and we added
> >all these buffers to BJ_SyncData list. (at this time buffer is clean -
> >not dirty).
> >
> >Now msync() called __set_page_dirty_buffers() and dirtied *all* the
> >buffers attached to this page.
> >
> >journal_submit_data_buffers() got around to this buffer and tried to
> >submit the buffer...
Yes, this is certainly one we need to fix.
> This seems about right, but one thing bothers me in the traces; it seems
> like there is some locking that is missing. In
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> for example, it looks like journal_dirty_data gets started, but then the
> buffer_head is acted on by journal_unmap_buffer, which decides this buffer
> is part of the running transaction, past EOF, and clears mapped, dirty,
It's part of the committing transaction.
> etc. Then journal_dirty_data picks up again, decides that the buffer is
> not on the right list (now BJ_None) and puts it back on BJ_SyncData. Then
> it gets picked up by journal_submit_data_buffers and submitted, and oops.
Now it is put on the running transaction's BJ_SyncData list. But
otherwise you're right.
> Talking with Stephen, it seemed like the page lock should synchronize these
> threads, but I've found that we can get to journal_dirty_data acting on the
> buffer heads w/o having the page locked...
Yes, PageLock should protect us. Where can we call
journal_dirty_data() without PageLock? I see the following callers:
ext3_ordered_commit_write - should have PageLock
ext3_ordered_writepage - has PageLock
ext3_block_truncate_page - has PageLock
And that are all callers from ext3. Am I missing something?
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara wrote:
>> Talking with Stephen, it seemed like the page lock should synchronize these
>> threads, but I've found that we can get to journal_dirty_data acting on the
>> buffer heads w/o having the page locked...
> Yes, PageLock should protect us. Where can we call
> journal_dirty_data() without PageLock? I see the following callers:
> ext3_ordered_commit_write - should have PageLock
> ext3_ordered_writepage - has PageLock
> ext3_block_truncate_page - has PageLock
>
> And that are all callers from ext3. Am I missing something?
>
> Honza
I put an assert about the page being locked in journal_dirty_data, and hit it
right away. I'll look a bit more but I think this is how I got there:
ext3_ordered_writepage <-- assert PageLocked
...
block_write_full_page
__block_write_full_page
unlock_page()
...
walk_page_buffers
journal_dirty_data_fn
ext3_journal_dirty_data
journal_dirty_data <-- find page unlocked
there's a comment in ext3_ordered_writepage:
/*
* The page can become unlocked at any point now, and
* truncate can then come in and change things. So we
* can't touch *page from now on. But *page_bufs is
* safe due to elevated refcount.
*/
-Eric
On Thu, 12 Oct 2006 14:28:20 +0200
Jan Kara <[email protected]> wrote:
> Where can we call
> journal_dirty_data() without PageLock?
block_write_full_page() will unlock the page, so ext3_writepage()
will run journal_dirty_data_fn() against an unlocked page.
I haven't looked into the exact details of the race, but it should
be addressable via jbd_lock_bh_state() or j_list_lock coverage.
Andrew Morton wrote:
> On Thu, 12 Oct 2006 14:28:20 +0200
> Jan Kara <[email protected]> wrote:
>
>> Where can we call
>> journal_dirty_data() without PageLock?
>
> block_write_full_page() will unlock the page, so ext3_writepage()
> will run journal_dirty_data_fn() against an unlocked page.
>
> I haven't looked into the exact details of the race, but it should
> be addressable via jbd_lock_bh_state() or j_list_lock coverage.
Yep, that's what I've been hashing out with Stephen today...
In one of my cases journal_dirty_data has dropped & re-acquired the
bh_state lock and j_list_lock, and journal_unmap_buffer has come along
in the meantime.
So it looks like we are missing some state tests, i.e. buffer_mapped(),
at a couple points after we acquire jbd_lock_bh_state().
-Eric
Andrew Morton wrote:
> On Thu, 12 Oct 2006 14:28:20 +0200
> Jan Kara <[email protected]> wrote:
>
>
>> Where can we call
>> journal_dirty_data() without PageLock?
>>
>
> block_write_full_page() will unlock the page, so ext3_writepage()
> will run journal_dirty_data_fn() against an unlocked page.
>
> I haven't looked into the exact details of the race, but it should
> be addressable via jbd_lock_bh_state() or j_list_lock coverage
I'm testing with something like this now; seem sane?
journal_dirty_data & journal_unmap_data both check do
jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has checked
buffer_mapped before getting into journal_dirty_data, but that state may
change before the lock is grabbed. Similarly re-check after we drop the lock.
-Eric
Index: linux-2.6.18-1.2737.fc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.18-1.2737.fc6.orig/fs/jbd/transaction.c
+++ linux-2.6.18-1.2737.fc6/fs/jbd/transaction.c
@@ -967,6 +967,13 @@ int journal_dirty_data(handle_t *handle,
*/
jbd_lock_bh_state(bh);
spin_lock(&journal->j_list_lock);
+
+ /* Now that we have bh_state locked, are we really still mapped? */
+ if (!buffer_mapped(bh)) {
+ JBUFFER_TRACE(jh, "unmapped, bailing out");
+ goto no_journal;
+ }
+
if (jh->b_transaction) {
JBUFFER_TRACE(jh, "has transaction");
if (jh->b_transaction != handle->h_transaction) {
@@ -1028,6 +1036,11 @@ int journal_dirty_data(handle_t *handle,
sync_dirty_buffer(bh);
jbd_lock_bh_state(bh);
spin_lock(&journal->j_list_lock);
+ /* Since we dropped the lock... */
+ if (!buffer_mapped(bh)) {
+ JBUFFER_TRACE(jh, "Got unmapped");
+ goto no_journal;
+ }
/* The buffer may become locked again at any
time if it is redirtied */
}
Eric Sandeen wrote:
> Andrew Morton wrote:
>
>
>> On Thu, 12 Oct 2006 14:28:20 +0200
>> Jan Kara <[email protected]> wrote:
>>
>>
>>
>>> Where can we call
>>> journal_dirty_data() without PageLock?
>>>
>>>
>> block_write_full_page() will unlock the page, so ext3_writepage()
>> will run journal_dirty_data_fn() against an unlocked page.
>>
>> I haven't looked into the exact details of the race, but it should
>> be addressable via jbd_lock_bh_state() or j_list_lock coverage
>>
> I'm testing with something like this now; seem sane?
>
> journal_dirty_data & journal_unmap_data both check do
> jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has checked
> buffer_mapped before getting into journal_dirty_data, but that state may
> change before the lock is grabbed. Similarly re-check after we drop the lock.
>
>
This is exactly the solution I proposed earlier (to check
buffer_mapped() before calling submit_bh()).
But at that time, Jan pointed out that the whole handling is wrong.
But if this is the only case we need to handle, I am okay with this band
aid :)
Thanks,
Badari
Badari Pulavarty wrote:
> This is exactly the solution I proposed earlier (to check
> buffer_mapped() before calling submit_bh()).
> But at that time, Jan pointed out that the whole handling is wrong.
>
> But if this is the only case we need to handle, I am okay with this band
> aid :)
Doh!
And we come full circle... ok let me go reread that thread, it got long
enough I had to swap out... :)
-Eric
Eric Sandeen wrote:
> Badari Pulavarty wrote:
>
>
>> This is exactly the solution I proposed earlier (to check
>> buffer_mapped() before calling submit_bh()).
>> But at that time, Jan pointed out that the whole handling is wrong.
>>
>> But if this is the only case we need to handle, I am okay with this band
>> aid :)
>>
>
> Doh!
>
> And we come full circle... ok let me go reread that thread, it got long
> enough I had to swap out... :)
Don't bother. Lets see if this is the only case that needs fixing and
move forward..
Thanks,
Badari
> Eric Sandeen wrote:
> >Andrew Morton wrote:
> >
> >
> >>On Thu, 12 Oct 2006 14:28:20 +0200
> >>Jan Kara <[email protected]> wrote:
> >>
> >>
> >>
> >>>Where can we call
> >>>journal_dirty_data() without PageLock?
> >>>
> >>>
> >>block_write_full_page() will unlock the page, so ext3_writepage()
> >>will run journal_dirty_data_fn() against an unlocked page.
> >>
> >>I haven't looked into the exact details of the race, but it should
> >>be addressable via jbd_lock_bh_state() or j_list_lock coverage
> >>
> >I'm testing with something like this now; seem sane?
> >
> >journal_dirty_data & journal_unmap_data both check do
> >jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has
> >checked buffer_mapped before getting into journal_dirty_data, but that
> >state may
> >change before the lock is grabbed. Similarly re-check after we drop the
> >lock.
> >
> >
> This is exactly the solution I proposed earlier (to check
> buffer_mapped() before calling submit_bh()).
> But at that time, Jan pointed out that the whole handling is wrong.
Yes, and it was. However it turned out that there are more problems
than I thought ;).
> But if this is the only case we need to handle, I am okay with this band
> aid :)
I think Eric's patch may be a part of it. But we still need to check whether
the buffer is not after EOF before submitting it (or better said just
after we manage to lock the buffer). Because while we are waiting for
the buffer lock, journal_unmap_buffer() can still come and steal the
buffer - at least the write-out in journal_dirty_data() definitely needs
the check if I haven't overlooked something.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara wrote:
>> This is exactly the solution I proposed earlier (to check
>> buffer_mapped() before calling submit_bh()).
>> But at that time, Jan pointed out that the whole handling is wrong.
> Yes, and it was. However it turned out that there are more problems
> than I thought ;).
>
>> But if this is the only case we need to handle, I am okay with this band
>> aid :)
> I think Eric's patch may be a part of it. But we still need to check whether
> the buffer is not after EOF before submitting it (or better said just
> after we manage to lock the buffer). Because while we are waiting for
> the buffer lock, journal_unmap_buffer() can still come and steal the
> buffer - at least the write-out in journal_dirty_data() definitely needs
> the check if I haven't overlooked something.
Ok, let me think on that today. My first reaction is that if we have
the bh state lock and pay attention to mapped in journal_dirty_data(),
then any blocks past EOF which have gotten unmapped by
journal_unmap_buffer will be recognized as such (because they are now
unmapped... without needing to check for past EOF...) and we'll be fine.
As a datapoint, davej's stresstest (several fsx's and fsstresses)
survived an overnight run on his box, which used to panic in < 2 hrs.
Survived about 6 hours on my box until I intentionally stopped it; my
box had added a write/truncate test in a loop, with a bunch of periodic
syncs as well....
-Eric
> Jan Kara wrote:
>
> >> This is exactly the solution I proposed earlier (to check
> >> buffer_mapped() before calling submit_bh()).
> >> But at that time, Jan pointed out that the whole handling is wrong.
> > Yes, and it was. However it turned out that there are more problems
> > than I thought ;).
> >
> >> But if this is the only case we need to handle, I am okay with this band
> >> aid :)
> > I think Eric's patch may be a part of it. But we still need to check whether
> > the buffer is not after EOF before submitting it (or better said just
> > after we manage to lock the buffer). Because while we are waiting for
> > the buffer lock, journal_unmap_buffer() can still come and steal the
> > buffer - at least the write-out in journal_dirty_data() definitely needs
> > the check if I haven't overlooked something.
>
> Ok, let me think on that today. My first reaction is that if we have
> the bh state lock and pay attention to mapped in journal_dirty_data(),
> then any blocks past EOF which have gotten unmapped by
> journal_unmap_buffer will be recognized as such (because they are now
> unmapped... without needing to check for past EOF...) and we'll be fine.
Hmm, yes, you're right. If we do the test in journal_dirty_data() we
should not file unmapped buffer into transaction's list and hence we
should be safe. Fine. In case we eventually hit the assertion, we can
think further ;).
> As a datapoint, davej's stresstest (several fsx's and fsstresses)
> survived an overnight run on his box, which used to panic in < 2 hrs.
> Survived about 6 hours on my box until I intentionally stopped it; my
> box had added a write/truncate test in a loop, with a bunch of periodic
> syncs as well....
Perfect :).
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara wrote:
>>> I think Eric's patch may be a part of it. But we still need to check whether
>>> the buffer is not after EOF before submitting it (or better said just
>>> after we manage to lock the buffer). Because while we are waiting for
>>> the buffer lock, journal_unmap_buffer() can still come and steal the
>>> buffer - at least the write-out in journal_dirty_data() definitely needs
>>> the check if I haven't overlooked something.
>> Ok, let me think on that today. My first reaction is that if we have
>> the bh state lock and pay attention to mapped in journal_dirty_data(),
>> then any blocks past EOF which have gotten unmapped by
>> journal_unmap_buffer will be recognized as such (because they are now
>> unmapped... without needing to check for past EOF...) and we'll be fine.
> Hmm, yes, you're right. If we do the test in journal_dirty_data() we
> should not file unmapped buffer into transaction's list and hence we
> should be safe. Fine. In case we eventually hit the assertion, we can
> think further ;).
Awww no way that can possibly happen right? :)
>> As a datapoint, davej's stresstest (several fsx's and fsstresses)
>> survived an overnight run on his box, which used to panic in < 2 hrs.
>> Survived about 6 hours on my box until I intentionally stopped it; my
>> box had added a write/truncate test in a loop, with a bunch of periodic
>> syncs as well....
> Perfect :).
Ok, thanks Jan!
I'll send a patch for -mm as a new thread, this one has gotten buried
pretty deep.
-Eric