2014-06-30 21:30:50

by Matteo Croce

[permalink] [raw]
Subject: ext4: journal has aborted

I was web surfing and using gimp when:

EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
Aborting journal on device sda2-8.
EXT4-fs (sda2): Remounting filesystem read-only
------------[ cut here ]------------
WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
__ext4_handle_dirty_metadata+0x18e/0x1d0()
Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
backlight
CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
Call Trace:
[<ffffffff813acbdd>] ? dump_stack+0x41/0x51
[<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
[<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
[<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
[<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
[<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
[<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
[<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
[<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
[<ffffffff81114a46>] ? evict+0xa6/0x160
[<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
[<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
[<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
[<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
---[ end trace 795411398e41fbcb ]---
EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
line 241, credits 91/91, errcode -30
EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
ext4_free_blocks:4867: Journal has aborted
EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
type 5 started at line 241, credits 91/91, errcode -30
EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has aborted
EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has aborted
EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
has aborted
EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
has aborted
EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
has aborted


2014-07-01 06:56:09

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi,

On Mon, 30 Jun 2014 23:30:10 +0200
Matteo Croce <[email protected]> wrote:

> I was web surfing and using gimp when:
>
> EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
> 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.

I was about to post a related question to this list. I am also seeing these
kind of errors when using ext4 on latest mainline (I began testing with 3.15
where I saw this and now in 3.16-rc3 it is still there).
It happens almost instantly when power-cycling the system (unclean shutdown).
The next time the system boots, I get these errors.

AFAICT, you are using a pretty recent kernel. Which version exactly?

> Aborting journal on device sda2-8.
> EXT4-fs (sda2): Remounting filesystem read-only
> ------------[ cut here ]------------
> WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
> __ext4_handle_dirty_metadata+0x18e/0x1d0()
> Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
> snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
> videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
> rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
> aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
> glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
> mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
> snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
> soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
> ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
> backlight
> CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
> 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
> ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
> ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
> Call Trace:
> [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
> [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
> [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
> [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
> [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
> [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
> [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
> [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
> [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
> [<ffffffff81114a46>] ? evict+0xa6/0x160
> [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
> [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
> [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
> [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
> ---[ end trace 795411398e41fbcb ]---
> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> line 241, credits 91/91, errcode -30
> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
> ext4_free_blocks:4867: Journal has aborted
> EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
> block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
> type 5 started at line 241, credits 91/91, errcode -30
> EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
> aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
> aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> has aborted
> EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> has aborted
> EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> has aborted

I did not get these errors. I suspect this may be a consequence of FS
corruption due to a bug in etx4.

Here's why I suspect a bug:

I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
cause FS corruption in presence of a journal.

I can format the eMMC device either as EXT3 or EXT4 for the test. After
formatting and writing the rootfs to the partition I can boot successfully in
either situation. Once booted from eMMC, I start bonnie++ (to just stress the
FS for a while), and after a minute or so the board is power-cycled while
bonnie++ is still running.

Next time I boot the situation is this:

With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
many times as I want, FS stays consistent.

With EXT4: After just one power cycle I start getting this:

[ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
[ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

If I continue the test, it doesn't take long and serious corruption starts
occurring.

Again, with EXT3 I am unable to detect any problems.

Best regards,

--
David Jander
Protonic Holland.

2014-07-01 08:01:17

by Matteo Croce

[permalink] [raw]
Subject: Re: ext4: journal has aborted

2014-07-01 8:26 GMT+02:00 David Jander <[email protected]>:
>
> Hi,
>
> On Mon, 30 Jun 2014 23:30:10 +0200
> Matteo Croce <[email protected]> wrote:
>
>> I was web surfing and using gimp when:
>>
>> EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
>> 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
>
> I was about to post a related question to this list. I am also seeing these
> kind of errors when using ext4 on latest mainline (I began testing with 3.15
> where I saw this and now in 3.16-rc3 it is still there).
> It happens almost instantly when power-cycling the system (unclean shutdown).
> The next time the system boots, I get these errors.

Indeed I had a power failure during last boot,
but it took about 1 hour after the unclean mount to abort the journal

> AFAICT, you are using a pretty recent kernel. Which version exactly?

vanilla 3.15.0

>> Aborting journal on device sda2-8.
>> EXT4-fs (sda2): Remounting filesystem read-only
>> ------------[ cut here ]------------
>> WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
>> __ext4_handle_dirty_metadata+0x18e/0x1d0()
>> Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
>> snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
>> videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
>> rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
>> snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
>> aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
>> glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
>> mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
>> snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
>> soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
>> ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
>> backlight
>> CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
>> 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
>> ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
>> ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
>> Call Trace:
>> [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
>> [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
>> [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
>> [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
>> [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
>> [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
>> [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
>> [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
>> [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
>> [<ffffffff81114a46>] ? evict+0xa6/0x160
>> [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
>> [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
>> [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
>> [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
>> ---[ end trace 795411398e41fbcb ]---
>> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>> line 241, credits 91/91, errcode -30
>> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>> line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
>> ext4_free_blocks:4867: Journal has aborted
>> EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
>> block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
>> type 5 started at line 241, credits 91/91, errcode -30
>> EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
>> aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
>> aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> has aborted
>> EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
>> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> has aborted
>> EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
>> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> has aborted
>
> I did not get these errors. I suspect this may be a consequence of FS
> corruption due to a bug in etx4.
>
> Here's why I suspect a bug:
>
> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
> cause FS corruption in presence of a journal.
>
> I can format the eMMC device either as EXT3 or EXT4 for the test. After
> formatting and writing the rootfs to the partition I can boot successfully in
> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
> FS for a while), and after a minute or so the board is power-cycled while
> bonnie++ is still running.
>
> Next time I boot the situation is this:
>
> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
> many times as I want, FS stays consistent.
>
> With EXT4: After just one power cycle I start getting this:
>
> [ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
> [ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>
> If I continue the test, it doesn't take long and serious corruption starts
> occurring.
>
> Again, with EXT3 I am unable to detect any problems.
>
> Best regards,
>
> --
> David Jander
> Protonic Holland.



--
Matteo Croce
OpenWrt Developer

2014-07-01 08:42:22

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Tue, Jul 01, 2014 at 08:26:19AM +0200, David Jander wrote:
>
> Hi,
>
> On Mon, 30 Jun 2014 23:30:10 +0200
> Matteo Croce <[email protected]> wrote:
>
> > I was web surfing and using gimp when:
> >
> > EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
> > 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
>
> I was about to post a related question to this list. I am also seeing these
> kind of errors when using ext4 on latest mainline (I began testing with 3.15
> where I saw this and now in 3.16-rc3 it is still there).
> It happens almost instantly when power-cycling the system (unclean shutdown).
> The next time the system boots, I get these errors.
>
> AFAICT, you are using a pretty recent kernel. Which version exactly?
>
> > Aborting journal on device sda2-8.
> > EXT4-fs (sda2): Remounting filesystem read-only

Matteo, could you please post the full dmesg log somewhere? I'm interested in
what happens before all this happens, because...

> > ------------[ cut here ]------------
> > WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
> > __ext4_handle_dirty_metadata+0x18e/0x1d0()
> > Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
> > snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
> > videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
> > rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
> > aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
> > glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
> > mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
> > snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
> > soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
> > ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
> > backlight
> > CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
> > 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
> > ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
> > ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
> > Call Trace:
> > [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
> > [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
> > [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
> > [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
> > [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
> > [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
> > [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
> > [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
> > [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
> > [<ffffffff81114a46>] ? evict+0xa6/0x160
> > [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
> > [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
> > [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
> > [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
> > ---[ end trace 795411398e41fbcb ]---
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
> > ext4_free_blocks:4867: Journal has aborted
> > EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
> > block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
> > type 5 started at line 241, credits 91/91, errcode -30
> > EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
>
> I did not get these errors. I suspect this may be a consequence of FS
> corruption due to a bug in etx4.
>
> Here's why I suspect a bug:
>
> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
> cause FS corruption in presence of a journal.
>
> I can format the eMMC device either as EXT3 or EXT4 for the test. After
> formatting and writing the rootfs to the partition I can boot successfully in
> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
> FS for a while), and after a minute or so the board is power-cycled while
> bonnie++ is still running.
>
> Next time I boot the situation is this:
>
> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
> many times as I want, FS stays consistent.
>
> With EXT4: After just one power cycle I start getting this:
>
> [ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
> [ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

I've been seeing this same set of symptoms with 3.15.0 on various SSDs (Samsung
840 Pro, Crucial M4). It seems that something (upstart?) is holding open some
file or other during poweroff, which means that the root fs can't be unmounted
or even remounted rw. I also noticed that the next time the system comes up,
the kernel tells me that it has to process the inode orphan list as part of
recovery.

Shortly after the orphan list gets processed, I get that message and the FS
goes ro. A subsequent fsck run reveals that the block bitmap is indeed
incorrect in that block group, and when I bd the blocks that are incorrect in
the bitmap, I see what could be some kind of upstart log file. Either way, I
suspect some bug in orphan processing.

<shrug> I don't know if this is specific to SSDs or spinning rust. Right now
I've simply rigged the initramfs to e2fsck -p the root fs before mounting it,
which seems(?) to have patched around it for now.

> If I continue the test, it doesn't take long and serious corruption starts
> occurring.

You're getting actual FS data corruption too? Or more of those messages?

--D
>
> Again, with EXT3 I am unable to detect any problems.
>
> Best regards,
>
> --
> David Jander
> Protonic Holland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-01 08:55:52

by Matteo Croce

[permalink] [raw]
Subject: Re: ext4: journal has aborted

2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> On Tue, Jul 01, 2014 at 08:26:19AM +0200, David Jander wrote:
>>
>> Hi,
>>
>> On Mon, 30 Jun 2014 23:30:10 +0200
>> Matteo Croce <[email protected]> wrote:
>>
>> > I was web surfing and using gimp when:
>> >
>> > EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
>> > 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
>>
>> I was about to post a related question to this list. I am also seeing these
>> kind of errors when using ext4 on latest mainline (I began testing with 3.15
>> where I saw this and now in 3.16-rc3 it is still there).
>> It happens almost instantly when power-cycling the system (unclean shutdown).
>> The next time the system boots, I get these errors.
>>
>> AFAICT, you are using a pretty recent kernel. Which version exactly?
>>
>> > Aborting journal on device sda2-8.
>> > EXT4-fs (sda2): Remounting filesystem read-only
>
> Matteo, could you please post the full dmesg log somewhere? I'm interested in
> what happens before all this happens, because...

I've rebooted the notebook twice

>> > ------------[ cut here ]------------
>> > WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
>> > __ext4_handle_dirty_metadata+0x18e/0x1d0()
>> > Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
>> > snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
>> > videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
>> > rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
>> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
>> > snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
>> > aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
>> > glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
>> > mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
>> > snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
>> > soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
>> > ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
>> > backlight
>> > CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
>> > 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
>> > ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
>> > ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
>> > Call Trace:
>> > [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
>> > [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
>> > [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
>> > [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
>> > [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
>> > [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
>> > [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
>> > [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
>> > [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
>> > [<ffffffff81114a46>] ? evict+0xa6/0x160
>> > [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
>> > [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
>> > [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
>> > [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
>> > ---[ end trace 795411398e41fbcb ]---
>> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>> > line 241, credits 91/91, errcode -30
>> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>> > line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
>> > ext4_free_blocks:4867: Journal has aborted
>> > EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
>> > block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
>> > type 5 started at line 241, credits 91/91, errcode -30
>> > EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
>> > aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
>> > aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> > has aborted
>> > EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
>> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> > has aborted
>> > EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
>> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>> > has aborted
>>
>> I did not get these errors. I suspect this may be a consequence of FS
>> corruption due to a bug in etx4.
>>
>> Here's why I suspect a bug:
>>
>> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
>> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
>> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
>> cause FS corruption in presence of a journal.

I have a Samsung SSD 840 PRO

>> I can format the eMMC device either as EXT3 or EXT4 for the test. After
>> formatting and writing the rootfs to the partition I can boot successfully in
>> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
>> FS for a while), and after a minute or so the board is power-cycled while
>> bonnie++ is still running.
>>
>> Next time I boot the situation is this:
>>
>> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
>> many times as I want, FS stays consistent.
>>
>> With EXT4: After just one power cycle I start getting this:
>>
>> [ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
>> [ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>
> I've been seeing this same set of symptoms with 3.15.0 on various SSDs (Samsung
> 840 Pro, Crucial M4). It seems that something (upstart?) is holding open some
> file or other during poweroff, which means that the root fs can't be unmounted
> or even remounted rw. I also noticed that the next time the system comes up,
> the kernel tells me that it has to process the inode orphan list as part of
> recovery.
>
> Shortly after the orphan list gets processed, I get that message and the FS
> goes ro. A subsequent fsck run reveals that the block bitmap is indeed
> incorrect in that block group, and when I bd the blocks that are incorrect in
> the bitmap, I see what could be some kind of upstart log file. Either way, I
> suspect some bug in orphan processing.
>
> <shrug> I don't know if this is specific to SSDs or spinning rust. Right now
> I've simply rigged the initramfs to e2fsck -p the root fs before mounting it,
> which seems(?) to have patched around it for now.
>
>> If I continue the test, it doesn't take long and serious corruption starts
>> occurring.
>
> You're getting actual FS data corruption too? Or more of those messages?

Actually it seems that there are no corruption

> --D
>>
>> Again, with EXT3 I am unable to detect any problems.
>>
>> Best regards,
>>
>> --
>> David Jander
>> Protonic Holland.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Matteo Croce
OpenWrt Developer

2014-07-01 09:02:53

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Tue, Jul 01, 2014 at 08:26:19AM +0200, David Jander wrote:
>
> Hi,
>
> On Mon, 30 Jun 2014 23:30:10 +0200
> Matteo Croce <[email protected]> wrote:
>
> > I was web surfing and using gimp when:
> >
> > EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
> > 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
>
> I was about to post a related question to this list. I am also seeing these
> kind of errors when using ext4 on latest mainline (I began testing with 3.15
> where I saw this and now in 3.16-rc3 it is still there).
> It happens almost instantly when power-cycling the system (unclean shutdown).
> The next time the system boots, I get these errors.

Hmm... David, could you also send me your dmesg when this happens? (I hope you
can figure out a way to stash it somewhere...)

I'm particularly interested in what the FS might do with regards to post-crash
recovery.

--D

>
> AFAICT, you are using a pretty recent kernel. Which version exactly?
>
> > Aborting journal on device sda2-8.
> > EXT4-fs (sda2): Remounting filesystem read-only
> > ------------[ cut here ]------------
> > WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
> > __ext4_handle_dirty_metadata+0x18e/0x1d0()
> > Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
> > snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
> > videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
> > rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
> > aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
> > glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
> > mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
> > snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
> > soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
> > ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
> > backlight
> > CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
> > 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
> > ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
> > ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
> > Call Trace:
> > [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
> > [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
> > [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
> > [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
> > [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
> > [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
> > [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
> > [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
> > [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
> > [<ffffffff81114a46>] ? evict+0xa6/0x160
> > [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
> > [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
> > [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
> > [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
> > ---[ end trace 795411398e41fbcb ]---
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30
> > EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
> > line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
> > ext4_free_blocks:4867: Journal has aborted
> > EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
> > block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
> > type 5 started at line 241, credits 91/91, errcode -30
> > EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
> > aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
> > EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
> > EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
> > has aborted
>
> I did not get these errors. I suspect this may be a consequence of FS
> corruption due to a bug in etx4.
>
> Here's why I suspect a bug:
>
> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
> cause FS corruption in presence of a journal.
>
> I can format the eMMC device either as EXT3 or EXT4 for the test. After
> formatting and writing the rootfs to the partition I can boot successfully in
> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
> FS for a while), and after a minute or so the board is power-cycled while
> bonnie++ is still running.
>
> Next time I boot the situation is this:
>
> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
> many times as I want, FS stays consistent.
>
> With EXT4: After just one power cycle I start getting this:
>
> [ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
> [ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>
> If I continue the test, it doesn't take long and serious corruption starts
> occurring.
>
> Again, with EXT3 I am unable to detect any problems.
>
> Best regards,
>
> --
> David Jander
> Protonic Holland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-01 12:07:30

by Jaehoon Chung

[permalink] [raw]
Subject: Re: ext4: journal has aborted

Hi,

i have interesting for this problem..Because i also found the same problem..
Is it Journal problem?

I used the Linux version 3.16.0-rc3.

[ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
[ 3.877937] Aborting journal on device mmcblk0p13-8.
[ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
[ 3.885025]
[ 3.891751] CPU: 1 PID: 1488 Comm: mv Not tainted 3.16.0-rc3-00206-gc0f4eea0-dirty #2
[ 3.899594] [<c0016258>] (unwind_backtrace) from [<c00127d4>] (show_stack+0x10/0x14)
[ 3.907294] [<c00127d4>] (show_stack) from [<c051f458>] (dump_stack+0x78/0xc8)
[ 3.914495] [<c051f458>] (dump_stack) from [<c051cd28>] (panic+0x8c/0x1f0)
[ 3.921354] [<c051cd28>] (panic) from [<c018eb38>] (ext4_handle_error+0x68/0xa8)
[ 3.928730] [<c018eb38>] (ext4_handle_error) from [<c018f584>] (__ext4_grp_locked_error+0x88/0x1c0)
[ 3.937759] [<c018f584>] (__ext4_grp_locked_error) from [<c01a1e04>] (ext4_mb_generate_buddy+0x260/0x398)
[ 3.947306] [<c01a1e04>] (ext4_mb_generate_buddy) from [<c01a3d20>] (ext4_mb_init_cache+0x2fc/0x7fc)
[ 3.956433] [<c01a3d20>] (ext4_mb_init_cache) from [<c01a43d8>] (ext4_mb_init_group+0x1b8/0x234)
[ 3.965186] [<c01a43d8>] (ext4_mb_init_group) from [<c01a4950>] (ext4_mb_load_buddy+0x34c/0x380)
[ 3.973952] [<c01a4950>] (ext4_mb_load_buddy) from [<c01a9918>] (ext4_free_blocks+0x47c/0xd30)
[ 3.982545] [<c01a9918>] (ext4_free_blocks) from [<c019b440>] (ext4_ext_remove_space+0xae0/0x1474)
[ 3.991486] [<c019b440>] (ext4_ext_remove_space) from [<c019df68>] (ext4_ext_truncate+0xb4/0xd4)
[ 4.000251] [<c019df68>] (ext4_ext_truncate) from [<c0178ac4>] (ext4_truncate+0x37c/0x43c)
[ 4.008497] [<c0178ac4>] (ext4_truncate) from [<c017ba90>] (ext4_evict_inode+0x334/0x4f4)
[ 4.016660] [<c017ba90>] (ext4_evict_inode) from [<c01229b0>] (evict+0x94/0x158)
[ 4.024036] [<c01229b0>] (evict) from [<c011f1fc>] (__dentry_kill+0x1dc/0x22c)
[ 4.031238] [<c011f1fc>] (__dentry_kill) from [<c011f924>] (dput+0xa8/0x1ac)
[ 4.038282] [<c011f924>] (dput) from [<c0119a1c>] (SyS_renameat2+0x250/0x36c)
[ 4.045396] [<c0119a1c>] (SyS_renameat2) from [<c0119b7c>] (SyS_rename+0x28/0x30)
[ 4.052859] [<c0119b7c>] (SyS_rename) from [<c000eea0>] (ret_fast_syscall+0x0/0x3c)

I used with the eMMC card. I want to know the debugging point for this problem.
If anybody have solution, let me know, plz.

Best Regards,
Jaehoon Chung

On 07/01/2014 05:42 PM, Darrick J. Wong wrote:
> On Tue, Jul 01, 2014 at 08:26:19AM +0200, David Jander wrote:
>>
>> Hi,
>>
>> On Mon, 30 Jun 2014 23:30:10 +0200
>> Matteo Croce <[email protected]> wrote:
>>
>>> I was web surfing and using gimp when:
>>>
>>> EXT4-fs error (device sda2): ext4_mb_generate_buddy:756: group 199,
>>> 9414 clusters in bitmap, 9500 in gd; block bitmap corrupt.
>>
>> I was about to post a related question to this list. I am also seeing these
>> kind of errors when using ext4 on latest mainline (I began testing with 3.15
>> where I saw this and now in 3.16-rc3 it is still there).
>> It happens almost instantly when power-cycling the system (unclean shutdown).
>> The next time the system boots, I get these errors.
>>
>> AFAICT, you are using a pretty recent kernel. Which version exactly?
>>
>>> Aborting journal on device sda2-8.
>>> EXT4-fs (sda2): Remounting filesystem read-only
>
> Matteo, could you please post the full dmesg log somewhere? I'm interested in
> what happens before all this happens, because...
>
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 6 PID: 4134 at fs/ext4/ext4_jbd2.c:259
>>> __ext4_handle_dirty_metadata+0x18e/0x1d0()
>>> Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek
>>> snd_hda_codec_generic ecb uvcvideo videobuf2_vmalloc videobuf2_memops
>>> videobuf2_core videodev ath3k btusb rts5139(C) ctr ccm iTCO_wdt bnep
>>> rfcomm bluetooth nls_iso8859_1 vfat fat arc4 intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
>>> snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
>>> aesni_intel aes_x86_64 snd_seq_midi snd_seq_midi_event ath9k led_class
>>> glue_helper ath9k_common lrw gf128mul ath9k_hw ablk_helper cryptd ath
>>> mac80211 snd_rawmidi snd_seq cfg80211 radeon microcode rfkill
>>> snd_timer snd_seq_device sr_mod psmouse r8169 snd cdrom i915 lpc_ich
>>> soundcore ttm mii mfd_core drm_kms_helper drm intel_gtt agpgart
>>> ehci_pci mei_me xhci_hcd tpm_infineon ehci_hcd video mei wmi tpm
>>> backlight
>>> CPU: 6 PID: 4134 Comm: gimp-2.8 Tainted: G C 3.15.0 #6
>>> 0000000000000009 ffffffff813acbdd 0000000000000000 ffffffff8103de3d
>>> ffff8802365231a0 00000000ffffffe2 0000000000000000 ffff8800b90816c0
>>> ffffffff814205a0 ffffffff8118879e 0000000000000005 ffff8802365231a0
>>> Call Trace:
>>> [<ffffffff813acbdd>] ? dump_stack+0x41/0x51
>>> [<ffffffff8103de3d>] ? warn_slowpath_common+0x6d/0x90
>>> [<ffffffff8118879e>] ? __ext4_handle_dirty_metadata+0x18e/0x1d0
>>> [<ffffffff8116e130>] ? ext4_dirty_inode+0x20/0x50
>>> [<ffffffff811903e9>] ? ext4_free_blocks+0x539/0xa40
>>> [<ffffffff8118468b>] ? ext4_ext_remove_space+0x83b/0xe60
>>> [<ffffffff81186a58>] ? ext4_ext_truncate+0x98/0xc0
>>> [<ffffffff8116c985>] ? ext4_truncate+0x2b5/0x300
>>> [<ffffffff8116d3d8>] ? ext4_evict_inode+0x3d8/0x410
>>> [<ffffffff81114a46>] ? evict+0xa6/0x160
>>> [<ffffffff81109346>] ? do_unlinkat+0x186/0x2a0
>>> [<ffffffff8110e51e>] ? SyS_getdents+0xde/0x100
>>> [<ffffffff8110e1d0>] ? fillonedir+0xd0/0xd0
>>> [<ffffffff813b2626>] ? system_call_fastpath+0x1a/0x1f
>>> ---[ end trace 795411398e41fbcb ]---
>>> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>>> line 241, credits 91/91, errcode -30
>>> EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at
>>> line 241, credits 91/91, errcode -30<2>EXT4-fs error (device sda2) in
>>> ext4_free_blocks:4867: Journal has aborted
>>> EXT4-fs error (device sda2): ext4_ext_rm_leaf:2731: inode #8257653:
>>> block 6520936: comm gimp-2.8: journal_dirty_metadata failed: handle
>>> type 5 started at line 241, credits 91/91, errcode -30
>>> EXT4-fs error (device sda2) in ext4_ext_remove_space:3018: Journal has
>>> aborted EXT4-fs error (device sda2) in ext4_ext_truncate:4666: Journal has
>>> aborted EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>>> has aborted
>>> EXT4-fs error (device sda2) in ext4_truncate:3788: Journal has aborted
>>> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>>> has aborted
>>> EXT4-fs error (device sda2) in ext4_orphan_del:2684: Journal has aborted
>>> EXT4-fs error (device sda2) in ext4_reserve_inode_write:4877: Journal
>>> has aborted
>>
>> I did not get these errors. I suspect this may be a consequence of FS
>> corruption due to a bug in etx4.
>>
>> Here's why I suspect a bug:
>>
>> I am running latest git head (3.16-rc3+ as of yesterday) on an ARM system with
>> eMMC flash. The eMMC is formatted in SLC mode ("enhanced" mode according to
>> eMMC 4.41) and "reliable-writes" are enabled, so power-cycling should not
>> cause FS corruption in presence of a journal.
>>
>> I can format the eMMC device either as EXT3 or EXT4 for the test. After
>> formatting and writing the rootfs to the partition I can boot successfully in
>> either situation. Once booted from eMMC, I start bonnie++ (to just stress the
>> FS for a while), and after a minute or so the board is power-cycled while
>> bonnie++ is still running.
>>
>> Next time I boot the situation is this:
>>
>> With EXT3: All seems fine, journal is replayed, no errors. I can repeat this as
>> many times as I want, FS stays consistent.
>>
>> With EXT4: After just one power cycle I start getting this:
>>
>> [ 7.603871] EXT4-fs error (device mmcblk0p2): ext4_mb_generate_buddy:757: group 1, 8542 clusters in bitmap, 8550 in gd; block bitmap corrupt.
>> [ 7.616743] JBD2: Spotted dirty metadata buffer (dev = mmcblk0p2, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>
> I've been seeing this same set of symptoms with 3.15.0 on various SSDs (Samsung
> 840 Pro, Crucial M4). It seems that something (upstart?) is holding open some
> file or other during poweroff, which means that the root fs can't be unmounted
> or even remounted rw. I also noticed that the next time the system comes up,
> the kernel tells me that it has to process the inode orphan list as part of
> recovery.
>
> Shortly after the orphan list gets processed, I get that message and the FS
> goes ro. A subsequent fsck run reveals that the block bitmap is indeed
> incorrect in that block group, and when I bd the blocks that are incorrect in
> the bitmap, I see what could be some kind of upstart log file. Either way, I
> suspect some bug in orphan processing.
>
> <shrug> I don't know if this is specific to SSDs or spinning rust. Right now
> I've simply rigged the initramfs to e2fsck -p the root fs before mounting it,
> which seems(?) to have patched around it for now.
>
>> If I continue the test, it doesn't take long and serious corruption starts
>> occurring.
>
> You're getting actual FS data corruption too? Or more of those messages?
>
> --D
>>
>> Again, with EXT3 I am unable to detect any problems.
>>
>> Best regards,
>>
>> --
>> David Jander
>> Protonic Holland.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2014-07-01 13:50:00

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Jaehoon,

On Tue, 01 Jul 2014 21:07:27 +0900
Jaehoon Chung <[email protected]> wrote:

> Hi,
>
> i have interesting for this problem..Because i also found the same problem..
> Is it Journal problem?
>
> I used the Linux version 3.16.0-rc3.
>
> [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> [ 3.877937] Aborting journal on device mmcblk0p13-8.
> [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
> [ 3.885025]
> [ 3.891751] CPU: 1 PID: 1488 Comm: mv Not tainted 3.16.0-rc3-00206-gc0f4eea0-dirty #2
> [ 3.899594] [<c0016258>] (unwind_backtrace) from [<c00127d4>] (show_stack+0x10/0x14)
> [ 3.907294] [<c00127d4>] (show_stack) from [<c051f458>] (dump_stack+0x78/0xc8)
> [ 3.914495] [<c051f458>] (dump_stack) from [<c051cd28>] (panic+0x8c/0x1f0)
> [ 3.921354] [<c051cd28>] (panic) from [<c018eb38>] (ext4_handle_error+0x68/0xa8)
> [ 3.928730] [<c018eb38>] (ext4_handle_error) from [<c018f584>] (__ext4_grp_locked_error+0x88/0x1c0)
> [ 3.937759] [<c018f584>] (__ext4_grp_locked_error) from [<c01a1e04>] (ext4_mb_generate_buddy+0x260/0x398)
> [ 3.947306] [<c01a1e04>] (ext4_mb_generate_buddy) from [<c01a3d20>] (ext4_mb_init_cache+0x2fc/0x7fc)
> [ 3.956433] [<c01a3d20>] (ext4_mb_init_cache) from [<c01a43d8>] (ext4_mb_init_group+0x1b8/0x234)
> [ 3.965186] [<c01a43d8>] (ext4_mb_init_group) from [<c01a4950>] (ext4_mb_load_buddy+0x34c/0x380)
> [ 3.973952] [<c01a4950>] (ext4_mb_load_buddy) from [<c01a9918>] (ext4_free_blocks+0x47c/0xd30)
> [ 3.982545] [<c01a9918>] (ext4_free_blocks) from [<c019b440>] (ext4_ext_remove_space+0xae0/0x1474)
> [ 3.991486] [<c019b440>] (ext4_ext_remove_space) from [<c019df68>] (ext4_ext_truncate+0xb4/0xd4)
> [ 4.000251] [<c019df68>] (ext4_ext_truncate) from [<c0178ac4>] (ext4_truncate+0x37c/0x43c)
> [ 4.008497] [<c0178ac4>] (ext4_truncate) from [<c017ba90>] (ext4_evict_inode+0x334/0x4f4)
> [ 4.016660] [<c017ba90>] (ext4_evict_inode) from [<c01229b0>] (evict+0x94/0x158)
> [ 4.024036] [<c01229b0>] (evict) from [<c011f1fc>] (__dentry_kill+0x1dc/0x22c)
> [ 4.031238] [<c011f1fc>] (__dentry_kill) from [<c011f924>] (dput+0xa8/0x1ac)
> [ 4.038282] [<c011f924>] (dput) from [<c0119a1c>] (SyS_renameat2+0x250/0x36c)
> [ 4.045396] [<c0119a1c>] (SyS_renameat2) from [<c0119b7c>] (SyS_rename+0x28/0x30)
> [ 4.052859] [<c0119b7c>] (SyS_rename) from [<c000eea0>] (ret_fast_syscall+0x0/0x3c)
>
> I used with the eMMC card. I want to know the debugging point for this problem.
> If anybody have solution, let me know, plz.

You can try to disable the mount option "errors=panic", I suspect the actual
error is harmless....
OTOH, then you cannot see the stack-trace anymore.

AFAICS, it seems to originate in ext4_ext_remove_space() in all cases I have
seen, and happens quicker if the filesystem was not unmounted properly, so
maybe the bug is related to the journal code's accounting of free space.
Let's wait and see what the experts say about this...

Regards,

--
David Jander
Protonic Holland.

2014-07-01 15:58:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> Hi,
>
> i have interesting for this problem..Because i also found the same problem..
> Is it Journal problem?
>
> I used the Linux version 3.16.0-rc3.
>
> [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> [ 3.877937] Aborting journal on device mmcblk0p13-8.
> [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error

This message means that the file system has detected an inconsistency
--- specifically, that the number of blocks marked as in use in the
allocation bbitmap is different from what is in the block group
descriptors.

The file system has been marked to force a panic after an error, at
which point e2fsck will be able to repair the inconsistency.

What's not clear is *how* the why this happened. It can happen simply
because of a hardware problem. (In particular, not all mmc flash
devices handle power failures gracefully.) Or it could be a cosmic,
ray, or it might be a kernel bug.

Normally I would chalk this up to a hardware bug, bug it's possible
that it is a kernel bug. If people can reliably reproduce the problem
where no power failures or other unclean shutdowns were involved
(since the last time file system has been checked using e2fsck) then
that would be realy interesting.

We should probably also change the message so the message is a bit
more understanding to people who aren't ext4 developers.

- Ted

2014-07-01 16:14:48

by Lukas Czerner

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Tue, 1 Jul 2014, Theodore Ts'o wrote:

> Date: Tue, 1 Jul 2014 11:58:12 -0400
> From: Theodore Ts'o <[email protected]>
> To: Jaehoon Chung <[email protected]>
> Cc: Darrick J. Wong <[email protected]>,
> Matteo Croce <[email protected]>, David Jander <[email protected]>,
> [email protected]
> Subject: Re: ext4: journal has aborted
>
> On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> > Hi,
> >
> > i have interesting for this problem..Because i also found the same problem..
> > Is it Journal problem?
> >
> > I used the Linux version 3.16.0-rc3.
> >
> > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> > [ 3.877937] Aborting journal on device mmcblk0p13-8.
> > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>
> This message means that the file system has detected an inconsistency
> --- specifically, that the number of blocks marked as in use in the
> allocation bbitmap is different from what is in the block group
> descriptors.
>
> The file system has been marked to force a panic after an error, at
> which point e2fsck will be able to repair the inconsistency.
>
> What's not clear is *how* the why this happened. It can happen simply
> because of a hardware problem. (In particular, not all mmc flash
> devices handle power failures gracefully.) Or it could be a cosmic,
> ray, or it might be a kernel bug.
>
> Normally I would chalk this up to a hardware bug, bug it's possible
> that it is a kernel bug. If people can reliably reproduce the problem
> where no power failures or other unclean shutdowns were involved
> (since the last time file system has been checked using e2fsck) then
> that would be realy interesting.
>
> We should probably also change the message so the message is a bit
> more understanding to people who aren't ext4 developers.

Yes we probably should :) Maybe we could make it more clear that the
there might be wide variety of reasons for this and that one of the
likely reasons for this might be hardware related.

-Lukas

>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2014-07-01 16:36:49

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4: journal has aborted

* Theodore Ts'o <[email protected]>:
> On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> > Hi,
> >
> > i have interesting for this problem..Because i also found the same problem..
> > Is it Journal problem?
> >
> > I used the Linux version 3.16.0-rc3.
> >
> > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> > [ 3.877937] Aborting journal on device mmcblk0p13-8.
> > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>
> This message means that the file system has detected an inconsistency
> --- specifically, that the number of blocks marked as in use in the
> allocation bbitmap is different from what is in the block group
> descriptors.
>
> The file system has been marked to force a panic after an error, at
> which point e2fsck will be able to repair the inconsistency.
>
> What's not clear is *how* the why this happened. It can happen simply
> because of a hardware problem. (In particular, not all mmc flash
> devices handle power failures gracefully.) Or it could be a cosmic,
> ray, or it might be a kernel bug.
>
> Normally I would chalk this up to a hardware bug, bug it's possible
> that it is a kernel bug. If people can reliably reproduce the problem
> where no power failures or other unclean shutdowns were involved
> (since the last time file system has been checked using e2fsck) then
> that would be realy interesting.

Hi Ted:

I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
patch) regression on the Pandaboard this morning. A generic/068 hang
on data_journal required a reboot for recovery (old bug, though rarer lately).
On reboot, the root filesystem - default 4K, and on an SD card - went ro
after the same sort of bad block bitmap / journal abort sequence. Rebooting
forced a fsck that cleared up the problem. The target test filesystem was on
a USB-attached disk, and it did not exhibit the same problems on recovery.

So, it looks like there might be more than just hardware involved here,
although eMMC/flash might be a common denominator. I'll see if I can come up
with a reliable reproducer once the regression pass is finished if someone
doesn't beat me to it.

Eric


>
> We should probably also change the message so the message is a bit
> more understanding to people who aren't ext4 developers.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-02 08:34:44

by Matteo Croce

[permalink] [raw]
Subject: Re: ext4: journal has aborted

Similar issue on an X86 router:

# dmesg
Initializing cgroup subsys cpu
Linux version 3.15.0-alix (root@alix) (gcc version 4.8.3 (Debian
4.8.3-2) ) #2 Mon Jun 9 16:54:44 CEST 2014
KERNEL supported cpus:
AMD AuthenticAMD
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x000000000fffffff] usable
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
Notice: NX (Execute Disable) protection missing in CPU!
e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
e820: last_pfn = 0x10000 max_arch_pfn = 0x100000
initial memory mapped: [mem 0x00000000-0x017fffff]
Base memory trampoline at [c009b000] 9b000 size 16384
init_memory_mapping: [mem 0x00000000-0x000fffff]
[mem 0x00000000-0x000fffff] page 4k
init_memory_mapping: [mem 0x0fc00000-0x0fffffff]
[mem 0x0fc00000-0x0fffffff] page 2M
init_memory_mapping: [mem 0x08000000-0x0fbfffff]
[mem 0x08000000-0x0fbfffff] page 2M
init_memory_mapping: [mem 0x00100000-0x07ffffff]
[mem 0x00100000-0x003fffff] page 4k
[mem 0x00400000-0x07ffffff] page 2M
256MB LOWMEM available.
mapped low ram: 0 - 10000000
low ram: 0 - 10000000
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
Normal [mem 0x01000000-0x0fffffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009ffff]
node 0: [mem 0x00100000-0x0fffffff]
On node 0 totalpages: 65439
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 3999 pages, LIFO batch:0
Normal zone: 480 pages used for memmap
Normal zone: 61440 pages, LIFO batch:15
e820: [mem 0x10000000-0xffefffff] available for PCI devices
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 64927
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.15.0-alix
root=/dev/sda1 ro console=ttyS0,115200 panic=1 init=/bin/systemd
PID hash table entries: 1024 (order: 0, 4096 bytes)
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Initializing CPU#0
Memory: 255748K/261756K available (2409K kernel code, 151K rwdata,
628K rodata, 160K init, 236K bss, 6008K reserved)
virtual kernel memory layout:
fixmap : 0xfffe5000 - 0xfffff000 ( 104 kB)
vmalloc : 0xd0800000 - 0xfffe3000 ( 759 MB)
lowmem : 0xc0000000 - 0xd0000000 ( 256 MB)
.init : 0xc1320000 - 0xc1348000 ( 160 kB)
.data : 0xc125aaef - 0xc131ece0 ( 784 kB)
.text : 0xc1000000 - 0xc125aaef (2410 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS:16 nr_irqs:16 16
CPU 0 irqstacks, hard=cf808000 soft=cf80a000
console [ttyS0] enabled
tsc: Fast TSC calibration using PIT
tsc: Detected 498.030 MHz processor
Calibrating delay loop (skipped), value calculated using timer
frequency.. 996.06 BogoMIPS (lpj=4980300)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
tlb_flushall_shift: -1
CPU: Geode(TM) Integrated Processor by AMD PCS (fam: 06, model: 0a,
stepping: 02)
Performance Events:
no APIC, boot with the "lapic" boot parameter to force-enable it.
no hardware sampling interrupt available.
Broken PMU hardware detected, using software events only.
Failed to access perfctr msr (MSR c0010004 is 0)
devtmpfs: initialized
NET: Registered protocol family 16
cpuidle: using governor ladder
cpuidle: using governor menu
PCI: PCI BIOS revision 2.10 entry at 0xfced9, last bus=0
PCI: Using configuration type 1 for base access
SCSI subsystem initialized
libata version 3.00 loaded.
PCI: Probing PCI hardware
PCI: root bus 00: using default resources
PCI: Probing PCI hardware (bus 00)
PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io 0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffff]
pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
pci 0000:00:01.0: [1022:2080] type 00 class 0x060000
pci 0000:00:01.0: reg 0x10: [io 0xac1c-0xac1f]
pci 0000:00:01.2: [1022:2082] type 00 class 0x101000
pci 0000:00:01.2: reg 0x10: [mem 0xefff4000-0xefff7fff]
pci 0000:00:09.0: [1106:3053] type 00 class 0x020000
pci 0000:00:09.0: reg 0x10: [io 0x1000-0x10ff]
pci 0000:00:09.0: reg 0x14: [mem 0xe0000000-0xe00000ff]
pci 0000:00:09.0: supports D1 D2
pci 0000:00:09.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:0c.0: [168c:0029] type 00 class 0x028000
pci 0000:00:0c.0: reg 0x10: [mem 0xe0040000-0xe004ffff]
pci 0000:00:0c.0: PME# supported from D0 D3hot
pci 0000:00:0f.0: [1022:2090] type 00 class 0x060100
pci 0000:00:0f.0: reg 0x10: [io 0x6000-0x6007]
pci 0000:00:0f.0: reg 0x14: [io 0x6100-0x61ff]
pci 0000:00:0f.0: reg 0x18: [io 0x6200-0x623f]
pci 0000:00:0f.0: reg 0x20: [io 0x9d00-0x9d7f]
pci 0000:00:0f.0: reg 0x24: [io 0x9c00-0x9c3f]
pci 0000:00:0f.2: [1022:209a] type 00 class 0x010180
pci 0000:00:0f.2: reg 0x20: [io 0xff00-0xff0f]
pci 0000:00:0f.2: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7]
pci 0000:00:0f.2: legacy IDE quirk: reg 0x14: [io 0x03f6]
pci 0000:00:0f.2: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177]
pci 0000:00:0f.2: legacy IDE quirk: reg 0x1c: [io 0x0376]
pci 0000:00:0f.4: [1022:2094] type 00 class 0x0c0310
pci 0000:00:0f.4: reg 0x10: [mem 0xefffe000-0xefffefff]
pci 0000:00:0f.4: PME# supported from D0 D3hot D3cold
pci 0000:00:0f.5: [1022:2095] type 00 class 0x0c0320
pci 0000:00:0f.5: reg 0x10: [mem 0xefffd000-0xefffdfff]
pci 0000:00:0f.5: PME# supported from D0 D3hot D3cold
pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
PCI: pci_cache_line_size set to 32 bytes
Switched to clocksource pit
pci_bus 0000:00: resource 4 [io 0x0000-0xffff]
pci_bus 0000:00: resource 5 [mem 0x00000000-0xffffffff]
NET: Registered protocol family 2
TCP established hash table entries: 2048 (order: 1, 8192 bytes)
TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
TCP: Hash tables configured (established 2048 bind 2048)
TCP: reno registered
UDP hash table entries: 256 (order: 0, 4096 bytes)
UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
NET: Registered protocol family 1
platform rtc_cmos: registered platform RTC device (no PNP device found)
futex hash table entries: 256 (order: -1, 3072 bytes)
msgmni has been set to 499
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
io scheduler noop registered
io scheduler deadline registered (default)
Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 921600) is a NS16550A
scsi0 : pata_cs5536
scsi1 : pata_cs5536
ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14
ata2: DUMMY
rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
rtc_cmos rtc_cmos: alarms up to one day, 114 bytes nvram
TCP: cubic registered
NET: Registered protocol family 10
NET: Registered protocol family 17
rtc_cmos rtc_cmos: setting system clock to 2000-01-01 00:00:04 UTC (946684804)
ata1.00: CFA: , 20101012, max UDMA/100
ata1.00: 62537328 sectors, multi 0: LBA
ata1.00: limited to UDMA/33 due to 40-wire cable
ata1.00: configured for UDMA/33
scsi 0:0:0:0: Direct-Access ATA 2010 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 62537328 512-byte logical blocks: (32.0 GB/29.8 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't
support DPO or FUA
sda: sda1 sda2
sd 0:0:0:0: [sda] Attached SCSI disk
EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities
EXT4-fs (sda1): couldn't mount as ext2 due to feature incompatibilities
EXT4-fs (sda1): INFO: recovery required on readonly filesystem
EXT4-fs (sda1): write access will be enabled during recovery
Switched to clocksource tsc
EXT4-fs (sda1): orphan cleanup on readonly fs
EXT4-fs (sda1): 1 orphan inode deleted
EXT4-fs (sda1): recovery complete
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
devtmpfs: mounted
Freeing unused kernel memory: 160K (c1320000 - c1348000)
Write protecting the kernel text: 2412k
Write protecting the kernel read-only data: 632k
systemd[1]: systemd 204 running in system mode. (+PAM +LIBWRAP +AUDIT
+SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
systemd[1]: Inserted module 'autofs4'
systemd[1]: Set hostname to <alix>.
random: systemd urandom read with 30 bits of entropy available
systemd[1]: Cannot add dependency job for unit
display-manager.service, ignoring: Unit display-manager.service failed
to load: No such file or directory. See system logs and 'systemctl
status display-manager.service' for details.
systemd[1]: Expecting device dev-ttyS0.device...
systemd[1]: Starting Forward Password Requests to Wall Directory Watch.
systemd[1]: Started Forward Password Requests to Wall Directory Watch.
systemd[1]: Starting Syslog Socket.
systemd[1]: Listening on Syslog Socket.
systemd[1]: Starting Delayed Shutdown Socket.
systemd[1]: Listening on Delayed Shutdown Socket.
systemd[1]: Starting /dev/initctl Compatibility Named Pipe.
systemd[1]: Listening on /dev/initctl Compatibility Named Pipe.
systemd[1]: Starting Dispatch Password Requests to Console Directory Watch.
systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
systemd[1]: Starting Encrypted Volumes.
systemd[1]: Reached target Encrypted Volumes.
systemd[1]: Starting udev Kernel Socket.
systemd[1]: Listening on udev Kernel Socket.
systemd[1]: Starting udev Control Socket.
systemd[1]: Listening on udev Control Socket.
systemd[1]: Set up automount Arbitrary Executable File Formats File
System Automount Point.
systemd[1]: Starting Journal Socket.
systemd[1]: Listening on Journal Socket.
systemd[1]: Starting Syslog.
systemd[1]: Reached target Syslog.
systemd[1]: Mounted Huge Pages File System.
systemd[1]: Started Set Up Additional Binary Formats.
systemd[1]: Starting Create static device nodes in /dev...
systemd[1]: Starting Apply Kernel Variables...
systemd[1]: Starting Load Kernel Modules...
systemd[1]: Starting udev Coldplug all Devices...
systemd[1]: Starting Journal Service...
systemd[1]: Started Journal Service.
systemd[1]: Mounted POSIX Message Queue File System.
systemd[1]: Expecting device dev-sda2.device...
systemd[1]: Starting File System Check on Root Device...
cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
cs5535-mfgpt cs5535-mfgpt: reserved resource region [io 0x6200-0x623f]
cs5535-mfgpt cs5535-mfgpt: 8 MFGPT timers available
cs5535-mfd 0000:00:0f.0: 5 devices registered.
systemd[1]: Started Create static device nodes in /dev.
systemd[1]: Started Apply Kernel Variables.
cs5535-mfgpt cs5535-mfgpt: registered timer 0
cs5535-clockevt: Registering MFGPT timer as a clock event, using IRQ 7
systemd[1]: Starting udev Kernel Device Manager...
systemd-udevd[317]: starting version 204
EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
cs5535-gpio cs5535-gpio: reserved resource region [io 0x6100-0x61ff]
via_rhine: v1.10-LK1.5.1 2010-10-09 Written by Donald Becker
via-rhine 0000:00:09.0 eth0: VIA Rhine III (Management Adapter) at
0xe0000000, 00:0d:b9:19:4c:bc, IRQ 10
via-rhine 0000:00:09.0 eth0: MII PHY found at address 1, status 0x7849
advertising 05e1 Link 0000
AMD Geode RNG detected
geode-aes: GEODE AES engine enabled.
cfg80211: Calling CRDA to update world regulatory domain
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci-pci: OHCI PCI platform driver
ohci-pci 0000:00:0f.4: OHCI PCI host controller
ohci-pci 0000:00:0f.4: new USB bus registered, assigned bus number 1
ohci-pci 0000:00:0f.4: irq 12, io mem 0xefffe000
ehci-pci: EHCI PCI platform driver
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
ehci-pci 0000:00:0f.5: EHCI Host Controller
ehci-pci 0000:00:0f.5: new USB bus registered, assigned bus number 2
ehci-pci 0000:00:0f.5: irq 12, io mem 0xefffd000
cfg80211: World regulatory domain updated:
cfg80211: DFS Master region: unset
cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain,
max_eirp), (dfs_cac_time)
cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
cfg80211: (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
cfg80211: (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
cfg80211: (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
cfg80211: (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
ehci-pci 0000:00:0f.5: USB 2.0 started, EHCI 1.00
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 4 ports detected
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
ath: EEPROM regdomain: 0x0
ath: EEPROM indicates default country code should be used
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x3a
ath: Country alpha2 being used: US
ath: Regpair used: 0x3a
ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xd0940000, irq=9
IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
cfg80211: Calling CRDA for country: US
cfg80211: Regulatory domain changed to country: US
cfg80211: DFS Master region: unset
cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain,
max_eirp), (dfs_cac_time)
cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
cfg80211: (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 1700 mBm), (N/A)
cfg80211: (5250000 KHz - 5330000 KHz @ 80000 KHz), (N/A, 2300 mBm), (0 s)
cfg80211: (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 3000 mBm), (N/A)
cfg80211: (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 4000 mBm), (N/A)
Adding 858932k swap on /dev/sda2. Priority:-1 extents:1 across:858932k
random: nonblocking pool is initialized
IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 114,
24855 clusters in bitmap, 24856 in gd; block bitmap corrupt.
Aborting journal on device sda1-8.
EXT4-fs (sda1): Remounting filesystem read-only
# e2fsck -fy /dev/sda1
e2fsck 1.42.10 (18-May-2014)
/dev/sda1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (6708644, counted=6708655).
Fix? yes

Free inodes count wrong (1752623, counted=1752627).
Fix? yes


/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: ***** REBOOT LINUX *****
/dev/sda1: 147917/1900544 files (0.1% non-contiguous), 893521/7602176 blocks

2014-07-01 18:36 GMT+02:00 Eric Whitney <[email protected]>:
> * Theodore Ts'o <[email protected]>:
>> On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
>> > Hi,
>> >
>> > i have interesting for this problem..Because i also found the same problem..
>> > Is it Journal problem?
>> >
>> > I used the Linux version 3.16.0-rc3.
>> >
>> > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
>> > [ 3.877937] Aborting journal on device mmcblk0p13-8.
>> > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>>
>> This message means that the file system has detected an inconsistency
>> --- specifically, that the number of blocks marked as in use in the
>> allocation bbitmap is different from what is in the block group
>> descriptors.
>>
>> The file system has been marked to force a panic after an error, at
>> which point e2fsck will be able to repair the inconsistency.
>>
>> What's not clear is *how* the why this happened. It can happen simply
>> because of a hardware problem. (In particular, not all mmc flash
>> devices handle power failures gracefully.) Or it could be a cosmic,
>> ray, or it might be a kernel bug.
>>
>> Normally I would chalk this up to a hardware bug, bug it's possible
>> that it is a kernel bug. If people can reliably reproduce the problem
>> where no power failures or other unclean shutdowns were involved
>> (since the last time file system has been checked using e2fsck) then
>> that would be realy interesting.
>
> Hi Ted:
>
> I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
> patch) regression on the Pandaboard this morning. A generic/068 hang
> on data_journal required a reboot for recovery (old bug, though rarer lately).
> On reboot, the root filesystem - default 4K, and on an SD card - went ro
> after the same sort of bad block bitmap / journal abort sequence. Rebooting
> forced a fsck that cleared up the problem. The target test filesystem was on
> a USB-attached disk, and it did not exhibit the same problems on recovery.
>
> So, it looks like there might be more than just hardware involved here,
> although eMMC/flash might be a common denominator. I'll see if I can come up
> with a reliable reproducer once the regression pass is finished if someone
> doesn't beat me to it.
>
> Eric
>
>
>>
>> We should probably also change the message so the message is a bit
>> more understanding to people who aren't ext4 developers.
>>
>> - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Matteo Croce
OpenWrt Developer

2014-07-02 09:44:19

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Ted,

On Tue, 1 Jul 2014 11:58:12 -0400
"Theodore Ts'o" <[email protected]> wrote:

> On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> > Hi,
> >
> > i have interesting for this problem..Because i also found the same problem..
> > Is it Journal problem?
> >
> > I used the Linux version 3.16.0-rc3.
> >
> > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> > [ 3.877937] Aborting journal on device mmcblk0p13-8.
> > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>
> This message means that the file system has detected an inconsistency
> --- specifically, that the number of blocks marked as in use in the
> allocation bbitmap is different from what is in the block group
> descriptors.
>
> The file system has been marked to force a panic after an error, at
> which point e2fsck will be able to repair the inconsistency.
>
> What's not clear is *how* the why this happened. It can happen simply
> because of a hardware problem. (In particular, not all mmc flash
> devices handle power failures gracefully.) Or it could be a cosmic,
> ray, or it might be a kernel bug.

I understand all this.

> Normally I would chalk this up to a hardware bug, bug it's possible
> that it is a kernel bug. If people can reliably reproduce the problem
> where no power failures or other unclean shutdowns were involved
> (since the last time file system has been checked using e2fsck) then
> that would be realy interesting.

If you read my first reply to Matteo, you would have noticed that I can
reliably reproduce this bug with ext4 and also that I can be pretty confident
that this is NOT a hardware issue. Here's (again) why:

The eMMC device supports eMMC 4.41 and is configured with all the "hardening"
features necessary for embedded systems that boot from eMMC: Enhanced mode is
active (SLC NAND mode) and reliable-writes are turned on. This means that (at
least by design) when a power cut occurs it is guaranteed that:

1.- The sector currently being written will be either in the old state or in
the new (re-written) state, but never "in-between" or in an unstable state
(what happens to regular MLC NAND flash).

2.- No other sectors of the flash may be affected by write interruptions on
one sector.

So power-cuts should always end up just requiring a journal-replay on next
mount. No real corruption should ever occur this way. Right?

I have been testing with both EXT3 and EXT4 on this device and I only see
problems when using EXT4. Furthermore, the process of reproducing the test
produces with almost 100% reliability this error _ALWAYS_ when using EXT4, and
until now I have not been able to use this procedure to corrupt or otherwise
harm an EXT3 filesystem beyond simply replaying the journal on the next boot.

Please tell me what you want me to test to continue investigating. I am
convinced this is a kernel-bug, but I'd be happy if you managed to prove me
wrong. I could even try git bisecting if you think this could help, but if I
have to go too far back in time to find a working version I will get into
trouble getting the kernel to boot on my hardware without patching a lot of
things on each iteration....

> We should probably also change the message so the message is a bit
> more understanding to people who aren't ext4 developers.

That would be nice, but not really necessary. Let's better find the bug and
solve it.

Best regards,

--
David Jander
Protonic Holland.

2014-07-02 10:17:47

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Eric,

On Tue, 1 Jul 2014 12:36:46 -0400
Eric Whitney <[email protected]> wrote:

> * Theodore Ts'o <[email protected]>:
> > On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> > > Hi,
> > >
> > > i have interesting for this problem..Because i also found the same problem..
> > > Is it Journal problem?
> > >
> > > I used the Linux version 3.16.0-rc3.
> > >
> > > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> > > [ 3.877937] Aborting journal on device mmcblk0p13-8.
> > > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
> >
> > This message means that the file system has detected an inconsistency
> > --- specifically, that the number of blocks marked as in use in the
> > allocation bbitmap is different from what is in the block group
> > descriptors.
> >
> > The file system has been marked to force a panic after an error, at
> > which point e2fsck will be able to repair the inconsistency.
> >
> > What's not clear is *how* the why this happened. It can happen simply
> > because of a hardware problem. (In particular, not all mmc flash
> > devices handle power failures gracefully.) Or it could be a cosmic,
> > ray, or it might be a kernel bug.
> >
> > Normally I would chalk this up to a hardware bug, bug it's possible
> > that it is a kernel bug. If people can reliably reproduce the problem
> > where no power failures or other unclean shutdowns were involved
> > (since the last time file system has been checked using e2fsck) then
> > that would be realy interesting.
>
> Hi Ted:
>
> I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
> patch) regression on the Pandaboard this morning. A generic/068 hang
> on data_journal required a reboot for recovery (old bug, though rarer lately).
> On reboot, the root filesystem - default 4K, and on an SD card - went ro
> after the same sort of bad block bitmap / journal abort sequence. Rebooting
> forced a fsck that cleared up the problem. The target test filesystem was on
> a USB-attached disk, and it did not exhibit the same problems on recovery.

Please be careful about conclusions from regular SD cards and USB sticks for
mass-storage. Unlike hardened eMMC (4.41+), these COTS mass-storage devices
are not meant for intensive use and can perfectly easily corrupt data out of
themselves. I've seen it happening many times already.

> So, it looks like there might be more than just hardware involved here,
> although eMMC/flash might be a common denominator. I'll see if I can come up
> with a reliable reproducer once the regression pass is finished if someone
> doesn't beat me to it.

I agree that there is a strong correlation towards flash-based storage, but I
cannot explain why this factor would make a difference. How are flash-based
block-devices different to ext4 than spinning-disk media (besides trim
support)?

Best regards,

--
David Jander
Protonic Holland.

2014-07-02 10:20:15

by Matteo Croce

[permalink] [raw]
Subject: Re: ext4: journal has aborted

2014-07-02 12:17 GMT+02:00 David Jander <[email protected]>:
>
> Hi Eric,
>
> On Tue, 1 Jul 2014 12:36:46 -0400
> Eric Whitney <[email protected]> wrote:
>
>> * Theodore Ts'o <[email protected]>:
>> > On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
>> > > Hi,
>> > >
>> > > i have interesting for this problem..Because i also found the same problem..
>> > > Is it Journal problem?
>> > >
>> > > I used the Linux version 3.16.0-rc3.
>> > >
>> > > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
>> > > [ 3.877937] Aborting journal on device mmcblk0p13-8.
>> > > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>> >
>> > This message means that the file system has detected an inconsistency
>> > --- specifically, that the number of blocks marked as in use in the
>> > allocation bbitmap is different from what is in the block group
>> > descriptors.
>> >
>> > The file system has been marked to force a panic after an error, at
>> > which point e2fsck will be able to repair the inconsistency.
>> >
>> > What's not clear is *how* the why this happened. It can happen simply
>> > because of a hardware problem. (In particular, not all mmc flash
>> > devices handle power failures gracefully.) Or it could be a cosmic,
>> > ray, or it might be a kernel bug.
>> >
>> > Normally I would chalk this up to a hardware bug, bug it's possible
>> > that it is a kernel bug. If people can reliably reproduce the problem
>> > where no power failures or other unclean shutdowns were involved
>> > (since the last time file system has been checked using e2fsck) then
>> > that would be realy interesting.
>>
>> Hi Ted:
>>
>> I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
>> patch) regression on the Pandaboard this morning. A generic/068 hang
>> on data_journal required a reboot for recovery (old bug, though rarer lately).
>> On reboot, the root filesystem - default 4K, and on an SD card - went ro
>> after the same sort of bad block bitmap / journal abort sequence. Rebooting
>> forced a fsck that cleared up the problem. The target test filesystem was on
>> a USB-attached disk, and it did not exhibit the same problems on recovery.
>
> Please be careful about conclusions from regular SD cards and USB sticks for
> mass-storage. Unlike hardened eMMC (4.41+), these COTS mass-storage devices
> are not meant for intensive use and can perfectly easily corrupt data out of
> themselves. I've seen it happening many times already.
>
>> So, it looks like there might be more than just hardware involved here,
>> although eMMC/flash might be a common denominator. I'll see if I can come up
>> with a reliable reproducer once the regression pass is finished if someone
>> doesn't beat me to it.
>
> I agree that there is a strong correlation towards flash-based storage, but I
> cannot explain why this factor would make a difference. How are flash-based
> block-devices different to ext4 than spinning-disk media (besides trim
> support)?

maybe the zero access time can trigger some race condition?

> Best regards,
>
> --
> David Jander
> Protonic Holland.



--
Matteo Croce
OpenWrt Developer

2014-07-03 13:43:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
>
> I have a Samsung SSD 840 PRO

Matteo,

For you, you said you were seeing these problems on 3.15. Was it
*not* happening for you when you used an older kernel? If so, that
would help us try to provide the basis of trying to do a bisection
search.


Using the kvm-xfstests infrastructure, I've been trying to reproduce
the problem as follows:

./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k

xfstests geneeric/075 runs fsx which does a fair amount of block
allocation deallocations, and then after the test finishes, it first
replays the journal (e2fsck -p) and then forces a fsck run on the
test disk that I use for the run.

After I launch this, in a separate window, I do this:

sleep 60 ; killall qemu-system-x86_64

This kills the qemu process midway through the fsx test, and then I
see if I can find a problem. I haven't had a chance to automate this
yet, and it is my intention to try to set this up where I can run this
on a ramdisk or a SSD, so I can more closely approximate what people
are reporting on flash-based media.

So far, I haven't been able to reproduce the problem. If after doing
a large number of times, it can't be reproduced (especially if it
can't be reproduced on an SSD), then it would lead us to believe that
one of two things is the cause. (a) The CACHE FLUSH command isn't
properly getting sent to the device in some cases, or (b) there really
is a hardware problem with the flash device in question.

Cheers,

- Ted

2014-07-03 14:15:46

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Ted,

On Thu, 3 Jul 2014 09:43:38 -0400
"Theodore Ts'o" <[email protected]> wrote:

> On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> > 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> >
> > I have a Samsung SSD 840 PRO
>
> Matteo,
>
> For you, you said you were seeing these problems on 3.15. Was it
> *not* happening for you when you used an older kernel? If so, that
> would help us try to provide the basis of trying to do a bisection
> search.

I also tested with 3.15, and there too I see the same problem.

> Using the kvm-xfstests infrastructure, I've been trying to reproduce
> the problem as follows:
>
> ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
>
> xfstests geneeric/075 runs fsx which does a fair amount of block
> allocation deallocations, and then after the test finishes, it first
> replays the journal (e2fsck -p) and then forces a fsck run on the
> test disk that I use for the run.
>
> After I launch this, in a separate window, I do this:
>
> sleep 60 ; killall qemu-system-x86_64
>
> This kills the qemu process midway through the fsx test, and then I
> see if I can find a problem. I haven't had a chance to automate this
> yet, and it is my intention to try to set this up where I can run this
> on a ramdisk or a SSD, so I can more closely approximate what people
> are reporting on flash-based media.
>
> So far, I haven't been able to reproduce the problem. If after doing
> a large number of times, it can't be reproduced (especially if it
> can't be reproduced on an SSD), then it would lead us to believe that
> one of two things is the cause. (a) The CACHE FLUSH command isn't
> properly getting sent to the device in some cases, or (b) there really
> is a hardware problem with the flash device in question.

Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
driver? Can you explain why I don't see any problems with EXT3?

I can't discard the possibility of (b) because I cannot prove it, but I will
try to see if I can do the same test on a SSD which I happen to have on that
platform. That should be able to rule out problems with the eMMC chip and
-driver, right?

Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?

I left the system running (it started from a dirty EXT4 partition), and I am
seen the following error pop up after a few minutes. The system is not doing
much (some syslog activity maybe, but not much more):

[ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
[ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
[ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757

What does that mean?

Best regards,

--
David Jander
Protonic Holland.

2014-07-03 14:46:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 03, 2014 at 04:15:51PM +0200, David Jander wrote:
>
> Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> driver? Can you explain why I don't see any problems with EXT3?

It's possible. I seem to recall a bug related to the mmc subsystem
that was causing file system corruption after power failure across
multiple file systems --- xfs, and reiserfs were mentioned, as I
recall. I *thought* the problem was fixed, and then backported if
necessary. Hmm... Here's where that bug was reported:

https://lkml.org/lkml/2014/6/12/19

... but I havne't found the fix yet.

Now, this would be quite different from the bug Matteo was seeing,
since he has a Samsung SSD which is *not* a MMC device.

As far as why you aren't seeing a problem with ext3, it doesn't have
the same sort of paranoid checks that ext4 has, so it's less likely to
catch certain problems at runtime. If you ran fsck on an ext3 file
system, and it was corrupt, of course that that would show th
eproblem.

> I left the system running (it started from a dirty EXT4 partition), and I am
> seen the following error pop up after a few minutes. The system is not doing
> much (some syslog activity maybe, but not much more):
>
> [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
>
> What does that mean?

This means that file sysgtem has errors that weren't fixed after an
fsck. The first error occured at:

% date -d @1404216838
Tue Jul 1 08:13:58 EDT 2014

and the most recent error occured at:

% date -d @1404388969
Thu Jul 3 08:02:49 EDT 2014

The error count information should have gotten cleared by e2fsck, so
long as you are using a version of e2fsck newer than 1.41.13, released
in December 2010.

So if it has not been cleared, and you've since rebooted, that's an
indication that e2fsck isn't getting run at boot. If you haven't
rebooted yet, then about once a day, you'll see that message in your
syslog. It's there so that people know that their file system has
been problems, and they *really* should get it unmounted and checked
before they lose more data....

The reason why I added this is because there were systems where people
weren't noticing that they had been running with a corrupted file
systems for days, weeks, months, etc., and then would complain that
they had lost lots of data. By putting something in the logs once a
day, hopefully it would reduce the chance of this happening. (And if
they had configured their file system to panic when an error was
detected, via "tune2fs -e panic /dev/sdXX", so long as their init
scripts were properly configured, the file system should have been
repaired after the reboot.)

- Ted

2014-07-03 14:57:22

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
>
> Hi Ted,
>
> On Thu, 3 Jul 2014 09:43:38 -0400
> "Theodore Ts'o" <[email protected]> wrote:
>
> > On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> > > 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> > >
> > > I have a Samsung SSD 840 PRO
> >
> > Matteo,
> >
> > For you, you said you were seeing these problems on 3.15. Was it
> > *not* happening for you when you used an older kernel? If so, that
> > would help us try to provide the basis of trying to do a bisection
> > search.
>
> I also tested with 3.15, and there too I see the same problem.
>
> > Using the kvm-xfstests infrastructure, I've been trying to reproduce
> > the problem as follows:
> >
> > ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
> >
> > xfstests geneeric/075 runs fsx which does a fair amount of block
> > allocation deallocations, and then after the test finishes, it first
> > replays the journal (e2fsck -p) and then forces a fsck run on the
> > test disk that I use for the run.
> >
> > After I launch this, in a separate window, I do this:
> >
> > sleep 60 ; killall qemu-system-x86_64
> >
> > This kills the qemu process midway through the fsx test, and then I
> > see if I can find a problem. I haven't had a chance to automate this
> > yet, and it is my intention to try to set this up where I can run this
> > on a ramdisk or a SSD, so I can more closely approximate what people
> > are reporting on flash-based media.
> >
> > So far, I haven't been able to reproduce the problem. If after doing
> > a large number of times, it can't be reproduced (especially if it
> > can't be reproduced on an SSD), then it would lead us to believe that
> > one of two things is the cause. (a) The CACHE FLUSH command isn't
> > properly getting sent to the device in some cases, or (b) there really
> > is a hardware problem with the flash device in question.
>
> Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> driver? Can you explain why I don't see any problems with EXT3?
>
> I can't discard the possibility of (b) because I cannot prove it, but I will
> try to see if I can do the same test on a SSD which I happen to have on that
> platform. That should be able to rule out problems with the eMMC chip and
> -driver, right?
>
> Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
>
> I left the system running (it started from a dirty EXT4 partition), and I am
> seen the following error pop up after a few minutes. The system is not doing
> much (some syslog activity maybe, but not much more):
>
> [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
>
> What does that mean?
This means that it found previous error in internal ext4's log. Which is
normal because your fs was corrupted before. It is reasonable to
recreate filesystem from very beginning.

In order to understand whenever it is regression in eMMC driver it is
reasonable to run integrity test for a device itself. You can run
any integrity test you like, For example just run a fio's job
"fio disk-verify2.fio" (see attachment), IMPORTANT this script will
destroy data on test partition. If it failed with errors like
follows "verify: bad magic header XXX" than it is definitely a drivers issue.

If my theory is true and it is storage's driver issue than JBD complain
simply because it do care about it's data (it does integrity checks).
Can you also create btrfs on that partition and performs some io
activity and run fsck after that. You likely will see similar corruption

>
> Best regards,
>
> --
> David Jander
> Protonic Holland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-03 17:14:38

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4: journal has aborted

* Matteo Croce <[email protected]>:
> 2014-07-02 12:17 GMT+02:00 David Jander <[email protected]>:
> >
> > Hi Eric,
> >
> > On Tue, 1 Jul 2014 12:36:46 -0400
> > Eric Whitney <[email protected]> wrote:
> >
> >> * Theodore Ts'o <[email protected]>:
> >> > On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> >> > > Hi,
> >> > >
> >> > > i have interesting for this problem..Because i also found the same problem..
> >> > > Is it Journal problem?
> >> > >
> >> > > I used the Linux version 3.16.0-rc3.
> >> > >
> >> > > [ 3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> >> > > [ 3.877937] Aborting journal on device mmcblk0p13-8.
> >> > > [ 3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
> >> >
> >> > This message means that the file system has detected an inconsistency
> >> > --- specifically, that the number of blocks marked as in use in the
> >> > allocation bbitmap is different from what is in the block group
> >> > descriptors.
> >> >
> >> > The file system has been marked to force a panic after an error, at
> >> > which point e2fsck will be able to repair the inconsistency.
> >> >
> >> > What's not clear is *how* the why this happened. It can happen simply
> >> > because of a hardware problem. (In particular, not all mmc flash
> >> > devices handle power failures gracefully.) Or it could be a cosmic,
> >> > ray, or it might be a kernel bug.
> >> >
> >> > Normally I would chalk this up to a hardware bug, bug it's possible
> >> > that it is a kernel bug. If people can reliably reproduce the problem
> >> > where no power failures or other unclean shutdowns were involved
> >> > (since the last time file system has been checked using e2fsck) then
> >> > that would be realy interesting.
> >>
> >> Hi Ted:
> >>
> >> I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
> >> patch) regression on the Pandaboard this morning. A generic/068 hang
> >> on data_journal required a reboot for recovery (old bug, though rarer lately).
> >> On reboot, the root filesystem - default 4K, and on an SD card - went ro
> >> after the same sort of bad block bitmap / journal abort sequence. Rebooting
> >> forced a fsck that cleared up the problem. The target test filesystem was on
> >> a USB-attached disk, and it did not exhibit the same problems on recovery.
> >
> > Please be careful about conclusions from regular SD cards and USB sticks for
> > mass-storage. Unlike hardened eMMC (4.41+), these COTS mass-storage devices
> > are not meant for intensive use and can perfectly easily corrupt data out of
> > themselves. I've seen it happening many times already.
> >
> >> So, it looks like there might be more than just hardware involved here,
> >> although eMMC/flash might be a common denominator. I'll see if I can come up
> >> with a reliable reproducer once the regression pass is finished if someone
> >> doesn't beat me to it.

I've not found a reproducer that doesn't involve an unclean shutdown, which
is what Ted's looking for.

However, I've noted a behavioral change that might be of interest with
the failure scenario described above using xfstests generic/068 that
occurred between 3.14 and 3.15-rc3. It's possible that this change would
make filesystem damage caused by an unclean shutdown more likely or more
noticable, and perhaps it's in play for the power fail/cycle cases
described in this thread.

FWIW, I've also been able to reproduce that failure scenario on an x86_64 KVM
with raw virtio disks alone. It's just a lot harder to get there with that
configuration - many more trials required.

The change is that the root filesystem sustains damage reported as -

EXT4-fs error (device mmcblk0p3): ext4_mb_generate_buddy:757: group 65, 1243 clusters in bitmap, 1244 in gd; block bitmap corrupt.
Aborting journal on device mmcblk0p3-8.
EXT4-fs error (device mmcblk0p3): ext4_journal_check_start:56: Detected aborted journal
EXT4-fs (mmcblk0p3): Remounting filesystem read-only

- when generic/068 is run on a separate test filesystem that forces an ext4
failure requiring a power cycle / reset to recover from a hung reboot attempt.
This doesn't happen in 3.14 on either my x86_64 or ARM test systems.
Generally, the root filesystem doesn't appear to be affected at all or is
minimally affected (does not require fsck to fully recover) in 3.14, whereas
a fsck is usually required to recover the root in 3.15-rc3.

My attempts to bisect further into 3.15-rc1 to 3.15-rc3 haven't gone well as
yet - other kernel problems are making it difficult to work in there.

Eric


> >
> > I agree that there is a strong correlation towards flash-based storage, but I
> > cannot explain why this factor would make a difference. How are flash-based
> > block-devices different to ext4 than spinning-disk media (besides trim
> > support)?
>
> maybe the zero access time can trigger some race condition?
>
> > Best regards,
> >
> > --
> > David Jander
> > Protonic Holland.
>
>
>
> --
> Matteo Croce
> OpenWrt Developer

2014-07-03 23:17:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 03, 2014 at 01:14:34PM -0400, Eric Whitney wrote:
>
> FWIW, I've also been able to reproduce that failure scenario on an
> x86_64 KVM with raw virtio disks alone. It's just a lot harder to
> get there with that configuration - many more trials required.

What configuration are you using when you see the failure using x86_64
KVM with raw virtio? Is this using kvm-xfstests, or some other setup?
And which file system is getting corrupted?

Because with kvm-xfstests the root file system is a read-only
snapshot, and the "mmcblk0p3" device which you show before is clearly
not the virtio device....

- Ted

2014-07-04 09:40:24

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Dmitry,

On Thu, 03 Jul 2014 18:58:48 +0400
Dmitry Monakhov <[email protected]> wrote:

> On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
> >
> > Hi Ted,
> >
> > On Thu, 3 Jul 2014 09:43:38 -0400
> > "Theodore Ts'o" <[email protected]> wrote:
> >
> > > On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> > > > 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> > > >
> > > > I have a Samsung SSD 840 PRO
> > >
> > > Matteo,
> > >
> > > For you, you said you were seeing these problems on 3.15. Was it
> > > *not* happening for you when you used an older kernel? If so, that
> > > would help us try to provide the basis of trying to do a bisection
> > > search.
> >
> > I also tested with 3.15, and there too I see the same problem.
> >
> > > Using the kvm-xfstests infrastructure, I've been trying to reproduce
> > > the problem as follows:
> > >
> > > ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
> > >
> > > xfstests geneeric/075 runs fsx which does a fair amount of block
> > > allocation deallocations, and then after the test finishes, it first
> > > replays the journal (e2fsck -p) and then forces a fsck run on the
> > > test disk that I use for the run.
> > >
> > > After I launch this, in a separate window, I do this:
> > >
> > > sleep 60 ; killall qemu-system-x86_64
> > >
> > > This kills the qemu process midway through the fsx test, and then I
> > > see if I can find a problem. I haven't had a chance to automate this
> > > yet, and it is my intention to try to set this up where I can run this
> > > on a ramdisk or a SSD, so I can more closely approximate what people
> > > are reporting on flash-based media.
> > >
> > > So far, I haven't been able to reproduce the problem. If after doing
> > > a large number of times, it can't be reproduced (especially if it
> > > can't be reproduced on an SSD), then it would lead us to believe that
> > > one of two things is the cause. (a) The CACHE FLUSH command isn't
> > > properly getting sent to the device in some cases, or (b) there really
> > > is a hardware problem with the flash device in question.
> >
> > Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> > driver? Can you explain why I don't see any problems with EXT3?
> >
> > I can't discard the possibility of (b) because I cannot prove it, but I will
> > try to see if I can do the same test on a SSD which I happen to have on that
> > platform. That should be able to rule out problems with the eMMC chip and
> > -driver, right?
> >
> > Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
> >
> > I left the system running (it started from a dirty EXT4 partition), and I am
> > seen the following error pop up after a few minutes. The system is not doing
> > much (some syslog activity maybe, but not much more):
> >
> > [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> > [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> > [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
> >
> > What does that mean?
> This means that it found previous error in internal ext4's log. Which is
> normal because your fs was corrupted before. It is reasonable to
> recreate filesystem from very beginning.
>
> In order to understand whenever it is regression in eMMC driver it is
> reasonable to run integrity test for a device itself. You can run
> any integrity test you like, For example just run a fio's job
> "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
> destroy data on test partition. If it failed with errors like
> follows "verify: bad magic header XXX" than it is definitely a drivers issue.

I have been trying to run fio on my board with your configuration file, but I
am having problems, and since I am not familiar with fio at all, I can't
really figure out what's wrong. My eMMC device is only 916MiB in size, so I
edited the last part to be:

offset_increment=100M
size=100M

Is that ok?

I still get error messages complaining about blocksize though. Here is the
output I get (can't really make sense of it):

# ./fio ../disk-verify2.fio
Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
/dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
...
fio-2.1.10-49-gf302
Starting 4 processes
fio: blocksize too large for data set
fio: blocksize too large for data set
fio: blocksize too large for data set
fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
fio: pid=7612, got signal=6

/dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul 4 09:31:15 2014
lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

Disk stats (read/write):
mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
fio: file hash not empty on exit


This assertion bugs me. Is it due to the previous errors ("blocksize too large
for data set") or is is because my eMMC drive/kernel is seriously screwed?

Help please!

> If my theory is true and it is storage's driver issue than JBD complain
> simply because it do care about it's data (it does integrity checks).
> Can you also create btrfs on that partition and performs some io
> activity and run fsck after that. You likely will see similar corruption

Best regards,

--
David Jander
Protonic Holland.

2014-07-04 10:17:25

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, 4 Jul 2014 11:40:31 +0200, David Jander <[email protected]> wrote:
>
> Hi Dmitry,
>
> On Thu, 03 Jul 2014 18:58:48 +0400
> Dmitry Monakhov <[email protected]> wrote:
>
> > On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
> > >
> > > Hi Ted,
> > >
> > > On Thu, 3 Jul 2014 09:43:38 -0400
> > > "Theodore Ts'o" <[email protected]> wrote:
> > >
> > > > On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> > > > > 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> > > > >
> > > > > I have a Samsung SSD 840 PRO
> > > >
> > > > Matteo,
> > > >
> > > > For you, you said you were seeing these problems on 3.15. Was it
> > > > *not* happening for you when you used an older kernel? If so, that
> > > > would help us try to provide the basis of trying to do a bisection
> > > > search.
> > >
> > > I also tested with 3.15, and there too I see the same problem.
> > >
> > > > Using the kvm-xfstests infrastructure, I've been trying to reproduce
> > > > the problem as follows:
> > > >
> > > > ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
> > > >
> > > > xfstests geneeric/075 runs fsx which does a fair amount of block
> > > > allocation deallocations, and then after the test finishes, it first
> > > > replays the journal (e2fsck -p) and then forces a fsck run on the
> > > > test disk that I use for the run.
> > > >
> > > > After I launch this, in a separate window, I do this:
> > > >
> > > > sleep 60 ; killall qemu-system-x86_64
> > > >
> > > > This kills the qemu process midway through the fsx test, and then I
> > > > see if I can find a problem. I haven't had a chance to automate this
> > > > yet, and it is my intention to try to set this up where I can run this
> > > > on a ramdisk or a SSD, so I can more closely approximate what people
> > > > are reporting on flash-based media.
> > > >
> > > > So far, I haven't been able to reproduce the problem. If after doing
> > > > a large number of times, it can't be reproduced (especially if it
> > > > can't be reproduced on an SSD), then it would lead us to believe that
> > > > one of two things is the cause. (a) The CACHE FLUSH command isn't
> > > > properly getting sent to the device in some cases, or (b) there really
> > > > is a hardware problem with the flash device in question.
> > >
> > > Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> > > driver? Can you explain why I don't see any problems with EXT3?
> > >
> > > I can't discard the possibility of (b) because I cannot prove it, but I will
> > > try to see if I can do the same test on a SSD which I happen to have on that
> > > platform. That should be able to rule out problems with the eMMC chip and
> > > -driver, right?
> > >
> > > Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
> > >
> > > I left the system running (it started from a dirty EXT4 partition), and I am
> > > seen the following error pop up after a few minutes. The system is not doing
> > > much (some syslog activity maybe, but not much more):
> > >
> > > [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> > > [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> > > [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
> > >
> > > What does that mean?
> > This means that it found previous error in internal ext4's log. Which is
> > normal because your fs was corrupted before. It is reasonable to
> > recreate filesystem from very beginning.
> >
> > In order to understand whenever it is regression in eMMC driver it is
> > reasonable to run integrity test for a device itself. You can run
> > any integrity test you like, For example just run a fio's job
> > "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
> > destroy data on test partition. If it failed with errors like
> > follows "verify: bad magic header XXX" than it is definitely a drivers issue.
>
> I have been trying to run fio on my board with your configuration file, but I
> am having problems, and since I am not familiar with fio at all, I can't
> really figure out what's wrong. My eMMC device is only 916MiB in size, so I
> edited the last part to be:
>
> offset_increment=100M
> size=100M
>
> Is that ok?
>
> I still get error messages complaining about blocksize though. Here is the
> output I get (can't really make sense of it):
>
> # ./fio ../disk-verify2.fio
> Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
> /dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> ...
> fio-2.1.10-49-gf302
> Starting 4 processes
> fio: blocksize too large for data set
> fio: blocksize too large for data set
> fio: blocksize too large for data set
> fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
> fio: pid=7612, got signal=6
>
> /dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul 4 09:31:15 2014
> lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
> lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
> cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
> IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>
> Disk stats (read/write):
> mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
> fio: file hash not empty on exit
>
>
> This assertion bugs me. Is it due to the previous errors ("blocksize too large
> for data set") or is is because my eMMC drive/kernel is seriously screwed?
>
> Help please!
Ohhh. Actually this is axboe's crap. Recent fio's version is broken.
Please use old good commit ffa93ca9d8d37ef
git checkout git://git.kernel.dk/fio.git
cd fio
git checkout -b b2.0.13 ffa93ca9d8d37ef
make -j4
./fio ffa93ca9d8d37ef
>
> > If my theory is true and it is storage's driver issue than JBD complain
> > simply because it do care about it's data (it does integrity checks).
> > Can you also create btrfs on that partition and performs some io
> > activity and run fsck after that. You likely will see similar corruption
>
> Best regards,
>
> --
> David Jander
> Protonic Holland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-04 11:04:15

by Jaehoon Chung

[permalink] [raw]
Subject: Re: ext4: journal has aborted

Hi, David.

On 07/04/2014 06:40 PM, David Jander wrote:
>
> Hi Dmitry,
>
> On Thu, 03 Jul 2014 18:58:48 +0400
> Dmitry Monakhov <[email protected]> wrote:
>
>> On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
>>>
>>> Hi Ted,
>>>
>>> On Thu, 3 Jul 2014 09:43:38 -0400
>>> "Theodore Ts'o" <[email protected]> wrote:
>>>
>>>> On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
>>>>> 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
>>>>>
>>>>> I have a Samsung SSD 840 PRO
>>>>
>>>> Matteo,
>>>>
>>>> For you, you said you were seeing these problems on 3.15. Was it
>>>> *not* happening for you when you used an older kernel? If so, that
>>>> would help us try to provide the basis of trying to do a bisection
>>>> search.
>>>
>>> I also tested with 3.15, and there too I see the same problem.
>>>
>>>> Using the kvm-xfstests infrastructure, I've been trying to reproduce
>>>> the problem as follows:
>>>>
>>>> ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
>>>>
>>>> xfstests geneeric/075 runs fsx which does a fair amount of block
>>>> allocation deallocations, and then after the test finishes, it first
>>>> replays the journal (e2fsck -p) and then forces a fsck run on the
>>>> test disk that I use for the run.
>>>>
>>>> After I launch this, in a separate window, I do this:
>>>>
>>>> sleep 60 ; killall qemu-system-x86_64
>>>>
>>>> This kills the qemu process midway through the fsx test, and then I
>>>> see if I can find a problem. I haven't had a chance to automate this
>>>> yet, and it is my intention to try to set this up where I can run this
>>>> on a ramdisk or a SSD, so I can more closely approximate what people
>>>> are reporting on flash-based media.
>>>>
>>>> So far, I haven't been able to reproduce the problem. If after doing
>>>> a large number of times, it can't be reproduced (especially if it
>>>> can't be reproduced on an SSD), then it would lead us to believe that
>>>> one of two things is the cause. (a) The CACHE FLUSH command isn't
>>>> properly getting sent to the device in some cases, or (b) there really
>>>> is a hardware problem with the flash device in question.
>>>
>>> Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
>>> driver? Can you explain why I don't see any problems with EXT3?
>>>
>>> I can't discard the possibility of (b) because I cannot prove it, but I will
>>> try to see if I can do the same test on a SSD which I happen to have on that
>>> platform. That should be able to rule out problems with the eMMC chip and
>>> -driver, right?
>>>
>>> Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
>>>
>>> I left the system running (it started from a dirty EXT4 partition), and I am
>>> seen the following error pop up after a few minutes. The system is not doing
>>> much (some syslog activity maybe, but not much more):
>>>
>>> [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
>>> [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
>>> [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
>>>
>>> What does that mean?
>> This means that it found previous error in internal ext4's log. Which is
>> normal because your fs was corrupted before. It is reasonable to
>> recreate filesystem from very beginning.
>>
>> In order to understand whenever it is regression in eMMC driver it is
>> reasonable to run integrity test for a device itself. You can run
>> any integrity test you like, For example just run a fio's job
>> "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
>> destroy data on test partition. If it failed with errors like
>> follows "verify: bad magic header XXX" than it is definitely a drivers issue.
>
> I have been trying to run fio on my board with your configuration file, but I
> am having problems, and since I am not familiar with fio at all, I can't
> really figure out what's wrong. My eMMC device is only 916MiB in size, so I
> edited the last part to be:

Which eMMC host controller did you use?

>
> offset_increment=100M
> size=100M
>
> Is that ok?
>
> I still get error messages complaining about blocksize though. Here is the
> output I get (can't really make sense of it):
>
> # ./fio ../disk-verify2.fio
> Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
> /dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> ...
> fio-2.1.10-49-gf302
> Starting 4 processes
> fio: blocksize too large for data set
> fio: blocksize too large for data set
> fio: blocksize too large for data set
> fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
> fio: pid=7612, got signal=6
>
> /dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul 4 09:31:15 2014
> lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
> lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
> cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
> IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>
> Disk stats (read/write):
> mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
> fio: file hash not empty on exit
>
>
> This assertion bugs me. Is it due to the previous errors ("blocksize too large
> for data set") or is is because my eMMC drive/kernel is seriously screwed?
>
> Help please!
>
>> If my theory is true and it is storage's driver issue than JBD complain
>> simply because it do care about it's data (it does integrity checks).
>> Can you also create btrfs on that partition and performs some io
>> activity and run fsck after that. You likely will see similar corruption
>
> Best regards,
>


2014-07-04 11:27:55

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, 04 Jul 2014 14:17:21 +0400
Dmitry Monakhov <[email protected]> wrote:

> On Fri, 4 Jul 2014 11:40:31 +0200, David Jander <[email protected]> wrote:
> >
> > Hi Dmitry,
> >
> > On Thu, 03 Jul 2014 18:58:48 +0400
> > Dmitry Monakhov <[email protected]> wrote:
> >
> > > On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
> > > >
> > > > Hi Ted,
> > > >
> > > > On Thu, 3 Jul 2014 09:43:38 -0400
> > > > "Theodore Ts'o" <[email protected]> wrote:
> > > >
> > > > > On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> > > > > > 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> > > > > >
> > > > > > I have a Samsung SSD 840 PRO
> > > > >
> > > > > Matteo,
> > > > >
> > > > > For you, you said you were seeing these problems on 3.15. Was it
> > > > > *not* happening for you when you used an older kernel? If so, that
> > > > > would help us try to provide the basis of trying to do a bisection
> > > > > search.
> > > >
> > > > I also tested with 3.15, and there too I see the same problem.
> > > >
> > > > > Using the kvm-xfstests infrastructure, I've been trying to reproduce
> > > > > the problem as follows:
> > > > >
> > > > > ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
> > > > >
> > > > > xfstests geneeric/075 runs fsx which does a fair amount of block
> > > > > allocation deallocations, and then after the test finishes, it first
> > > > > replays the journal (e2fsck -p) and then forces a fsck run on the
> > > > > test disk that I use for the run.
> > > > >
> > > > > After I launch this, in a separate window, I do this:
> > > > >
> > > > > sleep 60 ; killall qemu-system-x86_64
> > > > >
> > > > > This kills the qemu process midway through the fsx test, and then I
> > > > > see if I can find a problem. I haven't had a chance to automate this
> > > > > yet, and it is my intention to try to set this up where I can run this
> > > > > on a ramdisk or a SSD, so I can more closely approximate what people
> > > > > are reporting on flash-based media.
> > > > >
> > > > > So far, I haven't been able to reproduce the problem. If after doing
> > > > > a large number of times, it can't be reproduced (especially if it
> > > > > can't be reproduced on an SSD), then it would lead us to believe that
> > > > > one of two things is the cause. (a) The CACHE FLUSH command isn't
> > > > > properly getting sent to the device in some cases, or (b) there really
> > > > > is a hardware problem with the flash device in question.
> > > >
> > > > Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> > > > driver? Can you explain why I don't see any problems with EXT3?
> > > >
> > > > I can't discard the possibility of (b) because I cannot prove it, but I will
> > > > try to see if I can do the same test on a SSD which I happen to have on that
> > > > platform. That should be able to rule out problems with the eMMC chip and
> > > > -driver, right?
> > > >
> > > > Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
> > > >
> > > > I left the system running (it started from a dirty EXT4 partition), and I am
> > > > seen the following error pop up after a few minutes. The system is not doing
> > > > much (some syslog activity maybe, but not much more):
> > > >
> > > > [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> > > > [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> > > > [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
> > > >
> > > > What does that mean?
> > > This means that it found previous error in internal ext4's log. Which is
> > > normal because your fs was corrupted before. It is reasonable to
> > > recreate filesystem from very beginning.
> > >
> > > In order to understand whenever it is regression in eMMC driver it is
> > > reasonable to run integrity test for a device itself. You can run
> > > any integrity test you like, For example just run a fio's job
> > > "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
> > > destroy data on test partition. If it failed with errors like
> > > follows "verify: bad magic header XXX" than it is definitely a drivers issue.
> >
> > I have been trying to run fio on my board with your configuration file, but I
> > am having problems, and since I am not familiar with fio at all, I can't
> > really figure out what's wrong. My eMMC device is only 916MiB in size, so I
> > edited the last part to be:
> >
> > offset_increment=100M
> > size=100M
> >
> > Is that ok?
> >
> > I still get error messages complaining about blocksize though. Here is the
> > output I get (can't really make sense of it):
> >
> > # ./fio ../disk-verify2.fio
> > Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
> > /dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > ...
> > fio-2.1.10-49-gf302
> > Starting 4 processes
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
> > fio: pid=7612, got signal=6
> >
> > /dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul 4 09:31:15 2014
> > lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
> > lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
> > cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
> > IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=32
> >
> > Run status group 0 (all jobs):
> >
> > Disk stats (read/write):
> > mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
> > fio: file hash not empty on exit
> >
> >
> > This assertion bugs me. Is it due to the previous errors ("blocksize too large
> > for data set") or is is because my eMMC drive/kernel is seriously screwed?
> >
> > Help please!
> Ohhh. Actually this is axboe's crap. Recent fio's version is broken.
> Please use old good commit ffa93ca9d8d37ef
> git checkout git://git.kernel.dk/fio.git
> cd fio
> git checkout -b b2.0.13 ffa93ca9d8d37ef
> make -j4
> ./fio ffa93ca9d8d37ef

Ok, that fixed it... :-)

Here is the output I am getting... AFAICS no problems on the raw device. Is
this sufficient testing, Ted?

# ./fio ../disk-verify2.fio
Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
/dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
...
/dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.0.13
Starting 4 processes
Jobs: 4 (f=4): [wVww] [100.0% done] [339K/523K/0K /s] [84 /130 /0 iops] [eta 00m:00s]
/dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=1157: Fri Jul 4 11:22:53 2014
write: io=448828KB, bw=127664 B/s, iops=31 , runt=3600049msec
slat (msec): min=1 , max=1821 , avg=30.33, stdev=66.98
clat (usec): min=52 , max=5463.3K, avg=910506.27, stdev=554911.89
lat (msec): min=5 , max=5531 , avg=940.84, stdev=567.60
clat percentiles (msec):
| 1.00th=[ 159], 5.00th=[ 253], 10.00th=[ 351], 20.00th=[ 465],
| 30.00th=[ 578], 40.00th=[ 685], 50.00th=[ 791], 60.00th=[ 914],
| 70.00th=[ 1074], 80.00th=[ 1270], 90.00th=[ 1614], 95.00th=[ 1958],
| 99.00th=[ 2802], 99.50th=[ 3163], 99.90th=[ 4228], 99.95th=[ 4555],
| 99.99th=[ 5211]
bw (KB/s) : min= 3, max= 944, per=37.55%, avg=139.69, stdev=93.70
lat (usec) : 100=0.01%
lat (msec) : 2=0.01%, 4=0.61%, 10=4.33%, 20=5.93%, 50=16.04%
lat (msec) : 100=18.92%, 250=4.74%, 500=9.38%, 750=12.09%, 1000=9.87%
lat (msec) : 2000=15.58%, >=2000=2.51%
cpu : usr=0.38%, sys=0.59%, ctx=332741, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=111616/d=0, short=r=112207/w=0/d=0
/dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=1158: Fri Jul 4 11:22:53 2014
read : io=352016KB, bw=946262 B/s, iops=231 , runt=380935msec
slat (msec): min=2 , max=12334 , avg=261.58, stdev=527.14
clat (msec): min=18 , max=6145 , avg=59.95, stdev=221.07
lat (msec): min=21 , max=12360 , avg=321.53, stdev=570.56
clat percentiles (msec):
| 1.00th=[ 22], 5.00th=[ 23], 10.00th=[ 23], 20.00th=[ 25],
| 30.00th=[ 25], 40.00th=[ 26], 50.00th=[ 27], 60.00th=[ 28],
| 70.00th=[ 29], 80.00th=[ 34], 90.00th=[ 43], 95.00th=[ 198],
| 99.00th=[ 865], 99.50th=[ 1500], 99.90th=[ 3687], 99.95th=[ 3916],
| 99.99th=[ 6128]
write: io=352072KB, bw=111995 B/s, iops=27 , runt=3219084msec
slat (msec): min=1 , max=12214 , avg=36.47, stdev=175.23
clat (usec): min=20 , max=12325K, avg=224666.84, stdev=487725.70
lat (msec): min=2 , max=12333 , avg=261.14, stdev=527.05
clat percentiles (usec):
| 1.00th=[ 25], 5.00th=[ 28], 10.00th=[ 8032], 20.00th=[21376],
| 30.00th=[38144], 40.00th=[57088], 50.00th=[84480], 60.00th=[132096],
| 70.00th=[224256], 80.00th=[288768], 90.00th=[477184], 95.00th=[782336],
| 99.00th=[2072576], 99.50th=[2932736], 99.90th=[5865472], 99.95th=[7503872],
| 99.99th=[12255232]
bw (KB/s) : min= 0, max= 8, per=1.13%, avg= 4.20, stdev= 2.05
lat (usec) : 50=3.49%, 100=0.08%
lat (msec) : 4=0.20%, 10=2.37%, 20=3.64%, 50=54.33%, 100=9.55%
lat (msec) : 250=12.29%, 500=8.59%, 750=2.25%, 1000=1.04%, 2000=1.49%
lat (msec) : >=2000=0.69%
cpu : usr=0.28%, sys=0.38%, ctx=199155, majf=0, minf=0
IO depths : 1=7.1%, 2=14.3%, 4=28.6%, 8=50.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=88004/d=0, short=r=88018/w=0/d=0
/dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=1159: Fri Jul 4 11:22:53 2014
read : io=270648KB, bw=164244 B/s, iops=40 , runt=1687389msec
slat (msec): min=1 , max=2338 , avg=28.22, stdev=72.10
clat (msec): min=1 , max=2747 , avg=24.77, stdev=68.06
lat (msec): min=3 , max=2753 , avg=52.99, stdev=105.59
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 4], 20.00th=[ 6],
| 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 7], 60.00th=[ 8],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 35], 95.00th=[ 178],
| 99.00th=[ 269], 99.50th=[ 367], 99.90th=[ 734], 99.95th=[ 914],
| 99.99th=[ 1942]
write: io=270648KB, bw=144929 B/s, iops=35 , runt=1912261msec
slat (msec): min=1 , max=2338 , avg=28.04, stdev=72.08
clat (usec): min=20 , max=4560 , avg=27.39, stdev=40.17
lat (msec): min=1 , max=2338 , avg=28.07, stdev=72.08
clat percentiles (usec):
| 1.00th=[ 22], 5.00th=[ 23], 10.00th=[ 23], 20.00th=[ 24],
| 30.00th=[ 24], 40.00th=[ 25], 50.00th=[ 25], 60.00th=[ 26],
| 70.00th=[ 26], 80.00th=[ 27], 90.00th=[ 30], 95.00th=[ 41],
| 99.00th=[ 54], 99.50th=[ 65], 99.90th=[ 81], 99.95th=[ 98],
| 99.99th=[ 2160]
bw (KB/s) : min= 1, max= 8, per=1.40%, avg= 5.22, stdev= 1.68
lat (usec) : 50=49.15%, 100=0.83%, 250=0.01%, 500=0.01%
lat (msec) : 2=0.31%, 4=7.00%, 10=33.14%, 20=2.84%, 50=2.10%
lat (msec) : 100=0.89%, 250=2.99%, 500=0.61%, 750=0.09%, 1000=0.03%
lat (msec) : 2000=0.02%, >=2000=0.01%
cpu : usr=0.34%, sys=0.58%, ctx=179182, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=67662/d=0, short=r=67662/w=0/d=0
/dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=1160: Fri Jul 4 11:22:53 2014
read : io=269304KB, bw=161896 B/s, iops=39 , runt=1703357msec
slat (msec): min=1 , max=2787 , avg=28.12, stdev=69.99
clat (msec): min=1 , max=2876 , avg=25.13, stdev=70.53
lat (msec): min=3 , max=3065 , avg=53.26, stdev=105.59
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 4], 20.00th=[ 6],
| 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 7], 60.00th=[ 8],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 36], 95.00th=[ 178],
| 99.00th=[ 269], 99.50th=[ 375], 99.90th=[ 742], 99.95th=[ 1012],
| 99.99th=[ 1844]
write: io=269304KB, bw=145424 B/s, iops=35 , runt=1896288msec
slat (msec): min=1 , max=2787 , avg=27.94, stdev=69.97
clat (usec): min=20 , max=4580 , avg=27.51, stdev=44.68
lat (msec): min=1 , max=2787 , avg=27.98, stdev=69.97
clat percentiles (usec):
| 1.00th=[ 22], 5.00th=[ 23], 10.00th=[ 23], 20.00th=[ 24],
| 30.00th=[ 24], 40.00th=[ 25], 50.00th=[ 25], 60.00th=[ 25],
| 70.00th=[ 26], 80.00th=[ 27], 90.00th=[ 30], 95.00th=[ 40],
| 99.00th=[ 54], 99.50th=[ 65], 99.90th=[ 82], 99.95th=[ 105],
| 99.99th=[ 2160]
bw (KB/s) : min= 1, max= 8, per=1.42%, avg= 5.30, stdev= 1.56
lat (usec) : 50=49.18%, 100=0.79%, 250=0.01%, 500=0.01%
lat (msec) : 2=0.45%, 4=6.66%, 10=33.29%, 20=2.93%, 50=2.03%
lat (msec) : 100=0.86%, 250=3.03%, 500=0.63%, 750=0.09%, 1000=0.02%
lat (msec) : 2000=0.02%, >=2000=0.01%
cpu : usr=0.35%, sys=0.58%, ctx=179806, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=67326/d=0, short=r=67326/w=0/d=0

Run status group 0 (all jobs):
READ: io=891968KB, aggrb=523KB/s, minb=158KB/s, maxb=924KB/s, mint=380935msec, maxt=1703357msec
WRITE: io=1309.5MB, aggrb=372KB/s, minb=109KB/s, maxb=142KB/s, mint=1896288msec, maxt=3600049msec

Disk stats (read/write):
mmcblk1: ios=514970/335209, merge=2162/0, ticks=17103230/10385442, in_queue=27486452, util=100.00%


Best regards,

--
David Jander
Protonic Holland.

2014-07-04 11:32:41

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, 04 Jul 2014 20:04:13 +0900
Jaehoon Chung <[email protected]> wrote:

> Hi, David.
>
> On 07/04/2014 06:40 PM, David Jander wrote:
> >
> > Hi Dmitry,
> >
> > On Thu, 03 Jul 2014 18:58:48 +0400
> > Dmitry Monakhov <[email protected]> wrote:
> >
> >> On Thu, 3 Jul 2014 16:15:51 +0200, David Jander <[email protected]> wrote:
> >>>
> >>> Hi Ted,
> >>>
> >>> On Thu, 3 Jul 2014 09:43:38 -0400
> >>> "Theodore Ts'o" <[email protected]> wrote:
> >>>
> >>>> On Tue, Jul 01, 2014 at 10:55:11AM +0200, Matteo Croce wrote:
> >>>>> 2014-07-01 10:42 GMT+02:00 Darrick J. Wong <[email protected]>:
> >>>>>
> >>>>> I have a Samsung SSD 840 PRO
> >>>>
> >>>> Matteo,
> >>>>
> >>>> For you, you said you were seeing these problems on 3.15. Was it
> >>>> *not* happening for you when you used an older kernel? If so, that
> >>>> would help us try to provide the basis of trying to do a bisection
> >>>> search.
> >>>
> >>> I also tested with 3.15, and there too I see the same problem.
> >>>
> >>>> Using the kvm-xfstests infrastructure, I've been trying to reproduce
> >>>> the problem as follows:
> >>>>
> >>>> ./kvm-xfstests --no-log -c 4k generic/075 ; e2fsck -p /dev/heap/test-4k ; e2fsck -f /dev/heap/test-4k
> >>>>
> >>>> xfstests geneeric/075 runs fsx which does a fair amount of block
> >>>> allocation deallocations, and then after the test finishes, it first
> >>>> replays the journal (e2fsck -p) and then forces a fsck run on the
> >>>> test disk that I use for the run.
> >>>>
> >>>> After I launch this, in a separate window, I do this:
> >>>>
> >>>> sleep 60 ; killall qemu-system-x86_64
> >>>>
> >>>> This kills the qemu process midway through the fsx test, and then I
> >>>> see if I can find a problem. I haven't had a chance to automate this
> >>>> yet, and it is my intention to try to set this up where I can run this
> >>>> on a ramdisk or a SSD, so I can more closely approximate what people
> >>>> are reporting on flash-based media.
> >>>>
> >>>> So far, I haven't been able to reproduce the problem. If after doing
> >>>> a large number of times, it can't be reproduced (especially if it
> >>>> can't be reproduced on an SSD), then it would lead us to believe that
> >>>> one of two things is the cause. (a) The CACHE FLUSH command isn't
> >>>> properly getting sent to the device in some cases, or (b) there really
> >>>> is a hardware problem with the flash device in question.
> >>>
> >>> Could (a) be caused by a bug in the mmc subsystem or in the MMC peripheral
> >>> driver? Can you explain why I don't see any problems with EXT3?
> >>>
> >>> I can't discard the possibility of (b) because I cannot prove it, but I will
> >>> try to see if I can do the same test on a SSD which I happen to have on that
> >>> platform. That should be able to rule out problems with the eMMC chip and
> >>> -driver, right?
> >>>
> >>> Do you know a way to investigate (a) (CACHE FLUSH not being sent correctly)?
> >>>
> >>> I left the system running (it started from a dirty EXT4 partition), and I am
> >>> seen the following error pop up after a few minutes. The system is not doing
> >>> much (some syslog activity maybe, but not much more):
> >>>
> >>> [ 303.072983] EXT4-fs (mmcblk1p2): error count: 4
> >>> [ 303.077558] EXT4-fs (mmcblk1p2): initial error at 1404216838: ext4_mb_generate_buddy:756
> >>> [ 303.085690] EXT4-fs (mmcblk1p2): last error at 1404388969: ext4_mb_generate_buddy:757
> >>>
> >>> What does that mean?
> >> This means that it found previous error in internal ext4's log. Which is
> >> normal because your fs was corrupted before. It is reasonable to
> >> recreate filesystem from very beginning.
> >>
> >> In order to understand whenever it is regression in eMMC driver it is
> >> reasonable to run integrity test for a device itself. You can run
> >> any integrity test you like, For example just run a fio's job
> >> "fio disk-verify2.fio" (see attachment), IMPORTANT this script will
> >> destroy data on test partition. If it failed with errors like
> >> follows "verify: bad magic header XXX" than it is definitely a drivers issue.
> >
> > I have been trying to run fio on my board with your configuration file, but I
> > am having problems, and since I am not familiar with fio at all, I can't
> > really figure out what's wrong. My eMMC device is only 916MiB in size, so I
> > edited the last part to be:
>
> Which eMMC host controller did you use?

compatible = "fsl,imx6q-usdhc";

Basically a Freescale i.MX6 which uses the drivers/mmc/host/sdhci-esdhc-imx.c
driver which (as the name says) is SDHCI compliant.

> > offset_increment=100M
> > size=100M
> >
> > Is that ok?
> >
> > I still get error messages complaining about blocksize though. Here is the
> > output I get (can't really make sense of it):
> >
> > # ./fio ../disk-verify2.fio
> > Multiple writers may overwrite blocks that belong to other jobs. This can cause verification failures.
> > /dev/mmcblk1p2: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > ...
> > fio-2.1.10-49-gf302
> > Starting 4 processes
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: blocksize too large for data set
> > fio: io_u.c:1315: __get_io_u: Assertion `io_u->flags & IO_U_F_FREE' failed.ta 00m:00s]
> > fio: pid=7612, got signal=6
> >
> > /dev/mmcblk1p2: (groupid=0, jobs=1): err= 0: pid=7612: Fri Jul 4 09:31:15 2014
> > lat (msec) : 4=0.19%, 10=0.19%, 20=0.19%, 50=0.85%, 100=1.23%
> > lat (msec) : 250=56.01%, 500=37.18%, 750=1.14%
> > cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
> > IO depths : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.5%, 32=97.1%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=33/w=1024/d=0, short=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=32
> >
> > Run status group 0 (all jobs):
> >
> > Disk stats (read/write):
> > mmcblk1: ios=11/1025, merge=0/0, ticks=94/6671, in_queue=7121, util=96.12%
> > fio: file hash not empty on exit
> >
> >
> > This assertion bugs me. Is it due to the previous errors ("blocksize too large
> > for data set") or is is because my eMMC drive/kernel is seriously screwed?
> >
> > Help please!
> >
> >> If my theory is true and it is storage's driver issue than JBD complain
> >> simply because it do care about it's data (it does integrity checks).
> >> Can you also create btrfs on that partition and performs some io
> >> activity and run fsck after that. You likely will see similar corruption
> >
> > Best regards,
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Best regards,

--
David Jander
Protonic Holland.

2014-07-04 12:20:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, Jul 04, 2014 at 01:28:02PM +0200, David Jander wrote:
>
> Here is the output I am getting... AFAICS no problems on the raw device. Is
> this sufficient testing, Ted?

I'm not sure what theory Dmitry was trying to pursue when he requested
that you run the fio test. Dmitry?


Please note that at this point there may be multiple causes with
similar symptoms that are showing up. So just because one person
reports one set of data points, such as someone claiming they've seen
this without a power drop to the storage device, that therefore all of
the problems were caused by flaky I/O to the device.

Right now, there are multiple theories floating around --- and it may
be that more than one of them are true (i.e., there may be multiple
bugs here). Some of the possibilities, which again, may not be
mutually exclusive:

1) Some kind of eMMC driver bug, which is possibly causing the CACHE
FLUSH command not to be sent.

2) Some kind of hardware problem involving flash translation layers
not having durable transactions of their flash metadata across power
failures.

3) Some kind of ext4/jbd2 bug, recently introduced, where we are
modifying some ext4 metadata (either the block allocation bitmap or
block group summary statistics) outside of a valid transaction handle.

4) Some other kind of hard-to-reproduce race or wild pointer which is
sometimes corrupting fs data structures.


If someone has a easy to reproduce failure case, the first step is to
do a very rough bisection test. Does the easy-to-reproduce failure go
away if you use 3.14? 3.12? Also, if you can describe in great
detail your hardware and software configuration, and under what
circumstances the problem reproduces, and when it doesn't, that would
also be critical. Whether you are just doing reset or a power cycle
if an unclean shutdown is involved, might also be important.

And at this point, because I'm getting very suspicious that there may
be more than one root cause, we should try to keep the debugging of
one person's reproduction, such as David's, separate from another's,
such as Matteo's. It may be that there ultimately have the same root
cause, and so if one person is able to get an interesting reproduction
result, it would be great for the other person to try running the same
experiment on their hardware/software configuration. But what we must
not do is assume that one person's experiment is automatically
applicable to other circumstances.

Cheers,

- Ted

2014-07-04 12:38:54

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, 4 Jul 2014 08:20:22 -0400, Theodore Ts'o <[email protected]> wrote:
> On Fri, Jul 04, 2014 at 01:28:02PM +0200, David Jander wrote:
> >
> > Here is the output I am getting... AFAICS no problems on the raw device. Is
> > this sufficient testing, Ted?
>
> I'm not sure what theory Dmitry was trying to pursue when he requested
> that you run the fio test. Dmitry?
Because at this moment we have some complex storage+fs interaction,
My idea was to simply isolate raw dev case and run integrity test on that storage.
fio/libaio is trivial and easy way to do it(except it does not issued
flush cmd). Unfortunetly according to David test finished w/o any
error. So my theory about broken strorage driver was not confirmed.

>
>
> Please note that at this point there may be multiple causes with
> similar symptoms that are showing up. So just because one person
> reports one set of data points, such as someone claiming they've seen
> this without a power drop to the storage device, that therefore all of
> the problems were caused by flaky I/O to the device.
>
> Right now, there are multiple theories floating around --- and it may
> be that more than one of them are true (i.e., there may be multiple
> bugs here). Some of the possibilities, which again, may not be
> mutually exclusive:
>
> 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> FLUSH command not to be sent.
>
> 2) Some kind of hardware problem involving flash translation layers
> not having durable transactions of their flash metadata across power
> failures.
>
> 3) Some kind of ext4/jbd2 bug, recently introduced, where we are
> modifying some ext4 metadata (either the block allocation bitmap or
> block group summary statistics) outside of a valid transaction handle.
>
> 4) Some other kind of hard-to-reproduce race or wild pointer which is
> sometimes corrupting fs data structures.
>
>
> If someone has a easy to reproduce failure case, the first step is to
> do a very rough bisection test. Does the easy-to-reproduce failure go
> away if you use 3.14? 3.12? Also, if you can describe in great
> detail your hardware and software configuration, and under what
> circumstances the problem reproduces, and when it doesn't, that would
> also be critical. Whether you are just doing reset or a power cycle
> if an unclean shutdown is involved, might also be important.
>
> And at this point, because I'm getting very suspicious that there may
> be more than one root cause, we should try to keep the debugging of
> one person's reproduction, such as David's, separate from another's,
> such as Matteo's. It may be that there ultimately have the same root
> cause, and so if one person is able to get an interesting reproduction
> result, it would be great for the other person to try running the same
> experiment on their hardware/software configuration. But what we must
> not do is assume that one person's experiment is automatically
> applicable to other circumstances.
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-04 13:45:52

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Hi Ted, Dmitry,

On Fri, 4 Jul 2014 08:20:22 -0400
"Theodore Ts'o" <[email protected]> wrote:

> On Fri, Jul 04, 2014 at 01:28:02PM +0200, David Jander wrote:
> >
> > Here is the output I am getting... AFAICS no problems on the raw device. Is
> > this sufficient testing, Ted?
>
> I'm not sure what theory Dmitry was trying to pursue when he requested
> that you run the fio test. Dmitry?
>
>
> Please note that at this point there may be multiple causes with
> similar symptoms that are showing up. So just because one person
> reports one set of data points, such as someone claiming they've seen
> this without a power drop to the storage device, that therefore all of
> the problems were caused by flaky I/O to the device.
>
> Right now, there are multiple theories floating around --- and it may
> be that more than one of them are true (i.e., there may be multiple
> bugs here). Some of the possibilities, which again, may not be
> mutually exclusive:
>
> 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> FLUSH command not to be sent.

How can I investigate this? According to the fio tests I ran and the
explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
commands is the only thing left to be discarded on the eMMC driver front,
right?

> 2) Some kind of hardware problem involving flash translation layers
> not having durable transactions of their flash metadata across power
> failures.

That would be like blaming Micron (the eMMC part manufacturer) for faulty
firmware... could be, but how can we test this?

> 3) Some kind of ext4/jbd2 bug, recently introduced, where we are
> modifying some ext4 metadata (either the block allocation bitmap or
> block group summary statistics) outside of a valid transaction handle.

I think I have some more evidence to support this case:

Until previously, I did not run fsck EVER! I know that this is not a good idea
to do in a production environment, but I am only testing right now, and in
theory it should not be necessary, right?

What I did this time, was to run fsck.ext3 or fsck.ext4 (depending on FS
format of course) once every one or two power cycles.

So effectively, what I did amounts to this:

CASE 1: fsck on every power-cycle:

1.- Boot from clean filesystem
2.- Run the following command line:
$ cp -a /usr . & bonnie\+\+ -r 32 -u 100:100 & bonnie\+\+ -r 32 -u 102:102

3.- Hit CTRL+Z (to stop the second bonnie++ process)
4.- Execute "sync"
5.- While "sync" was running, cut off the power supply.
6.- Turn on power and boot from external medium
7.- Run fsck.ext3/4 on eMMC device
8.- Repeat

In this case, there was a minor difference for the fsck output of both
filesystems:

EXT4 was always something like this:

# fsck.ext4 /dev/mmcblk1p2
e2fsck 1.42.5 (29-Jul-2012)
rootfs: recovering journal
Setting free inodes count to 37692 (was 37695)
Setting free blocks count to 136285 (was 136291)
rootfs: clean, 7140/44832 files, 42915/179200 blocks

While for EXT3 the output did not contain the "Setting free * count..."
messages:

# fsck.ext3 -p /dev/mmcblk1p2
rootfs: recovering journal
rootfs: clean, 4895/44832 files, 36473/179200 blocks



CASE 2: fsck on every other power-cycle:

Same as CASE 1 steps 1...5 and then:
6.- Turn on power and boot again from dirty internal eMMC without running fsck.
7.- Repeat steps 2...5 one more time
8.- Perform steps 6...8 from CASE 1.

With this test, the following difference became apparent:

With EXT3: fsck.ext3 did the same as in CASE 1

With EXT4: I get a long list of errors that are being fixed.
It starts like this:


# fsck.ext4 /dev/mmcblk1p2
e2fsck 1.42.5 (29-Jul-2012)
rootfs: recovering journal
rootfs contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 4591, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4594, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4595, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4596, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4597, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4598, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4599, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4600, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4601, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4602, i_blocks is 16, should be 8. Fix<y>? yes
Inode 4603, i_blocks is 16, should be 8. Fix<y>? yes
...
...
Eventually I pressed CTRL+C and restarted fsck with the option "-p", because
this list was getting a little long.
...
...

# fsck.ext4 -p /dev/mmcblk1p2
rootfs contains a file system with errors, check forced.
rootfs: Inode 5391, i_blocks is 32, should be 16. FIXED.
rootfs: Inode 5392, i_blocks is 16, should be 8. FIXED.
rootfs: Inode 5393, i_blocks is 48, should be 24. FIXED.
rootfs: Inode 5394, i_blocks is 32, should be 16. FIXED.
rootfs: Inode 5395, i_blocks is 16, should be 8. FIXED.
...
...
rootfs: Inode 5854, i_blocks is 240, should be 120. FIXED.
rootfs: Inode 5857, i_blocks is 576, should be 288. FIXED.
rootfs: Inode 5860, i_blocks is 512, should be 256. FIXED.
rootfs: Inode 5863, i_blocks is 656, should be 328. FIXED.
rootfs: Inode 5866, i_blocks is 480, should be 240. FIXED.
rootfs: Inode 5869, i_blocks is 176, should be 88. FIXED.
rootfs: Inode 5872, i_blocks is 336, should be 168. FIXED.
rootfs: 11379/44832 files (0.1% non-contiguous), 70010/179200 blocks
#

> 4) Some other kind of hard-to-reproduce race or wild pointer which is
> sometimes corrupting fs data structures.

I don't have such a hard time reproducing it... but it does take quite some
time (booting several times, re-installing, testing, etc...)

> If someone has a easy to reproduce failure case, the first step is to
> do a very rough bisection test. Does the easy-to-reproduce failure go
> away if you use 3.14? 3.12? Also, if you can describe in great
> detail your hardware and software configuration, and under what
> circumstances the problem reproduces, and when it doesn't, that would
> also be critical. Whether you are just doing reset or a power cycle
> if an unclean shutdown is involved, might also be important.

Until now, I always do a power-cycle, but I can try to check if I am able to
reproduce the problem with just a "shutdown -f" (AFAIK, this does NOT sync
filesystems, right?)

I will try to check 3.14 and 3.12 (if 3.14 still seems buggy). It could take
quite a while until I have results... certainly not before monday.

> And at this point, because I'm getting very suspicious that there may
> be more than one root cause, we should try to keep the debugging of
> one person's reproduction, such as David's, separate from another's,
> such as Matteo's. It may be that there ultimately have the same root
> cause, and so if one person is able to get an interesting reproduction
> result, it would be great for the other person to try running the same
> experiment on their hardware/software configuration. But what we must
> not do is assume that one person's experiment is automatically
> applicable to other circumstances.

I agree.

Best regards,

--
David Jander
Protonic Holland.

2014-07-04 18:45:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > FLUSH command not to be sent.
>
> How can I investigate this? According to the fio tests I ran and the
> explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> commands is the only thing left to be discarded on the eMMC driver front,
> right?

Can you try using an older kernel? The report that that I quoted from
John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
a problem that showed up in "recent kernels", and a bisection search
seemed to point towards an unknown problem in the eMMC driver.
Quoting from https://lkml.org/lkml/2014/6/12/762:

"However, despite many many reboots the last good commit in my
branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
Fixup busy detection while...) doesn't ever show the issue. While
the immediately following commit which bisect found -
e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
irq before DATA irq) always does.

The immensely frustrating part is while backing that single change off
from its commit sha always makes the issue go away, reverting that
change from on top of v3.15 doesn't. The issue persists....."

> > 2) Some kind of hardware problem involving flash translation layers
> > not having durable transactions of their flash metadata across power
> > failures.
>
> That would be like blaming Micron (the eMMC part manufacturer) for faulty
> firmware... could be, but how can we test this?

The problem is that people who write these programs end up doing
one-offs, as opposed to something that is well packaged and stands the
test of time. But basically what we want is a program that writes to
sequential blocks in a block device with the following information:

*) a timestamp (seconds and microseconds from gettimeofday)
*) a 64-bit generation number (which is randomly
generated and the same for each run of the progam)
*) a 32-bit sequence number (starts at zero and
increments once per block
*) a 32-bit "sync" number which is written after each time
fsync(2) is called while writing to the disk
*) the sector number where the data was written
*) a CRC of the above information
*) some random pattern to fill the rest of the 512 or 4k block,
depending on the physical sector size

The program would then write this information to each 512 or 4k block,
and periodically, say every N blocks (where N should be configurable),
it would call fsync() which on an open block device, should result in
a CACHE FLUSH command being sent.

Ideally, the sequence number and sync number would be output via a
serial console, or maybe over an ssh connection, so you can see what
was the last sequence number and sync number that was written before
the disk, before you yank the power cord out of the wall.

Then you have write another program which scans the disk and makes
sure that all of the blocks up until the last sync number that was
output was in fact preserved faithfully on the disk after the power
outage.

If not, you know you have a problem.

For bonus credit, you can also have a mode where the disk is
pre-written with some known repeating pattern, such as 0xDEADBEEF, and
then see if you have some sectors that do not contain either the
expected 0xDEADBEEF, or the test pattern described above.

That indicates a potential problem where as power drops, sometimes the
memory can go insane before the hard drive and the DMA engine manages
to stop. (Since as the saying goes, death is not an event, but rather
a process, and different components inside the computer can die at
different times.) SGI discovered this problem very early on, which is
why SGI machines have extra large capacitors on their power supplies,
plus a power fail interrupt, so that when they lose power, the system
can abort DMA transfers before allowing the power to shut down.
Otherwise, a file system which is doing logical journalling can end up
with crap getting written on top of some critical file system data
block, and which replying the journal won't help, since the journal
just contains things like "update the mtime to be this value".

Since ext3 and ext4 do physical block journalling, we don't suffer
from this problem since we always write the journal first, and the
journal is flushed to disk before we update the metadata blocks. On a
power failure, when we replay the journal, if there is a corrupted fs
metadata block due to the power failure causing garbage to be written
to the disk, the damage can get repaired automatically when the
journal is replayed. And this is a good thing, since PC class
hardware generally don't have power fail interrupts. OTOH, ext3/4 is
not as efficient as file systems such as XFS that do logical
journalling. Life is full of tradeoffs....

> EXT4 was always something like this:
>
> # fsck.ext4 /dev/mmcblk1p2
> e2fsck 1.42.5 (29-Jul-2012)
> rootfs: recovering journal
> Setting free inodes count to 37692 (was 37695)
> Setting free blocks count to 136285 (was 136291)
> rootfs: clean, 7140/44832 files, 42915/179200 blocks

This is normal, and not an indication of a problem. In ext4, we
longer update the free blocks and free inodes count in the superblock,
except when we unmount the file system. Otherwise on large CPU
systems, needing to grab the superblock lock to update these values on
each block/inode allocation and deallocation becomes a massive
scalability bottleneck.

Instead, at mount time, since we need need to read in the block group
descriptors *anyway*, we just sum the free blocks/inodes in each block
group descriptor, and update the global counters then. So the fact
that the global free blocks and free inodes values in the superblock
are not up to date after a crash is normal, and expected. This is why
e2fsck didn't ask for permission before fixing this issue, which it
would have when you run e2fsck without either the -p or -y option.

> CASE 2: fsck on every other power-cycle:
>
> Same as CASE 1 steps 1...5 and then:
> 6.- Turn on power and boot again from dirty internal eMMC without running fsck.
> 7.- Repeat steps 2...5 one more time
> 8.- Perform steps 6...8 from CASE 1.
>
> With this test, the following difference became apparent:
>
> With EXT3: fsck.ext3 did the same as in CASE 1
>
> With EXT4: I get a long list of errors that are being fixed.
> It starts like this:
>
> # fsck.ext4 /dev/mmcblk1p2
> e2fsck 1.42.5 (29-Jul-2012)
> rootfs: recovering journal
> rootfs contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Inode 4591, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4594, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4595, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4596, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4597, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4598, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4599, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4600, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4601, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4602, i_blocks is 16, should be 8. Fix<y>? yes
> Inode 4603, i_blocks is 16, should be 8. Fix<y>? yes

*This* is a problem. And if you're seeing it for such a very large
number of cases, it makes it unlikely to me that it is a ext4/jbd2
bug. For one thing, if it's so blatent, I should be able to replicate
it very easily when using my "kill qemu" test.

What this makes me suspect is that you may be getting an older version
of an inode table block. That's why the above test procedure I
outlined included the timestamp and the generation number in each
block. If after a power failure you start seeing an old generation
number, then that's a hint that the FTL did somethign funky.

And the advantage of the using a test progam that writes to the raw
eMMC flash is that it cuts the file system completely out of the
picture....

Cheers,

- Ted

2014-07-04 20:48:05

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4: journal has aborted

* Theodore Ts'o <[email protected]>:
> On Thu, Jul 03, 2014 at 01:14:34PM -0400, Eric Whitney wrote:
> >
> > FWIW, I've also been able to reproduce that failure scenario on an
> > x86_64 KVM with raw virtio disks alone. It's just a lot harder to
> > get there with that configuration - many more trials required.
>
> What configuration are you using when you see the failure using x86_64
> KVM with raw virtio? Is this using kvm-xfstests, or some other setup?
> And which file system is getting corrupted?
>
> Because with kvm-xfstests the root file system is a read-only
> snapshot, and the "mmcblk0p3" device which you show before is clearly
> not the virtio device....
>

I'd meant the x86_64 note (paragraph beginning with FWIW) as an aside, and that
regrettably caused confusion. Here's a more formal report, this time largely
limited to an x86_64 case where eMMC is not involved.

x86_64 KVM case:

Base distro installed - Ubuntu 13.10
Kernel version - mainline 3.15-rc3
e2fsprogs - master branch, bb9cca2ca9
xfstests - master branch, 45d1fac130
xfsprogs - master branch, 03e956b252
xfstests-bld - master branch, 723c23e1b2 (about a year old)

(e2fsprogs, xfstests, and xfsprogs are relatively recent - 1 to 2 months old,
and latest as of the availability of 3.15-rc1.)

The test scenario involves running xfstest generic/068 using the runtests.sh
script from xfstests-bld within a KVM constructed with libvirt tools and where
all test components were built and installed by hand. (kvm-xfstests was not
used here.)

The virtual machine running the test used a raw virtio disk for the root (vda)
and for the test filesystem (vdc). (Two other virtio disks were also
attached but not used for this test - vdb and vdd.)

vda contained a vanilla 4k ext4 filesystem as provided by the distro. vdc
contained a 4k ext4 filesystem mounted with the data=journal and
block_validity mount options, newly created and mounted before each attempt
to execute the test.

The failure sequence:

1) running generic/068 on vdc forces a BUG -

[ 1976.238967] kernel BUG at fs/buffer.c:3017!
[ 1976.239070] invalid opcode: 0000 [#1] SMP
[ 1976.239150] Modules linked in: kvm_intel kvm psmouse microcode i2c_piix4 serio_raw virtio_balloon
[ 1976.239286] CPU: 0 PID: 11705 Comm: xfs_io Not tainted 3.15.0-rc3-emmc+ #1
[ 1976.239380] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1976.239472] task: ffff88007624a110 ti: ffff8800768da000 task.ti: ffff8800768da000
[ 1976.239568] RIP: 0010:[<ffffffff811d849c>] [<ffffffff811d849c>] _submit_bh+0x18c/0x210
[ 1976.239682] RSP: 0018:ffff8800768dbd38 EFLAGS: 00010246
[ 1976.239763] RAX: 000000000022c005 RBX: ffff88006e3e3e38 RCX: 0000000000000000
[ 1976.242063] RDX: 0000000000000000 RSI: ffff88006e3e3e38 RDI: 0000000000000411
[ 1976.242759] RBP: ffff8800768dbd58 R08: 0000000000000000 R09: 0000000000000000
[ 1976.242759] R10: 00000000000360b0 R11: 0000000225c17d03 R12: ffff88006e3e3e38
[ 1976.242759] R13: ffff8800768dbe14 R14: ffff880067648800 R15: ffff88006e023410
[ 1976.242759] FS: 00007f2e03b85740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 1976.242759] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1976.242759] CR2: 000000000061da00 CR3: 0000000076a68000 CR4: 00000000000006f0
[ 1976.242759] Stack:
[ 1976.242759] ffff88006e3e3e38 0000000000000411 ffff8800768dbe14 ffff880067648800
[ 1976.242759] ffff8800768dbd78 ffffffff811d8f71 0000000000000001 ffff880067648b00
[ 1976.242759] ffff8800768dbde0 ffffffff8129a67d 0000000091827364 ffff8800768dbd90
[ 1976.242759] Call Trace:
[ 1976.242759] [<ffffffff811d8f71>] write_dirty_buffer+0x51/0x70
[ 1976.242759] [<ffffffff8129a67d>] __flush_batch+0x4d/0xa0
[ 1976.242759] [<ffffffff8129ac5c>] jbd2_log_do_checkpoint+0x1dc/0x480
[ 1976.242759] [<ffffffff8129f574>] jbd2_journal_flush+0x94/0x180
[ 1976.242759] [<ffffffff8126277e>] ext4_freeze+0x3e/0x80
[ 1976.242759] [<ffffffff811a6a88>] freeze_super+0xb8/0x130
[ 1976.242759] [<ffffffff811b7d3e>] do_vfs_ioctl+0x2ce/0x520
[ 1976.242759] [<ffffffff816d3689>] ? retint_swapgs+0xe/0x13
[ 1976.242759] [<ffffffff810a09ed>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 1976.242759] [<ffffffff811b8011>] SyS_ioctl+0x81/0xa0
[ 1976.242759] [<ffffffff816dbf92>] system_call_fastpath+0x16/0x1b
[ 1976.242759] Code: 5d 41 5e 5d c3 66 2e 0f 1f 84 00 00 00 00 00 40 f6 c7 01 0f 84 e3 fe ff ff f0 41 80 64 24 01 f7 e9 d7 fe ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 48 8b 53 78 c1 e0 09 41 f6 c5 01 89 43 28\
89 42 08
[ 1976.242759] RIP [<ffffffff811d849c>] _submit_bh+0x18c/0x210
[ 1976.242759] RSP <ffff8800768dbd38>
[ 1976.268124] ---[ end trace c0ad2d4fabce4179 ]---

(Note that this snippet was extracted from the kernel log - /var/log/kern.log
on the root filesystem, and was therefore written successfully.)

2) Components of the test are then left in the D state (fsstress blocked on
wchan sync_inodes_sb and rm on vfs_unlink according to ps), and the test's
controlling shell is unresponsive.

3) Rebooting the test system then leads to a hang as the system is shutting
down - unresponsive console, shutdown does not make progress. (Nothing new
here.)

4) Resetting the virtual machine leads to a reboot, which in most (but not
all) cases results in a read-only root filesystem. (This is where the
filesystem damage is that I mentioned.)

>From the dmesg log after reboot (since the kernel log can't be written once
the root filesystem is read only) -

[ 1.030238] EXT4-fs (vda1): INFO: recovery required on readonly filesystem
[ 1.031162] EXT4-fs (vda1): write access will be enabled during recovery
[ 1.432329] tsc: Refined TSC clocksource calibration: 3000.110 MHz
[ 1.472223] EXT4-fs (vda1): recovery complete
[ 1.477099] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
[ 1.598232] random: init urandom read with 38 bits of entropy available
[ 2.079563] Adding 1046524k swap on /dev/vda5. Priority:-1 extents:1 across:1046524k FS
[ 2.305267] systemd-udevd[263]: starting version 204
[ 2.440193] Switched to clocksource tsc
[ 2.441306] EXT4-fs (vda1): re-mounted. Opts: errors=remount-ro
[ 2.626895] piix4_smbus 0000:00:01.3: SMBus Host Controller at 0xb100, revision 0
[ 2.729328] audit: type=1400 audit(1404499589.548:2): apparmor="STATUS" operation="profile_load" name="/sbin/dhclient" pid=314 comm="apparmor_parser"
[ 2.729338] audit: type=1400 audit(1404499589.548:3): apparmor="STATUS" operation="profile_load" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=314 comm="apparmor_parser"
[ 2.729342] audit: type=1400 audit(1404499589.548:4): apparmor="STATUS" operation="profile_load" name="/usr/lib/connman/scripts/dhclient-script" pid=314 comm="apparmor_parser"
[ 2.729803] audit: type=1400 audit(1404499589.548:5): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=314 comm="apparmor_parser"
[ 2.729810] audit: type=1400 audit(1404499589.548:6): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=314 comm="apparmor_parser"
[ 2.729997] audit: type=1400 audit(1404499589.548:7): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=314 comm="apparmor_parser"
[ 2.853872] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
[ 2.858650] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
[ 2.871491] microcode: Microcode Update Driver: v2.00 <[email protected]>, Peter Oruba
[ 3.780154] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 4.012541] init: failsafe main process (628) killed by TERM signal
[ 4.300987] audit: type=1400 audit(1404499591.120:8): apparmor="STATUS" operation="profile_replace" name="/sbin/dhclient" pid=749 comm="apparmor_parser"
[ 4.300997] audit: type=1400 audit(1404499591.120:9): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=749 comm="apparmor_parser"
[ 4.301003] audit: type=1400 audit(1404499591.120:10): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=749 comm="apparmor_parser"
[ 4.361772] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 42, 2591 clusters in bitmap, 2592 in gd; block bitmap corrupt.
[ 4.361908] Aborting journal on device vda1-8.
[ 4.362198] EXT4-fs (vda1): Remounting filesystem read-only

(Note that this snippet is complete from the first indication of successful
recovery of the root filesystem on reboot down to the apparent subsequent
discovery of block bitmap corruption and journal abort.)

5) A second reboot attempt forces fsck to run and successfully recover the
root filesystem, whereupon the test system resumes normal operation. (I've
never observed a failed recovery attempt at this step.)

This same sequence occurs in 3.14 (and has over several past releases) with the
exception of the block bitmap corruption indication and automated fsck
recovery on reboot here at the end. That appears to be new in 3.15 (and at
least as early there as -rc3).

It's about an order of magnitude harder for me to reproduce this on my
x86_64 KVM as on my Pandaboard test system. It requires 100-200 generic/068
trials to trigger the failure on x86_64 as opposed to roughly 10 on the
Pandaboard (this is a measure of the relative difficulty in triggering the
generic/068 failure only - the block bitmap corruption then generally but not
always occurs). So far, I've not been able to reproduce the failure on 3.14 on
either test system.

To be clear, it's not that I think the generic/068 failure is directly
related to the block bitmap corruption on the root filesystem - it simply
appears to be a (really awkward) way to trigger that behavior.

Eric







2014-07-04 22:46:49

by Dave Chinner

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, Jul 04, 2014 at 02:45:39PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > > FLUSH command not to be sent.
> >
> > How can I investigate this? According to the fio tests I ran and the
> > explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> > commands is the only thing left to be discarded on the eMMC driver front,
> > right?
>
> Can you try using an older kernel? The report that that I quoted from
> John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
> a problem that showed up in "recent kernels", and a bisection search
> seemed to point towards an unknown problem in the eMMC driver.
> Quoting from https://lkml.org/lkml/2014/6/12/762:
>
> "However, despite many many reboots the last good commit in my
> branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
> Fixup busy detection while...) doesn't ever show the issue. While
> the immediately following commit which bisect found -
> e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
> irq before DATA irq) always does.
>
> The immensely frustrating part is while backing that single change off
> from its commit sha always makes the issue go away, reverting that
> change from on top of v3.15 doesn't. The issue persists....."
>
> > > 2) Some kind of hardware problem involving flash translation layers
> > > not having durable transactions of their flash metadata across power
> > > failures.
> >
> > That would be like blaming Micron (the eMMC part manufacturer) for faulty
> > firmware... could be, but how can we test this?
>
> The problem is that people who write these programs end up doing
> one-offs, as opposed to something that is well packaged and stands the
> test of time. But basically what we want is a program that writes to
> sequential blocks in a block device with the following information:
>
> *) a timestamp (seconds and microseconds from gettimeofday)
> *) a 64-bit generation number (which is randomly
> generated and the same for each run of the progam)
> *) a 32-bit sequence number (starts at zero and
> increments once per block
> *) a 32-bit "sync" number which is written after each time
> fsync(2) is called while writing to the disk
> *) the sector number where the data was written
> *) a CRC of the above information
> *) some random pattern to fill the rest of the 512 or 4k block,
> depending on the physical sector size

genstream + checkstream.

http://oss.sgi.com/projects/nfs/testtools/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-07-05 02:30:40

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Fri, 4 Jul 2014 14:45:39 -0400, Theodore Ts'o <[email protected]> wrote:
> On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > > FLUSH command not to be sent.
> >
> > How can I investigate this? According to the fio tests I ran and the
> > explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> > commands is the only thing left to be discarded on the eMMC driver front,
> > right?
>
> Can you try using an older kernel? The report that that I quoted from
> John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
> a problem that showed up in "recent kernels", and a bisection search
> seemed to point towards an unknown problem in the eMMC driver.
> Quoting from https://lkml.org/lkml/2014/6/12/762:
>
> "However, despite many many reboots the last good commit in my
> branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
> Fixup busy detection while...) doesn't ever show the issue. While
> the immediately following commit which bisect found -
> e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
> irq before DATA irq) always does.
>
> The immensely frustrating part is while backing that single change off
> from its commit sha always makes the issue go away, reverting that
> change from on top of v3.15 doesn't. The issue persists....."
>
> > > 2) Some kind of hardware problem involving flash translation layers
> > > not having durable transactions of their flash metadata across power
> > > failures.
> >
> > That would be like blaming Micron (the eMMC part manufacturer) for faulty
> > firmware... could be, but how can we test this?
>
> The problem is that people who write these programs end up doing
> one-offs, as opposed to something that is well packaged and stands the
> test of time. But basically what we want is a program that writes to
> sequential blocks in a block device with the following information:
>
> *) a timestamp (seconds and microseconds from gettimeofday)
> *) a 64-bit generation number (which is randomly
> generated and the same for each run of the progam)
> *) a 32-bit sequence number (starts at zero and
> increments once per block
> *) a 32-bit "sync" number which is written after each time
> fsync(2) is called while writing to the disk
> *) the sector number where the data was written
> *) a CRC of the above information
> *) some random pattern to fill the rest of the 512 or 4k block,
> depending on the physical sector size

I've posted similar utility (hwflush-check) here
http://oss.sgi.com/archives/xfs/2013-05/msg00443.html
We use it for testing power-tolerance for SSD.
It has client and server
# On a server side, start to listen incoming connections
# Stage0:
hwflush-check -l
############
# On tested node
# Stage 1: Create and allocate all files.
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT --prep
# Stage 2: Start test, end let it work for a while
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT & sleep 20; wall "it is time
to cut power cord"
# Stage 3: On next boot check that all data was successfully flushed.
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT --check

It is reasonable to repeate steps 2,3 several times.
>
> The program would then write this information to each 512 or 4k block,
> and periodically, say every N blocks (where N should be configurable),
> it would call fsync() which on an open block device, should result in
> a CACHE FLUSH command being sent.
>
> Ideally, the sequence number and sync number would be output via a
> serial console, or maybe over an ssh connection, so you can see what
> was the last sequence number and sync number that was written before
> the disk, before you yank the power cord out of the wall.
>
> Then you have write another program which scans the disk and makes
> sure that all of the blocks up until the last sync number that was
> output was in fact preserved faithfully on the disk after the power
> outage.
>
> If not, you know you have a problem.
>
> For bonus credit, you can also have a mode where the disk is
> pre-written with some known repeating pattern, such as 0xDEADBEEF, and
> then see if you have some sectors that do not contain either the
> expected 0xDEADBEEF, or the test pattern described above.
>
> That indicates a potential problem where as power drops, sometimes the
> memory can go insane before the hard drive and the DMA engine manages
> to stop. (Since as the saying goes, death is not an event, but rather
> a process, and different components inside the computer can die at
> different times.) SGI discovered this problem very early on, which is
> why SGI machines have extra large capacitors on their power supplies,
> plus a power fail interrupt, so that when they lose power, the system
> can abort DMA transfers before allowing the power to shut down.
> Otherwise, a file system which is doing logical journalling can end up
> with crap getting written on top of some critical file system data
> block, and which replying the journal won't help, since the journal
> just contains things like "update the mtime to be this value".
>
> Since ext3 and ext4 do physical block journalling, we don't suffer
> from this problem since we always write the journal first, and the
> journal is flushed to disk before we update the metadata blocks. On a
> power failure, when we replay the journal, if there is a corrupted fs
> metadata block due to the power failure causing garbage to be written
> to the disk, the damage can get repaired automatically when the
> journal is replayed. And this is a good thing, since PC class
> hardware generally don't have power fail interrupts. OTOH, ext3/4 is
> not as efficient as file systems such as XFS that do logical
> journalling. Life is full of tradeoffs....
>
> > EXT4 was always something like this:
> >
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > Setting free inodes count to 37692 (was 37695)
> > Setting free blocks count to 136285 (was 136291)
> > rootfs: clean, 7140/44832 files, 42915/179200 blocks
>
> This is normal, and not an indication of a problem. In ext4, we
> longer update the free blocks and free inodes count in the superblock,
> except when we unmount the file system. Otherwise on large CPU
> systems, needing to grab the superblock lock to update these values on
> each block/inode allocation and deallocation becomes a massive
> scalability bottleneck.
>
> Instead, at mount time, since we need need to read in the block group
> descriptors *anyway*, we just sum the free blocks/inodes in each block
> group descriptor, and update the global counters then. So the fact
> that the global free blocks and free inodes values in the superblock
> are not up to date after a crash is normal, and expected. This is why
> e2fsck didn't ask for permission before fixing this issue, which it
> would have when you run e2fsck without either the -p or -y option.
>
> > CASE 2: fsck on every other power-cycle:
> >
> > Same as CASE 1 steps 1...5 and then:
> > 6.- Turn on power and boot again from dirty internal eMMC without running fsck.
> > 7.- Repeat steps 2...5 one more time
> > 8.- Perform steps 6...8 from CASE 1.
> >
> > With this test, the following difference became apparent:
> >
> > With EXT3: fsck.ext3 did the same as in CASE 1
> >
> > With EXT4: I get a long list of errors that are being fixed.
> > It starts like this:
> >
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > rootfs contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Inode 4591, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4594, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4595, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4596, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4597, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4598, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4599, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4600, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4601, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4602, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4603, i_blocks is 16, should be 8. Fix<y>? yes
>
> *This* is a problem. And if you're seeing it for such a very large
> number of cases, it makes it unlikely to me that it is a ext4/jbd2
> bug. For one thing, if it's so blatent, I should be able to replicate
> it very easily when using my "kill qemu" test.
>
> What this makes me suspect is that you may be getting an older version
> of an inode table block. That's why the above test procedure I
> outlined included the timestamp and the generation number in each
> block. If after a power failure you start seeing an old generation
> number, then that's a hint that the FTL did somethign funky.
>
> And the advantage of the using a test progam that writes to the raw
> eMMC flash is that it cuts the file system completely out of the
> picture....
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-05 20:36:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

I've been auditing the ext4 code looking for a potential bug in our
journalling code that might result in fs corruption after a
crash/power failure.

I did find one such instance: http://patchwork.ozlabs.org/patch/367346/

However, I'm fairly sure this is not the cause of most of the problems
folks have been complaining about. For one thing, this bug has been
around for a long, long time (since v3.5 or so). For another, hitting
it is incredibly rare.

It would only be noticeable if you crash immediately (within 5 seconds
or so) after allocating the first inode in a block group where both
the inode allocation bitmap and block allocation bitmap are
uninitialized (i.e, no inodes or blocks have ever been allocated from
that block group since the file system was first initialized).

So for people who are indicating that they are seeing problems much
more frequently in 3.15, but not in older kernels, it's almost
certainly something else going on. But if you can reliably reproduce
the problem, you might want to try applying the patch. It certainly
won't hurt, and if it does help, please let me know. :-)

Cheers,

- Ted

2014-07-07 12:16:52

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Dear Ted,

On Fri, 4 Jul 2014 14:45:39 -0400
"Theodore Ts'o" <[email protected]> wrote:

> On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > > FLUSH command not to be sent.
> >
> > How can I investigate this? According to the fio tests I ran and the
> > explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> > commands is the only thing left to be discarded on the eMMC driver front,
> > right?
>
> Can you try using an older kernel? The report that that I quoted from
> John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
> a problem that showed up in "recent kernels", and a bisection search
> seemed to point towards an unknown problem in the eMMC driver.
> Quoting from https://lkml.org/lkml/2014/6/12/762:
>
> "However, despite many many reboots the last good commit in my
> branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
> Fixup busy detection while...) doesn't ever show the issue. While
> the immediately following commit which bisect found -
> e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
> irq before DATA irq) always does.
>
> The immensely frustrating part is while backing that single change off
> from its commit sha always makes the issue go away, reverting that
> change from on top of v3.15 doesn't. The issue persists....."

Next thing I am going to do is check 3.14 and/or 3.12, but first...

> > > 2) Some kind of hardware problem involving flash translation layers
> > > not having durable transactions of their flash metadata across power
> > > failures.
> >
> > That would be like blaming Micron (the eMMC part manufacturer) for faulty
> > firmware... could be, but how can we test this?
>
> The problem is that people who write these programs end up doing
> one-offs, as opposed to something that is well packaged and stands the
> test of time. But basically what we want is a program that writes to
> sequential blocks in a block device with the following information:
>
> *) a timestamp (seconds and microseconds from gettimeofday)
> *) a 64-bit generation number (which is randomly
> generated and the same for each run of the progam)
> *) a 32-bit sequence number (starts at zero and
> increments once per block
> *) a 32-bit "sync" number which is written after each time
> fsync(2) is called while writing to the disk
> *) the sector number where the data was written
> *) a CRC of the above information
> *) some random pattern to fill the rest of the 512 or 4k block,
> depending on the physical sector size
>
> The program would then write this information to each 512 or 4k block,
> and periodically, say every N blocks (where N should be configurable),
> it would call fsync() which on an open block device, should result in
> a CACHE FLUSH command being sent.
>
> Ideally, the sequence number and sync number would be output via a
> serial console, or maybe over an ssh connection, so you can see what
> was the last sequence number and sync number that was written before
> the disk, before you yank the power cord out of the wall.
>
> Then you have write another program which scans the disk and makes
> sure that all of the blocks up until the last sync number that was
> output was in fact preserved faithfully on the disk after the power
> outage.
>
> If not, you know you have a problem.
>
> For bonus credit, you can also have a mode where the disk is
> pre-written with some known repeating pattern, such as 0xDEADBEEF, and
> then see if you have some sectors that do not contain either the
> expected 0xDEADBEEF, or the test pattern described above.

Ok, I wrote a simple python test script (see at the end of this e-mail) that
implements your idea. The tool checks the device, to see how far it came last
time, eventually analyzes and dumps a corrupt block, rewrites the whole device
to the background pattern (0xdeadbeef), syncs and then starts filling sectors
from the beginning with records consisting of sequence number (=sector index),
64-bit pseudo-random number, a timestamp, a CRC and filling data (0x55). The
sync-interval is an input parameter and is the interval (in sectors) between
calls to fsync() (Yes, I took care to flush python file objects first!). The
sector number after each sync is output to stdout, which is logged on a serial
port.
Next time the program is run (after a power-cycle), data must be correctly
written to at least that sector, and after that only sectors with 100%
background-fill have to exist. This is checked for.

I have been running this script repeatedly with different sync-interval values
on a 256 MiB partition of the eMMC device. Until now, I have not been able to
produce a sector with corrupt data, nor evidence of CACHE_FLUSH not being
processed correctly.

The tool itself is quite hacky (I wrote it this morning), but AFAICS it should
work correctly for our purposes.

> That indicates a potential problem where as power drops, sometimes the
> memory can go insane before the hard drive and the DMA engine manages
> to stop. (Since as the saying goes, death is not an event, but rather
> a process, and different components inside the computer can die at
> different times.) SGI discovered this problem very early on, which is
> why SGI machines have extra large capacitors on their power supplies,

Hahaha, yes, I think I heard that story before, but I honestly always thought
it was a joke or a myth, since it is pretty stupid. As for our hardware, it
does have proper brown-out detection and asserts "RESET" pins on all
peripherals that have one as soon as any power rail comes out of range. This
would have been the correct way of fixing SGI's "problem" IMHO.

> plus a power fail interrupt, so that when they lose power, the system
> can abort DMA transfers before allowing the power to shut down.

A power-fail interrupt would be helpful to make sure there is no ongoing write
or erase activity in the FLASH chips when power finally drops, but then the
eMMC device itself should have support for it. Instead of that, they do
actually guarantee that power interruptions are harmless through the
"reliable-writes" feature in the FTL firmware. The question is if we can
indeed trust that firmware.

> Otherwise, a file system which is doing logical journalling can end up
> with crap getting written on top of some critical file system data
> block, and which replying the journal won't help, since the journal
> just contains things like "update the mtime to be this value".

I am not familiar with filesystems having such type of journalling, and I
wonder what would be the use of it anyway, but I am not an expert in
filesystems....

> Since ext3 and ext4 do physical block journalling, we don't suffer
> from this problem since we always write the journal first, and the
> journal is flushed to disk before we update the metadata blocks.

That I understand. That's why a working CACHE_FLUSH is so important, right?

> On a
> power failure, when we replay the journal, if there is a corrupted fs
> metadata block due to the power failure causing garbage to be written
> to the disk, the damage can get repaired automatically when the
> journal is replayed. And this is a good thing, since PC class
> hardware generally don't have power fail interrupts. OTOH, ext3/4 is
> not as efficient as file systems such as XFS that do logical
> journalling. Life is full of tradeoffs....

I never dared to use anything different that trusty ol' ext3/4.... or JFFS2
and UBIFS when dealing with bare flash devices of course.

> > EXT4 was always something like this:
> >
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > Setting free inodes count to 37692 (was 37695)
> > Setting free blocks count to 136285 (was 136291)
> > rootfs: clean, 7140/44832 files, 42915/179200 blocks
>
> This is normal, and not an indication of a problem. In ext4, we
> longer update the free blocks and free inodes count in the superblock,
> except when we unmount the file system. Otherwise on large CPU
> systems, needing to grab the superblock lock to update these values on
> each block/inode allocation and deallocation becomes a massive
> scalability bottleneck.
>
> Instead, at mount time, since we need need to read in the block group
> descriptors *anyway*, we just sum the free blocks/inodes in each block
> group descriptor, and update the global counters then. So the fact
> that the global free blocks and free inodes values in the superblock
> are not up to date after a crash is normal, and expected. This is why
> e2fsck didn't ask for permission before fixing this issue, which it
> would have when you run e2fsck without either the -p or -y option.

Ok, thanks a lot for this detailed explanation!

> > CASE 2: fsck on every other power-cycle:
> >
> > Same as CASE 1 steps 1...5 and then:
> > 6.- Turn on power and boot again from dirty internal eMMC without running fsck.
> > 7.- Repeat steps 2...5 one more time
> > 8.- Perform steps 6...8 from CASE 1.
> >
> > With this test, the following difference became apparent:
> >
> > With EXT3: fsck.ext3 did the same as in CASE 1
> >
> > With EXT4: I get a long list of errors that are being fixed.
> > It starts like this:
> >
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > rootfs contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Inode 4591, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4594, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4595, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4596, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4597, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4598, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4599, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4600, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4601, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4602, i_blocks is 16, should be 8. Fix<y>? yes
> > Inode 4603, i_blocks is 16, should be 8. Fix<y>? yes
>
> *This* is a problem. And if you're seeing it for such a very large
> number of cases, it makes it unlikely to me that it is a ext4/jbd2
> bug. For one thing, if it's so blatent, I should be able to replicate
> it very easily when using my "kill qemu" test.
>
> What this makes me suspect is that you may be getting an older version
> of an inode table block. That's why the above test procedure I
> outlined included the timestamp and the generation number in each
> block. If after a power failure you start seeing an old generation
> number, then that's a hint that the FTL did somethign funky.

I have failed miserably until now at trying to prove the FTL wrong (or funky
at least ;-).

> And the advantage of the using a test progam that writes to the raw
> eMMC flash is that it cuts the file system completely out of the
> picture....

Any suggestion on how to improve my eMMC device test script (see below)?

Best regards,

--
David Jander
Protonic Holland.

filldev.py:
#!/usr/bin/env python
# vim: tabstop=4

import os, sys, stat
from time import time
from zlib import crc32
import subprocess

class Perc(object):
def __init__(self, size):
self.size = size
self.last = -1

def update(self, n):
perc = int((n * 1000) / self.size)
if perc != self.last:
sys.stdout.write("\rProgress : %4.1f %% " % (perc/10.0))
sys.stdout.flush()
self.last = perc

def devstat(dev):
# Return size of dev, if it's a file or a device
size = os.stat(dev).st_size
if not size:
# It's probably a device...
txt, _ = subprocess.Popen("blockdev --getsize " + dev,
shell=True, stdout=subprocess.PIPE).communicate()
size = int(txt) * 512
return size

def initdev(dev):
size = devstat(dev)
f = open(dev, 'r+b')
data = "\xde\xad\xbe\xef" * (512 / 4)
blocks = size / len(data)
p = Perc(blocks)
print "Filling whole device with background pattern:"
for i in xrange(blocks):
f.write(data)
p.update(i)
p.update(i+1)
f.close()
os.system("sync")
print "\nDone."

class PatternGenerator(object):
def __init__(self):
self.index = 0
self.rand = 18274091872761

def nextrand(self):
ret = self.rand * 128273237635 + 45845
self.rand = ret % (2**32)
return (ret >> 16) & 0xffff

def nextrand64(self):
ret = self.nextrand() | (self.nextrand() << 16)
return ret | (self.nextrand() << 32) | (self.nextrand() << 48)

def timestamp(self):
return "%17.6f" % time()

def make_record(self):
ret = "%010d %17.6f %16x" % (self.index, time(), self.nextrand64())
ret += " %08x \n\0" % (2**31 + crc32(ret))
self.index += 1
fill = "\x55" * (512 - len(ret))
return ret + fill

def check_record_crc(self, rec):
try:
data, crc, rest = rec.rsplit(" ", 2)
except ValueError:
return False
ncrc = (2**31 + crc32(data))
crc = int(crc, 16)
return crc == ncrc

def sync(self):
self.f.flush()
os.fsync(self.f)
os.system("sync")

class PatternWriter(PatternGenerator):
def __init__(self, dev, syncinterval):
PatternGenerator.__init__(self)
self.size = devstat(dev)
self.blocks = self.size / 512
self.f = open(dev, 'r+b')
self.perc = Perc(self.blocks)
self.syncinterval = syncinterval

def write_one(self):
self.f.write(self.make_record())
if (self.index % self.syncinterval) == 0:
self.sync()
print "Synced at block", self.index
self.perc.update(self.index)

def run(self):
print "Writing data to device..."
while self.blocks > self.index:
self.write_one()
print "\nDone."

class PatternChecker(PatternGenerator):
def __init__(self, dev):
PatternGenerator.__init__(self)
self.size = devstat(dev)
self.blocks = self.size / 512
self.f = open(dev, 'rb')
self.perc = Perc(self.blocks)
self.ti = -1

def read_one(self):
return self.f.read(512)

def analyze_block(self, blk):
data = "\xde\xad\xbe\xef" * (512 / 4)
self.index += 1
if blk == data:
return 1
print "Possibly corrupt block detected!"
print "Block data:", repr(blk)
return 2

def check_block(self, blk):
if not self.check_record_crc(blk):
print "\nCRC error on block", self.index
ret = self.analyze_block(blk)
if ret == 1:
print "Block contents are 100% fill pattern."
print "Checking rest of device for fill pattern..."
return ret
idx, ti, rnd, crc, rest = blk.split()
idx = int(idx)
ti = float(ti)
rnd = int(rnd, 16)
if idx != self.index:
print "\nIndex number error on block", self.index
print "Block data:", repr(blk)
return 3
if ti < self.ti:
print "\nTimestamp error on block", self.index
print "Block data:", repr(blk)
return 4
self.ti = ti
if rnd != self.nextrand64():
# FIXME: This can never be true, so remove this check?
print "\nRandom number incorrect on block", self.index
print "Block data:", repr(blk)
return 5
self.index += 1
return 0

def run(self):
print "Checking device contents..."
checkfill = False
while self.blocks > self.index:
blk = self.read_one()
if checkfill:
ret = self.analyze_block(blk)
else:
ret = self.check_block(blk)
if ret == 1:
checkfill = True
elif ret:
sys.exit(ret)
self.perc.update(self.index)
print "\nDone."
return 0

if __name__ == "__main__":
if len(sys.argv) < 3:
print "Syntax: filldev <device> <sync-interval>"
sys.exit(2)
dev = sys.argv[1]
sint = int(sys.argv[2], 0)
c = PatternChecker(dev)
c.run()
initdev(dev)
w = PatternWriter(dev, sint)
w.run()





2014-07-07 15:53:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

An update from today's ext4 concall. Eric Whitney can fairly reliably
reproduce this on his Panda board with 3.15, and definitely not on
3.14. So at this point there seems to be at least some kind of 3.15
regression going on here, regardless of whether it's in the eMMC
driver or the ext4 code. (It also means that the bug fix I found is
irrelevant for the purposes of working this issue, since that's a much
harder to hit, and that bug has been around long before 3.14.)

The problem in terms of narrowing it down any further is that the
Pandaboard is running into RCU bugs which makes it hard to test the
early 3.15-rcX kernels. There is some indication that the bug showed
up in the ext4 patches which Linus pulled at the beginning of
3.15-rc3. However, due to the ARM (or at least Pandaboard) RCU bugs,
it's not possible to bisect test this on the Pandaboard.

And on the x86_64, it takes most of a day to confirm the absence of a
test failure. (Although this is with a HDD, so assuming that we don't
have an eMMC as well as an ext4 regression in 3.15, it seems likely
that the problem is with some kind of ext4 regression sometime between
3.14 and 3.15.

So we are making progress, but it's slow. Hopefuly we'll know more in
the near future.

Thanks to everyone who has been working on this bug!

Cheers,

- Ted

2014-07-07 22:31:41

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4: journal has aborted

Hi all,

Whatever this bug is, I hit it on a spinny disk with 3.15.0.

Disk: APPLE HDD HTS541010A9E662 (Hitachi SATA disk in a Mac Mini)
Kernel: 3.15.0 (fairly standard build; didn't have problems w/ 3.14)
OS: Ubuntu 14.04 + e2fsprogs 1.42.10 (May 2014)

That machine hadn't been rebooted for a couple of weeks, when I rebooted it to
put 3.15.4 on the machine (ptrace bug fix). When 3.15.4 came up, I saw this in
dmesg:

[ 31.717188] EXT4-fs (dm-3): recovery complete
[ 31.721056] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: discard
<snip>
[ 213.150808] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:756: group 4803, 12227 clusters in bitmap, 12217 in gd; block bitmap corrupt.
[ 213.150822] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

After which I umounted the FS and ran e2fsck -n:

# e2fsck -fn /dev/dm-3 -C0
e2fsck 1.42.10 (18-May-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(157397995--157398003) -(157398006--157398014)
-(157403147--157403184) -(157403237--157403242) -(157403258--157403273)
-(157403280--157403505) -(157403508--157403647)
Fix? no

dumpe2fs shows this for BG 4803:

Group 4803: (Blocks 157384704-157417471) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0xfbd1, unused inodes 8192
Block bitmap at 157286403 (bg #4800 + 3), Inode bitmap at 157286419 (bg #4800 + 19)
Inode table at 157287968-157288479 (bg #4800 + 1568)
12233 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 157384770, 157384840-157384888, 157385059-157385087,
157385213-157385215, 157385283-157385285, 157385408, 157385469-157385471,
157385779-157385780, 157385846, 157385916-157385918, 157386119-157386175,
157386235-157386239, 157386563-157386571, 157386641-157386687,
157387037-157387062, 157387331, 157387531-157387555, 157387624-157387646,
157387725-157387775, 157388100-157388130, 157388523-157388543,
157388942-157389055, 157389258-157389300, 157389379-157389428,
157389795-157390137, 157390294-157390591, 157390798-157390847,
157391360-157391524, 157391594-157391871, 157391950-157392009,
157392073-157392309, 157392455-157392536, 157392604-157393151,
157393423-157394310, 157394377-157394431, 157395200-157395261,
157395320-157395342, 157395408-157395455, 157395968-157396438,
157396440-157396735, 157396799-157397005, 157397007-157397994,
157398004-157398005, 157398015, 157398272-157398527, 157399040-157399104,
157399106-157399757, 157399759-157400094, 157400177-157400190,
157400210-157400223, 157400404-157400419, 157400459-157400494,
157400571-157400831, 157401600-157401696, 157401698-157402124,
157402126-157402833, 157402835-157402836, 157402839-157403146,
157403185-157403214, 157403226-157403236, 157403243-157403257,
157403274-157403279, 157403506-157403507, 157403648-157403865,
157403867-157404241, 157404250-157404671, 157405440-157405695,
157406464-157406564, 157406710-157406764, 157406768-157407086,
157407088-157407231, 157407271-157407487, 157407744-157407975,
157408034-157408149, 157408672-157408676, 157408736-157408766,
157409000-157409023, 157409280-157409785, 157412864-157413119
Free inodes: 39346177-39354368

I then had the funny idea to copy all of those blocks out of the filesystem to
see what had previously been stored there. Of the files that weren't
unintelligible binary crud, the recognizable blocks (according to file(1))
mostly appear to be HTML, CSS, and JPEG files from Firefox's browser cache.
There's nothing obviously wrong with Fx's cache directory, though.

Anyway, I'll keep digging.

--D

On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> An update from today's ext4 concall. Eric Whitney can fairly reliably
> reproduce this on his Panda board with 3.15, and definitely not on
> 3.14. So at this point there seems to be at least some kind of 3.15
> regression going on here, regardless of whether it's in the eMMC
> driver or the ext4 code. (It also means that the bug fix I found is
> irrelevant for the purposes of working this issue, since that's a much
> harder to hit, and that bug has been around long before 3.14.)
>
> The problem in terms of narrowing it down any further is that the
> Pandaboard is running into RCU bugs which makes it hard to test the
> early 3.15-rcX kernels. There is some indication that the bug showed
> up in the ext4 patches which Linus pulled at the beginning of
> 3.15-rc3. However, due to the ARM (or at least Pandaboard) RCU bugs,
> it's not possible to bisect test this on the Pandaboard.
>
> And on the x86_64, it takes most of a day to confirm the absence of a
> test failure. (Although this is with a HDD, so assuming that we don't
> have an eMMC as well as an ext4 regression in 3.15, it seems likely
> that the problem is with some kind of ext4 regression sometime between
> 3.14 and 3.15.
>
> So we are making progress, but it's slow. Hopefuly we'll know more in
> the near future.
>
> Thanks to everyone who has been working on this bug!
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-07 22:56:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> An update from today's ext4 concall. Eric Whitney can fairly reliably
> reproduce this on his Panda board with 3.15, and definitely not on
> 3.14. So at this point there seems to be at least some kind of 3.15
> regression going on here, regardless of whether it's in the eMMC
> driver or the ext4 code. (It also means that the bug fix I found is
> irrelevant for the purposes of working this issue, since that's a much
> harder to hit, and that bug has been around long before 3.14.)
>
> The problem in terms of narrowing it down any further is that the
> Pandaboard is running into RCU bugs which makes it hard to test the
> early 3.15-rcX kernels.....

In the hopes of making it easy to bisect, I've created a kernel branch
which starts with 3.14, and then adds on all of the ext4-related
commits since then. You can find it at:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure

https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure

Eric, can you see if you can repro the failure on your Panda Board?
If you can, try doing a bisection search on these series:

git bisect start
git bisect good v3.14
git bisect bad test-mb_generate_buddy-failure

Hopefully if it is caused by one of the commits in this series, we'll
be able to pin point it this way.

Thanks!!

- Ted

2014-07-10 18:57:53

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4: journal has aborted

* Theodore Ts'o <[email protected]>:
> On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> > An update from today's ext4 concall. Eric Whitney can fairly reliably
> > reproduce this on his Panda board with 3.15, and definitely not on
> > 3.14. So at this point there seems to be at least some kind of 3.15
> > regression going on here, regardless of whether it's in the eMMC
> > driver or the ext4 code. (It also means that the bug fix I found is
> > irrelevant for the purposes of working this issue, since that's a much
> > harder to hit, and that bug has been around long before 3.14.)
> >
> > The problem in terms of narrowing it down any further is that the
> > Pandaboard is running into RCU bugs which makes it hard to test the
> > early 3.15-rcX kernels.....
>
> In the hopes of making it easy to bisect, I've created a kernel branch
> which starts with 3.14, and then adds on all of the ext4-related
> commits since then. You can find it at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
>
> https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
>
> Eric, can you see if you can repro the failure on your Panda Board?
> If you can, try doing a bisection search on these series:
>
> git bisect start
> git bisect good v3.14
> git bisect bad test-mb_generate_buddy-failure
>
> Hopefully if it is caused by one of the commits in this series, we'll
> be able to pin point it this way.

First, the good news (with luck):

My testing currently suggests that the patch causing this regression was
pulled into 3.15-rc3 -

007649375f6af242d5b1df2c15996949714303ba
ext4: initialize multi-block allocator before checking block descriptors

Bisection by selectively reverting ext4 commits in -rc3 identified this patch
while running on the Pandaboard. I'm still using generic/068 as my reproducer.
It occasionally yields a false negative, but it has passed 10 consecutive
trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
of false negatives I've seen, I'm reasonably confident in that result. I'm
going to run another series with just that patch reverted on 3.16-rc3.

Looking at the patch, the call to ext4_mb_init() was hoisted above the code
performing journal recovery in ext4_fill_super(). The regression occurs only
after journal recovery on the root filesystem.

Secondly:

Thanks for that git tree! However, I discovered that the same "RCU bug" I
thought I was seeing on the Panda was also visible on the x86_64 KVM, and
it was actually just RCU noticing stalls. These also occurred when using
your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
masking the regression on the root filesystem. The test system would lock up
completely - no console response - and made it impossible to force the reboot
which was required to set up the failure. Hence the reversion approach, since
RCU does not report stalls in 3.15-rc3 (final).

Eric



>
> Thanks!!
>
> - Ted

2014-07-10 20:01:45

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
> * Theodore Ts'o <[email protected]>:
> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
> > > reproduce this on his Panda board with 3.15, and definitely not on
> > > 3.14. So at this point there seems to be at least some kind of 3.15
> > > regression going on here, regardless of whether it's in the eMMC
> > > driver or the ext4 code. (It also means that the bug fix I found is
> > > irrelevant for the purposes of working this issue, since that's a much
> > > harder to hit, and that bug has been around long before 3.14.)
> > >
> > > The problem in terms of narrowing it down any further is that the
> > > Pandaboard is running into RCU bugs which makes it hard to test the
> > > early 3.15-rcX kernels.....
> >
> > In the hopes of making it easy to bisect, I've created a kernel branch
> > which starts with 3.14, and then adds on all of the ext4-related
> > commits since then. You can find it at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
> >
> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
> >
> > Eric, can you see if you can repro the failure on your Panda Board?
> > If you can, try doing a bisection search on these series:
> >
> > git bisect start
> > git bisect good v3.14
> > git bisect bad test-mb_generate_buddy-failure
> >
> > Hopefully if it is caused by one of the commits in this series, we'll
> > be able to pin point it this way.
>
> First, the good news (with luck):
>
> My testing currently suggests that the patch causing this regression was
> pulled into 3.15-rc3 -
>
> 007649375f6af242d5b1df2c15996949714303ba
> ext4: initialize multi-block allocator before checking block descriptors
>
> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
> It occasionally yields a false negative, but it has passed 10 consecutive
> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
> of false negatives I've seen, I'm reasonably confident in that result. I'm
> going to run another series with just that patch reverted on 3.16-rc3.
>
> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
> performing journal recovery in ext4_fill_super(). The regression occurs only
> after journal recovery on the root filesystem.

Thanks for finding the culprit! :)

Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
FS will mount without crashing? This was the cruddy patch I sent in (and later
killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
way. Maybe it's appropriate now.
http://www.spinics.net/lists/linux-ext4/msg43287.html

--D

>
> Secondly:
>
> Thanks for that git tree! However, I discovered that the same "RCU bug" I
> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
> it was actually just RCU noticing stalls. These also occurred when using
> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
> masking the regression on the root filesystem. The test system would lock up
> completely - no console response - and made it impossible to force the reboot
> which was required to set up the failure. Hence the reversion approach, since
> RCU does not report stalls in 3.15-rc3 (final).
>
> Eric
>
>
>
> >
> > Thanks!!
> >
> > - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-10 21:31:54

by Matteo Croce

[permalink] [raw]
Subject: Re: ext4: journal has aborted

Will do, thanks!

2014-07-10 22:01 GMT+02:00 Darrick J. Wong <[email protected]>:
> On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
>> * Theodore Ts'o <[email protected]>:
>> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
>> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
>> > > reproduce this on his Panda board with 3.15, and definitely not on
>> > > 3.14. So at this point there seems to be at least some kind of 3.15
>> > > regression going on here, regardless of whether it's in the eMMC
>> > > driver or the ext4 code. (It also means that the bug fix I found is
>> > > irrelevant for the purposes of working this issue, since that's a much
>> > > harder to hit, and that bug has been around long before 3.14.)
>> > >
>> > > The problem in terms of narrowing it down any further is that the
>> > > Pandaboard is running into RCU bugs which makes it hard to test the
>> > > early 3.15-rcX kernels.....
>> >
>> > In the hopes of making it easy to bisect, I've created a kernel branch
>> > which starts with 3.14, and then adds on all of the ext4-related
>> > commits since then. You can find it at:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
>> >
>> > Eric, can you see if you can repro the failure on your Panda Board?
>> > If you can, try doing a bisection search on these series:
>> >
>> > git bisect start
>> > git bisect good v3.14
>> > git bisect bad test-mb_generate_buddy-failure
>> >
>> > Hopefully if it is caused by one of the commits in this series, we'll
>> > be able to pin point it this way.
>>
>> First, the good news (with luck):
>>
>> My testing currently suggests that the patch causing this regression was
>> pulled into 3.15-rc3 -
>>
>> 007649375f6af242d5b1df2c15996949714303ba
>> ext4: initialize multi-block allocator before checking block descriptors
>>
>> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
>> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
>> It occasionally yields a false negative, but it has passed 10 consecutive
>> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
>> of false negatives I've seen, I'm reasonably confident in that result. I'm
>> going to run another series with just that patch reverted on 3.16-rc3.
>>
>> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
>> performing journal recovery in ext4_fill_super(). The regression occurs only
>> after journal recovery on the root filesystem.
>
> Thanks for finding the culprit! :)
>
> Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
> FS will mount without crashing? This was the cruddy patch I sent in (and later
> killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
> way. Maybe it's appropriate now.
> http://www.spinics.net/lists/linux-ext4/msg43287.html
>
> --D
>
>>
>> Secondly:
>>
>> Thanks for that git tree! However, I discovered that the same "RCU bug" I
>> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
>> it was actually just RCU noticing stalls. These also occurred when using
>> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
>> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
>> masking the regression on the root filesystem. The test system would lock up
>> completely - no console response - and made it impossible to force the reboot
>> which was required to set up the failure. Hence the reversion approach, since
>> RCU does not report stalls in 3.15-rc3 (final).
>>
>> Eric
>>
>>
>>
>> >
>> > Thanks!!
>> >
>> > - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Matteo Croce
OpenWrt Developer

2014-07-10 22:32:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

To be clear, what you would need to do is to revert commit
007649375f6af242d5b1df2c15996949714303ba to prevent the fs corruption.
Darrick's patch is one that tries to fix the problem addressed by that
commit in a different fashion.

Quite frankly, reverting the commit, which is causing real damage, is
far more impotrant to me right now than what to do in order allow
CONFIG_EXT4FS_DEBUG to work (which is nice, but it's only something
that file system developers need, and to be honest I can't remember
the last time I've used said config option). But if we know that
Darrick's fix works, I'm willing to push that to Linus at the same
time that I push a revert of 007649375f6af242d5b1df2c15996949714303ba

Cheers,

- Ted

On Thu, Jul 10, 2014 at 11:31:14PM +0200, Matteo Croce wrote:
> Will do, thanks!
>
> 2014-07-10 22:01 GMT+02:00 Darrick J. Wong <[email protected]>:
> > On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
> >> * Theodore Ts'o <[email protected]>:
> >> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> >> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
> >> > > reproduce this on his Panda board with 3.15, and definitely not on
> >> > > 3.14. So at this point there seems to be at least some kind of 3.15
> >> > > regression going on here, regardless of whether it's in the eMMC
> >> > > driver or the ext4 code. (It also means that the bug fix I found is
> >> > > irrelevant for the purposes of working this issue, since that's a much
> >> > > harder to hit, and that bug has been around long before 3.14.)
> >> > >
> >> > > The problem in terms of narrowing it down any further is that the
> >> > > Pandaboard is running into RCU bugs which makes it hard to test the
> >> > > early 3.15-rcX kernels.....
> >> >
> >> > In the hopes of making it easy to bisect, I've created a kernel branch
> >> > which starts with 3.14, and then adds on all of the ext4-related
> >> > commits since then. You can find it at:
> >> >
> >> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
> >> >
> >> > Eric, can you see if you can repro the failure on your Panda Board?
> >> > If you can, try doing a bisection search on these series:
> >> >
> >> > git bisect start
> >> > git bisect good v3.14
> >> > git bisect bad test-mb_generate_buddy-failure
> >> >
> >> > Hopefully if it is caused by one of the commits in this series, we'll
> >> > be able to pin point it this way.
> >>
> >> First, the good news (with luck):
> >>
> >> My testing currently suggests that the patch causing this regression was
> >> pulled into 3.15-rc3 -
> >>
> >> 007649375f6af242d5b1df2c15996949714303ba
> >> ext4: initialize multi-block allocator before checking block descriptors
> >>
> >> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
> >> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
> >> It occasionally yields a false negative, but it has passed 10 consecutive
> >> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
> >> of false negatives I've seen, I'm reasonably confident in that result. I'm
> >> going to run another series with just that patch reverted on 3.16-rc3.
> >>
> >> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
> >> performing journal recovery in ext4_fill_super(). The regression occurs only
> >> after journal recovery on the root filesystem.
> >
> > Thanks for finding the culprit! :)
> >
> > Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
> > FS will mount without crashing? This was the cruddy patch I sent in (and later
> > killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
> > way. Maybe it's appropriate now.
> > http://www.spinics.net/lists/linux-ext4/msg43287.html
> >
> > --D
> >
> >>
> >> Secondly:
> >>
> >> Thanks for that git tree! However, I discovered that the same "RCU bug" I
> >> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
> >> it was actually just RCU noticing stalls. These also occurred when using
> >> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
> >> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
> >> masking the regression on the root filesystem. The test system would lock up
> >> completely - no console response - and made it impossible to force the reboot
> >> which was required to set up the failure. Hence the reversion approach, since
> >> RCU does not report stalls in 3.15-rc3 (final).
> >>
> >> Eric
> >>
> >>
> >>
> >> >
> >> > Thanks!!
> >> >
> >> > - Ted
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Matteo Croce
> OpenWrt Developer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-10 23:29:13

by Azat Khuzhin

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
> * Theodore Ts'o <[email protected]>:
> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
> > > reproduce this on his Panda board with 3.15, and definitely not on
> > > 3.14. So at this point there seems to be at least some kind of 3.15
> > > regression going on here, regardless of whether it's in the eMMC
> > > driver or the ext4 code. (It also means that the bug fix I found is
> > > irrelevant for the purposes of working this issue, since that's a much
> > > harder to hit, and that bug has been around long before 3.14.)
> > >
> > > The problem in terms of narrowing it down any further is that the
> > > Pandaboard is running into RCU bugs which makes it hard to test the
> > > early 3.15-rcX kernels.....
> >
> > In the hopes of making it easy to bisect, I've created a kernel branch
> > which starts with 3.14, and then adds on all of the ext4-related
> > commits since then. You can find it at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
> >
> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
> >
> > Eric, can you see if you can repro the failure on your Panda Board?
> > If you can, try doing a bisection search on these series:
> >
> > git bisect start
> > git bisect good v3.14
> > git bisect bad test-mb_generate_buddy-failure
> >
> > Hopefully if it is caused by one of the commits in this series, we'll
> > be able to pin point it this way.
>
> First, the good news (with luck):
>
> My testing currently suggests that the patch causing this regression was
> pulled into 3.15-rc3 -
>
> 007649375f6af242d5b1df2c15996949714303ba
> ext4: initialize multi-block allocator before checking block descriptors
>
> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
> It occasionally yields a false negative, but it has passed 10 consecutive
> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
> of false negatives I've seen, I'm reasonably confident in that result. I'm
> going to run another series with just that patch reverted on 3.16-rc3.
>
> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
> performing journal recovery in ext4_fill_super(). The regression occurs only
> after journal recovery on the root filesystem.

Oops, nice catch!

I'm very sorry for this. When this problems begun, I rechecked my patch,
but didn't found this. (I should test more next time!)
But I don't understand why this triggers only on the root fs?

It will be greate if ext4 will have an BUG_ON() for this case, to avoid
futher bugs, something like this:
$ git di fs/ext4/mballoc.c
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 59e3162..8dfc999 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -832,6 +832,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)

inode = page->mapping->host;
sb = inode->i_sb;
+ BUG_ON((EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER));
+
ngroups = ext4_get_groups_count(sb);
blocksize = 1 << inode->i_blkbits;
blocks_per_page = PAGE_CACHE_SIZE / blocksize;

Thanks, and please accept my apologies those, who have got corrupted fs.

>
> Secondly:
>
> Thanks for that git tree! However, I discovered that the same "RCU bug" I
> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
> it was actually just RCU noticing stalls. These also occurred when using
> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
> masking the regression on the root filesystem. The test system would lock up
> completely - no console response - and made it impossible to force the reboot
> which was required to set up the failure. Hence the reversion approach, since
> RCU does not report stalls in 3.15-rc3 (final).
>
> Eric
>
>
>
> >
> > Thanks!!
> >
> > - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-11 00:13:51

by Darrick J. Wong

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 10, 2014 at 06:32:45PM -0400, Theodore Ts'o wrote:
> To be clear, what you would need to do is to revert commit
> 007649375f6af242d5b1df2c15996949714303ba to prevent the fs corruption.
> Darrick's patch is one that tries to fix the problem addressed by that
> commit in a different fashion.
>
> Quite frankly, reverting the commit, which is causing real damage, is
> far more impotrant to me right now than what to do in order allow
> CONFIG_EXT4FS_DEBUG to work (which is nice, but it's only something
> that file system developers need, and to be honest I can't remember
> the last time I've used said config option). But if we know that
> Darrick's fix works, I'm willing to push that to Linus at the same
> time that I push a revert of 007649375f6af242d5b1df2c15996949714303ba

Reverting the 007649375... patch doesn't seem to create any obvious regressions
on my test box (though again, I was never able to reproduce it as consistently
as Eric W.).

Tossing in the [1] patch also fixes the crash when CONFIG_EXT4_DEBUG=y on
3.16-rc4. I'd say it's safe to send both to Linus and stable.

If anyone experiences problems that I'm not seeing, please yell loudly and
soon!

--D

[1] http://www.spinics.net/lists/linux-ext4/msg43287.html
>
> Cheers,
>
> - Ted
>
> On Thu, Jul 10, 2014 at 11:31:14PM +0200, Matteo Croce wrote:
> > Will do, thanks!
> >
> > 2014-07-10 22:01 GMT+02:00 Darrick J. Wong <[email protected]>:
> > > On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
> > >> * Theodore Ts'o <[email protected]>:
> > >> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> > >> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
> > >> > > reproduce this on his Panda board with 3.15, and definitely not on
> > >> > > 3.14. So at this point there seems to be at least some kind of 3.15
> > >> > > regression going on here, regardless of whether it's in the eMMC
> > >> > > driver or the ext4 code. (It also means that the bug fix I found is
> > >> > > irrelevant for the purposes of working this issue, since that's a much
> > >> > > harder to hit, and that bug has been around long before 3.14.)
> > >> > >
> > >> > > The problem in terms of narrowing it down any further is that the
> > >> > > Pandaboard is running into RCU bugs which makes it hard to test the
> > >> > > early 3.15-rcX kernels.....
> > >> >
> > >> > In the hopes of making it easy to bisect, I've created a kernel branch
> > >> > which starts with 3.14, and then adds on all of the ext4-related
> > >> > commits since then. You can find it at:
> > >> >
> > >> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
> > >> >
> > >> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
> > >> >
> > >> > Eric, can you see if you can repro the failure on your Panda Board?
> > >> > If you can, try doing a bisection search on these series:
> > >> >
> > >> > git bisect start
> > >> > git bisect good v3.14
> > >> > git bisect bad test-mb_generate_buddy-failure
> > >> >
> > >> > Hopefully if it is caused by one of the commits in this series, we'll
> > >> > be able to pin point it this way.
> > >>
> > >> First, the good news (with luck):
> > >>
> > >> My testing currently suggests that the patch causing this regression was
> > >> pulled into 3.15-rc3 -
> > >>
> > >> 007649375f6af242d5b1df2c15996949714303ba
> > >> ext4: initialize multi-block allocator before checking block descriptors
> > >>
> > >> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
> > >> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
> > >> It occasionally yields a false negative, but it has passed 10 consecutive
> > >> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
> > >> of false negatives I've seen, I'm reasonably confident in that result. I'm
> > >> going to run another series with just that patch reverted on 3.16-rc3.
> > >>
> > >> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
> > >> performing journal recovery in ext4_fill_super(). The regression occurs only
> > >> after journal recovery on the root filesystem.
> > >
> > > Thanks for finding the culprit! :)
> > >
> > > Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
> > > FS will mount without crashing? This was the cruddy patch I sent in (and later
> > > killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
> > > way. Maybe it's appropriate now.
> > > http://www.spinics.net/lists/linux-ext4/msg43287.html
> > >
> > > --D
> > >
> > >>
> > >> Secondly:
> > >>
> > >> Thanks for that git tree! However, I discovered that the same "RCU bug" I
> > >> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
> > >> it was actually just RCU noticing stalls. These also occurred when using
> > >> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
> > >> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
> > >> masking the regression on the root filesystem. The test system would lock up
> > >> completely - no console response - and made it impossible to force the reboot
> > >> which was required to set up the failure. Hence the reversion approach, since
> > >> RCU does not report stalls in 3.15-rc3 (final).
> > >>
> > >> Eric
> > >>
> > >>
> > >>
> > >> >
> > >> > Thanks!!
> > >> >
> > >> > - Ted
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > >> the body of a message to [email protected]
> > >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Matteo Croce
> > OpenWrt Developer
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-11 00:45:12

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4: journal has aborted

* Darrick J. Wong <[email protected]>:
> On Thu, Jul 10, 2014 at 06:32:45PM -0400, Theodore Ts'o wrote:
> > To be clear, what you would need to do is to revert commit
> > 007649375f6af242d5b1df2c15996949714303ba to prevent the fs corruption.
> > Darrick's patch is one that tries to fix the problem addressed by that
> > commit in a different fashion.
> >
> > Quite frankly, reverting the commit, which is causing real damage, is
> > far more impotrant to me right now than what to do in order allow
> > CONFIG_EXT4FS_DEBUG to work (which is nice, but it's only something
> > that file system developers need, and to be honest I can't remember
> > the last time I've used said config option). But if we know that
> > Darrick's fix works, I'm willing to push that to Linus at the same
> > time that I push a revert of 007649375f6af242d5b1df2c15996949714303ba
>
> Reverting the 007649375... patch doesn't seem to create any obvious regressions
> on my test box (though again, I was never able to reproduce it as consistently
> as Eric W.).
>
> Tossing in the [1] patch also fixes the crash when CONFIG_EXT4_DEBUG=y on
> 3.16-rc4. I'd say it's safe to send both to Linus and stable.
>
> If anyone experiences problems that I'm not seeing, please yell loudly and
> soon!
>

Reverting the suspect patch - 007649375f - on 3.16-rc3 and running on the
Panda yielded 10 successive "successful" generic/068 failures (no block
bitmap trouble on reboot). So, it looks like that patch is all of it.

Running the same test scenario on Darrick's patch (CONFIG_EXT4FS_DEBUG =>
CONFIG_EXT4_DEBUG) applied to 3.16-rc3 lead to exactly the same result.
No panics, BUGS, or other misbehavior whether generic/068 completed
successfully or failed (and that test used here simply because it was
convenient) and no trouble on boot, etc.

Let me know if anything else is needed.

Eric

> --D
>
> [1] http://www.spinics.net/lists/linux-ext4/msg43287.html
> >
> > Cheers,
> >
> > - Ted
> >
> > On Thu, Jul 10, 2014 at 11:31:14PM +0200, Matteo Croce wrote:
> > > Will do, thanks!
> > >
> > > 2014-07-10 22:01 GMT+02:00 Darrick J. Wong <[email protected]>:
> > > > On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
> > > >> * Theodore Ts'o <[email protected]>:
> > > >> > On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
> > > >> > > An update from today's ext4 concall. Eric Whitney can fairly reliably
> > > >> > > reproduce this on his Panda board with 3.15, and definitely not on
> > > >> > > 3.14. So at this point there seems to be at least some kind of 3.15
> > > >> > > regression going on here, regardless of whether it's in the eMMC
> > > >> > > driver or the ext4 code. (It also means that the bug fix I found is
> > > >> > > irrelevant for the purposes of working this issue, since that's a much
> > > >> > > harder to hit, and that bug has been around long before 3.14.)
> > > >> > >
> > > >> > > The problem in terms of narrowing it down any further is that the
> > > >> > > Pandaboard is running into RCU bugs which makes it hard to test the
> > > >> > > early 3.15-rcX kernels.....
> > > >> >
> > > >> > In the hopes of making it easy to bisect, I've created a kernel branch
> > > >> > which starts with 3.14, and then adds on all of the ext4-related
> > > >> > commits since then. You can find it at:
> > > >> >
> > > >> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
> > > >> >
> > > >> > https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
> > > >> >
> > > >> > Eric, can you see if you can repro the failure on your Panda Board?
> > > >> > If you can, try doing a bisection search on these series:
> > > >> >
> > > >> > git bisect start
> > > >> > git bisect good v3.14
> > > >> > git bisect bad test-mb_generate_buddy-failure
> > > >> >
> > > >> > Hopefully if it is caused by one of the commits in this series, we'll
> > > >> > be able to pin point it this way.
> > > >>
> > > >> First, the good news (with luck):
> > > >>
> > > >> My testing currently suggests that the patch causing this regression was
> > > >> pulled into 3.15-rc3 -
> > > >>
> > > >> 007649375f6af242d5b1df2c15996949714303ba
> > > >> ext4: initialize multi-block allocator before checking block descriptors
> > > >>
> > > >> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
> > > >> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
> > > >> It occasionally yields a false negative, but it has passed 10 consecutive
> > > >> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
> > > >> of false negatives I've seen, I'm reasonably confident in that result. I'm
> > > >> going to run another series with just that patch reverted on 3.16-rc3.
> > > >>
> > > >> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
> > > >> performing journal recovery in ext4_fill_super(). The regression occurs only
> > > >> after journal recovery on the root filesystem.
> > > >
> > > > Thanks for finding the culprit! :)
> > > >
> > > > Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
> > > > FS will mount without crashing? This was the cruddy patch I sent in (and later
> > > > killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
> > > > way. Maybe it's appropriate now.
> > > > http://www.spinics.net/lists/linux-ext4/msg43287.html
> > > >
> > > > --D
> > > >
> > > >>
> > > >> Secondly:
> > > >>
> > > >> Thanks for that git tree! However, I discovered that the same "RCU bug" I
> > > >> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
> > > >> it was actually just RCU noticing stalls. These also occurred when using
> > > >> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
> > > >> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
> > > >> masking the regression on the root filesystem. The test system would lock up
> > > >> completely - no console response - and made it impossible to force the reboot
> > > >> which was required to set up the failure. Hence the reversion approach, since
> > > >> RCU does not report stalls in 3.15-rc3 (final).
> > > >>
> > > >> Eric
> > > >>
> > > >>
> > > >>
> > > >> >
> > > >> > Thanks!!
> > > >> >
> > > >> > - Ted
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > >> the body of a message to [email protected]
> > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > >
> > > --
> > > Matteo Croce
> > > OpenWrt Developer
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-11 08:50:37

by Jaehoon Chung

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On 07/11/2014 09:45 AM, Eric Whitney wrote:
> * Darrick J. Wong <[email protected]>:
>> On Thu, Jul 10, 2014 at 06:32:45PM -0400, Theodore Ts'o wrote:
>>> To be clear, what you would need to do is to revert commit
>>> 007649375f6af242d5b1df2c15996949714303ba to prevent the fs corruption.
>>> Darrick's patch is one that tries to fix the problem addressed by that
>>> commit in a different fashion.
>>>
>>> Quite frankly, reverting the commit, which is causing real damage, is
>>> far more impotrant to me right now than what to do in order allow
>>> CONFIG_EXT4FS_DEBUG to work (which is nice, but it's only something
>>> that file system developers need, and to be honest I can't remember
>>> the last time I've used said config option). But if we know that
>>> Darrick's fix works, I'm willing to push that to Linus at the same
>>> time that I push a revert of 007649375f6af242d5b1df2c15996949714303ba
>>
>> Reverting the 007649375... patch doesn't seem to create any obvious regressions
>> on my test box (though again, I was never able to reproduce it as consistently
>> as Eric W.).
>>
>> Tossing in the [1] patch also fixes the crash when CONFIG_EXT4_DEBUG=y on
>> 3.16-rc4. I'd say it's safe to send both to Linus and stable.
>>
>> If anyone experiences problems that I'm not seeing, please yell loudly and
>> soon!
>>
>
> Reverting the suspect patch - 007649375f - on 3.16-rc3 and running on the
> Panda yielded 10 successive "successful" generic/068 failures (no block
> bitmap trouble on reboot). So, it looks like that patch is all of it.
In my case, after reverting it, i didn't find the bitmap corrupt problem at exynos board.
Before reverting it, when i try to reboot, it occurred the problem at almost every time.
(Kernel version is 3.16-rv4, eMMC5.0 card is used.)

Best Regards,
Jaehoon Chung
>
> Running the same test scenario on Darrick's patch (CONFIG_EXT4FS_DEBUG =>
> CONFIG_EXT4_DEBUG) applied to 3.16-rc3 lead to exactly the same result.
> No panics, BUGS, or other misbehavior whether generic/068 completed
> successfully or failed (and that test used here simply because it was
> convenient) and no trouble on boot, etc.
>
> Let me know if anything else is needed.
>
> Eric
>
>> --D
>>
>> [1] http://www.spinics.net/lists/linux-ext4/msg43287.html
>>>
>>> Cheers,
>>>
>>> - Ted
>>>
>>> On Thu, Jul 10, 2014 at 11:31:14PM +0200, Matteo Croce wrote:
>>>> Will do, thanks!
>>>>
>>>> 2014-07-10 22:01 GMT+02:00 Darrick J. Wong <[email protected]>:
>>>>> On Thu, Jul 10, 2014 at 02:57:48PM -0400, Eric Whitney wrote:
>>>>>> * Theodore Ts'o <[email protected]>:
>>>>>>> On Mon, Jul 07, 2014 at 11:53:10AM -0400, Theodore Ts'o wrote:
>>>>>>>> An update from today's ext4 concall. Eric Whitney can fairly reliably
>>>>>>>> reproduce this on his Panda board with 3.15, and definitely not on
>>>>>>>> 3.14. So at this point there seems to be at least some kind of 3.15
>>>>>>>> regression going on here, regardless of whether it's in the eMMC
>>>>>>>> driver or the ext4 code. (It also means that the bug fix I found is
>>>>>>>> irrelevant for the purposes of working this issue, since that's a much
>>>>>>>> harder to hit, and that bug has been around long before 3.14.)
>>>>>>>>
>>>>>>>> The problem in terms of narrowing it down any further is that the
>>>>>>>> Pandaboard is running into RCU bugs which makes it hard to test the
>>>>>>>> early 3.15-rcX kernels.....
>>>>>>>
>>>>>>> In the hopes of making it easy to bisect, I've created a kernel branch
>>>>>>> which starts with 3.14, and then adds on all of the ext4-related
>>>>>>> commits since then. You can find it at:
>>>>>>>
>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-mb_generate_buddy-failure
>>>>>>>
>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/?h=test-mb_generate_buddy-failure
>>>>>>>
>>>>>>> Eric, can you see if you can repro the failure on your Panda Board?
>>>>>>> If you can, try doing a bisection search on these series:
>>>>>>>
>>>>>>> git bisect start
>>>>>>> git bisect good v3.14
>>>>>>> git bisect bad test-mb_generate_buddy-failure
>>>>>>>
>>>>>>> Hopefully if it is caused by one of the commits in this series, we'll
>>>>>>> be able to pin point it this way.
>>>>>>
>>>>>> First, the good news (with luck):
>>>>>>
>>>>>> My testing currently suggests that the patch causing this regression was
>>>>>> pulled into 3.15-rc3 -
>>>>>>
>>>>>> 007649375f6af242d5b1df2c15996949714303ba
>>>>>> ext4: initialize multi-block allocator before checking block descriptors
>>>>>>
>>>>>> Bisection by selectively reverting ext4 commits in -rc3 identified this patch
>>>>>> while running on the Pandaboard. I'm still using generic/068 as my reproducer.
>>>>>> It occasionally yields a false negative, but it has passed 10 consecutive
>>>>>> trials on my revert/bisect kernel derived from 3.15-rc3. Given the frequency
>>>>>> of false negatives I've seen, I'm reasonably confident in that result. I'm
>>>>>> going to run another series with just that patch reverted on 3.16-rc3.
>>>>>>
>>>>>> Looking at the patch, the call to ext4_mb_init() was hoisted above the code
>>>>>> performing journal recovery in ext4_fill_super(). The regression occurs only
>>>>>> after journal recovery on the root filesystem.
>>>>>
>>>>> Thanks for finding the culprit! :)
>>>>>
>>>>> Can you apply this patch, build with CONFIG_EXT4FS_DEBUG=y, and see if an
>>>>> FS will mount without crashing? This was the cruddy patch I sent in (and later
>>>>> killed) that fixed the crash on mount with EXT4FS_DEBUG in a somewhat silly
>>>>> way. Maybe it's appropriate now.
>>>>> http://www.spinics.net/lists/linux-ext4/msg43287.html
>>>>>
>>>>> --D
>>>>>
>>>>>>
>>>>>> Secondly:
>>>>>>
>>>>>> Thanks for that git tree! However, I discovered that the same "RCU bug" I
>>>>>> thought I was seeing on the Panda was also visible on the x86_64 KVM, and
>>>>>> it was actually just RCU noticing stalls. These also occurred when using
>>>>>> your git tree as well as on mainline 3.15-rc1 and 3.15-rc2 and during
>>>>>> bisection attempts on 3.15-rc3 within the ext4 patches, and had the effect of
>>>>>> masking the regression on the root filesystem. The test system would lock up
>>>>>> completely - no console response - and made it impossible to force the reboot
>>>>>> which was required to set up the failure. Hence the reversion approach, since
>>>>>> RCU does not report stalls in 3.15-rc3 (final).
>>>>>>
>>>>>> Eric
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks!!
>>>>>>>
>>>>>>> - Ted
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Matteo Croce
>>>> OpenWrt Developer
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2014-07-11 11:44:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: journal has aborted

On Thu, Jul 10, 2014 at 08:45:08PM -0400, Eric Whitney wrote:
>
> Reverting the suspect patch - 007649375f - on 3.16-rc3 and running on the
> Panda yielded 10 successive "successful" generic/068 failures (no block
> bitmap trouble on reboot). So, it looks like that patch is all of it.

Thanks again Eric, for finding this!!

> Running the same test scenario on Darrick's patch (CONFIG_EXT4FS_DEBUG =>
> CONFIG_EXT4_DEBUG) applied to 3.16-rc3 lead to exactly the same result.
> No panics, BUGS, or other misbehavior whether generic/068 completed
> successfully or failed (and that test used here simply because it was
> convenient) and no trouble on boot, etc.

I've been looking more closely at the changes in line, and I suspect
the real fix is that we should move these lines:

ext4_free_blocks_count_set(sbi->s_es,
EXT4_C2B(sbi, ext4_count_free_clusters(sb)));
sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));

after the journal is run. Not that it really matters since so very
little (and nothing for normal file system operation, including the
statfs(2) system call) actually depends on the free blocks count in
the superblock --- instead we the percpu "s_freeclusters_count" and
"s_dirtyclusters_counter" fields for scalability reasons --- but if we
*are* going to set these fields in the on-disk superblock, we should
wait until *after* we have updated the allocation bitmaps before we
start counting the free blocks in those bitmaps.

Cheers,

- Ted

2014-07-15 06:31:01

by David Jander

[permalink] [raw]
Subject: Re: ext4: journal has aborted


Dear Eric, Ted, ...

On Thu, 10 Jul 2014 20:45:08 -0400
Eric Whitney <[email protected]> wrote:

> * Darrick J. Wong <[email protected]>:
> > On Thu, Jul 10, 2014 at 06:32:45PM -0400, Theodore Ts'o wrote:
> > > To be clear, what you would need to do is to revert commit
> > > 007649375f6af242d5b1df2c15996949714303ba to prevent the fs corruption.
> > > Darrick's patch is one that tries to fix the problem addressed by that
> > > commit in a different fashion.
> > >
> > > Quite frankly, reverting the commit, which is causing real damage, is
> > > far more impotrant to me right now than what to do in order allow
> > > CONFIG_EXT4FS_DEBUG to work (which is nice, but it's only something
> > > that file system developers need, and to be honest I can't remember
> > > the last time I've used said config option). But if we know that
> > > Darrick's fix works, I'm willing to push that to Linus at the same
> > > time that I push a revert of 007649375f6af242d5b1df2c15996949714303ba
> >
> > Reverting the 007649375... patch doesn't seem to create any obvious regressions
> > on my test box (though again, I was never able to reproduce it as consistently
> > as Eric W.).
> >
> > Tossing in the [1] patch also fixes the crash when CONFIG_EXT4_DEBUG=y on
> > 3.16-rc4. I'd say it's safe to send both to Linus and stable.
> >
> > If anyone experiences problems that I'm not seeing, please yell loudly and
> > soon!
> >
>
> Reverting the suspect patch - 007649375f - on 3.16-rc3 and running on the
> Panda yielded 10 successive "successful" generic/068 failures (no block
> bitmap trouble on reboot). So, it looks like that patch is all of it.
>
> Running the same test scenario on Darrick's patch (CONFIG_EXT4FS_DEBUG =>
> CONFIG_EXT4_DEBUG) applied to 3.16-rc3 lead to exactly the same result.
> No panics, BUGS, or other misbehavior whether generic/068 completed
> successfully or failed (and that test used here simply because it was
> convenient) and no trouble on boot, etc.

Thanks a lot to everyone for the excellent work hunting this down!
I can confirm that on 3.16-rc5 the problem is gone. At least after extensively
testing yesterday, I have not been unable to reproduce it. No ext4 error
messages and no filesystem corruption anymore!

Best regards,

--
David Jander
Protonic Holland.