2012-10-15 17:46:11

by Toralf Förster

[permalink] [raw]
Subject: EXT4-fs error w/ external USB drive

Even with current stable kernel 3.6.2 I sometimes get those syslog messages :


2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
2012-10-15T19:37:58.417+02:00 n22 kernel: JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
2012-10-15T19:38:03.335+02:00 n22 kernel: EXT4-fs error (device sdb3): mb_free_blocks:1300: group 436, block 14294559:freeing already freed block (bit 7711)

when I run an almost stable Gentoo booted from an external USB disk.
I didn't configured too much exotic things, just this is worth to mention :

$> cat /proc/sys/vm/dirty_writeback_centisecs
1000


--
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3


Attachments:
.config (79.81 kB)

2012-10-19 21:07:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On Mon, Oct 15, 2012 at 07:46:02PM +0200, Toralf F?rster wrote:
> Even with current stable kernel 3.6.2 I sometimes get those syslog messages :
>
>
> 2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd

Have you run e2fsck to clean up the file system corruption? If you
have, do you continually get these errors afterwards?

You say this is an external USB disk; is there any possibility of the
disk getting unmounted uncleanly due to the cable getting pulled out
while the disk is still mounted, and then the disk getting remounted
w/o having e2fsck run on the disk?

- Ted

2012-10-22 18:39:41

by Toralf Förster

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On 10/19/2012 11:07 PM, Theodore Ts'o wrote:
> On Mon, Oct 15, 2012 at 07:46:02PM +0200, Toralf Förster wrote:
>> Even with current stable kernel 3.6.2 I sometimes get those syslog messages :
>>
>>
>> 2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
>
> Have you run e2fsck to clean up the file system corruption? If you
> have, do you continually get these errors afterwards?

Well, I got it yesterday too :

n22 ~ # zgrep ext4_mb_generate_buddy /var/log/messages-201210*
/var/log/messages-20121021.gz:2012-10-15T19:05:39.189+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 774, 27157 clusters in bitmap, 27052 in gd
/var/log/messages-20121021.gz:2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
/var/log/messages-20121021.gz:2012-10-15T19:56:05.301+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 1233, 11981 clusters in bitmap, 11974 in gd
/var/log/messages-20121021.gz:2012-10-15T19:56:18.601+02:00 n22 kernel: EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 484, 28817 clusters in bitmap, 28101 in gd

I rebooted the system and forced a run of fsck after these lines too.
I'll check periodically whether it happens again.

> You say this is an external USB disk; is there any possibility of the
> disk getting unmounted uncleanly due to the cable getting pulled out
> while the disk is still mounted, and then the disk getting remounted
> w/o having e2fsck run on the disk?

For the first occurrence probably yes (better : I dunno), but yesterday definitely not.



--
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

2012-10-24 01:11:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

Toralf,

Are you using any kind of special mount options on your usb stick?

Thanks,

- Ted

2012-10-24 17:34:09

by Toralf Förster

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On 10/24/2012 03:11 AM, Theodore Ts'o wrote:
> Toralf,
>
> Are you using any kind of special mount options on your usb stick?
>
> Thanks,
>
> - Ted
nope

tfoerste@n22 ~/devel/linux $ grep ext4 /etc/fstab
/dev/sdb3 / ext4 noatime
0 1


--
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

2012-10-24 18:35:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On Wed, Oct 24, 2012 at 07:31:57PM +0200, Toralf F?rster wrote:
> > Are you using any kind of special mount options on your usb stick?
> >
> nope

Thanks, we're trying to get a reliable repro of this failure, and so
every bit of data helps... I've cc'ed you on the other thread, and if
you could try the second patch I sent out last night (and let me know
when/if the WARN_ON triggers), I'd really appreciate it.

Thanks again,

- Ted

2012-10-25 16:39:46

by Toralf Förster

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/24/2012 08:35 PM, Theodore Ts'o wrote:
> Thanks, we're trying to get a reliable repro of this failure, and
> so every bit of data helps... I've cc'ed you on the other thread,
> and if you could try the second patch I sent out last night (and
> let me know when/if the WARN_ON triggers), I'd really appreciate
> it.
>
> Thanks again,
>
> - Ted
>
I'm running now a vanilla 3.6.3 + your patch.

After a lot of file operations (Gentoo emerging, kernel build, git
pulls, ...) I s2disk the system (that with the external USB drive)
yesterday, wake it up today, rebooted it -
and had to manually repair the file system, because the automatic fsck
gave up.

Most of what landed in /lost+found however were only temp data of
installing a Gentoo package at 14th of September - so nothing
seriously lost AFAICS.

Nevertheless there's another Linux system I have (64bit RH EL,internal
drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
I attached the whole appropriate section of /var/log/message.


- --
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCJa0IACgkQhyrlCH22naO/tgCgxCgSVSVqLEFYPKvkxZgSCJSB
4/UAoLikH/LuOdq1yOXaS5ODTijYVE7j
=pC1B
-----END PGP SIGNATURE-----


Attachments:
err.txt (175.09 kB)
err.txt.sig (72.00 B)
Download all attachments

2012-10-25 18:20:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On Thu, Oct 25, 2012 at 06:39:30PM +0200, Toralf F?rster wrote:
> After a lot of file operations (Gentoo emerging, kernel build, git
> pulls, ...) I s2disk the system (that with the external USB drive)
> yesterday, wake it up today, rebooted it -
> and had to manually repair the file system, because the automatic fsck
> gave up.

OK, I'm going to send another patch series which I'd hope you could
test to see if reduces the rate at which this happens.

> Nevertheless there's another Linux system I have (64bit RH EL,internal
> drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
> I attached the whole appropriate section of /var/log/message.

I don't have easy access to the RHEL kernel sources, and so I don't
know which patches were applied. Specifically, I'd really like to
know if the commit represented by 14b4ed22a6 is in RHEL
3.5.4-1.el6.elrepo. Also, I'd like to know which line number was
reflected here, which was the first EXT4-fs error:

> Sep 26 09:26:54 x kernel: EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure

This was from fs/ext4/ialloc.c line 938, and there are two
ext4_std_error() that this could represent, so which is why having the
exact kernel sources from this RHEL kernel would be useful. (I'd also
suggest opening a RHEL support ticket if you have a support contract,
since that way Red Hat can track this issue, and that way Eric can
count the work he's been doing on this fire drill as supporting a
customer. :-)

What's a bit unfortunate is that there was no other error messages
before this line. So we can't know for sure what caused or returned
the -EIO error code. I *suspect* it was this, which would would be
indocate a corrupted inode bitmap:

if (insert_inode_locked(inode) < 0) {
/*
* Likely a bitmap corruption causing inode to be allocated
* twice.
*/
err = -EIO;
goto fail;
}

Do you know if this external disk could have suffered from a cable
pull, or a flaky cable, or some kind of unclean shutdown/power failure
before it rebooted? That would be an interesting data point.

For the future, we need to add some better error reporting for
failures such as this. In addition, I have a recent change we made at
work that I should get upstream which avoids allocating from a block
group once we notice a corruption (currently just for the block
allocations, but I think we should do this for inode allocations as
well), to minimize the chances of lost data once we notice that the
block/inode allocation bitmap can't be trusted. This avoids data loss
in the case where users are using the default errors=continue instead
of errors=panic or errors=remount-ro.

Speaking of which, for your production RHEL server, you might want to
seriously consider errors=panic for any critical file system volume.
This allows the file system to get corrected via e2fsck, and prevents
the server from stumbling along, possibly causing more data loss due
to a fs corruption.

Regards,

- Ted

2012-10-25 18:27:42

by Eric Sandeen

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On 10/25/12 1:20 PM, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 06:39:30PM +0200, Toralf F?rster wrote:
>> After a lot of file operations (Gentoo emerging, kernel build, git
>> pulls, ...) I s2disk the system (that with the external USB drive)
>> yesterday, wake it up today, rebooted it -
>> and had to manually repair the file system, because the automatic fsck
>> gave up.
>
> OK, I'm going to send another patch series which I'd hope you could
> test to see if reduces the rate at which this happens.
>
>> Nevertheless there's another Linux system I have (64bit RH EL,internal
>> drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
>> I attached the whole appropriate section of /var/log/message.
>
> I don't have easy access to the RHEL kernel sources, and so I don't

Just FWIW, that's a 3rd party kernel, not something Red Hat
ships. (see "elrepo")

-Eric

2012-10-26 14:30:20

by Toralf Förster

[permalink] [raw]
Subject: Re: EXT4-fs error w/ external USB drive

On 10/24/2012 08:35 PM, Theodore Ts'o wrote:
> On Wed, Oct 24, 2012 at 07:31:57PM +0200, Toralf Förster wrote:
>>> Are you using any kind of special mount options on your usb stick?
>>>
>> nope
>
> Thanks, we're trying to get a reliable repro of this failure, and so
> every bit of data helps... I've cc'ed you on the other thread, and if
> you could try the second patch I sent out last night (and let me know
> when/if the WARN_ON triggers), I'd really appreciate it.
>
> Thanks again,
>
> - Ted
>
Well, here it is :

2012-10-25T21:05:28.000+02:00 n22 sudo: tfoerste : TTY=pts/2 ; PWD=/home/tfoerste/virtual/uml ; USER=root ; COMMAND=/bin/su -
2012-10-25T21:05:28.000+02:00 n22 sudo: pam_unix(sudo:session): session opened for user root by tfoerste(uid=0)
2012-10-25T21:05:28.000+02:00 n22 su[18998]: Successful su for root by root
2012-10-25T21:05:28.000+02:00 n22 su[18998]: + /dev/pts/2 root:root
2012-10-25T21:05:28.000+02:00 n22 su[18998]: pam_unix(su:session): session opened for user root by tfoerste(uid=0)
2012-10-25T21:05:44.880+02:00 n22 kernel: EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
2012-10-25T21:07:07.218+02:00 n22 kernel: JBD2: jbd2_mark_journal_empty bug workaround (79, 80)
2012-10-25T21:07:07.218+02:00 n22 kernel: ------------[ cut here ]------------
2012-10-25T21:07:07.218+02:00 n22 kernel: WARNING: at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+0xef/0x110()
2012-10-25T21:07:07.218+02:00 n22 kernel: Hardware name: 4180F65
2012-10-25T21:07:07.218+02:00 n22 kernel: Modules linked in: bluetooth cpufreq_stats loop ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_recent xt_tcpudp xt_mac nf_conntrack_ftp xt_state xt_limit xt_LOG iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun msr i915 coretemp cfbfillrect cfbimgblt i2c_algo_bit cfbcopyarea fbcon bitblit snd_hda_codec_conexant softcursor font snd_hda_intel snd_hda_codec kvm_intel snd_pcm intel_agp 8250_pci intel_gtt drm_kms_helper snd_page_alloc snd_timer kvm 8250 drm thinkpad_acpi nvram uvcvideo snd serial_core agpgart fb sdhci_pci usblp videobuf2_vmalloc e1000e videobuf2_memops videobuf2_core videodev i2c_i801 tpm_tis soundcore hwmon arc4 sdhci i2c_core tpm fbdev iwldvm acpi_cpufreq mac80211 mperf ac psmouse iwlwifi battery button cfg80211 rfkill evdev mmc_core processor video tpm_bios thermal wmi xts gf128mul aesni_intel ablk_helper cryptd aes_i58
6 aes_generic cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
2012-10-25T21:07:07.218+02:00 n22 kernel: Pid: 19040, comm: umount Not tainted 3.6.3 #8
2012-10-25T21:07:07.218+02:00 n22 kernel: Call Trace:
2012-10-25T21:07:07.218+02:00 n22 kernel: [<c10323e2>] warn_slowpath_common+0x72/0xa0
2012-10-25T21:07:07.218+02:00 n22 kernel: [<c11e253f>] ? jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.218+02:00 n22 kernel: [<c11e253f>] ? jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c1032432>] warn_slowpath_null+0x22/0x30
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c11e253f>] jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c11e272e>] jbd2_journal_destroy+0x1ce/0x1f0
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c10529c0>] ? add_wait_queue+0x50/0x50
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c11b079a>] ext4_put_super+0x4a/0x2e0
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c1127652>] ? dispose_list+0x32/0x40
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c1127c0f>] ? evict_inodes+0x8f/0xe0
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c1112521>] generic_shutdown_super+0x51/0xd0
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c10e5b45>] ? pcpu_free_area+0x145/0x190
2012-10-25T21:07:07.223+02:00 n22 kernel: [<c11125c9>] kill_block_super+0x29/0x70
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c1112800>] deactivate_locked_super+0x30/0x90
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c1113247>] deactivate_super+0x47/0x60
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c112a41d>] mntput_no_expire+0xcd/0x120
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c112b1ca>] sys_umount+0x6a/0x330
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c112b4ae>] sys_oldumount+0x1e/0x20
2012-10-25T21:07:07.224+02:00 n22 kernel: [<c13ad753>] sysenter_do_call+0x12/0x22
2012-10-25T21:07:07.224+02:00 n22 kernel: ---[ end trace 8e7416a7368818fe ]---
2012-10-25T21:07:07.000+02:00 n22 su[18998]: pam_unix(su:session): session closed for user root
2012-10-25T21:07:07.000+02:00 n22 sudo: pam_unix(sudo:session): session closed for user root


--
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3