2014-07-31 19:57:41

by Andy Isaacson

[permalink] [raw]
Subject: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt

3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on
Thinkpad T440s.

System has been quite stable for ~9 months, always running a very recent
stable tree.

kernel panicked this morning probably due to an external drive
triggering UAS errors in 3.15 (but the syslog didn't make it to disk
alas). The system remained powered on for >30 seconds after the panic,
finally I shut down by holding down the power button. So there should
not have been any writes in flight to the SSD.

After reboot, rootfs was deeply unhappy:

[ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery
[ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs
[ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted
[ 7.326280] EXT4-fs (dm-1): recovery complete
[ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
...
[ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
...
[ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt.
[ 39.354389] Aborting journal on device dm-1-8.
[ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only
[ 39.354485] ------------[ cut here ]------------
[ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]()
[ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sdmmc
mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys
[ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19
[ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013
[ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000
[ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0
[ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0
[ 39.354611] Call Trace:
[ 39.354617] [<ffffffff81378c2a>] dump_stack+0x45/0x56
[ 39.354621] [<ffffffff8103dc62>] warn_slowpath_common+0x7f/0x98
[ 39.354643] [<ffffffffa03a3d33>] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
[ 39.354648] [<ffffffff8103dd2e>] warn_slowpath_null+0x1a/0x1c
[ 39.354666] [<ffffffffa03a3d33>] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
[ 39.354686] [<ffffffffa03aa380>] ext4_free_blocks+0x713/0x809 [ext4]
[ 39.354704] [<ffffffffa03a0639>] ext4_ext_remove_space+0x698/0xbdc [ext4]
[ 39.354723] [<ffffffffa03af7b1>] ? __es_remove_extent+0x46/0x27d [ext4]
[ 39.354741] [<ffffffffa03a246f>] ext4_ext_truncate+0x89/0xad [ext4]
[ 39.354756] [<ffffffffa0383024>] ext4_truncate+0x199/0x281 [ext4]
[ 39.354770] [<ffffffffa038379b>] ext4_evict_inode+0x1a7/0x2d0 [ext4]
[ 39.354775] [<ffffffff8113f390>] evict+0xa8/0x14c
[ 39.354778] [<ffffffff8113fa75>] iput+0x12d/0x136
[ 39.354783] [<ffffffff81136d5b>] do_unlinkat+0x14e/0x1f4
[ 39.354788] [<ffffffff8112bfe9>] ? ____fput+0xe/0x10
[ 39.354794] [<ffffffff8105659d>] ? task_work_run+0x87/0x98
[ 39.354798] [<ffffffff81137b98>] SyS_unlinkat+0x29/0x2b
[ 39.354802] [<ffffffff81137b98>] ? SyS_unlinkat+0x29/0x2b
[ 39.354807] [<ffffffff8137d0d2>] system_call_fastpath+0x16/0x1b
[ 39.354810] ---[ end trace 80365b8da4738adc ]---
[ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30
[ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted
[ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted
[ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted
[ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted
[ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
[ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted


Rebooted again and rootfs came up dirty, of course, but journal seems
sadder than expected:

[ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
[ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
[ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1
[ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146
[ 12.506527] Aborting journal on device dm-1-8.
[ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only
[ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure
[ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt.
[ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt.


fsck claims to have fixed it but on reboot it blows up the same way:

e2fsck 1.42.11 (09-Jul-2014)
/dev/mapper/t440s-root: recovering journal
/dev/mapper/t440s-root contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Unconnected directory inode 3801092 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801093 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801106 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801107 (/lost+found/#3801106/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801111 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801116 (/tmp/???)
Connect to /lost+found<y>? yes
Unconnected directory inode 3801118 (/tmp/???)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 3801089 ref count is 61, should be 42. Fix<y>? yes
Inode 3801092 ref count is 3, should be 2. Fix<y>? yes
Inode 3801093 ref count is 3, should be 2. Fix<y>? yes
Unattached inode 3801099
Connect to /lost+found<y>? yes
Inode 3801099 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 3801103
Connect to /lost+found<y>? yes
Inode 3801103 ref count is 2, should be 1. Fix<y>? yes
Inode 3801106 ref count is 3, should be 2. Fix<y>? yes
Inode 3801107 ref count is 3, should be 2. Fix<y>? yes
Inode 3801111 ref count is 3, should be 2. Fix<y>? yes
Unattached inode 3801112
Connect to /lost+found<y>? yes
Inode 3801112 ref count is 2, should be 1. Fix<y>? yes
Inode 3801116 ref count is 3, should be 2. Fix<y>? yes
Inode 3801118 ref count is 3, should be 2. Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850
Fix<y>? yes
Free blocks count wrong for group #464 (24160, counted=24180).
Fix<y>? yes
Free blocks count wrong for group #465 (25520, counted=25827).
Fix<y>? yes
Free blocks count wrong for group #835 (18809, counted=18745).
Fix<y>? yes
Free blocks count wrong for group #837 (23154, counted=23024).
Fix<y>? yes
Free blocks count wrong for group #2506 (28536, counted=28535).
Fix<y>? yes
Free blocks count wrong for group #2842 (2415, counted=2478).
Fix<y>? yes
Free blocks count wrong for group #2844 (27816, counted=28135).
Fix<y>? yes
Free blocks count wrong (108044209, counted=108044918).
Fix<y>? yes
Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801896--3
801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907
Fix<y>? yes
Free inodes count wrong for group #464 (6581, counted=6696).
Fix<y>? yes
Directories count wrong for group #464 (366, counted=346).
Fix<y>? yes
Free inodes count wrong (29348331, counted=29348445).
Fix<y>? yes

/dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/t440s-root: ***** REBOOT LINUX *****
/dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks


After fsck reports clean, reboot still shows failures:


[ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery
[ 7.384663] EXT4-fs (dm-1): recovery complete
[ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)

[ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro

[ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt.
[ 9.820975] Aborting journal on device dm-1-8.
[ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only


Similar repeated problems repeat on every reboot.

SMART stats on the SSD do not indicate any signs of failing hardware:

Device Model: Samsung SSD 840 EVO 500GB
Serial Number: S1DHNSAD929048M
LU WWN Device Id: 5 002538 8a00452f8
Firmware Version: EXT0BB0Q
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jul 31 12:36:59 2014 PDT
...
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957

-andy


2014-07-31 20:33:04

by Andy Isaacson

[permalink] [raw]
Subject: Re: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt

3.14.9 boots just fine after a fsck.

-andy

On Thu, Jul 31, 2014 at 12:51:38PM -0700, Andy Isaacson wrote:
> 3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on
> Thinkpad T440s.
>
> System has been quite stable for ~9 months, always running a very recent
> stable tree.
>
> kernel panicked this morning probably due to an external drive
> triggering UAS errors in 3.15 (but the syslog didn't make it to disk
> alas). The system remained powered on for >30 seconds after the panic,
> finally I shut down by holding down the power button. So there should
> not have been any writes in flight to the SSD.
>
> After reboot, rootfs was deeply unhappy:
>
> [ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> [ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery
> [ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs
> [ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted
> [ 7.326280] EXT4-fs (dm-1): recovery complete
> [ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> ...
> [ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> ...
> [ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt.
> [ 39.354389] Aborting journal on device dm-1-8.
> [ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only
> [ 39.354485] ------------[ cut here ]------------
> [ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]()
> [ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sdmm
c mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys
> [ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19
> [ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013
> [ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000
> [ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0
> [ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0
> [ 39.354611] Call Trace:
> [ 39.354617] [<ffffffff81378c2a>] dump_stack+0x45/0x56
> [ 39.354621] [<ffffffff8103dc62>] warn_slowpath_common+0x7f/0x98
> [ 39.354643] [<ffffffffa03a3d33>] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> [ 39.354648] [<ffffffff8103dd2e>] warn_slowpath_null+0x1a/0x1c
> [ 39.354666] [<ffffffffa03a3d33>] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> [ 39.354686] [<ffffffffa03aa380>] ext4_free_blocks+0x713/0x809 [ext4]
> [ 39.354704] [<ffffffffa03a0639>] ext4_ext_remove_space+0x698/0xbdc [ext4]
> [ 39.354723] [<ffffffffa03af7b1>] ? __es_remove_extent+0x46/0x27d [ext4]
> [ 39.354741] [<ffffffffa03a246f>] ext4_ext_truncate+0x89/0xad [ext4]
> [ 39.354756] [<ffffffffa0383024>] ext4_truncate+0x199/0x281 [ext4]
> [ 39.354770] [<ffffffffa038379b>] ext4_evict_inode+0x1a7/0x2d0 [ext4]
> [ 39.354775] [<ffffffff8113f390>] evict+0xa8/0x14c
> [ 39.354778] [<ffffffff8113fa75>] iput+0x12d/0x136
> [ 39.354783] [<ffffffff81136d5b>] do_unlinkat+0x14e/0x1f4
> [ 39.354788] [<ffffffff8112bfe9>] ? ____fput+0xe/0x10
> [ 39.354794] [<ffffffff8105659d>] ? task_work_run+0x87/0x98
> [ 39.354798] [<ffffffff81137b98>] SyS_unlinkat+0x29/0x2b
> [ 39.354802] [<ffffffff81137b98>] ? SyS_unlinkat+0x29/0x2b
> [ 39.354807] [<ffffffff8137d0d2>] system_call_fastpath+0x16/0x1b
> [ 39.354810] ---[ end trace 80365b8da4738adc ]---
> [ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30
> [ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted
> [ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> [ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> [ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted
> [ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted
> [ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> [ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted
> [ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> [ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted
>
>
> Rebooted again and rootfs came up dirty, of course, but journal seems
> sadder than expected:
>
> [ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
> [ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> [ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1
> [ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146
> [ 12.506527] Aborting journal on device dm-1-8.
> [ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only
> [ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure
> [ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt.
> [ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt.
>
>
> fsck claims to have fixed it but on reboot it blows up the same way:
>
> e2fsck 1.42.11 (09-Jul-2014)
> /dev/mapper/t440s-root: recovering journal
> /dev/mapper/t440s-root contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Unconnected directory inode 3801092 (/tmp/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801093 (/tmp/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801106 (/tmp/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801107 (/lost+found/#3801106/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801111 (/tmp/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801116 (/tmp/???)
> Connect to /lost+found<y>? yes
> Unconnected directory inode 3801118 (/tmp/???)
> Connect to /lost+found<y>? yes
> Pass 4: Checking reference counts
> Inode 3801089 ref count is 61, should be 42. Fix<y>? yes
> Inode 3801092 ref count is 3, should be 2. Fix<y>? yes
> Inode 3801093 ref count is 3, should be 2. Fix<y>? yes
> Unattached inode 3801099
> Connect to /lost+found<y>? yes
> Inode 3801099 ref count is 2, should be 1. Fix<y>? yes
> Unattached inode 3801103
> Connect to /lost+found<y>? yes
> Inode 3801103 ref count is 2, should be 1. Fix<y>? yes
> Inode 3801106 ref count is 3, should be 2. Fix<y>? yes
> Inode 3801107 ref count is 3, should be 2. Fix<y>? yes
> Inode 3801111 ref count is 3, should be 2. Fix<y>? yes
> Unattached inode 3801112
> Connect to /lost+found<y>? yes
> Inode 3801112 ref count is 2, should be 1. Fix<y>? yes
> Inode 3801116 ref count is 3, should be 2. Fix<y>? yes
> Inode 3801118 ref count is 3, should be 2. Fix<y>? yes
>
> Pass 5: Checking group summary information
> Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850
> Fix<y>? yes
> Free blocks count wrong for group #464 (24160, counted=24180).
> Fix<y>? yes
> Free blocks count wrong for group #465 (25520, counted=25827).
> Fix<y>? yes
> Free blocks count wrong for group #835 (18809, counted=18745).
> Fix<y>? yes
> Free blocks count wrong for group #837 (23154, counted=23024).
> Fix<y>? yes
> Free blocks count wrong for group #2506 (28536, counted=28535).
> Fix<y>? yes
> Free blocks count wrong for group #2842 (2415, counted=2478).
> Fix<y>? yes
> Free blocks count wrong for group #2844 (27816, counted=28135).
> Fix<y>? yes
> Free blocks count wrong (108044209, counted=108044918).
> Fix<y>? yes
> Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801896-
-3801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907
> Fix<y>? yes
> Free inodes count wrong for group #464 (6581, counted=6696).
> Fix<y>? yes
> Directories count wrong for group #464 (366, counted=346).
> Fix<y>? yes
> Free inodes count wrong (29348331, counted=29348445).
> Fix<y>? yes
>
> /dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/mapper/t440s-root: ***** REBOOT LINUX *****
> /dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks
>
>
> After fsck reports clean, reboot still shows failures:
>
>
> [ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> [ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery
> [ 7.384663] EXT4-fs (dm-1): recovery complete
> [ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
>
> [ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
>
> [ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt.
> [ 9.820975] Aborting journal on device dm-1-8.
> [ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only
>
>
> Similar repeated problems repeat on every reboot.
>
> SMART stats on the SSD do not indicate any signs of failing hardware:
>
> Device Model: Samsung SSD 840 EVO 500GB
> Serial Number: S1DHNSAD929048M
> LU WWN Device Id: 5 002538 8a00452f8
> Firmware Version: EXT0BB0Q
> User Capacity: 500,107,862,016 bytes [500 GB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: Solid State Device
> Device is: Not in smartctl database [for details use: -P showall]
> ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Thu Jul 31 12:36:59 2014 PDT
> ...
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693
> 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165
> 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2
> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
> 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
> 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
> 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> 190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31
> 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
> 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
> 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7
> 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957
>
> -andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-31 22:53:11

by Andy Isaacson

[permalink] [raw]
Subject: Re: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt

Ran with 3.14.9 long enough to pull and build, then 3.15.7 booted
successfully where 3.15.5 had failed several times in a row.

-andy

On Thu, Jul 31, 2014 at 01:33:03PM -0700, Andy Isaacson wrote:
> 3.14.9 boots just fine after a fsck.
>
> -andy
>
> On Thu, Jul 31, 2014 at 12:51:38PM -0700, Andy Isaacson wrote:
> > 3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on
> > Thinkpad T440s.
> >
> > System has been quite stable for ~9 months, always running a very recent
> > stable tree.
> >
> > kernel panicked this morning probably due to an external drive
> > triggering UAS errors in 3.15 (but the syslog didn't make it to disk
> > alas). The system remained powered on for >30 seconds after the panic,
> > finally I shut down by holding down the power button. So there should
> > not have been any writes in flight to the SSD.
> >
> > After reboot, rootfs was deeply unhappy:
> >
> > [ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> > [ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery
> > [ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs
> > [ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted
> > [ 7.326280] EXT4-fs (dm-1): recovery complete
> > [ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> > ...
> > [ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> > ...
> > [ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt.
> > [ 39.354389] Aborting journal on device dm-1-8.
> > [ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only
> > [ 39.354485] ------------[ cut here ]------------
> > [ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]()
> > [ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sd
mmc mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys
> > [ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19
> > [ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013
> > [ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000
> > [ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0
> > [ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0
> > [ 39.354611] Call Trace:
> > [ 39.354617] [<ffffffff81378c2a>] dump_stack+0x45/0x56
> > [ 39.354621] [<ffffffff8103dc62>] warn_slowpath_common+0x7f/0x98
> > [ 39.354643] [<ffffffffa03a3d33>] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> > [ 39.354648] [<ffffffff8103dd2e>] warn_slowpath_null+0x1a/0x1c
> > [ 39.354666] [<ffffffffa03a3d33>] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> > [ 39.354686] [<ffffffffa03aa380>] ext4_free_blocks+0x713/0x809 [ext4]
> > [ 39.354704] [<ffffffffa03a0639>] ext4_ext_remove_space+0x698/0xbdc [ext4]
> > [ 39.354723] [<ffffffffa03af7b1>] ? __es_remove_extent+0x46/0x27d [ext4]
> > [ 39.354741] [<ffffffffa03a246f>] ext4_ext_truncate+0x89/0xad [ext4]
> > [ 39.354756] [<ffffffffa0383024>] ext4_truncate+0x199/0x281 [ext4]
> > [ 39.354770] [<ffffffffa038379b>] ext4_evict_inode+0x1a7/0x2d0 [ext4]
> > [ 39.354775] [<ffffffff8113f390>] evict+0xa8/0x14c
> > [ 39.354778] [<ffffffff8113fa75>] iput+0x12d/0x136
> > [ 39.354783] [<ffffffff81136d5b>] do_unlinkat+0x14e/0x1f4
> > [ 39.354788] [<ffffffff8112bfe9>] ? ____fput+0xe/0x10
> > [ 39.354794] [<ffffffff8105659d>] ? task_work_run+0x87/0x98
> > [ 39.354798] [<ffffffff81137b98>] SyS_unlinkat+0x29/0x2b
> > [ 39.354802] [<ffffffff81137b98>] ? SyS_unlinkat+0x29/0x2b
> > [ 39.354807] [<ffffffff8137d0d2>] system_call_fastpath+0x16/0x1b
> > [ 39.354810] ---[ end trace 80365b8da4738adc ]---
> > [ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30
> > [ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted
> > [ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > [ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > [ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted
> > [ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted
> > [ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > [ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted
> > [ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > [ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted
> >
> >
> > Rebooted again and rootfs came up dirty, of course, but journal seems
> > sadder than expected:
> >
> > [ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
> > [ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> > [ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1
> > [ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146
> > [ 12.506527] Aborting journal on device dm-1-8.
> > [ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only
> > [ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure
> > [ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt.
> > [ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt.
> >
> >
> > fsck claims to have fixed it but on reboot it blows up the same way:
> >
> > e2fsck 1.42.11 (09-Jul-2014)
> > /dev/mapper/t440s-root: recovering journal
> > /dev/mapper/t440s-root contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Unconnected directory inode 3801092 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801093 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801106 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801107 (/lost+found/#3801106/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801111 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801116 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Unconnected directory inode 3801118 (/tmp/???)
> > Connect to /lost+found<y>? yes
> > Pass 4: Checking reference counts
> > Inode 3801089 ref count is 61, should be 42. Fix<y>? yes
> > Inode 3801092 ref count is 3, should be 2. Fix<y>? yes
> > Inode 3801093 ref count is 3, should be 2. Fix<y>? yes
> > Unattached inode 3801099
> > Connect to /lost+found<y>? yes
> > Inode 3801099 ref count is 2, should be 1. Fix<y>? yes
> > Unattached inode 3801103
> > Connect to /lost+found<y>? yes
> > Inode 3801103 ref count is 2, should be 1. Fix<y>? yes
> > Inode 3801106 ref count is 3, should be 2. Fix<y>? yes
> > Inode 3801107 ref count is 3, should be 2. Fix<y>? yes
> > Inode 3801111 ref count is 3, should be 2. Fix<y>? yes
> > Unattached inode 3801112
> > Connect to /lost+found<y>? yes
> > Inode 3801112 ref count is 2, should be 1. Fix<y>? yes
> > Inode 3801116 ref count is 3, should be 2. Fix<y>? yes
> > Inode 3801118 ref count is 3, should be 2. Fix<y>? yes
> >
> > Pass 5: Checking group summary information
> > Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850
> > Fix<y>? yes
> > Free blocks count wrong for group #464 (24160, counted=24180).
> > Fix<y>? yes
> > Free blocks count wrong for group #465 (25520, counted=25827).
> > Fix<y>? yes
> > Free blocks count wrong for group #835 (18809, counted=18745).
> > Fix<y>? yes
> > Free blocks count wrong for group #837 (23154, counted=23024).
> > Fix<y>? yes
> > Free blocks count wrong for group #2506 (28536, counted=28535).
> > Fix<y>? yes
> > Free blocks count wrong for group #2842 (2415, counted=2478).
> > Fix<y>? yes
> > Free blocks count wrong for group #2844 (27816, counted=28135).
> > Fix<y>? yes
> > Free blocks count wrong (108044209, counted=108044918).
> > Fix<y>? yes
> > Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(380189
6--3801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907
> > Fix<y>? yes
> > Free inodes count wrong for group #464 (6581, counted=6696).
> > Fix<y>? yes
> > Directories count wrong for group #464 (366, counted=346).
> > Fix<y>? yes
> > Free inodes count wrong (29348331, counted=29348445).
> > Fix<y>? yes
> >
> > /dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED *****
> > /dev/mapper/t440s-root: ***** REBOOT LINUX *****
> > /dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks
> >
> >
> > After fsck reports clean, reboot still shows failures:
> >
> >
> > [ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> > [ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery
> > [ 7.384663] EXT4-fs (dm-1): recovery complete
> > [ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> >
> > [ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> >
> > [ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt.
> > [ 9.820975] Aborting journal on device dm-1-8.
> > [ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only
> >
> >
> > Similar repeated problems repeat on every reboot.
> >
> > SMART stats on the SSD do not indicate any signs of failing hardware:
> >
> > Device Model: Samsung SSD 840 EVO 500GB
> > Serial Number: S1DHNSAD929048M
> > LU WWN Device Id: 5 002538 8a00452f8
> > Firmware Version: EXT0BB0Q
> > User Capacity: 500,107,862,016 bytes [500 GB]
> > Sector Size: 512 bytes logical/physical
> > Rotation Rate: Solid State Device
> > Device is: Not in smartctl database [for details use: -P showall]
> > ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
> > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
> > Local Time is: Thu Jul 31 12:36:59 2014 PDT
> > ...
> > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
> > 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693
> > 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165
> > 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2
> > 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
> > 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
> > 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
> > 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
> > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> > 190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31
> > 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
> > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
> > 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7
> > 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957
> >
> > -andy
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-31 23:30:11

by Eric Whitney

[permalink] [raw]
Subject: Re: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt

It's likely your problem was fixed by a commit in 3.15.6. The symptoms you
describe are very familiar:

f9ae9cf5d7 - ext4: revert commit which was causing fs corruption after journal
replays

Eric


* Andy Isaacson <[email protected]>:
> Ran with 3.14.9 long enough to pull and build, then 3.15.7 booted
> successfully where 3.15.5 had failed several times in a row.
>
> -andy
>
> On Thu, Jul 31, 2014 at 01:33:03PM -0700, Andy Isaacson wrote:
> > 3.14.9 boots just fine after a fsck.
> >
> > -andy
> >
> > On Thu, Jul 31, 2014 at 12:51:38PM -0700, Andy Isaacson wrote:
> > > 3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on
> > > Thinkpad T440s.
> > >
> > > System has been quite stable for ~9 months, always running a very recent
> > > stable tree.
> > >
> > > kernel panicked this morning probably due to an external drive
> > > triggering UAS errors in 3.15 (but the syslog didn't make it to disk
> > > alas). The system remained powered on for >30 seconds after the panic,
> > > finally I shut down by holding down the power button. So there should
> > > not have been any writes in flight to the SSD.
> > >
> > > After reboot, rootfs was deeply unhappy:
> > >
> > > [ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> > > [ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery
> > > [ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs
> > > [ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted
> > > [ 7.326280] EXT4-fs (dm-1): recovery complete
> > > [ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> > > ...
> > > [ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> > > ...
> > > [ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt.
> > > [ 39.354389] Aborting journal on device dm-1-8.
> > > [ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only
> > > [ 39.354485] ------------[ cut here ]------------
> > > [ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]()
> > > [ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_
sdmmc mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys
> > > [ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19
> > > [ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013
> > > [ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000
> > > [ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0
> > > [ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0
> > > [ 39.354611] Call Trace:
> > > [ 39.354617] [<ffffffff81378c2a>] dump_stack+0x45/0x56
> > > [ 39.354621] [<ffffffff8103dc62>] warn_slowpath_common+0x7f/0x98
> > > [ 39.354643] [<ffffffffa03a3d33>] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> > > [ 39.354648] [<ffffffff8103dd2e>] warn_slowpath_null+0x1a/0x1c
> > > [ 39.354666] [<ffffffffa03a3d33>] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]
> > > [ 39.354686] [<ffffffffa03aa380>] ext4_free_blocks+0x713/0x809 [ext4]
> > > [ 39.354704] [<ffffffffa03a0639>] ext4_ext_remove_space+0x698/0xbdc [ext4]
> > > [ 39.354723] [<ffffffffa03af7b1>] ? __es_remove_extent+0x46/0x27d [ext4]
> > > [ 39.354741] [<ffffffffa03a246f>] ext4_ext_truncate+0x89/0xad [ext4]
> > > [ 39.354756] [<ffffffffa0383024>] ext4_truncate+0x199/0x281 [ext4]
> > > [ 39.354770] [<ffffffffa038379b>] ext4_evict_inode+0x1a7/0x2d0 [ext4]
> > > [ 39.354775] [<ffffffff8113f390>] evict+0xa8/0x14c
> > > [ 39.354778] [<ffffffff8113fa75>] iput+0x12d/0x136
> > > [ 39.354783] [<ffffffff81136d5b>] do_unlinkat+0x14e/0x1f4
> > > [ 39.354788] [<ffffffff8112bfe9>] ? ____fput+0xe/0x10
> > > [ 39.354794] [<ffffffff8105659d>] ? task_work_run+0x87/0x98
> > > [ 39.354798] [<ffffffff81137b98>] SyS_unlinkat+0x29/0x2b
> > > [ 39.354802] [<ffffffff81137b98>] ? SyS_unlinkat+0x29/0x2b
> > > [ 39.354807] [<ffffffff8137d0d2>] system_call_fastpath+0x16/0x1b
> > > [ 39.354810] ---[ end trace 80365b8da4738adc ]---
> > > [ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30
> > > [ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted
> > > [ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > > [ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > > [ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted
> > > [ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted
> > > [ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > > [ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted
> > > [ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted
> > > [ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted
> > >
> > >
> > > Rebooted again and rootfs came up dirty, of course, but journal seems
> > > sadder than expected:
> > >
> > > [ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
> > > [ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> > > [ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1
> > > [ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146
> > > [ 12.506527] Aborting journal on device dm-1-8.
> > > [ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only
> > > [ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure
> > > [ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt.
> > > [ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt.
> > >
> > >
> > > fsck claims to have fixed it but on reboot it blows up the same way:
> > >
> > > e2fsck 1.42.11 (09-Jul-2014)
> > > /dev/mapper/t440s-root: recovering journal
> > > /dev/mapper/t440s-root contains a file system with errors, check forced.
> > > Pass 1: Checking inodes, blocks, and sizes
> > > Pass 2: Checking directory structure
> > > Pass 3: Checking directory connectivity
> > > Unconnected directory inode 3801092 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801093 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801106 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801107 (/lost+found/#3801106/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801111 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801116 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Unconnected directory inode 3801118 (/tmp/???)
> > > Connect to /lost+found<y>? yes
> > > Pass 4: Checking reference counts
> > > Inode 3801089 ref count is 61, should be 42. Fix<y>? yes
> > > Inode 3801092 ref count is 3, should be 2. Fix<y>? yes
> > > Inode 3801093 ref count is 3, should be 2. Fix<y>? yes
> > > Unattached inode 3801099
> > > Connect to /lost+found<y>? yes
> > > Inode 3801099 ref count is 2, should be 1. Fix<y>? yes
> > > Unattached inode 3801103
> > > Connect to /lost+found<y>? yes
> > > Inode 3801103 ref count is 2, should be 1. Fix<y>? yes
> > > Inode 3801106 ref count is 3, should be 2. Fix<y>? yes
> > > Inode 3801107 ref count is 3, should be 2. Fix<y>? yes
> > > Inode 3801111 ref count is 3, should be 2. Fix<y>? yes
> > > Unattached inode 3801112
> > > Connect to /lost+found<y>? yes
> > > Inode 3801112 ref count is 2, should be 1. Fix<y>? yes
> > > Inode 3801116 ref count is 3, should be 2. Fix<y>? yes
> > > Inode 3801118 ref count is 3, should be 2. Fix<y>? yes
> > >
> > > Pass 5: Checking group summary information
> > > Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850
> > > Fix<y>? yes
> > > Free blocks count wrong for group #464 (24160, counted=24180).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #465 (25520, counted=25827).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #835 (18809, counted=18745).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #837 (23154, counted=23024).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #2506 (28536, counted=28535).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #2842 (2415, counted=2478).
> > > Fix<y>? yes
> > > Free blocks count wrong for group #2844 (27816, counted=28135).
> > > Fix<y>? yes
> > > Free blocks count wrong (108044209, counted=108044918).
> > > Fix<y>? yes
> > > Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801
896--3801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907
> > > Fix<y>? yes
> > > Free inodes count wrong for group #464 (6581, counted=6696).
> > > Fix<y>? yes
> > > Directories count wrong for group #464 (366, counted=346).
> > > Fix<y>? yes
> > > Free inodes count wrong (29348331, counted=29348445).
> > > Fix<y>? yes
> > >
> > > /dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED *****
> > > /dev/mapper/t440s-root: ***** REBOOT LINUX *****
> > > /dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks
> > >
> > >
> > > After fsck reports clean, reboot still shows failures:
> > >
> > >
> > > [ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
> > > [ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery
> > > [ 7.384663] EXT4-fs (dm-1): recovery complete
> > > [ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> > >
> > > [ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro
> > >
> > > [ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt.
> > > [ 9.820975] Aborting journal on device dm-1-8.
> > > [ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only
> > >
> > >
> > > Similar repeated problems repeat on every reboot.
> > >
> > > SMART stats on the SSD do not indicate any signs of failing hardware:
> > >
> > > Device Model: Samsung SSD 840 EVO 500GB
> > > Serial Number: S1DHNSAD929048M
> > > LU WWN Device Id: 5 002538 8a00452f8
> > > Firmware Version: EXT0BB0Q
> > > User Capacity: 500,107,862,016 bytes [500 GB]
> > > Sector Size: 512 bytes logical/physical
> > > Rotation Rate: Solid State Device
> > > Device is: Not in smartctl database [for details use: -P showall]
> > > ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
> > > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
> > > Local Time is: Thu Jul 31 12:36:59 2014 PDT
> > > ...
> > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> > > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
> > > 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693
> > > 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165
> > > 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2
> > > 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
> > > 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
> > > 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
> > > 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
> > > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> > > 190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31
> > > 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
> > > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
> > > 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7
> > > 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957
> > >
> > > -andy
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html