From: Andy Isaacson Subject: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt Date: Thu, 31 Jul 2014 12:51:38 -0700 Message-ID: <20140731195138.GA22842@hexapodia.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT To: Ext4 Developers List Return-path: Received: from straum.hexapodia.org ([192.235.78.53]:43825 "EHLO straum.hexapodia.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750836AbaGaT5l convert rfc822-to-8bit (ORCPT ); Thu, 31 Jul 2014 15:57:41 -0400 Content-Disposition: inline Sender: linux-ext4-owner@vger.kernel.org List-ID: 3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on Thinkpad T440s. System has been quite stable for ~9 months, always running a very recent stable tree. kernel panicked this morning probably due to an external drive triggering UAS errors in 3.15 (but the syslog didn't make it to disk alas). The system remained powered on for >30 seconds after the panic, finally I shut down by holding down the power button. So there should not have been any writes in flight to the SSD. After reboot, rootfs was deeply unhappy: [ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem [ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery [ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs [ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted [ 7.326280] EXT4-fs (dm-1): recovery complete [ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) ... [ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro ... [ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt. [ 39.354389] Aborting journal on device dm-1-8. [ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only [ 39.354485] ------------[ cut here ]------------ [ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]() [ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sdmmc mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys [ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19 [ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013 [ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000 [ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0 [ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0 [ 39.354611] Call Trace: [ 39.354617] [] dump_stack+0x45/0x56 [ 39.354621] [] warn_slowpath_common+0x7f/0x98 [ 39.354643] [] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4] [ 39.354648] [] warn_slowpath_null+0x1a/0x1c [ 39.354666] [] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4] [ 39.354686] [] ext4_free_blocks+0x713/0x809 [ext4] [ 39.354704] [] ext4_ext_remove_space+0x698/0xbdc [ext4] [ 39.354723] [] ? __es_remove_extent+0x46/0x27d [ext4] [ 39.354741] [] ext4_ext_truncate+0x89/0xad [ext4] [ 39.354756] [] ext4_truncate+0x199/0x281 [ext4] [ 39.354770] [] ext4_evict_inode+0x1a7/0x2d0 [ext4] [ 39.354775] [] evict+0xa8/0x14c [ 39.354778] [] iput+0x12d/0x136 [ 39.354783] [] do_unlinkat+0x14e/0x1f4 [ 39.354788] [] ? ____fput+0xe/0x10 [ 39.354794] [] ? task_work_run+0x87/0x98 [ 39.354798] [] SyS_unlinkat+0x29/0x2b [ 39.354802] [] ? SyS_unlinkat+0x29/0x2b [ 39.354807] [] system_call_fastpath+0x16/0x1b [ 39.354810] ---[ end trace 80365b8da4738adc ]--- [ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30 [ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted [ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted [ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted [ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted [ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted [ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted [ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted [ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted [ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted Rebooted again and rootfs came up dirty, of course, but journal seems sadder than expected: [ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended [ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro [ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1 [ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146 [ 12.506527] Aborting journal on device dm-1-8. [ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only [ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure [ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt. [ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt. fsck claims to have fixed it but on reboot it blows up the same way: e2fsck 1.42.11 (09-Jul-2014) /dev/mapper/t440s-root: recovering journal /dev/mapper/t440s-root contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 3801092 (/tmp/???) Connect to /lost+found? yes Unconnected directory inode 3801093 (/tmp/???) Connect to /lost+found? yes Unconnected directory inode 3801106 (/tmp/???) Connect to /lost+found? yes Unconnected directory inode 3801107 (/lost+found/#3801106/???) Connect to /lost+found? yes Unconnected directory inode 3801111 (/tmp/???) Connect to /lost+found? yes Unconnected directory inode 3801116 (/tmp/???) Connect to /lost+found? yes Unconnected directory inode 3801118 (/tmp/???) Connect to /lost+found? yes Pass 4: Checking reference counts Inode 3801089 ref count is 61, should be 42. Fix? yes Inode 3801092 ref count is 3, should be 2. Fix? yes Inode 3801093 ref count is 3, should be 2. Fix? yes Unattached inode 3801099 Connect to /lost+found? yes Inode 3801099 ref count is 2, should be 1. Fix? yes Unattached inode 3801103 Connect to /lost+found? yes Inode 3801103 ref count is 2, should be 1. Fix? yes Inode 3801106 ref count is 3, should be 2. Fix? yes Inode 3801107 ref count is 3, should be 2. Fix? yes Inode 3801111 ref count is 3, should be 2. Fix? yes Unattached inode 3801112 Connect to /lost+found? yes Inode 3801112 ref count is 2, should be 1. Fix? yes Inode 3801116 ref count is 3, should be 2. Fix? yes Inode 3801118 ref count is 3, should be 2. Fix? yes Pass 5: Checking group summary information Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850 Fix? yes Free blocks count wrong for group #464 (24160, counted=24180). Fix? yes Free blocks count wrong for group #465 (25520, counted=25827). Fix? yes Free blocks count wrong for group #835 (18809, counted=18745). Fix? yes Free blocks count wrong for group #837 (23154, counted=23024). Fix? yes Free blocks count wrong for group #2506 (28536, counted=28535). Fix? yes Free blocks count wrong for group #2842 (2415, counted=2478). Fix? yes Free blocks count wrong for group #2844 (27816, counted=28135). Fix? yes Free blocks count wrong (108044209, counted=108044918). Fix? yes Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801896--3 801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907 Fix? yes Free inodes count wrong for group #464 (6581, counted=6696). Fix? yes Directories count wrong for group #464 (366, counted=346). Fix? yes Free inodes count wrong (29348331, counted=29348445). Fix? yes /dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/t440s-root: ***** REBOOT LINUX ***** /dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks After fsck reports clean, reboot still shows failures: [ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem [ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery [ 7.384663] EXT4-fs (dm-1): recovery complete [ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) [ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro [ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt. [ 9.820975] Aborting journal on device dm-1-8. [ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only Similar repeated problems repeat on every reboot. SMART stats on the SSD do not indicate any signs of failing hardware: Device Model: Samsung SSD 840 EVO 500GB Serial Number: S1DHNSAD929048M LU WWN Device Id: 5 002538 8a00452f8 Firmware Version: EXT0BB0Q User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Jul 31 12:36:59 2014 PDT ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957 -andy