From: Andy Isaacson Subject: Re: ext4_mb_generate_buddy: 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt Date: Thu, 31 Jul 2014 13:33:03 -0700 Message-ID: <20140731203303.GB22842@hexapodia.org> References: <20140731195138.GA22842@hexapodia.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT To: Ext4 Developers List Return-path: Received: from straum.hexapodia.org ([192.235.78.53]:43841 "EHLO straum.hexapodia.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751280AbaGaUdE convert rfc822-to-8bit (ORCPT ); Thu, 31 Jul 2014 16:33:04 -0400 Content-Disposition: inline In-Reply-To: <20140731195138.GA22842@hexapodia.org> Sender: linux-ext4-owner@vger.kernel.org List-ID: 3.14.9 boots just fine after a fsck. -andy On Thu, Jul 31, 2014 at 12:51:38PM -0700, Andy Isaacson wrote: > 3.15.5 amd64, ext4 rootfs on LVM on LUKS on Samsung SSD 840 EVO on > Thinkpad T440s. > > System has been quite stable for ~9 months, always running a very recent > stable tree. > > kernel panicked this morning probably due to an external drive > triggering UAS errors in 3.15 (but the syslog didn't make it to disk > alas). The system remained powered on for >30 seconds after the panic, > finally I shut down by holding down the power button. So there should > not have been any writes in flight to the SSD. > > After reboot, rootfs was deeply unhappy: > > [ 7.248400] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem > [ 7.248404] EXT4-fs (dm-1): write access will be enabled during recovery > [ 7.303580] EXT4-fs (dm-1): orphan cleanup on readonly fs > [ 7.326277] EXT4-fs (dm-1): 10 orphan inodes deleted > [ 7.326280] EXT4-fs (dm-1): recovery complete > [ 7.380065] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) > ... > [ 8.829221] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro > ... > [ 39.354383] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 835, 18745 clusters in bitmap, 18746 in gd; block bitmap corrupt. > [ 39.354389] Aborting journal on device dm-1-8. > [ 39.354478] EXT4-fs (dm-1): Remounting filesystem read-only > [ 39.354485] ------------[ cut here ]------------ > [ 39.354517] WARNING: CPU: 0 PID: 2312 at fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4]() > [ 39.354519] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat ext2 joydev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev arc4 media ecb btusb bluetooth 6lowpan_iphc x86_pkg_temp_thermal intel_rapl kvm_intel iwlmvm kvm mac80211 pcspkr psmouse evdev serio_raw iwlwifi snd_hda_intel snd_hda_controller cfg80211 i2c_i801 snd_hda_codec snd_hwdep snd_pcm snd_seq i915 snd_seq_device thinkpad_acpi snd_timer nvram tpm_tis rfkill battery tpm ac drm_kms_helper drm snd video acpi_cpufreq intel_gtt shpchp i2c_algo_bit intel_smartconnect i2c_core soundcore button processor loop fuse autofs4 ext4 crc16 jbd2 mbcache hid_generic usbhid hid dm_crypt dm_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_common rtsx_pci_sdmm c mmc_core ahci e1000e ptp pps_core aesni_intel libahci aes_x86_64 glue_helper libata lrw gf128mul ablk_helper cryptd scsi_mod ehci_pci ehci_hcd xhci_hcd rtsx_pci mfd_core usbcore thermal usb_common thermal_sys > [ 39.354598] CPU: 0 PID: 2312 Comm: systemd-tmpfile Not tainted 3.15.5 #19 > [ 39.354600] Hardware name: LENOVO 20AQCTO1WW/20AQCTO1WW, BIOS GJET61WW (2.11 ) 10/02/2013 > [ 39.354602] 0000000000000000 ffff880213c67b78 ffffffff81378c2a 0000000000000000 > [ 39.354605] ffff880213c67bb0 ffffffff8103dc62 ffffffffa03a3d33 ffff8800d607eea0 > [ 39.354608] 00000000ffffffe2 0000000000000000 ffff8800d60a3030 ffff880213c67bc0 > [ 39.354611] Call Trace: > [ 39.354617] [] dump_stack+0x45/0x56 > [ 39.354621] [] warn_slowpath_common+0x7f/0x98 > [ 39.354643] [] ? __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4] > [ 39.354648] [] warn_slowpath_null+0x1a/0x1c > [ 39.354666] [] __ext4_handle_dirty_metadata+0xf4/0x1a4 [ext4] > [ 39.354686] [] ext4_free_blocks+0x713/0x809 [ext4] > [ 39.354704] [] ext4_ext_remove_space+0x698/0xbdc [ext4] > [ 39.354723] [] ? __es_remove_extent+0x46/0x27d [ext4] > [ 39.354741] [] ext4_ext_truncate+0x89/0xad [ext4] > [ 39.354756] [] ext4_truncate+0x199/0x281 [ext4] > [ 39.354770] [] ext4_evict_inode+0x1a7/0x2d0 [ext4] > [ 39.354775] [] evict+0xa8/0x14c > [ 39.354778] [] iput+0x12d/0x136 > [ 39.354783] [] do_unlinkat+0x14e/0x1f4 > [ 39.354788] [] ? ____fput+0xe/0x10 > [ 39.354794] [] ? task_work_run+0x87/0x98 > [ 39.354798] [] SyS_unlinkat+0x29/0x2b > [ 39.354802] [] ? SyS_unlinkat+0x29/0x2b > [ 39.354807] [] system_call_fastpath+0x16/0x1b > [ 39.354810] ---[ end trace 80365b8da4738adc ]--- > [ 39.354814] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30 > [ 39.354817] EXT4: jbd2_journal_dirty_metadata failed: handle type 5 started at line 241, credits 91/89, errcode -30<2>[ 39.354821] EXT4-fs error (device dm-1) in ext4_free_blocks:4867: Journal has aborted > [ 39.354906] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted > [ 39.354976] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted > [ 39.355042] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3018: Journal has aborted > [ 39.355109] EXT4-fs error (device dm-1) in ext4_ext_truncate:4666: Journal has aborted > [ 39.355179] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted > [ 39.355248] EXT4-fs error (device dm-1) in ext4_truncate:3790: Journal has aborted > [ 39.355314] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:4879: Journal has aborted > [ 39.355382] EXT4-fs error (device dm-1) in ext4_orphan_del:2684: Journal has aborted > > > Rebooted again and rootfs came up dirty, of course, but journal seems > sadder than expected: > > [ 12.465200] EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended > [ 12.465403] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro > [ 12.504024] systemd-journald[230]: Received request to flush runtime journal from PID 1 > [ 12.506433] EXT4-fs error (device dm-1): ext4_free_inode:323: comm systemd-tmpfile: bit already cleared for inode 3801146 > [ 12.506527] Aborting journal on device dm-1-8. > [ 12.506950] EXT4-fs (dm-1): Remounting filesystem read-only > [ 12.506957] EXT4-fs error (device dm-1) in ext4_evict_inode:310: IO failure > [ 12.506991] EXT4-fs error (device dm-1): mb_free_blocks:1441: group 464, block 15212940:freeing already freed block (bit 8588); block bitmap corrupt. > [ 12.507004] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 464, 24180 clusters in bitmap, 24181 in gd; block bitmap corrupt. > > > fsck claims to have fixed it but on reboot it blows up the same way: > > e2fsck 1.42.11 (09-Jul-2014) > /dev/mapper/t440s-root: recovering journal > /dev/mapper/t440s-root contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Unconnected directory inode 3801092 (/tmp/???) > Connect to /lost+found? yes > Unconnected directory inode 3801093 (/tmp/???) > Connect to /lost+found? yes > Unconnected directory inode 3801106 (/tmp/???) > Connect to /lost+found? yes > Unconnected directory inode 3801107 (/lost+found/#3801106/???) > Connect to /lost+found? yes > Unconnected directory inode 3801111 (/tmp/???) > Connect to /lost+found? yes > Unconnected directory inode 3801116 (/tmp/???) > Connect to /lost+found? yes > Unconnected directory inode 3801118 (/tmp/???) > Connect to /lost+found? yes > Pass 4: Checking reference counts > Inode 3801089 ref count is 61, should be 42. Fix? yes > Inode 3801092 ref count is 3, should be 2. Fix? yes > Inode 3801093 ref count is 3, should be 2. Fix? yes > Unattached inode 3801099 > Connect to /lost+found? yes > Inode 3801099 ref count is 2, should be 1. Fix? yes > Unattached inode 3801103 > Connect to /lost+found? yes > Inode 3801103 ref count is 2, should be 1. Fix? yes > Inode 3801106 ref count is 3, should be 2. Fix? yes > Inode 3801107 ref count is 3, should be 2. Fix? yes > Inode 3801111 ref count is 3, should be 2. Fix? yes > Unattached inode 3801112 > Connect to /lost+found? yes > Inode 3801112 ref count is 2, should be 1. Fix? yes > Inode 3801116 ref count is 3, should be 2. Fix? yes > Inode 3801118 ref count is 3, should be 2. Fix? yes > > Pass 5: Checking group summary information > Block bitmap differences: -(15212585--15212586) -(15212756--15212757) -15212761 -15212765 -15212883 -15212886 -(15212888--15212891) -15212905 -15212907 -15212911 -(15212923--15212924) -15212938 -15212940 -15213385 +15237175 +(27371328--27371391) +(27427126--27427191) +(27427648--27427711) +82127850 > Fix? yes > Free blocks count wrong for group #464 (24160, counted=24180). > Fix? yes > Free blocks count wrong for group #465 (25520, counted=25827). > Fix? yes > Free blocks count wrong for group #835 (18809, counted=18745). > Fix? yes > Free blocks count wrong for group #837 (23154, counted=23024). > Fix? yes > Free blocks count wrong for group #2506 (28536, counted=28535). > Fix? yes > Free blocks count wrong for group #2842 (2415, counted=2478). > Fix? yes > Free blocks count wrong for group #2844 (27816, counted=28135). > Fix? yes > Free blocks count wrong (108044209, counted=108044918). > Fix? yes > Inode bitmap differences: -3801122 -3801126 -(3801128--3801129) -3801134 -3801137 -(3801139--3801142) -3801146 -(3801149--3801150) -(3801152--3801154) -3801158 -3801160 -3801168 -(3801176--3801179) -(3801182--3801183) -3801186 -3801189 -3801193 -(3801199--3801200) -(3801203--3801205) -(3801208--3801211) -(3801213--3801214) -3801216 -3801220 -(3801223--3801224) -3801226 -(3801228--3801232) -(3801238--3801239) -3801738 -3801753 -3801755 -(3801758--3801759) -(3801762--3801763) -3801769 -3801792 -(3801805--3801806) -3801809 -(3801813--3801817) -3801822 -(3801826--3801828) -(3801832--3801834) -(3801836--3801837) -(3801842--3801843) -3801848 -3801853 -3801857 -(3801863--3801864) -3801871 -(3801873--3801876) -3801879 -3801881 -3801883 -3801885 -(3801888--3801889) -(3801891--3801892) -(3801896- -3801897) -3801899 -(3801901--3801902) -(3801905--3801906) -(3801909--3801910) -3801912 -3801914 -(3801920--3801921) -(3801923--3801924) -3801926 -3802690 -3805907 > Fix? yes > Free inodes count wrong for group #464 (6581, counted=6696). > Fix? yes > Directories count wrong for group #464 (366, counted=346). > Fix? yes > Free inodes count wrong (29348331, counted=29348445). > Fix? yes > > /dev/mapper/t440s-root: ***** FILE SYSTEM WAS MODIFIED ***** > /dev/mapper/t440s-root: ***** REBOOT LINUX ***** > /dev/mapper/t440s-root: 617891/29966336 files (0.7% non-contiguous), 11796874/119841792 blocks > > > After fsck reports clean, reboot still shows failures: > > > [ 7.378361] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem > [ 7.378365] EXT4-fs (dm-1): write access will be enabled during recovery > [ 7.384663] EXT4-fs (dm-1): recovery complete > [ 7.386479] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) > > [ 7.710694] EXT4-fs (dm-1): re-mounted. Opts: errors=remount-ro > > [ 9.820974] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 465, 29923 clusters in bitmap, 29922 in gd; block bitmap corrupt. > [ 9.820975] Aborting journal on device dm-1-8. > [ 9.821614] EXT4-fs (dm-1): Remounting filesystem read-only > > > Similar repeated problems repeat on every reboot. > > SMART stats on the SSD do not indicate any signs of failing hardware: > > Device Model: Samsung SSD 840 EVO 500GB > Serial Number: S1DHNSAD929048M > LU WWN Device Id: 5 002538 8a00452f8 > Firmware Version: EXT0BB0Q > User Capacity: 500,107,862,016 bytes [500 GB] > Sector Size: 512 bytes logical/physical > Rotation Rate: Solid State Device > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Thu Jul 31 12:36:59 2014 PDT > ... > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 > 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1693 > 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 165 > 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 2 > 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 > 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 > 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 > 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0032 069 053 000 Old_age Always - 31 > 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 > 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 7 > 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2102932957 > > -andy > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html