From: "Darrick J. Wong" <darrick.wong@oracle.com>
Subject: Re: e2fsck not fixing deleted inode referenced errors?
Date: Tue, 30 Sep 2014 12:29:55 -0700
Message-ID: <20140930192955.GB9942@birch.djwong.org>
References: <542AEED4.5050303@bitsync.net>
 <20140930183012.GA9942@birch.djwong.org>
 <542AF9B8.2090800@bitsync.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-ext4@vger.kernel.org
To: Zlatko Calusic <zcalusic@bitsync.net>
Content-Disposition: inline
In-Reply-To: <542AF9B8.2090800@bitsync.net>
Sender: linux-ext4-owner@vger.kernel.org

On Tue, Sep 30, 2014 at 08:43:04PM +0200, Zlatko Calusic wrote:
> On 30.09.2014 20:30, Darrick J. Wong wrote:
> >On Tue, Sep 30, 2014 at 07:56:36PM +0200, Zlatko Calusic wrote:
> >>Hope this is the right list to ask this question.
> >>
> >>I have an ext4 filesystem that has a few errors like this:
> >>
> >>Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> >>ext4_lookup:1448: inode #7913865: comm find: deleted inode
> >>referenced: 7912058
> >>Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> >>ext4_lookup:1448: inode #7913865: comm find: deleted inode
> >>referenced: 7912055
> >>
> >>Yet, when I run e2fsck -fy on it, I have a clean run, no errors are
> >>found and/or fixed. Is this the expected behaviour? What am I
> >>supposed to do to get rid of errors like the above?
> >
> >[I should hope not.]
> >
> >>The filesystem is on a md mirror device, the kernel is 3.17.0-rc7,
> >>e2progs 1.42.12-1 (Debian sid). Could md device somehow interfere? I
> >>ran md check yesterday, but there were no errors.
> >>
> >>BTW, this all started when I got ata2.00: failed command: FLUSH
> >>CACHE EXT error yesterday morning. I did several runs of e2fsck
> >>before the filesystem came up clean, yet errors like the above are
> >>popping constantly.
> >
> >Normally that kernel message only happens if a dir refers to an inode with
> >link_count and mode set to 0.
> >
> >Is the disk attached to ata2.00 one of the RAID1 mirrors?  What was the full
> >error message, and does smartctl -a report anything?
> 
> Yes, it is part of the mirror:
> 
> ata2.00: ATA-8: WDC WD1002FBYS-02A6B0, 03.00C06, max UDMA/133
> ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
> ata2.00: configured for UDMA/133
> 
> md2 : active raid1 sdb2[0] sda2[1]
>       976229760 blocks [2/2] [UU]
>       bitmap: 0/8 pages [0KB], 65536KB chunk
> 
> Full error message from the kernel log, together with data check I
> did in the evening:
> 
> Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0
> SErr 0x4010000 action 0xe frozen
> Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040,
> connection status changed
> Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
> Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
> Sep 29 05:07:51 atlas kernel: ata2.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a         res
> 40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
> Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
> Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
> Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please
> be patient (ready=0)
> Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
> Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
> Sep 29 05:08:00 atlas kernel: ata2: EH complete
> Sep 29 05:37:36 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 1783, block bitmap and bg
> descriptor inconsistent: 8218 vs 9292 free clusters
> Sep 29 05:37:36 atlas kernel: JBD2: Spotted dirty metadata buffer
> (dev = md2, blocknr = 0). There's a risk of filesystem corruption in
> case of system crash.
> Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 995, block bitmap and bg
> descriptor inconsistent: 15932 vs 15939 free clusters
> Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 1732, block bitmap and bg
> descriptor inconsistent: 5055 vs 5705 free clusters
> Sep 29 19:24:01 atlas kernel: md: data-check of RAID array md2
> Sep 29 19:24:01 atlas kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Sep 29 19:24:01 atlas kernel: md: using maximum available idle IO
> bandwidth (but not more than 200000 KB/sec) for data-check.
> Sep 29 19:24:01 atlas kernel: md: using 128k window, over a total of
> 976229760k.
> Sep 29 22:37:53 atlas kernel: md: md2: data-check done.
> 
> 
> Later on I did several (at least 3) e2fsck runs until the filesystem
> finally was clean of errors. Only to stumble upon new errors today
> that can't be fixed with e2fsck anymore. :(
> 
> >
> >It would be interesting to see what "debugfs -R 'stat <7912058>' /dev/md2"
> >returns.
> 
> Inode: 7912058   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 252726504    Version: 0x00000000:00000001
> User:     0   Group:     0   Size: 0
> File ACL: 0    Directory ACL: 0
> Links: 0   Blockcount: 0
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
>  atime: 0x5428ccf9:65fa3740 -- Mon Sep 29 05:07:37 2014
>  mtime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
> crtime: 0x53451666:d35246b0 -- Wed Apr  9 11:44:06 2014
> dtime: 0x5428ccf9 -- Mon Sep 29 05:07:37 2014
> Size of extra inode fields: 28
> EXTENTS:

Huh.  This looks like a normal deleted file... just to ensure we're sane,
what's the output of:

debugfs -R 'ls <7913865>' /dev/md2
debugfs -R 'ncheck 7913865' /dev/md2

Hoping 7913865 -> /ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters

> At this time there seems to be 7 such files. Here's what it looks like:
> 
> {atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# ls -la
> ls: cannot access colormod.so: Input/output error
> ls: cannot access bumpmap.so: Input/output error
> ls: cannot access bumpmap.la: Input/output error
> ls: cannot access testfilter.la: Input/output error
> ls: cannot access testfilter.so: Input/output error
> ls: cannot access colormod.la: Input/output error
> total 8
> drwxr-xr-x 2 root root 4096 Sep 28 11:10 .
> drwxr-xr-x 4 root root 4096 Sep 14  2013 ..
> -????????? ? ?    ?       ?            ? bumpmap.la
> -????????? ? ?    ?       ?            ? bumpmap.so
> -????????? ? ?    ?       ?            ? colormod.la
> -????????? ? ?    ?       ?            ? colormod.so
> -????????? ? ?    ?       ?            ? testfilter.la
> -????????? ? ?    ?       ?            ? testfilter.so
> {atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# cd
> {atlas} [~]# umount /ext
> tim{atlas} [~]# time e2fsck -fy /dev/md2
> e2fsck 1.42.12 (29-Aug-2014)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/md2: 3863428/61022208 files (0.7% non-contiguous),
> 231256220/244057440 blocks
> e2fsck -fy /dev/md2  9.57s user 2.05s system 5% cpu 3:14.40 total

By any chance did you save the e2fsck logs?

Digging through the e2fsck source code, the only way an inode gets marked used
is if i_link_count > 0 or ... badblocks thinks the inode table block is bad.
What does this say?

debugfs -R 'stat <1>' /dev/md2

> Tried to delete that directory - impossible, i/o errors. I'll try to
> reboot now to see if anything changes...

In theory we can use debugfs to clear the directory and then run e2fsck to
clean up, but let's sanity-check the world before we resort to that. :)

--D
> 
> Thanks for your help.
> -- 
> Zlatko
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html