LinuxLists.cc - e2fsck not fixing deleted inode referenced errors?

2014-09-30 18:05:49

Subject: e2fsck not fixing deleted inode referenced errors?

Hope this is the right list to ask this question.

I have an ext4 filesystem that has a few errors like this:

Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
ext4_lookup:1448: inode #7913865: comm find: deleted inode referenced:
7912058
Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
ext4_lookup:1448: inode #7913865: comm find: deleted inode referenced:
7912055

Yet, when I run e2fsck -fy on it, I have a clean run, no errors are
found and/or fixed. Is this the expected behaviour? What am I supposed
to do to get rid of errors like the above?

The filesystem is on a md mirror device, the kernel is 3.17.0-rc7,
e2progs 1.42.12-1 (Debian sid). Could md device somehow interfere? I ran
md check yesterday, but there were no errors.

BTW, this all started when I got ata2.00: failed command: FLUSH CACHE
EXT error yesterday morning. I did several runs of e2fsck before the
filesystem came up clean, yet errors like the above are popping constantly.

Thanks for any info. [and please Cc:, I'm not subscribed]

--
Zlatko

2014-09-30 18:30:24

by Darrick J. Wong

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On Tue, Sep 30, 2014 at 07:56:36PM +0200, Zlatko Calusic wrote:
> Hope this is the right list to ask this question.
>
> I have an ext4 filesystem that has a few errors like this:
>
> Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> ext4_lookup:1448: inode #7913865: comm find: deleted inode
> referenced: 7912058
> Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> ext4_lookup:1448: inode #7913865: comm find: deleted inode
> referenced: 7912055
>
> Yet, when I run e2fsck -fy on it, I have a clean run, no errors are
> found and/or fixed. Is this the expected behaviour? What am I
> supposed to do to get rid of errors like the above?

[I should hope not.]

> The filesystem is on a md mirror device, the kernel is 3.17.0-rc7,
> e2progs 1.42.12-1 (Debian sid). Could md device somehow interfere? I
> ran md check yesterday, but there were no errors.
>
> BTW, this all started when I got ata2.00: failed command: FLUSH
> CACHE EXT error yesterday morning. I did several runs of e2fsck
> before the filesystem came up clean, yet errors like the above are
> popping constantly.

Normally that kernel message only happens if a dir refers to an inode with
link_count and mode set to 0.

Is the disk attached to ata2.00 one of the RAID1 mirrors? What was the full
error message, and does smartctl -a report anything?

It would be interesting to see what "debugfs -R 'stat <7912058>' /dev/md2"
returns.

--D

>
> Thanks for any info. [and please Cc:, I'm not subscribed]
>
> --
> Zlatko
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-09-30 18:43:08

by Zlatko Calusic

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On 30.09.2014 20:30, Darrick J. Wong wrote:
> On Tue, Sep 30, 2014 at 07:56:36PM +0200, Zlatko Calusic wrote:
>> Hope this is the right list to ask this question.
>>
>> I have an ext4 filesystem that has a few errors like this:
>>
>> Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
>> ext4_lookup:1448: inode #7913865: comm find: deleted inode
>> referenced: 7912058
>> Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
>> ext4_lookup:1448: inode #7913865: comm find: deleted inode
>> referenced: 7912055
>>
>> Yet, when I run e2fsck -fy on it, I have a clean run, no errors are
>> found and/or fixed. Is this the expected behaviour? What am I
>> supposed to do to get rid of errors like the above?
>
> [I should hope not.]
>
>> The filesystem is on a md mirror device, the kernel is 3.17.0-rc7,
>> e2progs 1.42.12-1 (Debian sid). Could md device somehow interfere? I
>> ran md check yesterday, but there were no errors.
>>
>> BTW, this all started when I got ata2.00: failed command: FLUSH
>> CACHE EXT error yesterday morning. I did several runs of e2fsck
>> before the filesystem came up clean, yet errors like the above are
>> popping constantly.
>
> Normally that kernel message only happens if a dir refers to an inode with
> link_count and mode set to 0.
>
> Is the disk attached to ata2.00 one of the RAID1 mirrors? What was the full
> error message, and does smartctl -a report anything?

Yes, it is part of the mirror:

ata2.00: ATA-8: WDC WD1002FBYS-02A6B0, 03.00C06, max UDMA/133
ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
ata2.00: configured for UDMA/133

md2 : active raid1 sdb2[0] sda2[1]
976229760 blocks [2/2] [UU]
bitmap: 0/8 pages [0KB], 65536KB chunk

Full error message from the kernel log, together with data check I did
in the evening:

Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0
SErr 0x4010000 action 0xe frozen
Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040, connection
status changed
Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
Sep 29 05:07:51 atlas kernel: ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a res
40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please be
patient (ready=0)
Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
Sep 29 05:08:00 atlas kernel: ata2: EH complete
Sep 29 05:37:36 atlas kernel: EXT4-fs error (device md2):
ext4_mb_generate_buddy:757: group 1783, block bitmap and bg descriptor
inconsistent: 8218 vs 9292 free clusters
Sep 29 05:37:36 atlas kernel: JBD2: Spotted dirty metadata buffer (dev =
md2, blocknr = 0). There's a risk of filesystem corruption in case of
system crash.
Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
ext4_mb_generate_buddy:757: group 995, block bitmap and bg descriptor
inconsistent: 15932 vs 15939 free clusters
Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
ext4_mb_generate_buddy:757: group 1732, block bitmap and bg descriptor
inconsistent: 5055 vs 5705 free clusters
Sep 29 19:24:01 atlas kernel: md: data-check of RAID array md2
Sep 29 19:24:01 atlas kernel: md: minimum _guaranteed_ speed: 1000
KB/sec/disk.
Sep 29 19:24:01 atlas kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Sep 29 19:24:01 atlas kernel: md: using 128k window, over a total of
976229760k.
Sep 29 22:37:53 atlas kernel: md: md2: data-check done.

Later on I did several (at least 3) e2fsck runs until the filesystem
finally was clean of errors. Only to stumble upon new errors today that
can't be fixed with e2fsck anymore. :(

>
> It would be interesting to see what "debugfs -R 'stat <7912058>' /dev/md2"
> returns.

Inode: 7912058 Type: regular Mode: 0644 Flags: 0x80000
Generation: 252726504 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 0
File ACL: 0 Directory ACL: 0
Links: 0 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
atime: 0x5428ccf9:65fa3740 -- Mon Sep 29 05:07:37 2014
mtime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
crtime: 0x53451666:d35246b0 -- Wed Apr 9 11:44:06 2014
dtime: 0x5428ccf9 -- Mon Sep 29 05:07:37 2014
Size of extra inode fields: 28
EXTENTS:

At this time there seems to be 7 such files. Here's what it looks like:

{atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# ls -la
ls: cannot access colormod.so: Input/output error
ls: cannot access bumpmap.so: Input/output error
ls: cannot access bumpmap.la: Input/output error
ls: cannot access testfilter.la: Input/output error
ls: cannot access testfilter.so: Input/output error
ls: cannot access colormod.la: Input/output error
total 8
drwxr-xr-x 2 root root 4096 Sep 28 11:10 .
drwxr-xr-x 4 root root 4096 Sep 14 2013 ..
-????????? ? ? ? ? ? bumpmap.la
-????????? ? ? ? ? ? bumpmap.so
-????????? ? ? ? ? ? colormod.la
-????????? ? ? ? ? ? colormod.so
-????????? ? ? ? ? ? testfilter.la
-????????? ? ? ? ? ? testfilter.so
{atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# cd
{atlas} [~]# umount /ext
tim{atlas} [~]# time e2fsck -fy /dev/md2
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md2: 3863428/61022208 files (0.7% non-contiguous),
231256220/244057440 blocks
e2fsck -fy /dev/md2 9.57s user 2.05s system 5% cpu 3:14.40 total

Tried to delete that directory - impossible, i/o errors. I'll try to
reboot now to see if anything changes...

Thanks for your help.
--
Zlatko

2014-09-30 19:30:06

by Darrick J. Wong

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On Tue, Sep 30, 2014 at 08:43:04PM +0200, Zlatko Calusic wrote:
> On 30.09.2014 20:30, Darrick J. Wong wrote:
> >On Tue, Sep 30, 2014 at 07:56:36PM +0200, Zlatko Calusic wrote:
> >>Hope this is the right list to ask this question.
> >>
> >>I have an ext4 filesystem that has a few errors like this:
> >>
> >>Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> >>ext4_lookup:1448: inode #7913865: comm find: deleted inode
> >>referenced: 7912058
> >>Sep 30 19:14:09 atlas kernel: EXT4-fs error (device md2):
> >>ext4_lookup:1448: inode #7913865: comm find: deleted inode
> >>referenced: 7912055
> >>
> >>Yet, when I run e2fsck -fy on it, I have a clean run, no errors are
> >>found and/or fixed. Is this the expected behaviour? What am I
> >>supposed to do to get rid of errors like the above?
> >
> >[I should hope not.]
> >
> >>The filesystem is on a md mirror device, the kernel is 3.17.0-rc7,
> >>e2progs 1.42.12-1 (Debian sid). Could md device somehow interfere? I
> >>ran md check yesterday, but there were no errors.
> >>
> >>BTW, this all started when I got ata2.00: failed command: FLUSH
> >>CACHE EXT error yesterday morning. I did several runs of e2fsck
> >>before the filesystem came up clean, yet errors like the above are
> >>popping constantly.
> >
> >Normally that kernel message only happens if a dir refers to an inode with
> >link_count and mode set to 0.
> >
> >Is the disk attached to ata2.00 one of the RAID1 mirrors? What was the full
> >error message, and does smartctl -a report anything?
>
> Yes, it is part of the mirror:
>
> ata2.00: ATA-8: WDC WD1002FBYS-02A6B0, 03.00C06, max UDMA/133
> ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
> ata2.00: configured for UDMA/133
>
> md2 : active raid1 sdb2[0] sda2[1]
> 976229760 blocks [2/2] [UU]
> bitmap: 0/8 pages [0KB], 65536KB chunk
>
> Full error message from the kernel log, together with data check I
> did in the evening:
>
> Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0
> SErr 0x4010000 action 0xe frozen
> Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040,
> connection status changed
> Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
> Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
> Sep 29 05:07:51 atlas kernel: ata2.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a res
> 40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
> Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
> Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
> Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please
> be patient (ready=0)
> Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus
> 123 SControl 300)
> Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
> Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
> Sep 29 05:08:00 atlas kernel: ata2: EH complete
> Sep 29 05:37:36 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 1783, block bitmap and bg
> descriptor inconsistent: 8218 vs 9292 free clusters
> Sep 29 05:37:36 atlas kernel: JBD2: Spotted dirty metadata buffer
> (dev = md2, blocknr = 0). There's a risk of filesystem corruption in
> case of system crash.
> Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 995, block bitmap and bg
> descriptor inconsistent: 15932 vs 15939 free clusters
> Sep 29 16:03:43 atlas kernel: EXT4-fs error (device md2):
> ext4_mb_generate_buddy:757: group 1732, block bitmap and bg
> descriptor inconsistent: 5055 vs 5705 free clusters
> Sep 29 19:24:01 atlas kernel: md: data-check of RAID array md2
> Sep 29 19:24:01 atlas kernel: md: minimum _guaranteed_ speed: 1000
> KB/sec/disk.
> Sep 29 19:24:01 atlas kernel: md: using maximum available idle IO
> bandwidth (but not more than 200000 KB/sec) for data-check.
> Sep 29 19:24:01 atlas kernel: md: using 128k window, over a total of
> 976229760k.
> Sep 29 22:37:53 atlas kernel: md: md2: data-check done.
>
>
> Later on I did several (at least 3) e2fsck runs until the filesystem
> finally was clean of errors. Only to stumble upon new errors today
> that can't be fixed with e2fsck anymore. :(
>
> >
> >It would be interesting to see what "debugfs -R 'stat <7912058>' /dev/md2"
> >returns.
>
> Inode: 7912058 Type: regular Mode: 0644 Flags: 0x80000
> Generation: 252726504 Version: 0x00000000:00000001
> User: 0 Group: 0 Size: 0
> File ACL: 0 Directory ACL: 0
> Links: 0 Blockcount: 0
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
> atime: 0x5428ccf9:65fa3740 -- Mon Sep 29 05:07:37 2014
> mtime: 0x5428ccf9:667449f0 -- Mon Sep 29 05:07:37 2014
> crtime: 0x53451666:d35246b0 -- Wed Apr 9 11:44:06 2014
> dtime: 0x5428ccf9 -- Mon Sep 29 05:07:37 2014
> Size of extra inode fields: 28
> EXTENTS:

Huh. This looks like a normal deleted file... just to ensure we're sane,
what's the output of:

debugfs -R 'ls <7913865>' /dev/md2
debugfs -R 'ncheck 7913865' /dev/md2

Hoping 7913865 -> /ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters

> At this time there seems to be 7 such files. Here's what it looks like:
>
> {atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# ls -la
> ls: cannot access colormod.so: Input/output error
> ls: cannot access bumpmap.so: Input/output error
> ls: cannot access bumpmap.la: Input/output error
> ls: cannot access testfilter.la: Input/output error
> ls: cannot access testfilter.so: Input/output error
> ls: cannot access colormod.la: Input/output error
> total 8
> drwxr-xr-x 2 root root 4096 Sep 28 11:10 .
> drwxr-xr-x 4 root root 4096 Sep 14 2013 ..
> -????????? ? ? ? ? ? bumpmap.la
> -????????? ? ? ? ? ? bumpmap.so
> -????????? ? ? ? ? ? colormod.la
> -????????? ? ? ? ? ? colormod.so
> -????????? ? ? ? ? ? testfilter.la
> -????????? ? ? ? ? ? testfilter.so
> {atlas} [/ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters]# cd
> {atlas} [~]# umount /ext
> tim{atlas} [~]# time e2fsck -fy /dev/md2
> e2fsck 1.42.12 (29-Aug-2014)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/md2: 3863428/61022208 files (0.7% non-contiguous),
> 231256220/244057440 blocks
> e2fsck -fy /dev/md2 9.57s user 2.05s system 5% cpu 3:14.40 total

By any chance did you save the e2fsck logs?

Digging through the e2fsck source code, the only way an inode gets marked used
is if i_link_count > 0 or ... badblocks thinks the inode table block is bad.
What does this say?

debugfs -R 'stat <1>' /dev/md2

> Tried to delete that directory - impossible, i/o errors. I'll try to
> reboot now to see if anything changes...

In theory we can use debugfs to clear the directory and then run e2fsck to
clean up, but let's sanity-check the world before we resort to that. :)

--D
>
> Thanks for your help.
> --
> Zlatko
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-09-30 19:54:15

by Theodore Ts'o

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On Tue, Sep 30, 2014 at 08:43:04PM +0200, Zlatko Calusic wrote:
> Full error message from the kernel log, together with data check I did in
> the evening:
>
> Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr
> 0x4010000 action 0xe frozen
> Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040, connection
> status changed
> Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
> Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
> Sep 29 05:07:51 atlas kernel: ata2.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a res
> 40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
> Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
> Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
> Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please be
> patient (ready=0)
> Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
> SControl 300)
> Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
> Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
> Sep 29 05:08:00 atlas kernel: ata2: EH complete

That looks really bad; it sounds like you have a hardware error on at
least one of your disks. Have you tried running running badblocks on
both disks to make sure the disk isn't flagging more bad blocks, and
then resynchronizing the RAID 1 array? Then try running e2fsck again.

- Ted

2014-09-30 20:10:13

by Zlatko Calusic

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On 30.09.2014 21:29, Darrick J. Wong wrote:

> Huh. This looks like a normal deleted file... just to ensure we're sane,
> what's the output of:
>
> debugfs -R 'ls <7913865>' /dev/md2

7913865 (12) . 7913864 (12) .. 7912053 (20) bumpmap.la
7912054 (20) bumpmap.so 7912055 (20) colormod.la
7912056 (20) colormod.so 7912057 (24) testfilter.la
7912058 (3968) testfilter.so

> debugfs -R 'ncheck 7913865' /dev/md2

Inode Pathname
7913865 /backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters

>
> Hoping 7913865 -> /ext/backup/atlas/usr/lib/x86_64-linux-gnu/imlib2/filters

It is.

>
> By any chance did you save the e2fsck logs?

I'm afraid not. I was in single-user mode, it scrolled much. And later
the log in /var/log/fsck was overwritten with clean fsck run. :(

Always thought that history of fsck logs should be kept... something
like /var/log/syslog.

>
> Digging through the e2fsck source code, the only way an inode gets marked used
> is if i_link_count > 0 or ... badblocks thinks the inode table block is bad.
> What does this say?
>
> debugfs -R 'stat <1>' /dev/md2

Inode: 1 Type: bad type Mode: 0000 Flags: 0x0
Generation: 0 Version: 0x00000000
User: 0 Group: 0 Size: 0
File ACL: 0 Directory ACL: 0
Links: 0 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4ddb7efa -- Tue May 24 11:48:42 2011
atime: 0x4ddb7efa -- Tue May 24 11:48:42 2011
mtime: 0x4ddb7efa -- Tue May 24 11:48:42 2011
Size of extra inode fields: 0
BLOCKS:

>
>> Tried to delete that directory - impossible, i/o errors. I'll try to
>> reboot now to see if anything changes...
>
> In theory we can use debugfs to clear the directory and then run e2fsck to
> clean up, but let's sanity-check the world before we resort to that. :)
>

My thought exactly. Now that you reminded me that debugfs exists, I'm
pretty convinced I could just unlink one directory and one file with it,
run e2fsck and maybe be done. But, if you have any other ideas in the
meantime, let's test them first.

--
Zlatko

2014-09-30 20:27:19

by Zlatko Calusic

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On 30.09.2014 21:54, Theodore Ts'o wrote:
> On Tue, Sep 30, 2014 at 08:43:04PM +0200, Zlatko Calusic wrote:
>> Full error message from the kernel log, together with data check I did in
>> the evening:
>>
>> Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr
>> 0x4010000 action 0xe frozen
>> Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040, connection
>> status changed
>> Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
>> Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
>> Sep 29 05:07:51 atlas kernel: ata2.00: cmd
>> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a res
>> 40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
>> Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
>> Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
>> Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please be
>> patient (ready=0)
>> Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
>> SControl 300)
>> Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
>> Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
>> Sep 29 05:08:00 atlas kernel: ata2: EH complete
>
> That looks really bad; it sounds like you have a hardware error on at
> least one of your disks. Have you tried running running badblocks on
> both disks to make sure the disk isn't flagging more bad blocks, and
> then resynchronizing the RAID 1 array? Then try running e2fsck again.
>

Yep, both disks are pretty old, somewhere at the end of warranty. Yet
the interesting thing is that exactly that error (FLUSH CACHE EXT)
happened from time to time, say once a year, but never before I got in
such trouble that e2fsck wouldn't save the day after one quick run.

I now remember Darrick also asked for smartctl data. Here it is:

/dev/sda
========
Power_On_Hours 40984

and only 2 SMART READ/WRITE LOG errors in the log from long time ago...

ATA Error Count: 2
Error 1 occurred at disk power-on lifetime: 14493 hours (603 days + 21
hours)
Error 2 occurred at disk power-on lifetime: 14493 hours (603 days + 21
hours)

Full: http://pastebin.com/GnQhACXf

/dev/sdb (I believe the disk responsible for the problem)
========
Power_On_Hours 40978

No Errors Logged

Full: http://pastebin.com/nUB2q0Tk

Unless you have other ideas, I will run badblocks. Although, as ext4 fs
is on /dev/md2, I think I should run it on /dev/md2 only? Do you really
mean to run it on /dev/sda2, /dev/sdb2 - underlying devices? I'm not
sure how MD would cope with it.

But, I'm pretty sure that it will come out clean. The md check I did
last night would surely detected bad blocks if there were any. Or not?

Thanks for your help!
--
Zlatko

2014-09-30 20:36:35

by Darrick J. Wong

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On Tue, Sep 30, 2014 at 10:27:12PM +0200, Zlatko Calusic wrote:
> On 30.09.2014 21:54, Theodore Ts'o wrote:
> >On Tue, Sep 30, 2014 at 08:43:04PM +0200, Zlatko Calusic wrote:
> >>Full error message from the kernel log, together with data check I did in
> >>the evening:
> >>
> >>Sep 29 05:07:51 atlas kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr
> >>0x4010000 action 0xe frozen
> >>Sep 29 05:07:51 atlas kernel: ata2.00: irq_stat 0x00400040, connection
> >>status changed
> >>Sep 29 05:07:51 atlas kernel: ata2: SError: { PHYRdyChg DevExch }
> >>Sep 29 05:07:51 atlas kernel: ata2.00: failed command: FLUSH CACHE EXT
> >>Sep 29 05:07:51 atlas kernel: ata2.00: cmd
> >>ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0\x0a res
> >>40/00:f4:e2:7f:14/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
> >>Sep 29 05:07:51 atlas kernel: ata2.00: status: { DRDY }
> >>Sep 29 05:07:51 atlas kernel: ata2: hard resetting link
> >>Sep 29 05:07:57 atlas kernel: ata2: link is slow to respond, please be
> >>patient (ready=0)
> >>Sep 29 05:08:00 atlas kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
> >>SControl 300)
> >>Sep 29 05:08:00 atlas kernel: ata2.00: configured for UDMA/133
> >>Sep 29 05:08:00 atlas kernel: ata2.00: retrying FLUSH 0xea Emask 0x10
> >>Sep 29 05:08:00 atlas kernel: ata2: EH complete
> >
> >That looks really bad; it sounds like you have a hardware error on at
> >least one of your disks. Have you tried running running badblocks on
> >both disks to make sure the disk isn't flagging more bad blocks, and
> >then resynchronizing the RAID 1 array? Then try running e2fsck again.
> >
>
> Yep, both disks are pretty old, somewhere at the end of warranty.
> Yet the interesting thing is that exactly that error (FLUSH CACHE
> EXT) happened from time to time, say once a year, but never before I
> got in such trouble that e2fsck wouldn't save the day after one
> quick run.
>
> I now remember Darrick also asked for smartctl data. Here it is:
>
> /dev/sda
> ========
> Power_On_Hours 40984
>
> and only 2 SMART READ/WRITE LOG errors in the log from long time ago...
>
> ATA Error Count: 2
> Error 1 occurred at disk power-on lifetime: 14493 hours (603 days +
> 21 hours)
> Error 2 occurred at disk power-on lifetime: 14493 hours (603 days +
> 21 hours)

Yes, that looks like SMART commands colliding, i.e. not a media error.

> Full: http://pastebin.com/GnQhACXf
>
> /dev/sdb (I believe the disk responsible for the problem)

sysfs will tell you:

$ ls /sys/block/sd*
lrwxrwxrwx 1 root root 0 Sep 30 13:31 /sys/block/sda ->
../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/

^^^^

(So on this system, sda is ata1.00.)

> ========
> Power_On_Hours 40978
>
> No Errors Logged
>
> Full: http://pastebin.com/nUB2q0Tk
>
> Unless you have other ideas, I will run badblocks. Although, as ext4
> fs is on /dev/md2, I think I should run it on /dev/md2 only? Do you
> really mean to run it on /dev/sda2, /dev/sdb2 - underlying devices?
> I'm not sure how MD would cope with it.

Both drives, individually.

md should be fine with that, since you're only running the read test.

It would be fun to be able to poke around with an 'e2image /dev/md2' file, but
given the 1TB volume that might be unmanageable.

> But, I'm pretty sure that it will come out clean. The md check I did
> last night would surely detected bad blocks if there were any. Or
> not?

Let's hope so, unless it's some weird transient error...

--D
>
> Thanks for your help!
> --
> Zlatko
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-09-30 21:34:03

by Zlatko Calusic

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On 30.09.2014 22:36, Darrick J. Wong wrote:
>
> sysfs will tell you:
>
> $ ls /sys/block/sd*
> lrwxrwxrwx 1 root root 0 Sep 30 13:31 /sys/block/sda ->
> ../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/
>
> ^^^^
>

Nice, thanks. The error came from /dev/sdb definitely.

>
> Both drives, individually.
>
> md should be fine with that, since you're only running the read test.
>

You're right, of course, what was I thinking. :)

OK, 2 x badblocks already running. Will report tomorrow how it ended.

> It would be fun to be able to poke around with an 'e2image /dev/md2' file, but
> given the 1TB volume that might be unmanageable.
>

Yeah. Together with the fact that fs has millions of files + not enough
space on other disk, I'll have to pass on that one.

>> But, I'm pretty sure that it will come out clean. The md check I did
>> last night would surely detected bad blocks if there were any. Or
>> not?
>
> Let's hope so, unless it's some weird transient error...
>

The worst kind. :( Dunno. I guess tomorrow I'll try unlink from debugfs.
Then e2fsck. First, we'll see if there are any bad blocks around.

Regards,
--
Zlatko

2014-10-01 06:44:03

by Zlatko Calusic

[permalink] [raw]

Subject: Re: e2fsck not fixing deleted inode referenced errors?

On 30.09.2014 22:27, Zlatko Calusic wrote:
>
> Unless you have other ideas, I will run badblocks. Although, as ext4 fs
> is on /dev/md2, I think I should run it on /dev/md2 only? Do you really
> mean to run it on /dev/sda2, /dev/sdb2 - underlying devices? I'm not
> sure how MD would cope with it.
>
> But, I'm pretty sure that it will come out clean. The md check I did
> last night would surely detected bad blocks if there were any. Or not?

So, I ran badblocks on both underlying devices, and both came up clean,
as I suspected would happen. Then I ran debugfs -w /dev/md2 and unlinked
7 inodes that were problematic. Afterwards e2fsck found those inodes and
connected them to lost+found. Second e2fsck run finished clean. And at
this time I see no errors on the fs.

This is definitely the case where e2fsck can't see or repair fs errors
that are real.

Regards,
--
Zlatko