From: Theodore Ts'o Subject: Re: 4.7.0-rc7 ext4 error in dx_probe Date: Mon, 18 Jul 2016 09:38:43 -0400 Message-ID: <20160718133843.GA26664@thunk.org> References: <20160718105707.GA4253@sig21.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org To: Johannes Stezenbach Return-path: Content-Disposition: inline In-Reply-To: <20160718105707.GA4253@sig21.net> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Mon, Jul 18, 2016 at 12:57:07PM +0200, Johannes Stezenbach wrote: > > I'm running 4.7.0-rc7 with ext4 on lvm on dm-crypt on SSD > and out of the blue on idle machine the following error > message appeared: > > [373851.683131] EXT4-fs (dm-3): error count since last fsck: 1 > [373851.683151] EXT4-fs (dm-3): initial error at time 1468438194: dx_probe:740: inode 22288562 > [373851.683158] EXT4-fs (dm-3): last error at time 1468438194: dx_probe:740: inode 22288562 > > inode 22288562 is a directory with ~800 small files in it, > but AFAICT nothing was accessing it, no cron job running etc. > No further error message was logged. Accessing the directory > and the files in it also gives no further errors. Yes, thes messages gets printed once a day in case there was a file system corruption detected earlier. The problem is people unfortunately run with their file systems set to errors=continue, which I sometimes refer to as the "don't worry, be happy" option. The problem is sometimes this can cause data loss, and people aren't checking their logs so it's possible for errors to lurk hidden for a long time without noticing. Then they complain, and the people who are trying to resolve the bug report might not realize that in fact the file system has been corrupted for ages. I strongly suggest that people either (a) run automatic log analaysis software that informs you when a file system corruption was detected, or (b) run with errors=panic which will at least force a file system check when the system reboots, or (c) run with errors=remount-ro which avoids the file system from getting further corrupted. > Searching back in the log at date -d @1468438194 I found: > > Jul 13 21:29:54 foo kernel: EXT4-fs error (device dm-3): dx_probe:740: inode #22288562: comm git: Directory index failed checksum > > > Time to run fsck? Is it the consequence of a previous crash > (I had many recently)? In general, whenever the an EXT4-fs error is registered, you probably want to run fsck, right away. The default of errors=continue was more appropriate back in the day when servers were pets, and many people would complain if we just unceremoniously forced a reboot when we noticed a file system corruption, or remounted the file system read-only, which could lead to some surprising failures at the application stack level that could be hard to debug. But in a Cloud/Kuberneties world of "servers are cattle", forcing a reboot right away probably is the best thing to do, so the fsck can be forced right away. As far as what caused it, I'm not sure. If you can since look at the directory without any problems, it's possible that it was caused by a transient hardware glitch. That's because when we issue the "Directory index failed checksum", the directory block is discarded (to avoid potential further file system corruption). If you can list the directory now, that would appear that a subsequent attempt to reread the directory block was successful. But, in matters like this, I generally advise a "better safe than sorry", and run fsck. If you are using LVM, then you might be able to create a snapshot and then run fsck -n on the snapshot, to make sure the file system is OK, without needing to shutdown your server. An example of such a script can be found here: http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/e2croncheck Cheers, - Ted