Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   Jean-Louis Dupond <jean-louis@dupond.be>
Subject: Re: Filesystem corruption after unreachable storage
To:     "Theodore Y. Ts'o" <tytso@mit.edu>
Cc:     linux-ext4@vger.kernel.org
References: <c829a701-3e22-8931-e5ca-2508f87f4d78@dupond.be>
 <20200124203725.GH147870@mit.edu>
Message-ID: <3a7bc899-31d9-51f2-1ea9-b3bef2a98913@dupond.be>
Date:   Thu, 20 Feb 2020 10:08:44 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.5.0
MIME-Version: 1.0
In-Reply-To: <20200124203725.GH147870@mit.edu>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: nl-BE
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk

As the mail seems to have been trashed somewhere, I'll retry :)

Thanks
Jean-Louis


On 24/01/2020 21:37, Theodore Y. Ts'o wrote:
> On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
>> There was a short disruption of the SAN, which caused it to be 
>> unavailable
>> for 20-25 minutes for the ESXi.
> 20-25 minutes is "short"? I guess it depends on your definition / POV. :-)
Well more downtime was caused to recover (due to manual fsck) then the 
time the storage was down :)
>
>> What worries me is that almost all of the VM's (out of 500) were 
>> showing the
>> same error.
> So that's a bit surprising...
Indeed, that's were I thought, something went wrong here!
I've tried to simulate it, and were able to simulate the same error when 
we let the san recover BEFORE VM is shutdown.
If I poweroff the VM and then recover the SAN, it does an automatic fsck 
without problems.
So it really seems it breaks when the VM can write again to the SAN.
>
>> And even some (+-10) were completely corrupt.
> What do you mean by "completely corrupt"? Can you send an e2fsck
> transcript of file systems that were "completely corrupt"?
Well it was moving a tons of files to lost+found etc. So that was really 
broken.
I'll see if I can recover some backup of one in broken state.
Anyway this was only a very small percentage, so worries me less then 
the rest :)
>
>> Is there for example a chance that the filesystem gets corrupted the 
>> moment
>> the SAN storage was back accessible?
> Hmm... the one possibility I can think of off the top of my head is
> that in order to mark the file system as containing an error, we need
> to write to the superblock. The head of the linked list of orphan
> inodes is also in the superblock. If that had gotten modified in the
> intervening 20-25 minutes, it's possible that this would result in
> orphaned inodes not on the linked list, causing that error.
>
> It doesn't explain the more severe cases of corruption, though.
If fixing that would have left us with only 10 corrupt disks instead of 
500, would be a big win :)
>
>> I also have some snapshot available of a corrupted disk if some 
>> additional
>> debugging info is required.
> Before e2fsck was run? Can you send me a copy of the output of
> dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
> copy of that snapshot?
Sure:
dumpe2fs -> see attachment

Fsck:
# e2fsck -fy /dev/mapper/vg01-root
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 165708 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(863328--863355)
Fix? yes

Free blocks count wrong for group #26 (3485, counted=3513).
Fix? yes

Free blocks count wrong (1151169, counted=1151144).
Fix? yes

Inode bitmap differences:  -4401 -165708
Fix? yes

Free inodes count wrong for group #0 (2489, counted=2490).
Fix? yes

Free inodes count wrong for group #20 (1298, counted=1299).
Fix? yes

Free inodes count wrong (395115, counted=395098).
Fix? yes


/dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 
882520/2033664 blocks

>
>> It would be great to gather some feedback on how to improve the situation
>> (next to of course have no SAN outage :)).
> Something that you could consider is setting up your system to trigger
> a panic/reboot on a hung task timeout, or when ext4 detects an error
> (see the man page of tune2fs and mke2fs and the -e option for those
> programs).
>
> There are tradeoffs with this, but if you've lost the SAN for 15-30
> minutes, the file systems are going to need to be checked anyway, and
> the machine will certainly not be serving. So forcing a reboot might
> be the best thing to do.
Going to look into that! Thanks for the info.
>> On KVM for example there is a unlimited timeout (afaik) until the 
>> storage is
>> back, and the VM just continues running after storage recovery.
> Well, you can adjust the SCSI timeout, if you want to give that a try....
It has some other disadvantages? Or is it quite safe to increment the 
SCSI timeout?
>
> Cheers,
>
> - Ted