Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp441383ybv; Thu, 20 Feb 2020 01:09:44 -0800 (PST) X-Google-Smtp-Source: APXvYqxnRUnjquKjLyEJxEvYknS7Y5QUGvkFLH0vLfRDlACmq5N3A2+8cql5KuOEICdQFM9boL2C X-Received: by 2002:a9d:7ccd:: with SMTP id r13mr21934453otn.56.1582189784120; Thu, 20 Feb 2020 01:09:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582189784; cv=none; d=google.com; s=arc-20160816; b=fgRx/toZAmgfvHFu6GDOrELXGT8B8krjpkj1J824giDOZYrwEs31xwwvsPTrzdGSFu OBnYoYPMTCabKYJeDz3Q5t4//I6KutkGTCLxpACI0mhkb+Lw7WFVCdPI1glWdvMJNiTz IUF/VEXBWg2uPCYB6gou2LjZltXGfZ417K6chTVMXCjGk2IQhJ7NaRza3HcNX2qs+Cd9 XzHOJFIhYGpr/ChA3+H5plSbtrsd6ulqntyiky68SROoCJ7K/FTol36APG9/2gYaHqna VqhPAnpRBdrEZ2DUkPR14JPr6TlTTTXilcUOZQaxcnc8vpeAnx118ZIzJyknNQnG6CQr Gz/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:subject:from:dkim-signature; bh=1OQT133p0KcilpIVpqV1MNCTLS4CINQ/SdZX5wyzp4g=; b=FgROJILeIFwRjD4cycXrVCE4NrYIQHAW9vtmWPwApEiqimLkgxuWvCWO2SMNMFLmnt g+89EL2JnD4ngTkOt4mP9P5eB50sc4nNWm4x9SsbeVNoSUOYFCTBLnRaC+11fKxpdkbZ oQ301OOYEljs0dMGvKS0jg6V1m0Wo75tMYstP+t9xQZOGS4vrxU+s0J1icoTkX57wb4J bLehrbqgyFWmen1CEZb2HAZs1VJCSUfkYTnTep4uNeHHm3cZdGekUFrwv0JhpwnGbpmt 5Xi+ofhhgOchbxTRK7qegTmT6QmyrSxrqm1xixG55h3MEhUywtDMIsDRmDnPWXILn1Rh ed1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@dupond.be header.s=dkim header.b=Vh38GWt5; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=dupond.be Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 8si10698481oix.109.2020.02.20.01.09.11; Thu, 20 Feb 2020 01:09:44 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@dupond.be header.s=dkim header.b=Vh38GWt5; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=dupond.be Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726637AbgBTJIo (ORCPT + 99 others); Thu, 20 Feb 2020 04:08:44 -0500 Received: from apollo.dupie.be ([51.15.19.225]:54346 "EHLO apollo.dupie.be" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726501AbgBTJIo (ORCPT ); Thu, 20 Feb 2020 04:08:44 -0500 Received: from [10.10.1.155] (systeembeheer.combell.com [217.21.177.69]) by apollo.dupie.be (Postfix) with ESMTPSA id D0E5B809F20; Thu, 20 Feb 2020 10:08:41 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dupond.be; s=dkim; t=1582189721; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1OQT133p0KcilpIVpqV1MNCTLS4CINQ/SdZX5wyzp4g=; b=Vh38GWt5Qg4epGZDAr/ImjhKIqNjEt9Ec6ZopPe60YA+N1avNLVoob2H9FlIIjajEdoYaC WCI7nUOQh4a2mqIzWTtbHDEnT+/5SeiLjBWexV4mJaF2Q38m87CTkkUjU+/e5bWqNwH8yb eMBYFTMSK/SxLc2XNJBjHKqnvTHLbYdJGxcXw2vDAuDt/YQNuX0JunDsXnqwLnZ8KNnNII hahrBHlR/GmvegHJyu7PhuXbyMjnMldkARfzqJLXstVajuMqsMopWHmwFe1EiAplv3canq c6ZomdmimjJGatebQwExFj2BzzDhBeB4LCqEG1MXsuPK9Tl2NNzyKh6qIDCN+w== From: Jean-Louis Dupond Subject: Re: Filesystem corruption after unreachable storage To: "Theodore Y. Ts'o" Cc: linux-ext4@vger.kernel.org References: <20200124203725.GH147870@mit.edu> Message-ID: <3a7bc899-31d9-51f2-1ea9-b3bef2a98913@dupond.be> Date: Thu, 20 Feb 2020 10:08:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <20200124203725.GH147870@mit.edu> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: nl-BE Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org As the mail seems to have been trashed somewhere, I'll retry :) Thanks Jean-Louis On 24/01/2020 21:37, Theodore Y. Ts'o wrote: > On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote: >> There was a short disruption of the SAN, which caused it to be >> unavailable >> for 20-25 minutes for the ESXi. > 20-25 minutes is "short"? I guess it depends on your definition / POV. :-) Well more downtime was caused to recover (due to manual fsck) then the time the storage was down :) > >> What worries me is that almost all of the VM's (out of 500) were >> showing the >> same error. > So that's a bit surprising... Indeed, that's were I thought, something went wrong here! I've tried to simulate it, and were able to simulate the same error when we let the san recover BEFORE VM is shutdown. If I poweroff the VM and then recover the SAN, it does an automatic fsck without problems. So it really seems it breaks when the VM can write again to the SAN. > >> And even some (+-10) were completely corrupt. > What do you mean by "completely corrupt"? Can you send an e2fsck > transcript of file systems that were "completely corrupt"? Well it was moving a tons of files to lost+found etc. So that was really broken. I'll see if I can recover some backup of one in broken state. Anyway this was only a very small percentage, so worries me less then the rest :) > >> Is there for example a chance that the filesystem gets corrupted the >> moment >> the SAN storage was back accessible? > Hmm... the one possibility I can think of off the top of my head is > that in order to mark the file system as containing an error, we need > to write to the superblock. The head of the linked list of orphan > inodes is also in the superblock. If that had gotten modified in the > intervening 20-25 minutes, it's possible that this would result in > orphaned inodes not on the linked list, causing that error. > > It doesn't explain the more severe cases of corruption, though. If fixing that would have left us with only 10 corrupt disks instead of 500, would be a big win :) > >> I also have some snapshot available of a corrupted disk if some >> additional >> debugging info is required. > Before e2fsck was run? Can you send me a copy of the output of > dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a > copy of that snapshot? Sure: dumpe2fs -> see attachment Fsck: # e2fsck -fy /dev/mapper/vg01-root e2fsck 1.44.5 (15-Dec-2018) Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found.  Fix? yes Inode 165708 was part of the orphaned inode list.  FIXED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences:  -(863328--863355) Fix? yes Free blocks count wrong for group #26 (3485, counted=3513). Fix? yes Free blocks count wrong (1151169, counted=1151144). Fix? yes Inode bitmap differences:  -4401 -165708 Fix? yes Free inodes count wrong for group #0 (2489, counted=2490). Fix? yes Free inodes count wrong for group #20 (1298, counted=1299). Fix? yes Free inodes count wrong (395115, counted=395098). Fix? yes /dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 882520/2033664 blocks > >> It would be great to gather some feedback on how to improve the situation >> (next to of course have no SAN outage :)). > Something that you could consider is setting up your system to trigger > a panic/reboot on a hung task timeout, or when ext4 detects an error > (see the man page of tune2fs and mke2fs and the -e option for those > programs). > > There are tradeoffs with this, but if you've lost the SAN for 15-30 > minutes, the file systems are going to need to be checked anyway, and > the machine will certainly not be serving. So forcing a reboot might > be the best thing to do. Going to look into that! Thanks for the info. >> On KVM for example there is a unlimited timeout (afaik) until the >> storage is >> back, and the VM just continues running after storage recovery. > Well, you can adjust the SCSI timeout, if you want to give that a try.... It has some other disadvantages? Or is it quite safe to increment the SCSI timeout? > > Cheers, > > - Ted