Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp33924ybh; Mon, 9 Mar 2020 15:33:35 -0700 (PDT) X-Google-Smtp-Source: ADFU+vtqgkSs5Hqs4paU8UlEP8ruCI4pK2cc6dSO01ioV9sQPVGeMgTACGKKmqOEc2skC/7qYm1O X-Received: by 2002:a9d:4d8f:: with SMTP id u15mr14729454otk.261.1583793215617; Mon, 09 Mar 2020 15:33:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1583793215; cv=none; d=google.com; s=arc-20160816; b=NTZKrs4K2v1M11fpBNBFhf+AWr4liYbuWisSrIEc+S3gHjPjweH0gzk1cj+M4R2uyU pmGfXyKuJgxt++SnNNN9DDc4a62JVDbs5xj1Pa2NYhc3iYDHOz3a9igjz5TE+anZE0Jd VTkVNw33czc4PWxmnCDajae1w0HMnMZIMsa7FTMHVpmEqGfqCCHMxb/fssEbn6MnPMl+ fLBxrTFmK/EEPz5iX0o+ObRWu32aEHOq4bUTSo4Dsw/df18tSFAIBoI2qaBioHAWUt92 Uhusi+BmGIlc4iuScQKDInuaXe5LT7wRZpK4VXCOpBAu1C7iB+T6FkJdq0rzbaBwXpf4 B/UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=tijr6qdp99n/cGHiAIaKJfu0PTAt8ruIXJrKjFulYVs=; b=swNK/3BqWjN/Y6MNVwhxdxWl8w7hO/wffDuZXansbJOCAfOmcN01UeoSORU6QkT62T ZwX8fTnXj/chIV4Q1LQitCJNrpw2DxjoEvS9mZKNykbk9evSAGBpOwjWtOC2qvdtCou9 mkbz713OFOsb1xvCIOqsmKh8aMMNGrNPX38fJJbXb0OHHR+qioT5eGQ2acWyp+CQrCcK dCKEbThV1ejXQKmVArGOx56C1l+W9DlDYEddjgUeGEXk9jHRqewYYALf0+4dB2xanUjz bL4aGNMrnaaU5hRbeYu8mqgg74UnuNs4RtjtgUcFSTgyFpWG5Fiw8OiazhmuWe5xcKb0 Ox4A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b20si4073912otp.82.2020.03.09.15.33.17; Mon, 09 Mar 2020 15:33:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727097AbgCIWcq (ORCPT + 99 others); Mon, 9 Mar 2020 18:32:46 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:56607 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726536AbgCIWcq (ORCPT ); Mon, 9 Mar 2020 18:32:46 -0400 Received: from callcc.thunk.org (guestnat-104-133-0-105.corp.google.com [104.133.0.105] (may be forged)) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 029MWctU010734 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 9 Mar 2020 18:32:40 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id AB97542045B; Mon, 9 Mar 2020 18:32:38 -0400 (EDT) Date: Mon, 9 Mar 2020 18:32:38 -0400 From: "Theodore Y. Ts'o" To: Jean-Louis Dupond Cc: linux-ext4@vger.kernel.org Subject: Re: Filesystem corruption after unreachable storage Message-ID: <20200309223238.GC4852@mit.edu> References: <20200124203725.GH147870@mit.edu> <3a7bc899-31d9-51f2-1ea9-b3bef2a98913@dupond.be> <20200220155022.GA532518@mit.edu> <7376c09c-63e3-488f-fcf8-89c81832ef2d@dupond.be> <20200225172355.GA14617@mit.edu> <50f93ccb-2b2c-15c5-8b08-facc3a25068a@dupond.be> <20200309151838.GA4852@mit.edu> <93e74f9f-6694-a3e9-4fac-981389522d25@dupond.be> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <93e74f9f-6694-a3e9-4fac-981389522d25@dupond.be> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Mon, Mar 09, 2020 at 04:33:52PM +0100, Jean-Louis Dupond wrote: > On 9/03/2020 16:18, Theodore Y. Ts'o wrote: > > Did the panic happen immediately, or did things hang until the storage > > recovered, and*then* it rebooted. Or did the hard reset and reboot > > happened before the storage network connection was restored? > > The panic (well it was just frozen, no stacktrace or automatic reboot) did > happen *after* storage came back online. > So nothing happens while the storage is offline, even if we wait until the > scsi timeout is exceeded (180s * 6). > It's only when the storage returns that the filesystem goes read-only / > panic (depending on the error setting). So I under why the scsi timeout isn't sufficient to keep the panic from hanging. > If we do reset the VM before storage is back, the filesystem check just goes > fine in automatic mode. > So I think we should (in some cases) not try to update the superblock > anymore on I/O errors, but just go read-only/panic. > Cause it seems like updating the superblock makes things worse. The problem is that from the file system's perspective, we don't know why the I/O error has happened. Is it because of timeout, or is it because of a media error? In the case where an SSD really was unable to write to a metadata block, we *do* want to update the superblock. There is a return status that the block device could send back, BLK_STS_TIMEOUT, but it's not set by the SCSI layer. It is by the network block device (nbd), but it looks like the SCSI layer just returns BLK_STS_IOERR if I'm reading the code correctly. > Or changes could be made to e2fsck to allow automatic repair of this kind of > error for example? The fundamental problem is we don't know what "kind of error" has taken place. If we did, we could theoretically have some kind of mount option which means "in case of timeout, reboot the system without setting some kind of file system error". But we need to know that the I/O error was caused by a timeout first. - Ted