Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp416791ybl; Fri, 23 Aug 2019 02:52:48 -0700 (PDT) X-Google-Smtp-Source: APXvYqzL1Kv0IVxQaHuDu2kPe0R+YyzAnWHWz1nwTyfceUn+UxQjv42rHsZof1ZwGEfDazROgWJ/ X-Received: by 2002:aa7:818b:: with SMTP id g11mr4299450pfi.122.1566553968653; Fri, 23 Aug 2019 02:52:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566553968; cv=none; d=google.com; s=arc-20160816; b=l8Xxf+kbRkNUIWwfuJCP1cntWwjiGIJJgGxyWfxcBzxk1Hbt6BqrZcaUgtzhOY4mpw nhG3nd4Kj+uj0TCSJ6jxDoKh8zgK/wYOuJ+YyHihihEjVw+ao6UZG0UnDpw3pAN2EYRA psyhV1dBA16QtvktJhmrqCE8gT7Mt4ex65utQXbRfehQ10B6p7JGGKMKNF+ULNJT9dl5 Z8ARIxWeEvxAYm9mXEMI0F0UnBtfuT80oDca0Zt+KsaPZksCcvT3m24VOWC16EUTD6Jc PcnjUeIoMTRMZHI/OGIOy188/zhVLMjuFl0zmQfzPkqRlHs92snH/ugrKFe4ZrtKLjJg 9agA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=8GWCYcF7D+IY9VST6rFv0/Qszy/pqzGFNOQlJfduFPg=; b=K/HlUKe/cfjF62p4rAuWyQLB/REZqsBR5lY7TGghV+jFClh9zGYu1+ORkACN9rQoZM VXDnPWtzB+m74C5gsO27jswjNGkgpiIl8Ovj6p7rtxi10OskWjMyYbOfmKyKk+++mfKK xsNyJ+Q+1RC/yxCO+B2tF6ueR1OqOLzxNc0E4za7wTIum3f/Z35Hf4RhlSGdEOgN6uhs 5k3Q/OHToupDw/7mJAN25zOs3wKItFAoYoiRvD83VH1kHQY0IOCVuvuXOBePhrmSzddy 1ct/J24lStSeA24A/flHsxQPdw0WNHMjmCIW83UC6ZB7zPqsVWgLihvhcsN4ywXHeEsE PNfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w6si1586184pge.395.2019.08.23.02.52.35; Fri, 23 Aug 2019 02:52:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404023AbfHWDCU (ORCPT + 99 others); Thu, 22 Aug 2019 23:02:20 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:59268 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726283AbfHWDCU (ORCPT ); Thu, 22 Aug 2019 23:02:20 -0400 Received: from callcc.thunk.org ([66.31.38.53]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id x7N327pc001171 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 22 Aug 2019 23:02:08 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 5BE7142049E; Thu, 22 Aug 2019 23:02:07 -0400 (EDT) Date: Thu, 22 Aug 2019 23:02:07 -0400 From: "Theodore Y. Ts'o" To: ZhangXiaoxu Cc: adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org Subject: Re: [PATCH] ext4: Fix entry corruption when disk online and offline frequently Message-ID: <20190823030207.GC8130@mit.edu> References: <1557807817-121893-1-git-send-email-zhangxiaoxu5@huawei.com> <20190517225940.GC21961@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190517225940.GC21961@mit.edu> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org I've applied this patch with the modified patch summary: ext4: treat buffers with write errors as containing valid data Cheers, - Ted On Fri, May 17, 2019 at 06:59:40PM -0400, Theodore Ts'o wrote: > On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote: > > I got some errors when I repair an ext4 volume which stacked by an > > iscsi target: > > Entry 'test60' in / (2) has deleted/unused inode 73750. Clear? > > It can be reproduced when the network not good enough. > > > > When I debug this I found ext4 will read entry buffer from disk and > > the buffer is marked with write_io_error. > > > > If the buffer is marked with write_io_error, it means it already > > wroten to journal, and not checked out to disk. IOW, the journal > > is newer than the data in disk. > > If this journal record 'delete test60', it means the 'test60' still > > on the disk metadata. > > > > In this case, if we read the buffer from disk successfully and create > > file continue, the new journal record will overwrite the journal > > which record 'delete test60', then the entry corruptioned. > > > > So, use the buffer rather than read from disk if the buffer marked > > with write_io_error > > You've raised a number of issues about how we handle write errors, > especially when they occur due to a flaky transport --- in your case, > due to iSCSI. As such, your patch isn't wrong, so much as it is > incomplete. > > For example, your assumption that if the buffer is marked > write_io_error, it's safe to clear write_io_error and reset > buffer_uptodate assumes that journalling is enabled. If the file > system does not have the journal, there is no journal to fall back > upon. For file systems which do have a journal, if you are using a > flaky iSCSI transport, there is no protection from write errors which > occur when the journal is replayed. (fs/jbd2/recovery.c simply marks > the buffer dirty and allows the writeback code take care of writing > the buffer.) This means that the buffer could have write_io_error set > due to a failure to write the buffer during recovery, in which case > relying on the journal having a uptodate copy block is invalid. > > Also, this patch only patches the ex4_bread() path, which is only used > by directories. It doesn't deal with metadata reads for allocation > bitmaps or extent tree blocks. We are doing this hack for inode table > blocks, already; perhaps you got the idea to do this for ext4_bread() > from __ext4_get_inode_loc()? > > We could add some kind of callback from the buffer cache layer when an > aysnchronous writeback fails --- or we could use a synchronous write > in the journal recovery code (which would be bad from a performance > perspective, but ignore that for the moment) --- however, what do we > do when we discover that there is an error? Right now, we do nothing > until we try to read the inode table block (and after your patch, > reading a directory block). Under memory pressure, though, the data > will get lost and we don't even mark the file system as needing to be > checked. We could retry the write, but if it's due to a flaky iSCSI > or FC transport, this write could fail yet again --- and then what? > > So while I could apply this patch, since it doesn't make things worse, > I want to make sure you are aware that if you have problems with your > iSCSI device, this patch is far from a complete solution. At the very > least, we should handle reads for other metadata block. > > - Ted