Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp367827ybg; Wed, 10 Jun 2020 02:59:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJybYaRDO6GPhhP8qUpn8emITMr4s6HGRpDyo2bJ518DsFbjxMrh2zlvBIXzFop1pb9jAgwl X-Received: by 2002:a50:f087:: with SMTP id v7mr1678818edl.225.1591783188412; Wed, 10 Jun 2020 02:59:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591783188; cv=none; d=google.com; s=arc-20160816; b=bwL9NoJDApL7BwLeKkBS6U2+0yXNzrNhM8P1DUhWaIHCmTxVUQLcUd7DE5lQuElL8A N1zZrwy6wgb2tv2Pj7bL6FbNgYwgC3kSZ8dM127/VN1XNVtBDXldYCBoUnKvN7QgTf7S /W2TBVWV1ysf2lRg6nwtqohjgtpb3uqqe9EwKFu6EM+aNg3C3HGirdjdq/GZLsJXjXAY L8+DX1cyPNmiJAE61Op1qczHW5+y/Y4dvOQJdzkj+xW8f7K5/juLAoIuWyA4RiLX7eIJ QMi6kkZ/kJ2w+TjgOxKnjBiSYENXrmNnWOY3Ej/PApcs1a7PjDaQguqh742rFCsMXsbr DGTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=xgUjOd9jTAY6u7HfgsDb4zxNqu7TFNBpqBUXudpV88s=; b=MCm7pOIc2vqMN1fv1o6Dp35vQScNKF6O0j5cPOVP9fMvxI9IlpF4CqqbfrM5X/i/T2 0vCNlpN2onI48wrlqdVx6V478y0E4XG6SHmcQ1pkhzpAwm57hY7Ef9zDsvah6huOiZkr 0jO+TkEleBMe6+Qf0MM3YNaBKy1Rua+BBwKJB9kXpwwTtMp0jYguWEe/NeCcSOHaRNO/ GYKHTGbamJtjEXxJG1u7jmDa5NFM1DPRXqp9B6S+Ztjq7nqDckxofz2WzwGafYSOV+c5 vD6PcI7IXIZH/uh2qoxlR1Y9JXk73+fXNZ1Og8hBOZ5gTPZKXjjcrhMiu+ATxWrvj5qL 1WjQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id mj1si12246502ejb.668.2020.06.10.02.59.16; Wed, 10 Jun 2020 02:59:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727916AbgFJJ5m (ORCPT + 99 others); Wed, 10 Jun 2020 05:57:42 -0400 Received: from mx2.suse.de ([195.135.220.15]:50504 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726134AbgFJJ5m (ORCPT ); Wed, 10 Jun 2020 05:57:42 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 98D1AAAC6; Wed, 10 Jun 2020 09:57:43 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 2F0BD1E1283; Wed, 10 Jun 2020 11:57:39 +0200 (CEST) Date: Wed, 10 Jun 2020 11:57:39 +0200 From: Jan Kara To: "zhangyi (F)" Cc: Jan Kara , linux-ext4@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, zhangxiaoxu5@huawei.com Subject: Re: [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk Message-ID: <20200610095739.GE12551@quack2.suse.cz> References: <20200526071754.33819-1-yi.zhang@huawei.com> <20200608082007.GJ13248@quack2.suse.cz> <20200609121920.GB12551@quack2.suse.cz> <45796804-07f7-2f62-b8c5-db077950d882@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <45796804-07f7-2f62-b8c5-db077950d882@huawei.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed 10-06-20 16:55:15, zhangyi (F) wrote: > Hi, Jan. > > On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote: > >>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote: > >>>> Background > >>>> ========== > >>>> > >>>> This patch set point to fix the inconsistency problem which has been > >>>> discussed and partial fixed in [1]. > >>>> > >>>> Now, the problem is on the unstable storage which has a flaky transport > >>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to > >>>> the bad network environment), if we failed to async write metadata in > >>>> background, the end write routine in block layer will clear the buffer's > >>>> uptodate flag, but the data in such buffer is actually uptodate. Finally > >>>> we may read "old && inconsistent" metadata from the disk when we get the > >>>> buffer later because not only the uptodate flag was cleared but also we > >>>> do not check the write io error flag, or even worse the buffer has been > >>>> freed due to memory presure. > >>>> > >>>> Fortunately, if the jbd2 do checkpoint after async IO error happens, > >>>> the checkpoint routine will check the write_io_error flag and abort the > >>>> the journal if detect IO error. And in the journal recover case, the > >>>> recover code will invoke sync_blockdev() after recover complete, it will > >>>> also detect IO error and refuse to mount the filesystem. > >>>> > >>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc() > >>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as > >>>> containing valid data"), but it's not enough. > >>> > >>> Before we go and complicate ext4 code like this, I'd like to understand > >>> what is the desired outcome which doesn't seem to be mentioned here, in the > >>> commit 7963e5ac90125, or in the discussion you reference. If you have a > >>> flaky transport that gives you IO errors, IMO it is not a bussiness of the > >>> filesystem to try to fix that. I just has to make sure it properly reports > >> > >> If we meet some IO errors due to the flaky transport, IMO the desired outcome > >> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration > >> specified, if we set "errors=read-only || panic", we expect ext4 could remount > >> to read-only or panic immediately to avoid inconsistency. In brief, the kernel > >> should try best to guarantee the filesystem on disk is consistency, this will > >> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode > >> for most cases caused by the async error problem I mentioned), so we could > >> recover the fs automatically in next boot. > > > > Good, so I fully agree with your goals. Let's now talk about how to achieve > > them :) > > > >> But now, in the case of metadata async writeback, (1) is done in > >> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect > >> metadata write error, and it also cannot remount the filesystem or invoke panic > >> immediately. Finally, if we read the metadata on disk and re-write again, it > >> may lead to on-disk filesystem inconsistency. > > > > Ah, I see. This was the important bit I was missing. And I think the > > real problem here is that ext4 cannot detect metadata write error from > > async writeback. So my plan would be to detect metadata write errors early > > and abort the journal and do appropriate errors=xxx handling. And a > > relatively simple way how to do that these days would be to use errseq in > > the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that > > gets incremented whenever there's writeback error in the block device > > mapping so (probably in ext4_journal_check_start()) we could check whether > > wb_err is different from the original value we sampled at mount time an if > > yes, we know metadata writeback error has happened and we trigger the error > > handling. What do you think? > > > > Thanks a lot for your suggestion, this solution looks good to me. But Ithink > add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below > race condition (It's just theoretical analysis, I'm not test it): > > ext4_journal_start() > ext4_journal_check_start() <-- pass checking > | end_buffer_async_write() > | mark_buffer_write_io_error() <-- set b_page > sb_getblk(bh) <-- read old data from disk > ext4_journal_get_write_access(bh) > modify this bh <-- modify data and lead to inconsistency > ext4_handle_dirty_metadata(bh) > > So I guess it may still lead to inconsistency. How about add this checking > into ext4_journal_get_write_access() ? Yes, this also occured to me later. Adding the check to ext4_journal_get_write_access() should be safer. Honza -- Jan Kara SUSE Labs, CR