Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp13200989ybl; Sun, 29 Dec 2019 06:38:18 -0800 (PST) X-Google-Smtp-Source: APXvYqw6+FSWRrTfzLa0Ax+LTh7QHeIL/Pk0Kpr6YvBtsJL0mYCmLj8PHBtwcGCHCc5nOxBsaRaM X-Received: by 2002:a05:6830:1cd3:: with SMTP id p19mr51733398otg.70.1577630298286; Sun, 29 Dec 2019 06:38:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577630298; cv=none; d=google.com; s=arc-20160816; b=TnSJLGyFr6vTrnqtecpulfNt0314J3YFUgp/ogkBxXRhSc3XoVLq/pvdax8himk4rD m0oH5LLbv1k1mTW8DGToCkzU56eCAhZUU/m+LBmbZVyxh/iodpCadJXhQmXDErpEKA7u jTAGATDIgoJ5nPzcFnORICQ++sl+7GmMQWcGqOhNb1gaHO3Sb17zz920LvvTK+5kLgph AcOpzjo7IvctrKYAQajF0ZxkryOu1D3DqSxSMmTBCBWGfI3cWf5AxRIWTtdS/x5PI3G2 qXXbco9iORW4sNIv3TZ5Dvmtoo8TqVZyNjRSzwJx9Mi7iH6+QxFpYqTXPsScwZfw9UnG w5YQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=b4u9h240A3VWM96uxbMrDdP2ttMwf5WMNKgXmWCS1tE=; b=UwyA1uSaVKxVGVzkUM/YrNWhrzb+z0J1/1FgQS9XjaPQIF4aWE5wKqOk08NSB77FEn YiJMP+OinMRUTeTqg7MhDf2p+nl5cQCz2eFdaKnMQAuboNfRoCAxZ9h1X+auGXmpWyvK Y8KFWNwMs7UED2PwqjyZUWNrTuZX6xz+sgLIFAe5UIU0I5r78eOIWQZNEyR4cUb0J21q ZecdpD3qda5CvIqEOIvzHyMDHWLpyUCuSFLHP7ef6W6Dc0aPQF9dQiA8px6odIKgnEmR JtVW1bPo6K/bHmsQCkN0cz7lvG+O5WgF/xWslXJn3zIY2l205c2Tqg9Iui1FbYZFfIos rvUw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 128si12864333oih.78.2019.12.29.06.37.58; Sun, 29 Dec 2019 06:38:18 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726602AbfL2OhN (ORCPT + 99 others); Sun, 29 Dec 2019 09:37:13 -0500 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:44605 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726189AbfL2OhN (ORCPT ); Sun, 29 Dec 2019 09:37:13 -0500 Received: from callcc.thunk.org (96-72-102-169-static.hfc.comcastbusiness.net [96.72.102.169] (may be forged)) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id xBTEb7Co021984 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 29 Dec 2019 09:37:08 -0500 Received: by callcc.thunk.org (Postfix, from userid 15806) id C5051420485; Sun, 29 Dec 2019 09:37:06 -0500 (EST) Date: Sun, 29 Dec 2019 09:37:06 -0500 From: "Theodore Y. Ts'o" To: xiaohui li Cc: Ext4 Developers List Subject: Re: the side effect of enlarger max mount count in ext4 superblock Message-ID: <20191229143706.GA7177@mit.edu> References: <20191226130935.GA3158@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Sun, Dec 29, 2019 at 02:58:21PM +0800, xiaohui li wrote: > > shall the e2fsck tool can be divided into two parts ? > one part only do the full data consistency check work, focus on > checking if data has inconsistency just when ext4 filesystem has been > frozen or very few IO activities are going on. > and the other part can be doing the actual repair work if data > inconsistent has encountered. Alas, that's not really practical. In order to repair a particular part of the file system, you need to know what the correct value should be. And calculating the correct value sometimes requires global knowledge of the entire file system state. For example, consider the inode field i_links_count. For regular files, the value of this field is the number of references from directory entries (in other words, links) that point at a particular inode. If the correct value is 2 (there are two directory entries which reference this inode), but it is incorrectly set to 1, then when the first directory entry is removed with an unlink system call, the i_links_count will go to zero, and the kernel will free the inode and its blocks, leaving those blocks to be used by other inodes. But there still is a valid reference to that inode, and the potential result is that the one or more files will get corrupted, because blocks can end up being claimed by different inodes. So there are a couple of things to learn from this. First, the determine whether or not the field is corrupted is 99.999% of the effort. Once you know the correct value, the repair part is trivial. So separating the consistency check and repair efforts don't make much sense. Second, when we are considering the i_links_count for a particular inode, we have no idea where in the directory tree structure the directory entries which reference that inode might be located. So we have to examine all of the blocks of all directories in order to determine the value of each inodes i_links_count. And of course, if the contents of the directory blocks are changing while you are trying calculate the i_links_count for all of the inodes in the directory, this makes the job incredibly difficult. Effectively, it also requires reading all of the metadata blocks, and making sure that they are consistent with each other and this requires a lot of memory and a lot of I/O bandwidth. > but i wonder if some problems will happen if doing the full data > consistency checking online, without ext4 filesystem umount. > so even if very few io activities are going on, the data checking > can't be implemented. just because some file data may be in memory, > not in disk. > so the data consistency checking only can be started when ext4 > filesystem has been frozen from my viewpoint, at least at this moment, > file data can be returned back to disk as much as possible. So we can do this already. It's called e2scrub[1]. It requires using dm_snapshot, so we can create a frozen copy of the file system, and then we check that frozen file system. [1] https://manpages.debian.org/testing/e2fsprogs/e2scrub.8.en.html This has tradeoffs. The first, and most important, is that if any problems are found, you need to unmount the file system, and then rerun e2fsck on the actual file system (as opposed to the frozen copy) to actually effectuate the repair. So if you have a large 100TB RAID array, which takes hours to run fsck, first of all, you need to reserve enough space in the snapshot partition to save an original copy of all blocks written to the file system while the e2fsck is running. This could potentially be a large amount of storage. Secondly, if a problem is found, now what? Current e2scrub sends an e-mail to the system administrator, requesting that the sysadmin schedule downtime so the system can be rebooted, and e2fsck run on the unmounted file system so it can be fixed. If it took hours to detect that the file system was corrupted, it will take hours to repair the file system, and the system will be out of service during that time. I'm not convinced this would work terribly well on an Android device. E2scrub was designed for enterprise servers that might be running for years without a reboot, and the goal was to allow a periodic sanity check (say, every few months) to make sure there weren't any problems that had accumulated due to cosmic rays flipping bigs in the DRAM (although hopefully all enterprise servers are using ECC memory), etc. One thing that we could do to optimize things a bit is to enhance dm_snapshot so that it only makes copies of the original block if the I/O indicates that it is a metadata block. This would reduce the amount of space needed to be reserved for the snapshot volume, and it would reduce the overhead of dm_snapshot while the fsck is running. This isn't something that has been done, because e2scrub is all that commonly used, and most uses of dm_snapshot want the snapshot to have the data blocks snapshotted as well as the metadata blocks. So if you are looking for a project, one thing you could perhaps do is to approach the device mapper developers at dm-devel@vger.kernel.org, and try to add this feature to dm_snapshot. It might be, though, that getting your Android devices to use the latest kernels and using the highest quality flash might be a better approach in the long run. Cheers, - Ted