Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp222616pxb; Wed, 22 Sep 2021 20:58:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxAkn6aHPKIdbfh50g/U015ZsWOty6jVlumdEp4bUIjlIwAreiQD/sAFqH5vO5RTP8T0z5m X-Received: by 2002:aa7:d1d3:: with SMTP id g19mr2988875edp.103.1632369534204; Wed, 22 Sep 2021 20:58:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1632369534; cv=none; d=google.com; s=arc-20160816; b=V9DeXMACCQqwu8UiGpZCDgucj/bIJan/7LG0f4DnGOmWia5pvYixQv2SATJbs5jRaW 4JeYHS5+BL3CuupbPmHAMD2aIWW2gFmSvZ0C/JuLOSnOS7m4kYqcD30WUYxKMQ+S5njf ZdjFOd0f04wHd+4XkRV1Bm3TmflfdsIX7WT4uCWWXlviMRViPq68wTEu5RHs4a5UR1Aj bIMP9dDN2y9Uw4VVrz7d6YG3DcLnmtjd85b6jSjB/bp/+9IsWgggXRqT120IauGNwTSe K1JEMXz34TyImE+1Cn0l7Jhy1/qTLxFzWULPQiD4PjObQqOnOOoO/sk8eiWCVqaY1550 c0QA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=Al3do6MtoT54Fs6LlD7XFmJVG8q9hyw523KmaoA7GqU=; b=Z7BRWJiVEUMeFadNNGP2/FvIw3V/TQgS1Q/5EZjmcfrVICcjKtOTtKo8yLvHmqMNDG i/ezhcQRAKN07AQdrltCFprfyrlqVVAGRUhwc8/2aZCnWGKLzt1ZV8KezV9qvrK3Fosi EzalhhbX9whkznCQPMB6PKGbtltMZNVOkk+SOfLwHUX4LQHFVth4zOMZbXIapwky83BI DyYnLNs7od8C9F5a0EofxehUErSM09ZFxMrA+CV6LG9USveesLPfIRvclvf0liOjL4b+ GxW2QbKD9c685XcebKy6DVFf4doT0ARCgMo4sRbvnMX/eglItakyEfPY22VbT5dC4+UP ntUQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dd7si5384923ejc.41.2021.09.22.20.58.25; Wed, 22 Sep 2021 20:58:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239050AbhIWD7M (ORCPT + 99 others); Wed, 22 Sep 2021 23:59:12 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:42680 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S237798AbhIWD7M (ORCPT ); Wed, 22 Sep 2021 23:59:12 -0400 Received: from cwcc.thunk.org (pool-72-74-133-215.bstnma.fios.verizon.net [72.74.133.215]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 18N3vZI2001360 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Sep 2021 23:57:36 -0400 Received: by cwcc.thunk.org (Postfix, from userid 15806) id B2B9315C3756; Wed, 22 Sep 2021 23:57:35 -0400 (EDT) Date: Wed, 22 Sep 2021 23:57:35 -0400 From: "Theodore Ts'o" To: "Kiselev, Oleg" Cc: Andreas Dilger , "linux-ext4@vger.kernel.org" Subject: Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices Message-ID: References: <20210921034203.323950-1-sarthakkukreti@google.com> <0A4B11C1-A119-4733-A841-683889E9DC7B@amazon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0A4B11C1-A119-4733-A841-683889E9DC7B@amazon.com> Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote: > Wouldn't it make more sense to use "write-same" of 0 instead of > writing a page of zeros and task the layers that do thin > provisioning and return 0 on read from unallocated blocks to check > if a block exists before writing zeros to it? The problem is we have absolutely no idea what "write-same" of 0 will actually do in terms of whether it will consume storage for various thinly provisioned devices. We also have no idea what the performance might be. It might be the same speed as explicitly passing in zero-filled buffers and sending DMA requests to a hard drive. (e.g., potentially very S-L-O-W.) That's technically true for "discard" as well, except there's a vague understanding that discard will generally be faster than writing all zeros --- it's just that it might also be a no-op, or it might randomly be a no-op, depending on the phase of the moon, or anything other random variable, including whether "the storage device feels like it or not". Bottom line --- unfortunately, the SATA/SCSI standards authors were mealy-mouthed and made discard something which is completely useless for our purposes. And since we don't know anything about the performance of write same and what it might do from the perspective of thin-provisioned storage, we can't really depend on it either. The problem is mke2fs really does need to care about the performance of discard or write same. Users want mke2fs to be fast, especially during the distro installation process. That's why we implemented the lazy inode table initialization feature in the first place. So reading all each block from the inode table to see if it's zero might be slow, and so we might be better off just doing the lazy itable init instead. Hence, I think Sarthak's approach of giving an explicit hint is a good approach. The other approach we can use is to depend on metadata checksums, and the fact that a new file system will use a different UUID for the seed for the checksum. Unfortunately, in order to make this work well, we need to change e2fsck so that if the checksum doesn't work out --- especially if all of the checksums in an inode table block are incorrect --- we need to assume that it means we should just presume that the inode table block is from an old instance of the file system, and return a zero-filled block when reading that inode table block. (Right now, e2fsck still offers the chance to just fix the checksum, back when we were worried there might be bugs in the metadata checksum code.) But I don't think the two approaches are mutually exclusive. The approach of an explicit hint is a "safe" and a lot easier to review. Cheers, - Ted