Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp6108772rdb; Mon, 18 Sep 2023 04:43:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFQnpfWFTSZRqDCo+VPagpZt6pCluXqJhPRxEOeGKzxfPXUtnTbp5cHm9SPLGqDVAJK4VNA X-Received: by 2002:a05:6a21:498b:b0:159:b7dc:2295 with SMTP id ax11-20020a056a21498b00b00159b7dc2295mr7309712pzc.9.1695037415969; Mon, 18 Sep 2023 04:43:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695037415; cv=none; d=google.com; s=arc-20160816; b=IA6Kf/0pCa+oCnGDxLJI4KAQWKLRWfTsAch0/VVKbxAMZoicGmFxn8Zgbg9J2uFSy4 Q4xWTp4SZLzA01BMoqFczLzi7xiqjqhsPHl4dXwK71lJT9QY0BIIRT5HgbYXWeSh+p+Z P1vsOxgzPeSIkhfRxOE3yb3HWaixrD62bmZEDEKo2Goji8GZu6bMWDpbzAJr0xtuoXWU c1/9DG7No/J5WRXzGqUeRfsQPCRM5SYBQS9iwurWdmbgUlbvIYVsPAYE1Y0+xvC7oDAK G1IoaaEyp8nlDEe6Ji3xSpMiW4VcKibh5EcQascEUxMuJJA9eFBKxIEodyqF6bxYIwVz Udhg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-signature; bh=2Py/I/FzMrkGhKFEk1Qy4ED+NcXo3BnZB0csOeT5dGY=; fh=9YeD+/1Lt1hy1WwsFf2xgL+JaQVwRkyLtT/FOmbm2fg=; b=G3Owx26FEImJY7vPNa+UdLt/SDYf95F0zMnPG7B2/h30sHzTUpsvrRry5s8oKgHsPk CWEn+r2wiG4VuQUp8vcuVgso0w39e0xo7XIxdsNXDpUeKSNFq4RHEdRCX+ENFSlhWFpL +3VVWCpQIZ9sey0KLvbqORUQjV+XBSjjEfhKwsill9+TMV425z2kkYWeFVO24Ljjn8R6 l87lfQG3z6Hsiok3V68/kaeswTPTpBfG+iihbBAav+mXUGFjfX34RnSRVURYyQBay2nK Q0874QlQcvRAuinksQg8jqzhlqkgdcOM00eL/UFCfA4iwzefRaOxJ/Z8TTbV+RD8UxN6 JwIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=xI1hI3rJ; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=RCIgCCRU; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id bd6-20020a056a00278600b0068e3fe64a03si7874927pfb.308.2023.09.18.04.43.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Sep 2023 04:43:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=xI1hI3rJ; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=RCIgCCRU; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 9423E8072F7D; Mon, 18 Sep 2023 00:26:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240097AbjIRH0a (ORCPT + 99 others); Mon, 18 Sep 2023 03:26:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240241AbjIRH0R (ORCPT ); Mon, 18 Sep 2023 03:26:17 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21C03A8; Mon, 18 Sep 2023 00:26:11 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 89E461F88D; Mon, 18 Sep 2023 07:26:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1695021969; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2Py/I/FzMrkGhKFEk1Qy4ED+NcXo3BnZB0csOeT5dGY=; b=xI1hI3rJcSO2/N2jCikMCB0oIl8blMlnXuEoKPnq4NTCvdCUgvHjDIUsbk6VbfklnVYZiU mNl23GdUrIWKywB4ocKdDDyJrJ1mP2VfqjI9yiwpL4D+LMQPB5Kzc5tmsTaLC7CMFN+ZI3 UR5vrPIexL79F29FzA/57o95LftZA0A= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1695021969; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2Py/I/FzMrkGhKFEk1Qy4ED+NcXo3BnZB0csOeT5dGY=; b=RCIgCCRU33M4CcRAuwV3q3a4kAbEpqbeaqIRR+bLDY9zrPWwlANvr+mz2H1GfptZf2akOe w12BLqljj6zPsYAQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 7ADA113480; Mon, 18 Sep 2023 07:26:09 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uQb2HZH7B2WFUwAAMHmgww (envelope-from ); Mon, 18 Sep 2023 07:26:09 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 171E5A0759; Mon, 18 Sep 2023 09:26:09 +0200 (CEST) Date: Mon, 18 Sep 2023 09:26:09 +0200 From: Jan Kara To: Ojaswin Mujoo Cc: linux-ext4@vger.kernel.org, Theodore Ts'o , Ritesh Harjani , linux-kernel@vger.kernel.org, Jan Kara Subject: Re: [PATCH v2 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure Message-ID: <20230918072609.teegybfhht23gzzc@quack3> References: <2fe0c7461d7a49eec46a1c83667ae678825d8b76.1694860198.git.ojaswin@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2fe0c7461d7a49eec46a1c83667ae678825d8b76.1694860198.git.ojaswin@linux.ibm.com> X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 18 Sep 2023 00:26:56 -0700 (PDT) On Sat 16-09-23 16:12:13, Ojaswin Mujoo wrote: > ** Short Version ** > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we > never zero out the range of bh that is not under write, causing whatever stale > data is present in the folio at that time to be written out to disk. To fix this > mark the buffer as new in ext4_get_block_unwritten(), in case it is unwritten. > > ----- I'm not sure if this separator isn't going to confuse some tools processing patches ;) > ** Long Version ** > > The issue mentioned above was resulting in two different bugs: > > 1. On block size < page size case in ext4, generic/269 was reliably > failing with dioread_nolock. The state of the write was as follows: > > * The write was extending i_size. > * The last block of the file was fallocated and had an unwritten extent > * We were near ENOSPC and hence we were switching to non-delayed alloc > allocation. > > In this case, the back trace that triggers the bug is as follows: > > ext4_da_write_begin() > /* switch to nodelalloc due to low space */ > ext4_write_begin() > ext4_should_dioread_nolock() // true since mount flags still have delalloc > __block_write_begin(..., ext4_get_block_unwritten) > __block_write_begin_int() > for(each buffer head in page) { > /* first iteration, this is bh1 which contains i_size */ > if (!buffer_mapped) > get_block() /* returns bh with only UNWRITTEN and MAPPED */ > /* second iteration, bh2 */ > if (!buffer_mapped) > get_block() /* we fail here, could be ENOSPC */ > } > if (err) > /* > * this would zero out all new buffers and mark them uptodate. > * Since bh1 was never marked new, we skip it here which causes > * the bug later. > */ > folio_zero_new_buffers(); > /* ext4_wrte_begin() error handling */ > ext4_truncate_failed_write() > ext4_truncate() > ext4_block_truncate_page() > __ext4_block_zero_page_range() > if(!buffer_uptodate()) > ext4_read_bh_lock() > ext4_read_bh() -> ... ext4_submit_bh_wbc() > BUG_ON(buffer_unwritten(bh)); /* !!! */ > > 2. The second issue is stale data exposure with page size >= blocksize > with dioread_nolock. The conditions needed for it to happen are same as > the previous issue ie dioread_nolock around ENOSPC condition. The issue > is also similar where in __block_write_begin_int() when we call > ext4_get_block_unwritten() on the buffer_head and the underlying extent > is unwritten, we get an unwritten and mapped buffer head. Since it is > not new, we never zero out the partial range which is not under write, > thus writing stale data to disk. This can be easily observed with the > following reproducer: > > fallocate -l 4k testfile > xfs_io -c "pwrite 2k 2k" testfile > # hexdump output will have stale data in from byte 0 to 2k in testfile > hexdump -C testfile > > NOTE: To trigger this, we need dioread_nolock enabled and write > happening via ext4_write_begin(), which is usually used when we have -o > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only > alternate way to call ext4_write_begin() is to make sure dellayed > alloc switches to nodelalloc (ext4_da_write_begin() calls > ext4_write_begin()). This will usually happen when ext4 is almost full > like the way generic/269 was triggering it in Issue 1 above. This might > make this issue harder to replicate hence for reliable replicate, I used > the below patch to temporarily allow dioread_nolock with nodelalloc and > then mount the disk with -o nodealloc,dioread_nolock. With this you can > hit the stale data issue 100% of times: > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) > if (ext4_should_journal_data(inode)) > return 0; > /* temporary fix to prevent generic/422 test failures */ > - if (!test_opt(inode->i_sb, DELALLOC)) > - return 0; > + // if (!test_opt(inode->i_sb, DELALLOC)) > + // return 0; > return 1; > } > > After applying this patch to mark buffer as NEW, both the above issues are > fixed. > > Signed-off-by: Ojaswin Mujoo Looks good to me. Thanks for fixing this! Feel free to add: Reviewed-by: Jan Kara Honza > --- > fs/ext4/inode.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 6c490f05e2ba..8b286a800193 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -789,10 +789,22 @@ int ext4_get_block(struct inode *inode, sector_t iblock, > int ext4_get_block_unwritten(struct inode *inode, sector_t iblock, > struct buffer_head *bh_result, int create) > { > + int ret = 0; > + > ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", > inode->i_ino, create); > - return _ext4_get_block(inode, iblock, bh_result, > + ret = _ext4_get_block(inode, iblock, bh_result, > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); > + > + /* > + * If the buffer is marked unwritten, mark it as new to make sure it is > + * zeroed out correctly in case of partial writes. Otherwise, there is > + * a chance of stale data getting exposed. > + */ > + if (ret == 0 && buffer_unwritten(bh_result)) > + set_buffer_new(bh_result); > + > + return ret; > } > > /* Maximum number of blocks we map for direct IO at once. */ > -- > 2.39.3 > -- Jan Kara SUSE Labs, CR