Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp5125693rdb; Sat, 16 Sep 2023 04:13:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHmIbfDcXaO8h4kk22BTI2S00CPxQ57was1VLQiLH2x8oYNynNjCai32xcm2DnHqYkq5U+g X-Received: by 2002:a05:6a21:199:b0:153:6a8b:8f5d with SMTP id le25-20020a056a21019900b001536a8b8f5dmr5341622pzb.23.1694862833365; Sat, 16 Sep 2023 04:13:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694862833; cv=none; d=google.com; s=arc-20160816; b=m/pEckI9Sye3bDhrUGiN9Cx/5DDfsm5N0KkjTzX8RE7WeJc6v8j+Y85PK0cdhosLHV QUqC0BnFyghKtVfPzAuKUD1VmPWhPQlLwX9f5NZka2AnAUSLnvZ8BsV+qbcIRAFUKxue TAmd+VFJMPlhaQBfeo89Vlg23aq6pqoTAZiB9k5oJ0mwNZXcL9fYS/lPL/XHQkQYz7NP /Ba4hkZ2ApBYMgHQ4PGjLRdlgvjMUDstVmQmv0QPPtqeAYi8tZ4zZCAPCM079km7IOoC Lm1YlMHl7I3ur0Z3V9KBXdnQn3sBmC/4vBAdMQbpAo0euyUKo0Jn3a5AaP9L6Kk5e9u3 9trg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=M2UBylssOM+4TEMqiW5J18Y0S1NTDL4uf26/ovomt0Y=; fh=ddtMqXlDoe0M03oQ7zFLM66BHmYWyhNUXmPqHrxSGq8=; b=KEPrzkpmzwQjO5DuXl4XhaBO5ADB2t+Mp1TYg5tm+2qV3p7PfubN4ZhPMqcOwHSvKu ivvFo4InJd1L6CMvl1EWJSB0Ko+cuFWybTik6FTP4oy4qdGHCFo6Ov0Pc9CwmQQsYEWb PDwf5iaQN0dKVlcQSxuzIsYiOCcn5fQY5EJvb8aEe+3XQauFE1eTgYg5TFOBfYxlQBib aR31JLAVW6AMLC/wWt2C3GTROVoOLThesoXCbbvWxOlFTevIM4kiraTxdxyoInzbiHR5 OF+d9IuCYSYkMCDZp/AgezuRl6ofnHGxpUE0tbj4Swjhedwfk4p+5Elhm1/FV0ETRszT gZVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Q+Nbpgig; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id g184-20020a636bc1000000b00565e865d381si4557789pgc.447.2023.09.16.04.13.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 16 Sep 2023 04:13:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Q+Nbpgig; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 4AACC805E413; Sat, 16 Sep 2023 04:11:14 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239141AbjIPLIx (ORCPT + 99 others); Sat, 16 Sep 2023 07:08:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239139AbjIPLIU (ORCPT ); Sat, 16 Sep 2023 07:08:20 -0400 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD7551A6; Sat, 16 Sep 2023 04:08:11 -0700 (PDT) Received: from pps.filterd (m0353723.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 38GB81HY011667; Sat, 16 Sep 2023 11:08:01 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=pp1; bh=M2UBylssOM+4TEMqiW5J18Y0S1NTDL4uf26/ovomt0Y=; b=Q+NbpgigNWtuc+y2qTlWKHZuJCGbq/gBH9NE6OT8Ao1XO7gkIgF98yw7+D6ulNjhgMfI AH82/ksccPiLr//Jgp/BEQKNDqrGdg7RmitEp7uA6/tqR++YqkGvJvXn+xNli3lRohdI OOD9lQHRcpi4Jh8jgynEPGukvuxi0K4M5KvGitSSOkSM5LHoBrmGY408u09aT/Qn36Hp rLWmhH5y3joh0dUAyJdKgqcpahnEJIS/PxmKJDDqtF57Z5QNWplnS+peKubi0R5NmJ6H gXh/dqncRsBms9JyJdLYULU0vsxBlBjhifAElLUGbRbqsfrj7yXmvzx/hcbV21ddoRzs Ag== Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3t558ke76w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 16 Sep 2023 11:08:01 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 38G9RZN9002352; Sat, 16 Sep 2023 11:02:59 GMT Received: from smtprelay03.fra02v.mail.ibm.com ([9.218.2.224]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3t158m9dn2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 16 Sep 2023 11:02:59 +0000 Received: from smtpav03.fra02v.mail.ibm.com (smtpav03.fra02v.mail.ibm.com [10.20.54.102]) by smtprelay03.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 38GB2vg610814068 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sat, 16 Sep 2023 11:02:57 GMT Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CAE4120043; Sat, 16 Sep 2023 11:02:57 +0000 (GMT) Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7830B20040; Sat, 16 Sep 2023 11:02:47 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com (unknown [9.43.102.18]) by smtpav03.fra02v.mail.ibm.com (Postfix) with ESMTPS; Sat, 16 Sep 2023 11:02:47 +0000 (GMT) Date: Sat, 16 Sep 2023 16:32:39 +0530 From: Ojaswin Mujoo To: Jan Kara Cc: linux-ext4@vger.kernel.org, "Theodore Ts'o" , Ritesh Harjani , linux-kernel@vger.kernel.org, Alexander Viro , Christian Brauner , linux-fsdevel@vger.kernel.org Subject: Re: [PATCH 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure Message-ID: References: <20230905135629.wdjl2b6s3pzh7idg@quack3> <20230914141920.lw2nlpzhcxwuz2y6@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230914141920.lw2nlpzhcxwuz2y6@quack3> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 2rX1_xqXeBZpimiX6ou2TEtl3RXShPEQ X-Proofpoint-ORIG-GUID: 2rX1_xqXeBZpimiX6ou2TEtl3RXShPEQ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.267,Aquarius:18.0.980,Hydra:6.0.601,FMLib:17.11.176.26 definitions=2023-09-15_20,2023-09-15_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 bulkscore=0 mlxlogscore=946 suspectscore=0 phishscore=0 adultscore=0 malwarescore=0 priorityscore=1501 clxscore=1015 lowpriorityscore=0 impostorscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2308100000 definitions=main-2309160095 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Sat, 16 Sep 2023 04:11:14 -0700 (PDT) On Thu, Sep 14, 2023 at 04:19:20PM +0200, Jan Kara wrote: > Hello Ojaswin, > > On Thu 14-09-23 17:24:52, Ojaswin Mujoo wrote: > > On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote: > > > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote: > > > > ** Short Version ** > > > > > > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by > > > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has > > > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we > > > > never zero out the range of bh that is not under write, causing whatever stale > > > > data is present in the folio at that time to be written out to disk. To fix this > > > > mark the buffer as new in _ext4_get_block(), in case it is unwritten. > > > > > > > > ----- > > > > ** Long Version ** > > > > > > > > The issue mentioned above was resulting in two different bugs: > > > > > > > > 1. On block size < page size case in ext4, generic/269 was reliably > > > > failing with dioread_nolock. The state of the write was as follows: > > > > > > > > * The write was extending i_size. > > > > * The last block of the file was fallocated and had an unwritten extent > > > > * We were near ENOSPC and hence we were switching to non-delayed alloc > > > > allocation. > > > > > > > > In this case, the back trace that triggers the bug is as follows: > > > > > > > > ext4_da_write_begin() > > > > /* switch to nodelalloc due to low space */ > > > > ext4_write_begin() > > > > ext4_should_dioread_nolock() // true since mount flags still have delalloc > > > > __block_write_begin(..., ext4_get_block_unwritten) > > > > __block_write_begin_int() > > > > for(each buffer head in page) { > > > > /* first iteration, this is bh1 which contains i_size */ > > > > if (!buffer_mapped) > > > > get_block() /* returns bh with only UNWRITTEN and MAPPED */ > > > > /* second iteration, bh2 */ > > > > if (!buffer_mapped) > > > > get_block() /* we fail here, could be ENOSPC */ > > > > } > > > > if (err) > > > > /* > > > > * this would zero out all new buffers and mark them uptodate. > > > > * Since bh1 was never marked new, we skip it here which causes > > > > * the bug later. > > > > */ > > > > folio_zero_new_buffers(); > > > > /* ext4_wrte_begin() error handling */ > > > > ext4_truncate_failed_write() > > > > ext4_truncate() > > > > ext4_block_truncate_page() > > > > __ext4_block_zero_page_range() > > > > if(!buffer_uptodate()) > > > > ext4_read_bh_lock() > > > > ext4_read_bh() -> ... ext4_submit_bh_wbc() > > > > BUG_ON(buffer_unwritten(bh)); /* !!! */ > > > > > > > > 2. The second issue is stale data exposure with page size >= blocksize > > > > with dioread_nolock. The conditions needed for it to happen are same as > > > > the previous issue ie dioread_nolock around ENOSPC condition. The issue > > > > is also similar where in __block_write_begin_int() when we call > > > > ext4_get_block_unwritten() on the buffer_head and the underlying extent > > > > is unwritten, we get an unwritten and mapped buffer head. Since it is > > > > not new, we never zero out the partial range which is not under write, > > > > thus writing stale data to disk. This can be easily observed with the > > > > following reporducer: > > > > > > > > fallocate -l 4k testfile > > > > xfs_io -c "pwrite 2k 2k" testfile > > > > # hexdump output will have stale data in from byte 0 to 2k in testfile > > > > hexdump -C testfile > > > > > > > > NOTE: To trigger this, we need dioread_nolock enabled and write > > > > happening via ext4_write_begin(), which is usually used when we have -o > > > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only > > > > alternate way to call ext4_write_begin() is to fill make sure dellayed > > > > alloc switches to nodelalloc (ext4_da_write_begin() calls > > > > ext4_write_begin()). This will usually happen when FS is almost full > > > > like the way generic/269 was triggering it in Issue 1 above. This might > > > > make this issue harder to replicate hence for reliable replicate, I used > > > > the below patch to temporarily allow dioread_nolock with nodelalloc and > > > > then mount the disk with -o nodealloc,dioread_nolock. With this you can > > > > hit the stale data issue 100% of times: > > > > > > > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) > > > > if (ext4_should_journal_data(inode)) > > > > return 0; > > > > /* temporary fix to prevent generic/422 test failures */ > > > > - if (!test_opt(inode->i_sb, DELALLOC)) > > > > - return 0; > > > > + // if (!test_opt(inode->i_sb, DELALLOC)) > > > > + // return 0; > > > > return 1; > > > > } > > > > > > > > ------- > > > > > > > > After applying this patch to mark buffer as NEW, both the above issues are > > > > fixed. > > > > > > > > Signed-off-by: Ojaswin Mujoo > > > > > > Good catch! But I'm wondering whether this is really the right fix. For > > > example in ext4_block_truncate_page() shouldn't we rather be checking > > > whether the buffer isn't unwritten and if yes then bail because there's > > > nothing to zero out in the block? That would seem like a more logical > > > and robust solution of the first problem to me. > > > > So I was looking into this to understand the code paths and it seems > > like ext4_truncate doesn't really impose that a unwritten buffer does > > not have any data in its corresponding folio, which might sometimes be > > the case. > > > > For example, imagine a case where we get the last block of a file via > > ext4_da_get_block_prep() which returns a bh that is unwritten, mapped > > and new. During the write, we'll copy data in this folio and then > > adjust i_size in write_end, release the folio lock and ultimately the > > inode_lock(). > > > > In this intermediate state, before writeback happens, the buffer is > > unwritten but has data which will be written. At this point, if we call > > ext4_block_truncate_page() and have the logic to exit early for bh_unwritten, the > > we will never actually zero out the folio which might cause stale data to be > > written during writeback (?) > > Actually we will. truncate_inode_pages_range() ends up calling > truncate_inode_partial_folio() which zeros out the tail of the partial > page. I think you are confusing two different things. One is zeroing of > partial page cache pages - that is generally handled by the generic > truncate code - and another one is zeroing of on-disk partial blocks - that > is handled by the filesystem itself. The contents on on-disk blocks does > not matter as long as they are marked as unwritten in the extent tree > (their contents is random anyway) and therefore __ext4_block_zero_page_range() > has nothing to do in that case. > > > Now, most of the calls to ext4_block_truncate_page() happen via ext4_truncate ( like via ext4_setattr, > > ext4_truncate_failed_write() etc) call truncate_inode_pages() which > > seems to handle zeroing the partial folio beyond i_size. However, if we > > add the logic to skip unwritten blocks in this function then: > > > > 1. We create an implicit dependency in ext4_block_truncate_page() that > > the folio needs to not have any data if its unwritten ie some other > > function has taken care of by some other function called before it. > > Yes, this dependency already exists today because when blocksize < pagesize > the zeroing happening in __ext4_block_zero_page_range() may be a subset of > what gets zeroed by truncate_inode_partial_folio(). Still we zero the page > in __ext4_block_zero_page_range() for the case when the page was not cached > at all and we've just loaded it from the disk. > > > 2. Additionally, that other function will also need to mark the relevant > > buffer dirty, which is done in this function. > > AFAICT there's no need to mark the buffer dirty - the whole point is we > don't need to touch the on-disk contents if the block is unwritten... > > > 3. There are some other paths that call ext4_block_truncate_page() > > without turncating the pagecache like ext4_zero_range(). Im not sure if > > this will cause any issues as I've not gone through the function > > completely though but yes, other functions that later call truncate > > in future might need to keep this implicit dependency in mind. > > Indeed, this is a good catch. So we either need to make both sites calling > ext4_zero_partial_blocks() to use truncate_pagecache_range() for the whole > range including partial blocks or we need to zero out the page cache > (without bringing the page uptodate or marking it dirty) in > ext4_zero_partial_blocks() even if the buffer is unwritten. I don't have a > strong preference either way. Okay, got it so basically we depend on the truncate_inode_* family of functions to zero out the partial pagecache anyways so once we have fixed the other call sites like ext4_zero_partial_blocks() we should be good to skip unwritten buffers in ext4_block_zero_page_range(). Thanks for the explanation, let me go through this and prepare a fix. Since this might take a bit more time, I've submitted the v2 fix to stale data exposure issue and will fix this one separately. Thanks, Ojaswin > > Honza > -- > Jan Kara > SUSE Labs, CR