Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp5747331rwl; Tue, 4 Apr 2023 03:06:12 -0700 (PDT) X-Google-Smtp-Source: AKy350ZQAlta6XsL0VkuOuA/ar2biIl/LxL73yAjmxDZEsHV8VTIShh7anhQCiSMks0i7EJaUFEm X-Received: by 2002:a05:6402:45:b0:4fc:709f:7ab0 with SMTP id f5-20020a056402004500b004fc709f7ab0mr2105059edu.0.1680602772256; Tue, 04 Apr 2023 03:06:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680602772; cv=none; d=google.com; s=arc-20160816; b=hepEsXYeXgMXpYzqajy9Kc/oApEJzcNcUBfXxc0Hfzz1eteV6HpQikV7kISH26ta6Y DgkpNrdZ4H7srvbs11HOywbGyaw/0y9tzGVzDvJayK9mSzj4+V5Rbg47giYaQf9JN5lp KdgK7uN8sN6ppVbb24p/yVdjOI+n+tkg8NjmyhEGisWTOF5qe61G+WvS/SgOW1iOEwGA VyZ3kekDREWktM1O4rL9ddC869D35Jb/rYbEXUPS+RUL28hQakPtX67/ayfeRz5l/Yi1 IGzyFhieh67/Q80uypEzv30HsyTAlXRGWfguHpBqfE+MYDwRdLkokvvsmzVAC+ajv+nO wdyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:dkim-signature; bh=p0AB/drouMtDRwvcSkIGBTa/IBApGxJYPT6bYboWjO4=; b=fieAQF38SE6D5kZBgsNvWbV2LxlOyB5uog7dpV8o1xEiIUk/LdZXvy/8/P+sywq9mr hifLisMQRJybs0CVSPaDWUJUEr/nqzYtkI6v6oiY3QnRASjnXFrTXsvS2Cypv22CXx1F cEsQnGuD70+oLeBdTQi0IVVhgS1HSluMNJNDaCeHSEHjb/A78mFzHhZfpiW2KkiiCjxO otkZyhcXk5EmROsTs9dsLE6BZbAU5krX2j6E1lGQ6Gd2nz1JSIOoFimdz7I5WzNwWiME lUyrKezWr8J0AwBwEOA+yxhYvqsXJ3YJ7Uulz/ut3TUJv/CIYYTIwc9MeuLIJNlN1Fcc eR5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=CiwoJwzM; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=KKdeodU3; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i5-20020a056402054500b005022100f550si10907147edx.686.2023.04.04.03.05.42; Tue, 04 Apr 2023 03:06:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=CiwoJwzM; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=KKdeodU3; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233917AbjDDKEo (ORCPT + 99 others); Tue, 4 Apr 2023 06:04:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233365AbjDDKEn (ORCPT ); Tue, 4 Apr 2023 06:04:43 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DC741BD9; Tue, 4 Apr 2023 03:04:41 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 5C9FD228E3; Tue, 4 Apr 2023 10:04:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1680602680; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p0AB/drouMtDRwvcSkIGBTa/IBApGxJYPT6bYboWjO4=; b=CiwoJwzMO0jOjX0KFkCr4dql9rxdBCULSTMxNtECHzXk0s8TLDPIjKYJHsgKaFLubKSJzD Uvu6F7Id9sKUQkvAJeGiVC+9WNwqquIXDexmB8KrHnus/YeW5BNEyU8usVPlPifNmLHKir Crv8n+crVSdTm6nWyMF60yui22CtByw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1680602680; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p0AB/drouMtDRwvcSkIGBTa/IBApGxJYPT6bYboWjO4=; b=KKdeodU3fiYTbBKoRQlTDvi3af12U/SmQtJcTJyMflgDanPv8vT/IeaQalMX27fNK8wbUV ElDlhQ2Rn/sscrBg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 467B713920; Tue, 4 Apr 2023 10:04:40 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id ysw/ETj2K2TsUAAAMHmgww (envelope-from ); Tue, 04 Apr 2023 10:04:40 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id C63CAA0732; Tue, 4 Apr 2023 12:04:39 +0200 (CEST) Date: Tue, 4 Apr 2023 12:04:39 +0200 From: Jan Kara To: Baokun Li Cc: Jan Kara , linux-ext4@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, ritesh.list@gmail.com, linux-kernel@vger.kernel.org, yi.zhang@huawei.com, yangerkun@huawei.com, yukuai3@huawei.com Subject: Re: [PATCH] ext4: only update i_reserved_data_blocks on successful block allocation Message-ID: <20230404100439.wrtlibaiu5nsvdem@quack3> References: <20230325063443.1839558-1-libaokun1@huawei.com> <20230327124700.mnldh4sosp3ptbls@quack3> <20230328100037.vy23wsnl437ujdoh@quack3> <7410b9be-da2d-57e0-c4f8-19900df2c440@huawei.com> <20230329162228.evbppkkcbbbnaeeo@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.5 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Mon 03-04-23 22:02:56, Baokun Li wrote: > On 2023/3/30 0:22, Jan Kara wrote: > > On Wed 29-03-23 15:23:19, Baokun Li wrote: > > > On 2023/3/28 18:00, Jan Kara wrote: > > > > On Mon 27-03-23 21:09:42, Baokun Li wrote: > > > > > On 2023/3/27 20:47, Jan Kara wrote: > > > > > > On Sat 25-03-23 14:34:43, Baokun Li wrote: > > > > > > > In our fault injection test, we create an ext4 file, migrate it to > > > > > > > non-extent based file, then punch a hole and finally trigger a WARN_ON > > > > > > > in the ext4_da_update_reserve_space(): > > > > > > > > > > > > > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369: > > > > > > > ino 14, used 11 with only 10 reserved data blocks > > > > > > > > > > > > > > When writing back a non-extent based file, if we enable delalloc, the > > > > > > > number of reserved blocks will be subtracted from the number of blocks > > > > > > > mapped by ext4_ind_map_blocks(), and the extent status tree will be > > > > > > > updated. We update the extent status tree by first removing the old > > > > > > > extent_status and then inserting the new extent_status. If the block range > > > > > > > we remove happens to be in an extent, then we need to allocate another > > > > > > > extent_status with ext4_es_alloc_extent(). > > > > > > > > > > > > > > use old to remove to add new > > > > > > > |----------|------------|------------| > > > > > > > old extent_status > > > > > > > > > > > > > > The problem is that the allocation of a new extent_status failed due to a > > > > > > > fault injection, and __es_shrink() did not get free memory, resulting in > > > > > > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM, > > > > > > > we map to the same extent again, and the number of reserved blocks is again > > > > > > > subtracted from the number of blocks in that extent. Since the blocks in > > > > > > > the same extent are subtracted twice, we end up triggering WARN_ON at > > > > > > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks. > > > > > > Hum, but this second call to ext4_map_blocks() should find already allocated > > > > > > blocks in the indirect block and thus should not be subtracting > > > > > > ei->i_reserved_data_blocks for the second time. What am I missing? > > > > > > > > > > > > Honza > > > > > > > > > > > ext4_map_blocks > > > > > ? 1. Lookup extent status tree firstly > > > > > ?????? goto found; > > > > > ? 2. get the block without requesting a new file system block. > > > > > found: > > > > > ? 3. ceate and map the block > > > > > > > > > > When we call ext4_map_blocks() for the second time, we directly find the > > > > > corresponding blocks in the extent status tree, and then go directly to step > > > > > 3, > > > > > because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED > > > > > but contains EXT4_GET_BLOCKS_CREATE, thus subtracting > > > > > ei->i_reserved_data_blocks > > > > > for the second time. > > > > Ah, I see. Thanks for explanation. But then the problem is deeper than just > > > > a mismatch in number of reserved delalloc block. The problem really is that > > > > if extent status tree update fails, we have inconsistency between what is > > > > stored in the extent status tree and what is stored on disk. And that can > > > > cause even data corruption issues in some cases. > > > The scenario we encountered was this: > > > ``` > > > write: > > > ??? ext4_es_insert_delayed_block > > > ??? [0/16) 576460752303423487 (U,D) > > > writepages: > > > ?? ?alloc lblk 11 pblk 35328 > > > ?? ?[0/16) 576460752303423487 (U,D) > > > ?? ?-- remove block 11 from extent > > > ?? ?? [0/11) 576460752303423487 (U,D,R)? +? (Newly allocated)[12/4) > > > 549196775151 (U,D,R) > > > ?? ?? --Failure to allocate memory for a new extent will undo as: > > > ?? ???? ??? [0/16) 576460752303423487 (U,D,R) > > Yes, this is what I was expecting. So now extent status tree is > > inconsistent with the on-disk allocation info because the block 11 is > > already allocated on disk but recorded as unallocated in the extent status > > tree. > > Yes! There is an inconsistency here, but do_writepages finds that the > writeback returns -ENOMEM and keeps retrying until it succeeds, at which > point the above inconsistency does not exist. Well, do_writepages() will not retry if wbc->sync_mode == WB_SYNC_NONE. So the inconsistency can stay for a long time. > > If the similar problem happened say when we punch a hole into a middle of a > > written extent and so block on disk got freed but extent status tree would > > still record it as allocated, user would be able to access freed block thus > > potentially exposing sensitive data. > > ext4_punch_hole > ? // remove extents in extents status tree > ? ext4_es_remove_extent > ? // remove extents tree on disk > ? ext4_ext_remove_space > > In this scenario, we always try to delete the extents in the in-memory > extents status tree first, and then delete the extents tree on disk. So > even if we fail in deleting extents in memory, there is no inconsistency, > am I missing something? No, you are right, this case is safe. Still I suspect inconsistencies with extent status tree can cause more problems and possibly stale data exposure. > > > ?? ?-- if success insert block 11 to extent status tree > > > ?? ?? [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W) > > > + [12/4) 549196775151 (U,D,R) > > > > > > U: UNWRITTEN > > > D: DELAYED > > > W: WRITTEN > > > R: REFERENCED > > > ``` > > > > > > When we fail to allocate a new extent, we don't map buffer and we don't do > > > io_submit, so why is the extent tree in memory inconsistent with the one > > > stored on disk? Am I missing something? > > > > > > I would appreciate it if you could explain under what cases and what kind of > > > data corruption issues can be caused. > > See above. > > > > > > And this should also fix the problem you've hit because in case of > > > > allocation failure we may just end up with removed extent from the extent > > > > status tree and thus we refetch info from the disk and find out blocks are > > > > already allocated. > > > Reloading extent tree from disk I don't quite understand here, how do we > > > handle reserved blocks? could you explain it in more detail? > > > > > > Logically, I think it is still necessary to update i_reserved_data_blocks > > > only after a successful allocation. This is also done in > > > ext4_ext_map_blocks(). > > I guess there is some misunderstanding here. Both with > > ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating > > i_reserved_data_blocks only after the blocks are successfully allocated and > > inserted in the respective data structure but *before* updating extent > > status tree. If extent status tree operation fails, we currently get > > inconsistency between extent status tree and on-disk info in both cases > > AFAICS. Am I missing something? > > Yes, our code is indeed designed to only update the number of reserved > blocks after the block allocation is complete. We have different > treatment for extent based file and non-extent based file in commit > 5f634d064c70 ("ext4: Fix quota accounting error with fallocate"). > > For extent based file, we update the number of reserved blocks before the > "got_allocated_blocks" tag after the blocks are successfully allocated in > ext4_ext_map_blocks(). > > For the non-extent based file we update the number of reserved blocks > after ext4_ind_map_blocks() is executed, which leads to the problem that > when we call ext4_ind_map_blocks() to create a block, it does not always > create a block. For example, if the extents status tree we encountered > earlier does not match the extents tree on disk, this is of course a > problem in itself, but in terms of code logic, updating the number of > reserved blocks as ext4_ext_map_blocks() does can prevent us from trying > to create a block and not creating it, resulting in an incorrect number > of reserved blocks. I see, thanks for explanation! Indeed it may be good to cleanup this code so that indirect block and extent based inodes are handled in the same way. Honza -- Jan Kara SUSE Labs, CR