Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp858389rwb; Tue, 4 Oct 2022 11:37:49 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7XDwTbQIpYLLSiLaHInJdo0LAsYc2Hrok6sifzWybC7l44HFzyFFZazEg/KgLzRUxMSM4Q X-Received: by 2002:a17:90b:4a8f:b0:20a:ad78:7821 with SMTP id lp15-20020a17090b4a8f00b0020aad787821mr973662pjb.173.1664908669601; Tue, 04 Oct 2022 11:37:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664908669; cv=none; d=google.com; s=arc-20160816; b=WabhJyZqw3nQVTq93xJn8M9tND7j7ChR8V30ZSWdOwHD15Z+21ZVDs/Ih+nueV0UbL 3boTRUOMMVRmtABjEvuQ7Ub5Rdy5vFTuDYd+Bv7+gwyb8yHHZnZItrx0qx6QVPBSfLja A/n0ZCZqjyePLu/Ix+RZruIgJLwTvS+VAVf0dkTGdVjCrLKUd0aYkDZ3lKkpmZ+HqteU g8TThDTXQrvLlzMPAVvuRv8CmjaCn/UVoiTlzYrvtRNzvtA9HfF4T2Xx9RJ35gRu8x2u V9tm3kk232cIhtfL8Q+YH3ur2SnBGycGGq96W9IRjhYe+LT2AqqFdpmE/3B1V45DMhyJ +CRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=HxvRoAmUL13Va+9lGfhRhwQJ/TyXpUAH+3yfKHxuvkA=; b=soqQS8NSN/HI/Qfcf9YGT599dFBbyZhctPUvV1qloxgE/ECkPjerTzoVceneaixSVb w+psUfYkTWnfkfl0YDTWpTPp+e95Ubgg61cxnYb0kYbLIJmV6LqwwmDYe0Jw4gZPG4NC bWSG7K7bp3lK2yRLhKMkW/4y+R+445Go3ll6UHCZ2CxTdaE4RvaALU66ggfsZ44JaoyJ XqcsuJRDVDa5N5k2DmKht0PYFhKi81cimi+N119ZwOxzvnpmNBuWSOpzCf/2U7Dv0dwy bEUfobaj9B7YiOH8T+rr6W3XVZgeh3R161CtE6Ry94u/LU3qGsPIyuLOp6lTHg8fAjjj /XRw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="tnIrh7v/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z16-20020a170903019000b00176e16b0c58si9648723plg.21.2022.10.04.11.37.35; Tue, 04 Oct 2022 11:37:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="tnIrh7v/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229762AbiJDS04 (ORCPT + 99 others); Tue, 4 Oct 2022 14:26:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229771AbiJDS0q (ORCPT ); Tue, 4 Oct 2022 14:26:46 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 270415AA0F; Tue, 4 Oct 2022 11:26:45 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 5C879614F8; Tue, 4 Oct 2022 18:26:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B470EC433D6; Tue, 4 Oct 2022 18:26:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1664908003; bh=oOOixuLKdO7T3peG5VpU1afyn8twx99hV9hT8wyH/MQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=tnIrh7v/CUw1AVF/+VFeSsXvTWcmu7o0PZgYDfhzW+23cdLgAD3eOla+HcCdidC/z izXZrfnFJ9azS4hx4bevKP1YGEtyPOfFFiKBM43eBKLt6aBPzUE33VEzWBwXzdSThi ATKPzFmwMqJ4mRn0EbH/x0a7+z6xolAS5LHfBH32OxjYsyhtRC7EauKDloHQwgpOiT YVEnhyWLc6wWTVfgh0KRToWeDVCBypPvM94pNln7eW0IbZGVR23yRDziKA73WU0Zc+ r1MUi2rbbpCpcq2Lhram0ad74G5HlXtu6F3UeMXwYlAMBdp/nmnb9rDkd1/Qca2Lnk mTvW687anHNgQ== Date: Tue, 4 Oct 2022 11:26:43 -0700 From: "Darrick J. Wong" To: =?utf-8?B?R290b3UsIFlhc3Vub3JpL+S6lOWztiDlurfmloc=?= Cc: =?utf-8?B?WWFuZywgWGlhby/mnagg5pmT?= , Brian Foster , "hch@infradead.org" , =?utf-8?B?UnVhbiwgU2hpeWFuZy/pmK4g5LiW6Ziz?= , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-fsdevel@vger.kernel.org" , "david@fromorbit.com" , zwisler@kernel.org, Jeff Moyer , dm-devel@redhat.com, toshi.kani@hpe.com Subject: Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition Message-ID: References: <7fdc9e88-f255-6edb-7964-a5a82e9b1292@fujitsu.com> <76ea04b4-bad7-8cb3-d2c6-4ad49def4e05@fujitsu.com> <1444b9b5-363a-163c-0513-55d1ea951799@fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 03, 2022 at 09:12:46PM -0700, Gotou, Yasunori/五島 康文 wrote: > On 2022/10/03 17:12, Darrick J. Wong wrote: > > On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote: > > > Hello everyone, > > > > > > On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote: > > > > Hi Darrick, Brian and Christoph > > > > > > > > Ping. I hope to get your feedback. > > > > > > > > 1) I have confirmed that the following patch set did not change the test > > > > result of generic/470 with thin-volume. Besides, I didn't see any > > > > failure when running generic/470 based on normal PMEM device instaed of > > > > thin-volume. > > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > > > > > > > > 2) I can reproduce the failure of generic/482 without thin-volume. > > > > > > > > 3) Is it necessary to make thin-volume support DAX. Is there any use > > > > case for the requirement? > > > > > > > > > Though I asked other place(*), I really want to know the usecase of > > > dm-thin-volume with DAX and reflink. > > > > > > > > > In my understanding, dm-thin-volume seems to provide similar feature like > > > reflink of xfs. Both feature provide COW update to reduce usage of > > > its region, and snapshot feature, right? > > > > > > I found that docker seems to select one of them (or other feature which > > > supports COW). Then user don't need to use thin-volume and reflink at same > > > time. > > > > > > Database which uses FS-DAX may want to use snapshot for its data of FS-DAX, > > > its user seems to be satisfied with reflink or thin-volume. > > > > > > So I could not find on what use-case user would like to use dm-thin-volume > > > and reflink at same time. > > > > > > The only possibility is that the user has mistakenly configured dm-thinpool > > > and reflink to be used at the same time, but if that is the case, it seems > > > to be better for the user to disable one or the other. > > > > > > I really wander why dm-thin-volume must be used with reflik and FS-DAX. > > > > There isn't a hard requirement between fsdax and dm-thinp. The /test/ > > needs dm-logwrites to check that write page faults on a MAP_SYNC > > mmapping are persisted directly to disk. dm-logwrites requires a fast > > way to zero an entire device for correct operation of the replay step, > > and thinp is the only way to guarantee that. > > Thank you for your answer. But I still feel something is strange. > Though dm-thinp may be good way to execute the test correctly, Yep. > I suppose it seems to be likely a kind of workaround to pass the test, > it may not be really required for actual users. Exactly correct. Real users should /never/ set up this kind of (test scaffolding|insanity) to use fsdax. > Could you tell me why passing test by workaround is so necessary? Notice this line in generic/470: $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \ -c "log_writes -d $LOGWRITES_NAME -m preunmap" \ -f $SCRATCH_MNT/test The second xfs_io command creates a MAP_SYNC mmap of the SCRATCH_MNT/test file, and the third command memcpy's bytes to the mapping to invoke the write page fault handler. The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME (aka the block device containing the mounted XFS filesystem) to create a mark called "preunmap". This mark captures the exact state of the block device immediately after the write faults complete, so that we can come back to it later. There are a few things to note here: (1) We did not tell the fs to persist anything; (2) We can't use dm-snapshot here, because dm-snapshot will flush the fs (I think?); and (3) The fs is still mounted, so the state of the block device at the mark reflects a dirty XFS with a log that must be replayed. The next thing the test does is unmount the fs, remove the dm-logwrites driver to stop recording, and check the fs: _log_writes_unmount _log_writes_remove _dmthin_check_fs This ensures that the post-umount fs is consistent. Now we want to roll back to the place we marked to see if the mwrite data made it to pmem. It *should* have, since we asked for a MAP_SYNC mapping on a fsdax filesystem recorded on a pmem device: # check pre-unmap state _log_writes_replay_log preunmap $DMTHIN_VOL_DEV _dmthin_mount dm-logwrites can't actually roll backwards in time to a mark, since it only records new disk contents. It /can/ however roll forward from whatever point it began recording writes to the mark, so that's what it does. However -- remember note (3) from earlier. When we _dmthin_mount after replaying the log to the "preunmap" mark, XFS will see the dirty XFS log and try to recover the XFS log. This is where the replay problems crop up. The XFS log records a monotonically increasing sequence number (LSN) with every log update, and when updates are written into the filesystem, that LSN is also written into the filesystem block. Log recovery also replays updates into the filesystem, but with the added behavior that it skips a block replay if the block's LSN is higher than the transaction being replayed. IOWs, we never replay older block contents over newer block contents. For dm-logwrites this is a major problem, because there could be more filesystem updates written to the XFS log after the mark is made. LSNs will then be handed out like this: mkfs_lsn preunmap_lsn umount_lsn | | | |--------------------------||----------|-----------| | | xxx_lsn yyy_lsn Let's say that a new metadata block "BBB" was created in update "xxx" immediately before the preunmap mark was made. Per (1), we didn't flush the filesystem before taking the mark, which means that the new block's contents exist only in the log at this point. Let us further say that the new block was again changed in update "yyy", where preunmap_lsn < yyy_lsn <= umount_lsn. Clearly, yyy_lsn > xxx_lsn. yyy_lsn is written to the block at unmount, because unmounting flushes the log clean before it completes. This is the first time that BBB ever gets written. _log_writes_replay_log begins replaying the block device from mkfs_lsn towards preunmap_lsn. When it's done, it will have a log that reflects all the changes up to preunmap_lsn. Recall however that BBB isn't written until after the preunmap mark, which means that dm-logwrites has no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch BBB. At this point, the block header for BBB has a UUID that matches the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn. XFS log recovery starts up, and finds transaction xxx. It will read BBB from disk, but then it will see that it has an LSN of yyy_lsn. This is larger than xxx_lsn, so it concludes that BBB is newer than the log and moves on to the next log item. No other log items touch BBB, so recovery finishes, and now we have a filesystem containing one metadata block (BBB) from the future. This is an inconsistent filesystem, and has caused failures in the tests that use logwrites. To work around this problem, all we really need to do is reinitialize the entire block device to known contents at mkfs time. This can be done expensively by writing zeroes to the entire block device, or it can be done cheaply by (a) issuing DISCARD to the whole the block device at the start of the test and (b) ensuring that reads after a discard always produce zeroes. mkfs.xfs already does (a), so the test merely has to ensure (b). dm-thinp is the only software solution that provides (b), so that's why this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV. This combination used to work, but with the pending pmem/blockdev divorce, this strategy is no longer feasible. I think the only way to fix this test is (a) revert all of Christoph's changes so far and scuttle the divorce; or (b) change this test like so: 1. Create a large sparse file on $TEST_DIR and losetup that sparse file. The resulting loop device will not have dax capability. 2. Set up the dmthin/dmlogwrites stack on top of this loop device. 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem device) as the realtime device, and set the daxinherit and rtinherit flags on the root directory. The result is a filesystem with a data section that the kernel will treat as a regular block device, a realtime section backed by pmem, and the necessary flags to make sure that the test file will actually get fsdax mode. 4. Acknowledge that we no longer have any way to test MAP_SYNC functionality on ext4, which means that generic/470 has to move to tests/xfs/. --D > Thanks, > > > > > > --D > > > > > If my understanding is something wrong, please correct me. > > > > > > (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/