Received: by 2002:a05:7412:251c:b0:e2:908c:2ebd with SMTP id w28csp2465529rda; Wed, 25 Oct 2023 03:43:00 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEHnGerW5qAWQ7DFzURlLjFLp/D2QZGCEik6/k4rHKDeLPRMN9aFmsct+WtPRHaq9pCQao1 X-Received: by 2002:a54:418a:0:b0:3ae:16aa:8ba9 with SMTP id 10-20020a54418a000000b003ae16aa8ba9mr16067178oiy.30.1698230580401; Wed, 25 Oct 2023 03:43:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698230580; cv=none; d=google.com; s=arc-20160816; b=FrMUnioO9laXhl7GfW+/nYYSFuAfAUR6FS92dmSWcTPArbd/cku5+mEsmwBik5QQxP 3S914+Re0h43rs+xpNSLyGUjAjV5AgWpyA68DePwnUYW40j0SkU3DruB4FPUT3qhpaX0 iMAfC/7Q4zP2Td9LsAV/mgm8YNZZDOWdqwStDTJj9AYVCfpdlpbaQ/rAqUlX3JOuGUCh KXPRPRLZ1konUJB2YXnxMbb3EvIBeSoIQPV49SWtonG2vn9zezK4iSOytufN38IBf79a fN86647i06OH9OiSWxqMFpBk/5DeIDgFQyjD41WV7Al7pzM2VqqfEMvIm62KddNYABBi f9iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=waqA3Wuu0WYamjZmXtBbHzp1YM3pbNgPBrZtQ2JTrdM=; fh=GKWpGN+yqSKaxEQjGt6XLpIYYuiroD0PRMyjNCEbbuk=; b=qa1b4cazWHGbvPJXqwsUL5iE4ui/VrAQNu035jyrMGN/WM0lgIGMGJcxVNvULWELmz BoeDfUaSWn+J552800428u2vTgTGuKL3bYd94l4ZA15X03EtN1+/oUKY+g36XoZvP/ul 3L/DtZhyshpPPFa+rvVWQnF39F77JOYRKhX7TR0TqpFignV2wGJQ1bt08zHSIEdVdcpn yY69rM1OynH6AKYj9Auvc551v2m/LKP2cAPgoTcILgkHxOJIPGq+NfadpuRJZ3gKOMkb w8BZbKqD0msf7dTx/oAclPVFlSva88DfruyLsrnIpnZQXJvGi9cqqkxhrTex21SR9wLb v7jg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=mbr2SmSV; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id x14-20020a25ac8e000000b00d865c55cb7bsi10841391ybi.606.2023.10.25.03.42.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Oct 2023 03:43:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=mbr2SmSV; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B23FD8092EE5; Wed, 25 Oct 2023 03:42:58 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343764AbjJYKmy (ORCPT + 99 others); Wed, 25 Oct 2023 06:42:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234941AbjJYKme (ORCPT ); Wed, 25 Oct 2023 06:42:34 -0400 Received: from mail-qk1-x736.google.com (mail-qk1-x736.google.com [IPv6:2607:f8b0:4864:20::736]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F36091BDB; Wed, 25 Oct 2023 03:41:58 -0700 (PDT) Received: by mail-qk1-x736.google.com with SMTP id af79cd13be357-7789a4c01ddso371588885a.1; Wed, 25 Oct 2023 03:41:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698230518; x=1698835318; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=waqA3Wuu0WYamjZmXtBbHzp1YM3pbNgPBrZtQ2JTrdM=; b=mbr2SmSV3miUD750RE7vqS+m+1hEZdPEg+6mKrAEl8ZnmcB4nHilOn8c/rioVIl95+ Hj1DdUTbL4O4itVTKd6AWlzbScI095b4XhMIvE6DFgCWlh+eBGQ2GCf7utKtn3TR/a0z EHOLNyjEBLNxuCtYTMk6ADWTWQ3skeyyx1YafVSXgtnakWqqLkDT7cY8df9+L/Fe7EG+ LqAAUxqp6EIttGtA5HqQE5Xdf5F+y/R+23KcxbYErRXGoHvzwtw8owncSCFCdOUapzMQ GtTMIKjK6XARd11b7fg61U/4a5ALJmwjG3ubFuwTj/Iyh7tHxRGffFsfx0Uf1kByaQ0Q rYqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698230518; x=1698835318; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=waqA3Wuu0WYamjZmXtBbHzp1YM3pbNgPBrZtQ2JTrdM=; b=SIWQU2Tj7xm6yMkFkZP1I4ws6fdNVn2WKW9C7B8dWQfOlxwnLL1Ewzr0knkNzN7kqx pN/UnMm0QTV3+o9ycHscpWo7hRmiswKwCLechXCE20/GmvjNtEYnK9ynLAGPrRo2o1pA mKzbIrWSNvNEVZ1VHSsup+H7TIXchUvZv01wg9aGLGIGs2ThvL6OrVK1523Z/ZQLxG68 NgCcf/tE15CphGJ3Rp+xcswDad7oS0HkRDouvOhVa6UyUQQDejWGn7tJzoti7Je+Hi6q Xp2LrXEC02oNKS3VF5yhhFwHlnQAShNdINYBVYq82tEE12Of5hMkjndgDBnI193eR6jH Ae8A== X-Gm-Message-State: AOJu0YzMNkqb7Qw6Cv8hukUc3hJH5BHhna2+Rolsj869sAQNCbP1oVus lD63uccw8XLDRogH7rE9mU8Yqmwdzp/zLOSdI1U= X-Received: by 2002:a05:620a:10a6:b0:76f:1a6b:571 with SMTP id h6-20020a05620a10a600b0076f1a6b0571mr14824898qkk.27.1698230517883; Wed, 25 Oct 2023 03:41:57 -0700 (PDT) MIME-Version: 1.0 References: <0a1a847af4372e62000b259e992850527f587205.camel@kernel.org> <61b32a4093948ae1ae8603688793f07de764430f.camel@kernel.org> In-Reply-To: From: Amir Goldstein Date: Wed, 25 Oct 2023 13:41:46 +0300 Message-ID: Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing To: Dave Chinner Cc: Jeff Layton , Linus Torvalds , Kent Overstreet , Christian Brauner , Alexander Viro , John Stultz , Thomas Gleixner , Stephen Boyd , Chandan Babu R , "Darrick J. Wong" , "Theodore Ts'o" , Andreas Dilger , Chris Mason , Josef Bacik , David Sterba , Hugh Dickins , Andrew Morton , Jan Kara , David Howells , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 25 Oct 2023 03:42:58 -0700 (PDT) On Wed, Oct 25, 2023 at 11:05=E2=80=AFAM Dave Chinner = wrote: > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote: > > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote: > > > On Tue, Oct 24, 2023 at 6:40=E2=80=AFAM Dave Chinner wrote: > > > > > > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote: > > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner = wrote: > > > > > > > > > > > > The problem is the first read request after a modification has = been > > > > > > made. That is causing relatime to see mtime > atime and trigger= ing > > > > > > an atime update. XFS sees this, does an atime update, and in > > > > > > committing that persistent inode metadata update, it calls > > > > > > inode_maybe_inc_iversion(force =3D false) to check if an iversi= on > > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it = bumps > > > > > > i_version and tells XFS to persist it. > > > > > > > > > > Could we perhaps just have a mode where we don't increment i_vers= ion > > > > > for just atime updates? > > > > > > > > > > Maybe we don't even need a mode, and could just decide that atime > > > > > updates aren't i_version updates at all? > > > > > > > > We do that already - in memory atime updates don't bump i_version a= t > > > > all. The issue is the rare persistent atime update requests that > > > > still happen - they are the ones that trigger an i_version bump on > > > > XFS, and one of the relatime heuristics tickle this specific issue. > > > > > > > > If we push the problematic persistent atime updates to be in-memory > > > > updates only, then the whole problem with i_version goes away.... > > > > > > > > > Yes, yes, it's obviously technically a "inode modification", but = does > > > > > anybody actually *want* atime updates with no actual other change= s to > > > > > be version events? > > > > > > > > Well, yes, there was. That's why we defined i_version in the on dis= k > > > > format this way well over a decade ago. It was part of some deep > > > > dark magical HSM beans that allowed the application to combine > > > > multiple scans for different inode metadata changes into a single > > > > pass. atime changes was one of the things it needed to know about > > > > for tiering and space scavenging purposes.... > > > > > > > > > > But if this is such an ancient mystical program, why do we have to > > > keep this XFS behavior in the present? > > > BTW, is this the same HSM whose DMAPI ioctls were deprecated > > > a few years back? > > Drop the attitude, Amir. > > That "ancient mystical program" is this: > > https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing= -solutions/high-performance-computing-storage-solutions/hpc-storage-solutio= ns/hpe-data-management-framework-7/p/1010144088 > Sorry for the attitude Dave, I somehow got the impression that you were talking about a hypothetical old program that may be out of use. I believe that Jeff and Linus got the same impression... > Yup, that product is backed by a proprietary descendent of the Irix > XFS code base XFS that is DMAPI enabled and still in use today. It's > called HPE XFS these days.... > What do you mean? Do you mean that the HPE product uses patched XFS? If so, why is that an upstream concern? Upstream xfs indeed preserves di_dmstate,di_dmevmask, but it does not change those state members when file changes happen. So if mounting an HPE XFS disk on with upstream kernel is not going to record DMAPI state changes, does it matter if upstream xfs does not update di_changecount on atime change? Maybe I did not understand the situation w.r.t HPE XFS. > > > I mean, I understand that you do not want to change the behavior of > > > i_version update without an opt-in config or mount option - let the d= istro > > > make that choice. > > > But calling this an "on-disk format change" is a very long stretch. > > Telling the person who created, defined and implemented the on disk > format that they don't know what constitutes a change of that > on-disk format seems kinda Dunning-Kruger to me.... > OK. I will choose my words more carefully: I still do not understand, from everything that you have told us so far, including the mention of the specific product above, why not updating di_changecount on atime update constitutes an on-disk format change and not a runtime behavior change. You also did not address my comment that xfs_repair does not update di_changecount on any inode changes to the best of my code reading abilities. > There are *lots* of ways that di_changecount is now incompatible > with the VFS change counter. That's now defined as "i_version should > only change when [cm]time is changed". > > di_changecount is defined to be a count of the number of changes > made to the attributes of the inode. It's not just atime at issue > here - we bump di_changecount when make any inode change, including > background work that does not otherwise change timestamps. e.g. > allocation at writeback time, unwritten extent conversion, on-disk > EOF extension at IO completion, removal of speculative > pre-allocation beyond EOF, etc. > I see. Does xfs update ctime on all those inode block map changes? > IOWs, di_changecount was never defined as a linux "i_version" > counter, regardless of the fact we originally we able to implement > i_version with it - all extra bumps to di_changecount were not > important to the users of i_version for about a decade. > > Unfortunately, the new i_version definition is very much > incompatible with the existing di_changecount definition and that's > the underlying problem here. i.e. the problem is not that we bump > i_version on atime, it's that di_changecount is now completely > incompatible with the new i_version change semantics. > > To implement the new i_version semantics exactly, we need to add a > new field to the inode to hold this information. > If we change the on disk format like this, then the atime > problems go away because the new field would not get updated on > atime updates. We'd still be bumping di_changecount on atime > updates, though, because that's what is required by the on-disk > format. > I fully agree with you that we should avoid on-disk format change. This is exactly the reason that I'm insisting on the point of clarifying how exactly, this semantic change of di_changecount is going to break existing applications that run on upstream kernel. Thanks, Amir.