Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp67815rdh; Sat, 23 Sep 2023 01:37:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEU2IxGQTuR2K/MzLoDGkX28tONkZoFXfTjisWPbo5mCq47RtVWbDKKWJzjubFm80CVjKUT X-Received: by 2002:a05:6358:881e:b0:142:f730:ff33 with SMTP id hv30-20020a056358881e00b00142f730ff33mr2387705rwb.13.1695458262332; Sat, 23 Sep 2023 01:37:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695458262; cv=none; d=google.com; s=arc-20160816; b=TguPHJkfn5/lKURW8GKBgOHv9UtrGNo5ZBpitwVE+PMACvvpCLdqtz+BDNrxbXkwqS 5mouR6hSpb1xcdjI75jhkvlChWQJYZYeYeIG5MKYQYN+tsjPuyfvRM0YsFATgQxUav/g uPwdsUw64MdTQZiu9r23DcVzhXydxNC/TBGlIUO94I7PGU6Kp90FYchXpQJfvEI7WWrg e/pD0xX8EO0z2if4dF5qHw60t1zpqpB61Ziv3YGrD75KvaTABdhVXSz8Y3GPh52MXadu zLTP9RA/yelCemCW6ggFHTLbIy9TwExb9alwyoOJAxca9CiCp4VQ3WcBhISJZABkQh23 +IEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=a2gpg7YRP4E+XAUJHrlXeLmjFlOQEtcIKaK+DMVYN/I=; fh=XQuXYQWv47k+TzlDPlsrPuBA9mO0swHzZHP8zyGrIr4=; b=MymCRU5bCl9Q2p7fbXS8OzVLE20wGEQKkaKC2BtWGSI5NnDvne5ln3VBLJbABd4H46 dnAawJs5397/rc7mz23864HQ7ACr9yEZDzyA2YnoiKaHOlRfgV5zt9iVfEDubYJmLAfg mbmjDg+5wOzHLKTtrdh0KOuDsG6odhKGPdlUW2VxnCMnCrndaqp+908Pberbb+0OAJRQ XjawKhjwC41PHCG8CAQdpXOuvtGKCWF3MJB6mjUS6N/w2zC8HmBHnV1RE147/oEEbc8J 22/P5gT//P+sd1SnzXvUsGS2T/s/Z9/k+2wvV3efZhaCRtw/B3tHos/e90Z7levhfcLh umvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=VFJCqQ5n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id w1-20020a63c101000000b005653316de6fsi5575619pgf.271.2023.09.23.01.37.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Sep 2023 01:37:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=VFJCqQ5n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 8035C829AA27; Fri, 22 Sep 2023 23:36:39 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229980AbjIWGgf (ORCPT + 99 others); Sat, 23 Sep 2023 02:36:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229541AbjIWGge (ORCPT ); Sat, 23 Sep 2023 02:36:34 -0400 Received: from mail-oi1-x232.google.com (mail-oi1-x232.google.com [IPv6:2607:f8b0:4864:20::232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4B25199; Fri, 22 Sep 2023 23:36:27 -0700 (PDT) Received: by mail-oi1-x232.google.com with SMTP id 5614622812f47-3adcec86a8cso1957311b6e.3; Fri, 22 Sep 2023 23:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695450987; x=1696055787; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=a2gpg7YRP4E+XAUJHrlXeLmjFlOQEtcIKaK+DMVYN/I=; b=VFJCqQ5nGm7ukXcnRWL3QZ/Myw1jpfbOp7VdG/YgR/xvvMqTxH4BpVcQVVLoDX5073 gj0R5k8FhojWNMplF+Uoaames06vLuiqqIjT0TYix8rkrbksJpjuCJJNzhFLo4jZpzku BBSRmQLSFV6RRS/onTaDiXCY5xYP1dGbeOTpf7KVyfpo6JiLY/zdBmcRcdsyxDEqQNGE lekPIOoXUVp6MEYPDjI/qjbKopxKTu1vgG08CaD2TMjIVJbxwwlD9+7XO8MwViKFwGL7 mimTaN3LCNY4ctKLljGzvd3J5rOJfCHvVRnre0+l5FSdTrEB4WACZLEYMf9c8wpQJnCV FbIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695450987; x=1696055787; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=a2gpg7YRP4E+XAUJHrlXeLmjFlOQEtcIKaK+DMVYN/I=; b=PhSknrbq0463MTRGYsU3pwuIBlc7o9hnj75fmtDzIBEP0uapvlkC7aieFD+tiX1OIr fWMcq4OUt8JqF8JEfG2LmB6kkhpLqilmzQNd7VhPNUsn7UYlAq89f9H7EKTym4UTpjBO +QcYss5fNDGUffvqzhvoSmi+JjCGZXRFD0h1G0fWBwl+ggIgWco4Y9V1KF9lcOIg55nY 6eNfzN3n8r5o5VfqhPbfSzL70aL26qc1rj6Q4d96f+YowEI5RVDx08K7r1GiK6t6dOKM m8H86DrYpma2tI/JlAqUXp5pGctabYeOCKfV2Z5ipJQdfpaEGU9qepw51DXDjkCFYvP2 wkzg== X-Gm-Message-State: AOJu0YwiZhGSDADDGrklTa99gidUSFROO96/jqp6PYZCpp2EYzgwB0ww E00WtZkjXQf5ebBQbQoLhvFvc6R5sn2/tq3RIS0= X-Received: by 2002:a05:6808:199f:b0:3ae:126b:8bfc with SMTP id bj31-20020a056808199f00b003ae126b8bfcmr2328511oib.4.1695450986775; Fri, 22 Sep 2023 23:36:26 -0700 (PDT) MIME-Version: 1.0 References: <20230921-umgekehrt-buden-a8718451ef7c@brauner> <0d006954b698cb1cea3a93c1662b5913a0ded3b1.camel@kernel.org> In-Reply-To: From: Amir Goldstein Date: Sat, 23 Sep 2023 09:36:15 +0300 Message-ID: Subject: Re: [GIT PULL v2] timestamp fixes To: Linus Torvalds Cc: Jeff Layton , Christian Brauner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jan Kara , "Darrick J. Wong" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=3.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_SBL_CSS, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 22 Sep 2023 23:36:39 -0700 (PDT) X-Spam-Level: ** On Sat, Sep 23, 2023 at 3:43=E2=80=AFAM Linus Torvalds wrote: > > On Thu, 21 Sept 2023 at 11:51, Jeff Layton wrote: > > > > We have many, many inodes though, and 12 bytes per adds up! > > That was my thinking, but honestly, who knows what other alignment > issues might eat up some - or all - of the theoreteical 12 bytes. > > It might be, for example, that the inode is already some aligned size, > and that the allocation alignment means that the size wouldn't > *really* shrink at all. > > So I just want to make clear that I think the 12 bytes isn't > necessarily there. Maybe you'd get it, maybe it would be hidden by > other things. > > My biggest impetus was really that whole abuse of a type that I > already disliked for other reasons. > > > I'm on board with the idea, but...that's likely to be as big a patch > > series as the ctime overhaul was. In fact, it'll touch a lot of the sam= e > > code. I can take a stab at that in the near future though. > > Yea, it's likely to be fairly big and invasive. That was one of the > reasons for my suggested "inode_time()" macro hack: using the macro > argument concatenation is really a hack to "gather" the pieces based > on name, and while it's odd and not a very typical kernel model, I > think doing it that way might allow the conversion to be slightly less > painful. > > You'd obviously have to have the same kind of thing for assignment. > > Without that kind of name-based hack, you'd have to create all these > random helper functions that just do the same thing over and over for > the different times, which seems really annoying. > > > Since we're on the subject...another thing that bothers me with all of > > the timestamp handling is that we don't currently try to mitigate "torn > > reads" across the two different words. It seems like you could fetch a > > tv_sec value and then get a tv_nsec value that represents an entirely > > different timestamp if there are stores between them. > > Hmm. I think that's an issue that we have always had in theory, and > have ignored because it's simply not problematic in practice, and > fixing it is *hugely* painful. > > I suspect we'd have to use some kind of sequence lock for it (to make > reads be cheap), and while it's _possible_ that having the separate > accessor functions for reading/writing those times might help things > out, I suspect the reading/writing happens for the different times (ie > atime/mtime/ctime) together often enough that you might want to have > the locking done at an outer level, and _not_ do it at the accessor > level. > > So I suspect this is a completely separate issue (ie even an accessor > doesn't make the "hugely painful" go away). And probably not worth > worrying about *unless* somebody decides that they really really care > about the race. > > That said, one thing that *could* help is if people decide that the > right format for inode times is to just have one 64-bit word that has > "sufficient resolution". That's what we did for "kernel time", ie > "ktime_t" is a 64-bit nanosecond count, and by being just a single > value, it avoids not just the horrible padding with 'struct > timespec64', it is also dense _and_ can be accessed as one atomic > value. Just pointing out that xfs has already changed it's on-disk timestamp format to this model (a.k.a bigtime), but the in-core inode still uses the timespec64 of course. The vfs can inprise from this model. > > Sadly, that "sufficient resolution" couldn't be nanoseconds, because > 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel > time, because 2**63 nanoseconds is 292 years, so it moved the "year > 2038" problem to "year 2262". Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year 2486, not year 2262, because there was no need to use the 64bit to go backwards to year 1678. > > And that's ok when we're talking about times that are kernel running > times and we haev a couple of centuries to say "ok, we'll need to make > it be a bigger type", but when you save the values to disk, things are > different. I suspect filesystem people are *not* willing to deal with > a "year 2262" issue. > Apparently, they are willing to handle the "year 2486" issue ;) > But if we were to say that "a tenth of microsecond resolution is > sufficient for inode timestamps", then suddenly 64 bits is *enormous*. > So we could do a > > // tenth of a microseconds since Jan 1, 1970 > typedef s64 fstime_t; > > and have a nice dense timestamp format with reasonable - but not > nanosecond - accuracy. Now that 292 year range has become 29,247 > years, and filesystem people *might* find the "year-31k" problem > acceptable. > > I happen to think that "100ns timestamp resolution on files is > sufficient" is a very reasonable statement, but I suspect that we'll > still find lots of people who say "that's completely unacceptable" > both to that resolution, and to the 31k-year problem. > I am guessing that you are aware of the Windows/SMB FILETIME standard which is 64bit in 100ns units (since 1601). So the 31k-year "problem" is very widespread already. But the resolution change is counter to the purpose of multigrain timestamps - if two syscalls updated the same or two different inodes within a 100ns tick, apparently, there are some workloads that care to know about it and fs needs to store this information persistently. Thanks, Amir.