Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753240AbcKRSEV convert rfc822-to-8bit (ORCPT ); Fri, 18 Nov 2016 13:04:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:32932 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751736AbcKRSET (ORCPT ); Fri, 18 Nov 2016 13:04:19 -0500 Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 From: David Howells In-Reply-To: <1479489454.7629.1.camel@poochiereds.net> References: <1479489454.7629.1.camel@poochiereds.net> <20161117234047.GE28177@dastard> <147938969703.13574.10295364502230379833.stgit@warthog.procyon.org.uk> <147938970382.13574.11581172952175034619.stgit@warthog.procyon.org.uk> <26168.1479461768@warthog.procyon.org.uk> To: Jeff Layton Cc: dhowells@redhat.com, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/4] statx: Add a system call to make enhanced file info available MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <7018.1479492256.1@warthog.procyon.org.uk> Content-Transfer-Encoding: 8BIT Date: Fri, 18 Nov 2016 18:04:16 +0000 Message-ID: <7019.1479492256@warthog.procyon.org.uk> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 18 Nov 2016 18:04:18 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3680 Lines: 89 Jeff Layton wrote: > > We've already been through that. I wanted to call it stx_data_version but > > that got argued down to stx_version. The problem is that what the version > > number means is entirely filesystem dependent, and it might not just reflect > > changes in the data. > > > > It had better not just reflect data changes. > > knfsd populates the NFSv4 change attribute from inode->i_version. It > _must_ have changed between subsequent queries if either the data or > metadata has changed (basically whenever you would update either the > ctime or the mtime). No, I think it *should* just reflect the data changes - otherwise you have have to burn your cached data unnecessarily. > > > So if stx_version this is intended to export the internal filesystem > > > inode change counter (i.e. inode->i_version) then lets call it that: > > > stx_modification_count. It's clear and unambiguous as to what it > > > represents, especially as this counter is more than just a "data > > > modification" counter - inode metadata modifications will also > > > cause it to change.... > > > > I disagree that it's unambiguous. It works like mtime, right? > > More like ctime + mtime mashed together. Isn't ctime updated every time mtime is? In which case stx_change_count would be a better name. > > Which wouldn't be of use for certain filesystems. An example of this > > would be AFS, where it's incremented by 1 each time a write is committed, > > but is not updated for metadata changes. This is what matters for data > > caching. > > > > No. Basically the rules are that if something in the inode data or > metadata changed, then it must be a "larger" value (also accounting for > wraparound). So you also need to change it (usually by incrementing it) > when doing namespace changes that involve it (renames, unlinks, etc.). That's entirely filesystem dependent. A better rule is that if you do a write and then compare the data version you got back to the version you had before; if it's increased by exactly one, there were no other writes between your last retrieval of the attributes and your write that just got committed. Admittedly, this assumes that the server serialises writes to a particular file. If the value just increases, you don't know that didn't happen by this mechanism, so the version is of limited value. > Adding new fields in later piecemeal patches allows us to demonstrate > that that concept actually works. You're probably right, but the downside is that we really need some way to find out what's supported. On the other hand, we probably need that anyway, hence my suggestion of an fsinfo() syscall also. > > You really think we're going to have accurate timestamps with a resolution > > of a millionth of a nanosecond? This means you're going to be doing a > > 64-bit division every time you want a nanosecond timestamp. > ... > > Could contemporary machines get away with just shifting down by 32 > bits? A better way would probably be to have: struct timestamp { __u64 seconds; __u32 nanoseconds; __u32 femtoseconds; }; where you effectively add all the fields together with appropriate multipliers. But I still wonder if we really are going to move to femtosecond timestamps, given that that's going to involve clock frequencies well in excess of 1 THz to be useful. Even attoseconds is probably unnecessary, given that clock frequencies don't seem to be moving much beyond a few GHz, though it's reasonable that we could have a timestamp counter that has an attosecond period - it's just that the processing time to deal with it seems likely to render it unnecessary. David