Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp3491082rwb; Mon, 3 Oct 2022 16:14:49 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4oYAm+dP7pGQjMWe57utGh7dR/wNcV1SYYRZJ+8Xi/dZ+dUU7CaOl0tC/mjWQS1HYr3tNa X-Received: by 2002:a17:90a:6503:b0:207:cd0e:bf1f with SMTP id i3-20020a17090a650300b00207cd0ebf1fmr14673069pjj.25.1664838888933; Mon, 03 Oct 2022 16:14:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664838888; cv=none; d=google.com; s=arc-20160816; b=FcPVfIFKzaV1iHnu5O3uA1CeMPrOBZcMpCUMCbO/ZFbpY4q0Jg0Rka9iyjWY4Lg086 1VBrki59n4e4S8dwc6KhyzzKUIH8mIoUWllh/+0nxLErJ5DqPGKVgZBwGNZlj3vv1Vd8 tAQzfdy338juSMJ3/yYPIwOmhK0ogbEJob9CVyFaYj+XKlCq4hWlQ0NhhLhQrXYgU9U7 bgnC8dtDihqYYUVt9iJgSgN8ODQW/sknW33GC/ul/uPwtkUNofK9kw3NLNNA5KUibUQY dNqV6Q11XKX6nRRsL1Rf6pjsfNvsExWimFD6FjaJtOKcbV5tMpjg62TvI0EtrFVr3ISu 6fFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:references:in-reply-to:subject :cc:to:from:mime-version:content-transfer-encoding:dkim-signature :dkim-signature; bh=V1nQZ58TJ+GEmzZcpyTOK5ayREv1t5fIjSbrNpFEZ4o=; b=ESe+pP+JrmV1smnir99bijq9K1Sfbk9uxXfVnLF+PFqVHr+sLrEIxdWwDT5nT451K4 BPftIl9Ksu5RZHCwZlDwndi33CkSQRlJF3j3SxNm0uvbCaKoF6PT8/71jF4pCBrKGHpE nbe5ufSqEKhgT9ET3C89lGQAgE3Q3+XZT5f0XPn52JUnT/9t3w6giF3XZ8CNPf8jHhuZ 9tvxNSKseujSPBYI4oIqkP2X8ER6IM8spVrlcM+MjGAazk84fA9Oog3wDjFz/VajG+bg xGBlUUhzIhxYRQsJDSd4pM22BEm7oVg+2IfC9aQeZP0bFu9y4EahHB4hQUkwQHaKsGKy gIqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b="yimo5/6E"; dkim=neutral (no key) header.i=@suse.de; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v10-20020a170902b7ca00b001713df0c8b2si11092518plz.213.2022.10.03.16.14.31; Mon, 03 Oct 2022 16:14:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b="yimo5/6E"; dkim=neutral (no key) header.i=@suse.de; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229697AbiJCW46 (ORCPT + 99 others); Mon, 3 Oct 2022 18:56:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229549AbiJCW45 (ORCPT ); Mon, 3 Oct 2022 18:56:57 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3DCAB31373; Mon, 3 Oct 2022 15:56:55 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id AC70821902; Mon, 3 Oct 2022 22:56:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1664837813; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V1nQZ58TJ+GEmzZcpyTOK5ayREv1t5fIjSbrNpFEZ4o=; b=yimo5/6EB/XmhspbZuEjyIRw3HYb3bw8toqAtzev2eUQkTn/03jPZw7eZ/PQk4+m4i7XKR /P7hp7Rh9MPX3ISKaj08hoFSFB5nZGLSrQZAQ0tf7AriKJ34pb0PaXqYz68VYIIXSXFXHU A8NmMmj/VcMLYYmVhJTBBt+nwRVmZBs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1664837813; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V1nQZ58TJ+GEmzZcpyTOK5ayREv1t5fIjSbrNpFEZ4o=; b=jrlkmi0B7QBCeHc3w4JoJ77FztOga2B/4GULpyIMCqDJyJ85OhF+N3KBk6kStvshIiqYZK R0gpizHBYQ/QCNDA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E65291332F; Mon, 3 Oct 2022 22:56:46 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id geubJ65oO2NBDwAAMHmgww (envelope-from ); Mon, 03 Oct 2022 22:56:46 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "NeilBrown" To: "Amir Goldstein" Cc: "Jeff Layton" , tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org, david@fromorbit.com, trondmy@hammerspace.com, viro@zeniv.linux.org.uk, zohar@linux.ibm.com, xiubli@redhat.com, chuck.lever@oracle.com, lczerner@redhat.com, jack@suse.cz, bfields@fieldses.org, brauner@kernel.org, fweimer@redhat.com, linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter In-reply-to: References: <20220930111840.10695-1-jlayton@kernel.org>, <20220930111840.10695-9-jlayton@kernel.org>, , , Date: Tue, 04 Oct 2022 09:56:42 +1100 Message-id: <166483780286.14457.1388505585556274283@noble.neil.brown.name> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, 04 Oct 2022, Amir Goldstein wrote: > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton wrote: > > > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote: > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton wrote: > > > > > > > > The c/mtime and i_version currently get updated before the data is > > > > copied (or a DIO write is issued), which is problematic for NFS. > > > > > > > > READ+GETATTR can race with a write (even a local one) in such a way as > > > > to make the client associate the state of the file with the wrong cha= nge > > > > attribute. That association can persist indefinitely if the file sees= no > > > > further changes. > > > > > > > > Move the setting of times to the bottom of the function in > > > > __generic_file_write_iter and only update it if something was > > > > successfully written. > > > > > > > > > > This solution is wrong for several reasons: > > > > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't > > > solved the problem completely > > > > Right. I don't think there is a way to solve the problem vs. mmap. > > Userland can write to a writeable mmap'ed page at any time and we'd > > never know. We have to specifically carve out mmap as an exception here. > > I'll plan to add something to the manpage patch for this. > > > > > 2. The other side of the coin is that post crash state is more likely t= o end > > > up data changes without mtime/ctime change > > > > > > > Is this really something filesystems rely on? I suppose the danger is > > that some cached data gets written to disk before the write returns and > > the inode on disk never gets updated. > > > > But...isn't that a danger now? Some of the cached data could get written > > out and the updated inode just never makes it to disk before a crash > > (AFAIU). I'm not sure that this increases our exposure to that problem. > > > > >=20 > You are correct that that danger exists, but it only exists for overwriting > to allocated blocks. >=20 > For writing to new blocks, mtime change is recorded in transaction > before the block mapping is recorded in transaction so there is no > danger in this case (before your patch). >=20 > Also, observing size change without observing mtime change > after crash seems like a very bad outcome that may be possible > after your change. >=20 > These are just a few cases that I could think of, they may be filesystem > dependent, but my gut feeling is that if you remove the time update before > the operation, that has been like that forever, a lot of s#!t is going to f= loat > for various filesystems and applications. >=20 > And it is not one of those things that are discovered during rc or even > stable kernel testing - they are discovered much later when users start to > realize their applications got bogged up after crash, so it feels like to me > like playing with fire. >=20 > > > If I read the problem description correctly, then a solution that inval= idates > > > the NFS cache before AND after the write would be acceptable. Right? > > > Would an extra i_version bump after the write solve the race? > > > > > > > I based this patch on Neil's assertion that updating the time before an > > operation was pointless if we were going to do it afterward. The NFS > > client only really cares about seeing it change after a write. > > >=20 > Pointless to NFS client maybe. > Whether or not this is not changing user behavior for other applications > is up to you to prove and I doubt that you can prove it because I doubt > that it is true. >=20 > > Doing both would be fine from a correctness standpoint, and in most > > cases, the second would be a no-op anyway since a query would have to > > race in between the two for that to happen. > > > > FWIW, I think we should update the m/ctime and version at the same time. > > If the version changes, then there is always the potential that a timer > > tick has occurred. So, that would translate to a second call to > > file_update_time in here. > > > > The downside of bumping the times/version both before and after is that > > these are hot codepaths, and we'd be adding extra operations there. Even > > in the case where nothing has changed, we'd have to call > > inode_needs_update_time a second time for every write. Is that worth the > > cost? >=20 > Is there a practical cost for iversion bump AFTER write as I suggested? > If you NEED m/ctime update AFTER write and iversion update is not enough > then I did not understand from your commit message why that is. >=20 > Thanks, > Amir. >=20 Maybe we should split i_version updates from ctime updates. While it isn't true that ctime updates have happened before the write "forever" it has been true since 2.3.43[1] which is close to forever. For ctime there doesn't appear to be a strong specification of when the change happens, so history provides a good case for leaving it before. For i_version we want to provide clear and unambiguous semantics. Performing 2 updates makes the specification muddy. So I would prefer a single update for i_version, performed after the change becomes visible. If that means it has to be separate from ctime, then so be it. NeilBrown [1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/com= mit/?id=3D636b38438001a00b25f23e38747a91cb8428af29