Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp5995426rwn; Mon, 12 Sep 2022 18:48:38 -0700 (PDT) X-Google-Smtp-Source: AA6agR5sn0bvBc1uCoYmE/YUvNbHNGtiQBqdiSvqCRCzqz9H+2lEEaB3a3qf9pWWlK5clMNjq9L0 X-Received: by 2002:a65:5184:0:b0:439:14cb:fbe4 with SMTP id h4-20020a655184000000b0043914cbfbe4mr5254486pgq.166.1663033718273; Mon, 12 Sep 2022 18:48:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663033718; cv=none; d=google.com; s=arc-20160816; b=LLxofGjiIGQKT0fNwvGWLDIE77zYIr2WLuAlJ6UZWzsrN6q4gZrr4kgb1GYu5i4WDR U49ByQSf/18b33HwLv6OB5cGdIONwiFDZZOQlZ4rm3BkisD1C5cDQXDFlFbFBdfmTQf6 JyMtNdwCdj4PhH+FBuiMfAkoaWPUPUJglYVPYgNqB2re74qnkxL8xiOAi3kvdJ9jvuh4 y0vQnA59wEEeBGiBVt3SQnTLVhVi5aryC6xEXpBx2rD3D1rYJvZ8u3ReIuC1eNDt45IG txl1etCOD5i3SwBDIsZXocit782wioOxZNqxomtA6t5yAlPrxRobrzrlS8ogIoSZbWw1 b8oQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:references:in-reply-to:subject :cc:to:from:mime-version:content-transfer-encoding:dkim-signature :dkim-signature; bh=jQXS62ARbgeMHE4zS0cb/E0qsyN+7SPamJzzkekPkYQ=; b=CqwJ3+CIEa/zD4pEuIn3ZPHG1MzmIMnu9us2lzFpOWzLAQlso1dTS/oQ/JSOFpruTl sY3WRMtvT+nfoP+DO/IYYBCMczJOJcxDSYhF+MkLcqJ5TFPj8Bu24b4FOu6qtMVuBz8D HHObYIzEoNAIAEkCw//uVcaE0kBv484Mq5sDqnWbec9H4rxs3nkIOSKrBDIF+YAUFE6Y Ctmm8XXBNePoXmVgGGFcugThdMxrjd03l4/mLfYeGjVcyjCb67b7i1mVMhF+vF3WMWy0 6YlcqIRTxoih6TbGZA61+7n31uKE9SNkk7QNuBuHoXoNpRIx7otVLlnE0kzKUmLCkPMe dczw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=0oqI5Kqu; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=mqeakveC; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v12-20020a63480c000000b0041c5b90f8cbsi10562687pga.850.2022.09.12.18.48.17; Mon, 12 Sep 2022 18:48:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=0oqI5Kqu; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=mqeakveC; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229533AbiIMBmJ (ORCPT + 99 others); Mon, 12 Sep 2022 21:42:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45212 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229453AbiIMBmI (ORCPT ); Mon, 12 Sep 2022 21:42:08 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C0AE46D9A; Mon, 12 Sep 2022 18:42:04 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id B1AE634435; Tue, 13 Sep 2022 01:42:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1663033321; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jQXS62ARbgeMHE4zS0cb/E0qsyN+7SPamJzzkekPkYQ=; b=0oqI5KqujSsYt7OcPMrkd8raLljXMDOjOw0JWU49e0OTkyrT07Ox6npF0kwZOotzcdmDQk Kxzv9Q3cwVSk4pe07CtkvYgiYYnwXZaVebXViXl0nLcYX0OvVs36pslpkPGA9qA7kIHq8s 6QVpyoZC+VNDbh2ffaTv712fWTGRAqk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1663033321; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jQXS62ARbgeMHE4zS0cb/E0qsyN+7SPamJzzkekPkYQ=; b=mqeakveCsYWbMxGgdf2U4X84OMkcEdq0hJu/JTWKO5bTENPJiLaOgcXmER6MF6gE6CXHt4 4hHlDTvEuZLB6MDw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id E8796139B3; Tue, 13 Sep 2022 01:41:53 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id MYXIJeHfH2PFFwAAMHmgww (envelope-from ); Tue, 13 Sep 2022 01:41:53 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 From: "NeilBrown" To: "Dave Chinner" Cc: "Jeff Layton" , "Trond Myklebust" , "zohar@linux.ibm.com" , "djwong@kernel.org" , "xiubli@redhat.com" , "brauner@kernel.org" , "linux-xfs@vger.kernel.org" , "linux-api@vger.kernel.org" , "bfields@fieldses.org" , "fweimer@redhat.com" , "linux-kernel@vger.kernel.org" , "chuck.lever@oracle.com" , "linux-man@vger.kernel.org" , "linux-nfs@vger.kernel.org" , "tytso@mit.edu" , "viro@zeniv.linux.org.uk" , "jack@suse.cz" , "linux-ext4@vger.kernel.org" , "linux-btrfs@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "adilger.kernel@dilger.ca" , "lczerner@redhat.com" , "ceph-devel@vger.kernel.org" Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field In-reply-to: <20220913011518.GE3600936@dread.disaster.area> References: <91e31d20d66d6f47fe12c80c34b1cffdfc202b6a.camel@hammerspace.com>, <166268467103.30452.1687952324107257676@noble.neil.brown.name>, <166268566751.30452.13562507405746100242@noble.neil.brown.name>, <29a6c2e78284e7947ddedf71e5cb9436c9330910.camel@hammerspace.com>, <8d638cb3c63b0d2da8679b5288d1622fdb387f83.camel@hammerspace.com>, <166270570118.30452.16939807179630112340@noble.neil.brown.name>, <33d058be862ccc0ccaf959f2841a7e506e51fd1f.camel@kernel.org>, <166285038617.30452.11636397081493278357@noble.neil.brown.name>, <2e34a7d4e1a3474d80ee0402ed3bc0f18792443a.camel@kernel.org>, <166302538820.30452.7783524836504548113@noble.neil.brown.name>, <20220913011518.GE3600936@dread.disaster.area> Date: Tue, 13 Sep 2022 11:41:47 +1000 Message-id: <166303330794.30452.2450184433123024585@noble.neil.brown.name> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, 13 Sep 2022, Dave Chinner wrote: > On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, Jeff Layton wrote: > > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > > > This could be expensive. > > > > > > > > There is not currently any locking around O_DIRECT writes. You cannot > > > > synchronise with them. > > > > > > > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > > > inode_lock_shared across the I/O. That was why patch #8 takes the > > > inode_lock (exclusive) across the getattr. > > > > Looking at ext4_dio_write_iter() it certain does take > > inode_lock_shared() before starting the write and in some cases it > > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for > > the write to complete. But not in all cases. > > So I don't think it always holds the shared lock across all direct IO. > > To serialise against dio writes, one must: > > // Lock the inode exclusively to block new DIO submissions > inode_lock(inode); > > // Wait for all in flight DIO reads and writes to complete > inode_dio_wait(inode); > > This is how truncate, fallocate, etc serialise against AIO+DIO which > do not hold the inode lock across the entire IO. These have to > serialise aginst DIO reads, too, because we can't have IO in > progress over a range of the file that we are about to free.... > > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > > > writes. DIO writes sometimes only take the shared lock (e.g. when the > > > data is already properly aligned). If we want to ensure the getattr > > > doesn't run while _any_ writes are running, we'd need the exclusive > > > lock. > > > > But the exclusive lock is bad for scalability. > > Serilisation against DIO is -expensive- and -slow-. It's not a > solution for what is supposed to be a fast unlocked read-only > operation like statx(). > > > > Maybe that's overkill, though it seems like we could have a race like > > > this without taking inode_lock across the getattr: > > > > > > reader writer > > > ----------------------------------------------------------------- > > > i_version++ > > > getattr > > > read > > > DIO write to backing store > > > > > > > This is why I keep saying that the i_version increment must be after the > > write, not before it. > > Sure, but that ignores the reason why we actually need to bump > i_version *before* we submit a DIO write. DIO write invalidates the > page cache over the range of the write, so any racing read will > re-populate the page cache during the DIO write. and DIO reads can also get data at some intermediate state of the write. So what? i_version cannot provide coherent caching. You needs locks and call-backs and such for that. i_version (as used by NFS) only aims for approximate caching. Specifically: 1/ max-age caching. The i_version is polled when the age of the cache reaches some preset value, and the cache is purged/reloaded if needed, and the age reset to 0. 2/ close-to-open caching. There are well-defined events where the i_version must reflect preceding events by that the same client (close and unlock) and well defined events when the client will check the i_version before trusting any cache (open and lock). There is absolutely no need ever to update the i_version *before* making a change. If you really think there is, please provide a sequence of events for two different actors/observes where having the update before the change will provide a useful benefit. Thanks, NeilBrown > > Hence buffered reads can return before the DIO write has completed, > and the contents of the read can contain, none, some or all of the > contents of the DIO write. Hence i_version has to be incremented > before the DIO write is submitted so that racing getattrs will > indicate that the local caches have been invalidated and that data > needs to be refetched. > > But, yes, to actually be safe here, we *also* should be bumping > i_version on DIO write on DIO write completion so that racing > i_version and data reads that occur *after* the initial i_version > bump are invalidated immediately. > > IOWs, to avoid getattr/read races missing stale data invalidations > during DIO writes, we really need to bump i_version both _before and > after_ DIO write submission. > > It's corner cases like this where "i_version should only be bumped > when ctime changes" fails completely. i.e. there are concurrent IO > situations which can only really be handled correctly by bumping > i_version whenever either in-memory and/or on-disk persistent data/ > metadata state changes occur..... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >