Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp807632rdh; Thu, 23 Nov 2023 20:33:40 -0800 (PST) X-Google-Smtp-Source: AGHT+IEdkXUp6wxZyUZsLjM3eDqLsBo5mQw8U1kn2jyYjBq2v3aK+YeCe/KdECs35nqNeCT6t0kI X-Received: by 2002:a17:902:d50d:b0:1cc:70dd:62e7 with SMTP id b13-20020a170902d50d00b001cc70dd62e7mr1881804plg.32.1700800419685; Thu, 23 Nov 2023 20:33:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700800419; cv=none; d=google.com; s=arc-20160816; b=XVeN1ADuLrRQTGdX42I9iE1afdNjaN1tweL7Md26l/YsGla46JnnHHRtaldjEuwrH2 I8+SLOJ/xgDpp1QTMAed+6xLccyiCSJsBmH2LT6JB6BrJLdlUgBJWffBcpRUlOpZrDtH AasDx+lhGxQfPKUtbaUOAA8kAul1J1Ge0sCmn0GITMb9+3SHcJLcnUzWtEtshZ4eSqwB oTfZ8y+SO+Ay8GRN24Ud/LQ/jO0Zhgd4S0cwoiDJtEqfxazbCW5lK7W+J1JzhCFqQnTp fLNDH/n3VELiPoymEQMxdlhXX4b3rqnoWffKevGNGEALD+P/kyXEyLOe9xKpeAvcVVLF RU1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=eULuJrxaEcKt1s9QTY8Z4iEikDlYQ4UeH32DWQUcL4U=; fh=1UvAp7GUQSaqrdARWG8zoUWPhj4hWkACxxyVSa4vxWY=; b=m4v1vkZ3uB5T8jbI6y2y945nNfBtMwF1tNwn/z992lBWD9zkWvZUBY76dwna4nUF/3 T50UjFspP4pquKAl0rRHGb+CcUgFQgV4olGt42WjdFW5oQ/TRS73kTEEpiROSNSB5FJS qQzsNhV7xwObr1MvzRm1ONRFB3AvZkepKwkIWyRwHwUYRr491YxBR8iZkY/mq3S3ITJf ijL8dZ5vS6hy+U+XHVwL7CUAbUqiIXdDiSGvw2YGYnSuZ7+pK9Mw0zqia/VxJFWNPznR MYedSeBbWnLKYny6As90RvQrqmUZzU057LoXxc++VgEkbS/RPvl4XrlbSXtiKf0JXlyo 3I0A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id v3-20020a170902d68300b001c61acd5be6si2564890ply.209.2023.11.23.20.33.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Nov 2023 20:33:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id D01158366263; Thu, 23 Nov 2023 20:33:34 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231255AbjKXEdO (ORCPT + 99 others); Thu, 23 Nov 2023 23:33:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51212 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231217AbjKXEdL (ORCPT ); Thu, 23 Nov 2023 23:33:11 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C48D010E4; Thu, 23 Nov 2023 20:33:12 -0800 (PST) Received: from imap2.dmz-prg2.suse.org (imap2.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:98]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 201F31FCE0; Thu, 23 Nov 2023 14:43:12 +0000 (UTC) Received: from imap2.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap2.dmz-prg2.suse.org (Postfix) with ESMTPS id 4604613AB5; Thu, 23 Nov 2023 12:17:33 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap2.dmz-prg2.suse.org with ESMTPSA id cZrJEN1CX2UCaAAAn2gu4w (envelope-from ); Thu, 23 Nov 2023 12:17:33 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 04CBFA07DC; Thu, 23 Nov 2023 10:50:44 +0100 (CET) Date: Thu, 23 Nov 2023 10:50:44 +0100 From: Jan Kara To: Al Viro Cc: linux-fsdevel@vger.kernel.org, Linus Torvalds , Mo Zou , Jan Kara , linux-kernel@vger.kernel.org Subject: Re: [PATCH 7/9] rename(): fix the locking of subdirectories Message-ID: <20231123095044.gtuuyhphgwbrxgni@quack3> References: <20231122193028.GE38156@ZenIV> <20231122193652.419091-1-viro@zeniv.linux.org.uk> <20231122193652.419091-7-viro@zeniv.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231122193652.419091-7-viro@zeniv.linux.org.uk> X-Spam-Level: Authentication-Results: smtp-out2.suse.de; none X-Rspamd-Server: rspamd2 X-Spamd-Result: default: False [-4.00 / 50.00]; REPLY(-4.00)[] X-Spam-Score: -4.00 X-Rspamd-Queue-Id: 201F31FCE0 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 23 Nov 2023 20:33:35 -0800 (PST) On Wed 22-11-23 19:36:50, Al Viro wrote: > We should never lock two subdirectories without having taken > ->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed > in 28eceeda130f "fs: Lock moved directories" is not transitive, with > the usual consequences. > > The rationale for locking renamed subdirectory in all cases was > the possibility of race between rename modifying .. in a subdirectory to > reflect the new parent and another thread modifying the same subdirectory. > For a lot of filesystems that's not a problem, but for some it can lead > to trouble (e.g. the case when short directory contents is kept in the > inode, but creating a file in it might push it across the size limit > and copy its contents into separate data block(s)). > > However, we need that only in case when the parent does change - > otherwise ->rename() doesn't need to do anything with .. entry in the > first place. Some instances are lazy and do a tautological update anyway, > but it's really not hard to avoid. > > Amended locking rules for rename(): > find the parent(s) of source and target > if source and target have the same parent > lock the common parent > else > lock ->s_vfs_rename_mutex > lock both parents, in ancestor-first order; if neither > is an ancestor of another, lock the parent of source > first. > find the source and target. > if source and target have the same parent > if operation is an overwriting rename of a subdirectory > lock the target subdirectory > else > if source is a subdirectory > lock the source > if target is a subdirectory > lock the target > lock non-directories involved, in inode pointer order if both > source and target are such. > > That way we are guaranteed that parents are locked (for obvious reasons), > that any renamed non-directory is locked (nfsd relies upon that), > that any victim is locked (emptiness check needs that, among other things) > and subdirectory that changes parent is locked (needed to protect the update > of .. entries). We are also guaranteed that any operation locking more > than one directory either takes ->s_vfs_rename_mutex or locks a parent > followed by its child. > > Cc: stable@vger.kernel.org > Fixes: 28eceeda130f "fs: Lock moved directories" > Signed-off-by: Al Viro Looks good to me and thanks for fixing this! Feel free to add: Reviewed-by: Jan Kara Honza > --- > .../filesystems/directory-locking.rst | 29 ++++----- > Documentation/filesystems/locking.rst | 5 +- > Documentation/filesystems/porting.rst | 18 ++++++ > fs/namei.c | 60 ++++++++++++------- > 4 files changed, 74 insertions(+), 38 deletions(-) > > diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst > index dccd61c7c5c3..193c22687851 100644 > --- a/Documentation/filesystems/directory-locking.rst > +++ b/Documentation/filesystems/directory-locking.rst > @@ -22,13 +22,16 @@ exclusive. > 3) object removal. Locking rules: caller locks parent, finds victim, > locks victim and calls the method. Locks are exclusive. > > -4) rename() that is _not_ cross-directory. Locking rules: caller locks the > -parent and finds source and target. We lock both (provided they exist). If we > -need to lock two inodes of different type (dir vs non-dir), we lock directory > -first. If we need to lock two inodes of the same type, lock them in inode > -pointer order. Then call the method. All locks are exclusive. > -NB: we might get away with locking the source (and target in exchange > -case) shared. > +4) rename() that is _not_ cross-directory. Locking rules: caller locks > +the parent and finds source and target. Then we decide which of the > +source and target need to be locked. Source needs to be locked if it's a > +non-directory; target - if it's a non-directory or about to be removed. > +Take the locks that need to be taken, in inode pointer order if need > +to take both (that can happen only when both source and target are > +non-directories - the source because it wouldn't be locked otherwise > +and the target because mixing directory and non-directory is allowed > +only with RENAME_EXCHANGE, and that won't be removing the target). > +After the locks had been taken, call the method. All locks are exclusive. > > 5) link creation. Locking rules: > > @@ -44,20 +47,17 @@ rules: > > * lock the filesystem > * lock parents in "ancestors first" order. If one is not ancestor of > - the other, lock them in inode pointer order. > + the other, lock the parent of source first. > * find source and target. > * if old parent is equal to or is a descendent of target > fail with -ENOTEMPTY > * if new parent is equal to or is a descendent of source > fail with -ELOOP > - * Lock both the source and the target provided they exist. If we > - need to lock two inodes of different type (dir vs non-dir), we lock > - the directory first. If we need to lock two inodes of the same type, > - lock them in inode pointer order. > + * Lock subdirectories involved (source before target). > + * Lock non-directories involved, in inode pointer order. > * call the method. > > -All ->i_rwsem are taken exclusive. Again, we might get away with locking > -the source (and target in exchange case) shared. > +All ->i_rwsem are taken exclusive. > > The rules above obviously guarantee that all directories that are going to be > read, modified or removed by method will be locked by caller. > @@ -67,6 +67,7 @@ If no directory is its own ancestor, the scheme above is deadlock-free. > > Proof: > > +[XXX: will be updated once we are done massaging the lock_rename()] > First of all, at any moment we have a linear ordering of the > objects - A < B iff (A is an ancestor of B) or (B is not an ancestor > of A and ptr(A) < ptr(B)). > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst > index 7be2900806c8..bd12f2f850ad 100644 > --- a/Documentation/filesystems/locking.rst > +++ b/Documentation/filesystems/locking.rst > @@ -101,7 +101,7 @@ symlink: exclusive > mkdir: exclusive > unlink: exclusive (both) > rmdir: exclusive (both)(see below) > -rename: exclusive (all) (see below) > +rename: exclusive (both parents, some children) (see below) > readlink: no > get_link: no > setattr: exclusive > @@ -123,6 +123,9 @@ get_offset_ctx no > Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem > exclusive on victim. > cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. > + ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories > + involved. > + ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent. > > See Documentation/filesystems/directory-locking.rst for more detailed discussion > of the locking scheme for directory operations. > diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst > index 878e72b2f8b7..9100969e7de6 100644 > --- a/Documentation/filesystems/porting.rst > +++ b/Documentation/filesystems/porting.rst > @@ -1061,3 +1061,21 @@ export_operations ->encode_fh() no longer has a default implementation to > encode FILEID_INO32_GEN* file handles. > Filesystems that used the default implementation may use the generic helper > generic_encode_ino32_fh() explicitly. > + > +--- > + > +**mandatory** > + > +If ->rename() update of .. on cross-directory move needs an exclusion with > +directory modifications, do *not* lock the subdirectory in question in your > +->rename() - it's done by the caller now [that item should've been added in > +28eceeda130f "fs: Lock moved directories"]. > + > +--- > + > +**mandatory** > + > +On same-directory ->rename() the (tautological) update of .. is not protected > +by any locks; just don't do it if the old parent is the same as the new one. > +We really can't lock two subdirectories in same-directory rename - not without > +deadlocks. > diff --git a/fs/namei.c b/fs/namei.c > index 71c13b2990b4..29bafbdb44ca 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3021,20 +3021,14 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2) > p = d_ancestor(p2, p1); > if (p) { > inode_lock_nested(p2->d_inode, I_MUTEX_PARENT); > - inode_lock_nested(p1->d_inode, I_MUTEX_CHILD); > + inode_lock_nested(p1->d_inode, I_MUTEX_PARENT2); > return p; > } > > p = d_ancestor(p1, p2); > - if (p) { > - inode_lock_nested(p1->d_inode, I_MUTEX_PARENT); > - inode_lock_nested(p2->d_inode, I_MUTEX_CHILD); > - return p; > - } > - > - lock_two_inodes(p1->d_inode, p2->d_inode, > - I_MUTEX_PARENT, I_MUTEX_PARENT2); > - return NULL; > + inode_lock_nested(p1->d_inode, I_MUTEX_PARENT); > + inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2); > + return p; > } > > /* > @@ -4716,11 +4710,12 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname > * > * a) we can get into loop creation. > * b) race potential - two innocent renames can create a loop together. > - * That's where 4.4 screws up. Current fix: serialization on > + * That's where 4.4BSD screws up. Current fix: serialization on > * sb->s_vfs_rename_mutex. We might be more accurate, but that's another > * story. > - * c) we have to lock _four_ objects - parents and victim (if it exists), > - * and source. > + * c) we may have to lock up to _four_ objects - parents and victim (if it exists), > + * and source (if it's a non-directory or a subdirectory that moves to > + * different parent). > * And that - after we got ->i_mutex on parents (until then we don't know > * whether the target exists). Solution: try to be smart with locking > * order for inodes. We rely on the fact that tree topology may change > @@ -4752,6 +4747,7 @@ int vfs_rename(struct renamedata *rd) > bool new_is_dir = false; > unsigned max_links = new_dir->i_sb->s_max_links; > struct name_snapshot old_name; > + bool lock_old_subdir, lock_new_subdir; > > if (source == target) > return 0; > @@ -4805,15 +4801,32 @@ int vfs_rename(struct renamedata *rd) > take_dentry_name_snapshot(&old_name, old_dentry); > dget(new_dentry); > /* > - * Lock all moved children. Moved directories may need to change parent > - * pointer so they need the lock to prevent against concurrent > - * directory changes moving parent pointer. For regular files we've > - * historically always done this. The lockdep locking subclasses are > - * somewhat arbitrary but RENAME_EXCHANGE in particular can swap > - * regular files and directories so it's difficult to tell which > - * subclasses to use. > + * Lock children. > + * The source subdirectory needs to be locked on cross-directory > + * rename or cross-directory exchange since its parent changes. > + * The target subdirectory needs to be locked on cross-directory > + * exchange due to parent change and on any rename due to becoming > + * a victim. > + * Non-directories need locking in all cases (for NFS reasons); > + * they get locked after any subdirectories (in inode address order). > + * > + * NOTE: WE ONLY LOCK UNRELATED DIRECTORIES IN CROSS-DIRECTORY CASE. > + * NEVER, EVER DO THAT WITHOUT ->s_vfs_rename_mutex. > */ > - lock_two_inodes(source, target, I_MUTEX_NORMAL, I_MUTEX_NONDIR2); > + lock_old_subdir = new_dir != old_dir; > + lock_new_subdir = new_dir != old_dir || !(flags & RENAME_EXCHANGE); > + if (is_dir) { > + if (lock_old_subdir) > + inode_lock_nested(source, I_MUTEX_CHILD); > + if (target && (!new_is_dir || lock_new_subdir)) > + inode_lock(target); > + } else if (new_is_dir) { > + if (lock_new_subdir) > + inode_lock_nested(target, I_MUTEX_CHILD); > + inode_lock(source); > + } else { > + lock_two_nondirectories(source, target); > + } > > error = -EPERM; > if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target))) > @@ -4861,8 +4874,9 @@ int vfs_rename(struct renamedata *rd) > d_exchange(old_dentry, new_dentry); > } > out: > - inode_unlock(source); > - if (target) > + if (!is_dir || lock_old_subdir) > + inode_unlock(source); > + if (target && (!new_is_dir || lock_new_subdir)) > inode_unlock(target); > dput(new_dentry); > if (!error) { > -- > 2.39.2 > -- Jan Kara SUSE Labs, CR