Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp2428212pxp; Mon, 21 Mar 2022 20:08:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw/efN4GHoze0H8G+DIlw7weEvIEJV4ar8xWC3N24cW4a8q5+re/okVOq/GNVyw76E7gFMI X-Received: by 2002:a17:902:8644:b0:153:9f01:2090 with SMTP id y4-20020a170902864400b001539f012090mr15651713plt.101.1647918482533; Mon, 21 Mar 2022 20:08:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647918482; cv=none; d=google.com; s=arc-20160816; b=PZsKA/UbRV0wfxld4vLX0ZWN0GknBvA5PimJxm/ZrRCYaso0xviDPnpMFLD4cO4rgF vj9SXiugYdFfVRnrS5Jmo8sav9LhErDUbnN/DvGoYRPjZBjoIcsIOVFK5aHX3kwVThtB IyTgpbZv/+9HF0gKdOwZZ1kQbRonoJpbQbVmk95Fwgel4OPjA/+F1MaYTSXdtW9c/S7j CNaHVqtNXdiQaqSV8C5QmpOMfqJHqxoaLZTHOy0woYvJcjCinHh14yVKxjeleEHAv8WH uPsY6G1gH3cY9uxs5hTf9W739knhGwDeksn7Rtb30itgy3WKdYi2xmiW2YzCkZVU5WPQ MzZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=imPtpQHF+XnuX31KUbJjEptuiXQNro/SG94J8PrMCz4=; b=TLedD7wWzdA/SxJhQ9sqo2F4NNo6eNgas2ub5P1lx0zNOwWvW2eZoUc+HQCZb9zU0x EUPG8pfud87/bxri5mXuAGhLWYgVY9zfyzbzobKMsvJYdHbqBifKHH1y5EaMSUScm2AD JsgF2eN2DBlcOAyFrJAFtw0wqwsPsdeTHi2hqjfzFcrUumr6ddRPhrKQCJVbRDmQyTAn p4X95fs1ergXu4aDezn02fzehww1TJRQ/TZf9qV7d06jpwaZ7qRvy3/OXUoE5KoW+Fx4 i47mEShK+fQusKMBzFHOkx6hkzgwGQBR0dA8JQprB+XYhawlfqWW6pV/lsEJlyWejQGt m2BQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id s2-20020a056a00194200b004fa3a8e002csi9983645pfk.227.2022.03.21.20.08.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Mar 2022 20:08:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E31A0201A9; Mon, 21 Mar 2022 19:41:05 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235546AbiCVCm0 (ORCPT + 99 others); Mon, 21 Mar 2022 22:42:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60730 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235473AbiCVCmZ (ORCPT ); Mon, 21 Mar 2022 22:42:25 -0400 Received: from zeniv-ca.linux.org.uk (zeniv-ca.linux.org.uk [IPv6:2607:5300:60:148a::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 11BE413F53 for ; Mon, 21 Mar 2022 19:40:58 -0700 (PDT) Received: from viro by zeniv-ca.linux.org.uk with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1nWUS6-00FB9i-7g; Tue, 22 Mar 2022 02:40:54 +0000 Date: Tue, 22 Mar 2022 02:40:54 +0000 From: Al Viro To: Tejun Heo Cc: Imran Khan , gregkh@linuxfoundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems. Message-ID: References: <20220317072612.163143-1-imran.f.khan@oracle.com> <20220317072612.163143-8-imran.f.khan@oracle.com> <536f2392-45d2-2f43-5e9d-01ef50e33126@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: Al Viro X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 21, 2022 at 09:20:06AM -1000, Tejun Heo wrote: > Hello, > > On Mon, Mar 21, 2022 at 05:55:53PM +0000, Al Viro wrote: > > Why bother with rwsem, when we don't need anything blocking under it? > > DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static. > > Oh I mean, in case the common readers get way too hot, percpu_rwsem is a > relatively easy way to shift the burder from the readers to the writers. I > doubt we'll need that. > > > kernfs_walk_ns() - this is fucking insane; on the surface, it needs to > > be exclusive due to the use of the same static buffer. It uses that > > buffer to generate a pathname, *THEN* walks over it with strsep(). > > That's an... interesting approach, for the lack of other printable > > terms - we walk the chain of ancestors, concatenating their names > > into a buffer and separating those names with slashes, then we walk > > that buffer, searching for slashes... WTF? > > It takes the @parent to walk string @path from. Where does it generate the > pathname? Sorry, misread that thing - the reason it copies the damn thing at all is the use of strsep(). Yecch... Rule of the thumb regarding strsep() use, be it in kernel or in the userland: don't. It's almost never the right primitive to use. Lookups should use qstr; it has both the length and place for hash. Switch kernfs_find_ns() to that (and lift the calculation of length into the callers that do not have it - note that kernfs_iop_lookup() does) and you don't need the strsep() shite (or copying) anymore. That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared. HOWEVER, that's not the only lock needed there and this patchset is broken in that respect - it locks the starting node, then walks the path. Complete with lookups in rbtrees of children in the descendents of that node and those are *not* locked. > > Wait a sec; what happens if e.g. kernfs_path_from_node() races with > > __kernfs_remove()? We do _not_ clear ->parent, but we do drop references > > that used to pin what it used to point to, unless I'm misreading that > > code... Or is it somehow prevented by drain-related logics? Seeing > > that it seems to be possible to have kernfs_path_from_node() called from > > an interrupt context, that could be delicate... > > kernfs_remove() is akin to freeing of the node and all its descendants. The > caller shouldn't be racing that against any other operations in the subtree. That's interesting... My impression had been that some of these functions could be called from interrupt contexts (judging by the spin_lock_irqsave() in there). What kind of async contexts those are, and what do you use to make sure they don't leak into overlap with kernfs_remove()? In particular, if you ever use those from RCU callbacks, you need to make sure that you have a grace period somewhere; the wait_event() you have in kernfs_drain() won't do it. I've tried to track the callchains that could lead there, but it gets hairy fast.