Received: by 2002:a05:6358:53a8:b0:117:f937:c515 with SMTP id z40csp4637757rwe; Mon, 17 Apr 2023 16:18:19 -0700 (PDT) X-Google-Smtp-Source: AKy350aKXIAhiGPxTFs2PAvPtPyYL55LufBSAFHgibewe7tDN9LUsKZGc5UtqWbM4+aPftMZuVz3 X-Received: by 2002:a05:6a20:841f:b0:ec:d7cf:bcf7 with SMTP id c31-20020a056a20841f00b000ecd7cfbcf7mr18569142pzd.17.1681773499286; Mon, 17 Apr 2023 16:18:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681773499; cv=none; d=google.com; s=arc-20160816; b=l+Gwm6LqTuhqtni85nF91lSa0HJGNTDgQHl/SQq5j5kBBea1Npih8KuLhJ3dRUtliX TOatTWSQu0jiUdRF2wADPCmnDNvyYcguIHkzfm18e97SUvL/NSX4YAhVU8JY+oE5pT5J AnxiYhgQmqoAkif9DbDhJ8Wt8U5CNGQxKOvX/FYI5tuNU+hIIYLsNLD+Mvsnn2dJQXze xWDaDeOUhLFlwixHiXJofpVDM9kGolub2+GpeUi+FL9pVmCJuaGYih5enB8BhRm16jVt XImuoS0uDkchSf+MRxQAFHcm9eosECsUdW+jkqVU1pPQP3DUtG00uDnMTIyLK4EAeJpM jLtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=BTLomzXysnNz5F9GuF2JLgEA/M+Ux3RN4dN2+Tmf1kM=; b=HBuKZnaX5lGW33ndr4ZiRidlYanUm1AxcK62HfNSe7+BPJkQrw8VgDBqUdWjVNhqYo 72oVOzDFYj3581IjomhdHzG464CPloFaOBsR53bkJaHBf7bqXM0dtvhCKPq2B88ugbo8 Vs5k9oqswMfiOWe1kUIgnexa5hoOlPl3z80osDjW37/qxGkzdic6MFLF6+56xzcrto44 CSmlGKcvRmLeNJrc8PPB8f6K02krJNGhDRcAaghGFLxX5i2OjPxdvKGA7RJ7mq93eC60 ycVHRD58RU53D06BvyxZ1FVHqAgxBd+RnPodQ7l/KnDceqKXQlM5vJ54jJZhdBhxFzwq ksAQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=OXy5hW6b; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r21-20020a63ce55000000b0050c0763b32csi12581659pgi.233.2023.04.17.16.18.00; Mon, 17 Apr 2023 16:18:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=OXy5hW6b; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229588AbjDQXHf (ORCPT + 99 others); Mon, 17 Apr 2023 19:07:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230189AbjDQXHe (ORCPT ); Mon, 17 Apr 2023 19:07:34 -0400 Received: from mail-ej1-x631.google.com (mail-ej1-x631.google.com [IPv6:2a00:1450:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAD331FC4 for ; Mon, 17 Apr 2023 16:07:32 -0700 (PDT) Received: by mail-ej1-x631.google.com with SMTP id a640c23a62f3a-94a34a0b531so622175366b.1 for ; Mon, 17 Apr 2023 16:07:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681772851; x=1684364851; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BTLomzXysnNz5F9GuF2JLgEA/M+Ux3RN4dN2+Tmf1kM=; b=OXy5hW6baMAllE8gVr5sG4Y5CHP1PBHx8nPcMN/P1tyNwMeZeDn9JQlVZX/YVMSljV 3VQ42EbllFuPrD1cJGwX2GuIt44SW2lIBmnw36oL+roTIEcl/ZuHmwMC7Nf/TphOxXDr EolEm5mar9VbrWS5CDgKhTbtykQPJgeqnLQKhA0ljJKI6ObgUhm2MEW23Gwcvjf2Ab4E dyO9thEymUbxrTbd12n/3kTuRRMZ1I/lv0Q1Vpy8gBk14ywxU0l7CFLcoWEj+EMlLlBQ 71Mf5Drx0nR6s+9oztH0fyDVgP6btnXstVTzZxvmQ8q1kpTsq9nyVD8JT0V9CUGBoKnb BFbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681772851; x=1684364851; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BTLomzXysnNz5F9GuF2JLgEA/M+Ux3RN4dN2+Tmf1kM=; b=WaavgPxp4PvKWHq080tpi0ojVztRDkM/1qyeytGh6fRLd1O5hf/gL9SHaKx1tT/S5u VeJJ4JTE+zjAlg4qBmcJBoHk020EU8apmS5y/ruI8ilWGaOk3wWtESZnBJQTVgO+DwVe dWiolxAYKa38E5+RTEhAmIt8/KsnDkPk1q/kTvnpBk1Nka38jNWPRADIjDZjTL1XuBpD wJoyXYnkivF+DHfpc58mlebWvX8xzpFQG6ShM75089e0Sa7CdfwPqM13BaoWLgT9uSOM b6QkcV8+esrMbFRzTUTZ9i9xIceA4BkIeViVzD2yYlhmApJeyZc/v+rCYSTWPeJnmtU6 pPoQ== X-Gm-Message-State: AAQBX9dlpQL6IyAdbS5y+ulKKUInr8VIr99cD0+lWYOaqHSbsVVGOkFj M2BUVlJF1F4MYEHWve4+c0rMfYb6NuwlGAMgf6lOqx1Ta2kKIvovs8c= X-Received: by 2002:a50:9b43:0:b0:504:7857:d739 with SMTP id a3-20020a509b43000000b005047857d739mr254015edj.7.1681772851281; Mon, 17 Apr 2023 16:07:31 -0700 (PDT) MIME-Version: 1.0 References: <72bf692e-bb6b-c1f2-d1ba-3205ab649b43@I-love.SAKURA.ne.jp> <4BC7955B-40E4-4A43-B2D1-2E9302E84337@oracle.com> <7246a80ae33244a4553bbc0ca9e771ce8143d97b.camel@kernel.org> <20230416233758.GD447837@dread.disaster.area> In-Reply-To: From: Frank van der Linden Date: Mon, 17 Apr 2023 16:07:20 -0700 Message-ID: Subject: Re: [PATCH] nfsd: don't use GFP_KERNEL from nfsd_getxattr()/nfsd_listxattr() To: Dave Chinner Cc: Jeff Layton , Tetsuo Handa , Chuck Lever III , Linux NFS Mailing List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Mon, Apr 17, 2023 at 3:25=E2=80=AFPM Frank van der Linden wrote: > > On Sun, Apr 16, 2023 at 4:38=E2=80=AFPM Dave Chinner wrote: > > > > On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote: > > > On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote: > > > > On 2023/04/16 3:40, Jeff Layton wrote: > > > > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote: > > > > > > On 2023/04/16 1:13, Chuck Lever III wrote: > > > > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa wrote: > > > > > > > > > > > > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KER= NEL | GFP_NOFS > > > > > > > > does not make sense. Drop __GFP_FS flag in order to avoid d= eadlock. > > > > > > > > > > > > > > The server side threads run in process context. GFP_KERNEL > > > > > > > is safe to use here -- as Jeff said, this code is not in > > > > > > > the server's reclaim path. Plenty of other call sites in > > > > > > > the NFS server code use GFP_KERNEL. > > > > > > > > > > > > GFP_KERNEL memory allocation calls filesystem's shrinker functi= ons > > > > > > because of __GFP_FS flag. My understanding is > > > > > > > > > > > > Whether this code is in memory reclaim path or not is irrelev= ant. > > > > > > Whether memory reclaim path might hold lock or not is relevan= t. > > > > > > > > > > > > . Therefore, question is, does nfsd hold i_rwsem during memory = reclaim path? > > > > > > > > > > > > > > > > No. At the time of these allocations, the i_rwsem is not held. > > > > > > > > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem > > > > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS)= allocation. > > > > That's why > > > > > > > > /* > > > > * We're holding i_rwsem - use GFP_NOFS. > > > > */ > > > > > > > > is explicitly there in nfsd_listxattr() side. > > > > You can do GFP_KERNEL allocations holding the i_rwsem just fine. > > All that it requires is the caller holds a reference to the inode, > > and at that point inode will should skip the given inode without > > every locking it. > > > > Of course, lockdep can't handle the "referenced inode lock -> > > fsreclaim -> unreferenced inode lock" pattern at all. It throws out > > false positives when it detects this because it's not aware of the > > fact that reference counts prevent inode lock recursion based > > deadlocks in the vfs inode cache shrinker. > > > > If a custom, non-vfs shrinker is walking inodes that have no > > references and taking i_rwsem in a way that can block without first > > checking whether it is safe to lock the inode in a deadlock free > > manner, they are doing the wrong thing and the custom shrinker needs > > to be fixed. > > > > > > > > > > If memory reclaim path (directly or indirectly via locking dependen= cy) involves > > > > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use _= _GFP_FS flag. > > > > > > > > > > (cc'ing Frank V. who wrote this code and -fsdevel) > > > > > > I stand corrected! You're absolutely right that it's taking the i_rws= em > > > for read there. That seems pretty weird, actually. I don't believe we > > > need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, an= d > > > certainly nothing else under there requires it. > > > > > > Frank, was there some reason you decided you needed the inode_lock > > > there? It looks like under the hood, the xattr code requires you to t= ake > > > it for write in setxattr and removexattr, but you don't need it at al= l > > > in getxattr or listxattr. Go figure. > > > > IIRC, the filesytsem can't take the i_rwsem for get/listxattr > > because the lookup contexts may already hold the i_rwsem. I think > > this is largely a problem caused by LSMs (e.g. IMA) needing to > > access security xattrs in paths where the inode is already > > locked. > > > > > Longer term, I wonder what the inode_lock is protecting in setxattr a= nd > > > removexattr operations, given that get and list don't require them? > > > These are always delegated to the filesystem driver -- there is no > > > generic xattr implementation. > > > > Serialising updates against each other. xattr modifications are > > supposed to be "atomic" from the perspective of the user - they see > > the entire old or the new xattr, never a partial value. > > > > FWIW, XFS uses it's internal metadata rwsem for access/update > > serialisation of xattrs because we can't rely on the i_rwsem for > > reliable serialisation. I'm guessing that most journalling > > filesystems do something similar. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > Sorry about the slow response - thanks Chuck for adding my > correct/updated email address. > > As Dave notes, you need inode_lock to get an atomic view of an xattr. > Since both nfsd_getxattr and nfsd_listxattr to the standard trick of > querying the xattr length with a NULL buf argument (just getting the > length back), allocating the right buffer size, and then querying > again, they need to hold the inode lock to avoid having the xattr > changed from under them while doing that. > > From that then flows the requirement that GFP_FS could cause problems > while holding i_rwsem, so I added GFP_NOFS. > > If any of these requirements have changed since then, obviously this > should be fixed. But I don't think they have, right? > > - Frank Forgot to mention - I know there is vfs_getxattr_alloc, but that doesn't hold any lock across the query-alloc-requery sequence, so I didn't use it. The locking in fs/xattr.c does seem to be inconsistent in this regard, as was mentioned in this thread. Arguably, the chances of an xattr being changed while you're querying one are very low (and it doesn't seem to have caused any problems all this time), but being correct doesn't cost much here. - Frank