Received: by 2002:ac0:da4c:0:0:0:0:0 with SMTP id a12csp647074imi; Fri, 22 Jul 2022 06:33:54 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vS2nWxIPjC30fYvnB9r3eIJ5saZWSaYS9zCFUiMP0l0GBsQ4KzY5UNsDgkJmBNiE9Iez3U X-Received: by 2002:a05:6870:2305:b0:10d:c8dd:529a with SMTP id w5-20020a056870230500b0010dc8dd529amr23363oao.105.1658496832864; Fri, 22 Jul 2022 06:33:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658496832; cv=none; d=google.com; s=arc-20160816; b=wCcp2guo4OMaM+GLZKYG5mApdnXxpRjYg4oPYzj3g+aI/YhUiYBWNgNeJaIE0QtBqI OMINIHo3Po0u/AoJA+TnhSOKD6xkdF+D1L6XyKGYY9jINnKtthB4H5t58RpV6qVh2oyx KJurnRpQj9YdwvjH4smSzDoj+UkkJXmylCWe1q4K5JvQolOH7wzsgQ/mnPwnc2no6Tp6 UIP67yjRHT62oek0ojLgPKzxBMc33JT+NL4Jyt+g+vAVgDVxZW2LC/adO9Fv35n2r+Ga BT5icdrexNFPB13nWiEX7V6ievVwsvLugMn+SHEKA+vUbd2gaptDE7KBDYlVhtTIi0Hl qqww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=fgV0OyZQnEkaUoJHBRpOfKLsz4CrPGRVpIp6sQ3jsLI=; b=Nd+kXJvYOJAyaRDWhPDe6XdPfBr+k0NaLZlqJ2yOEmE8BODKRCsKwjF15n/aKINoyL Es9Gw0SMPMeLCAwT+uO2/e2Yu9WTbI0OnqFh7PQ0igXu/q6zRRlLTRDbC3pMvLgua092 Sk8yQguhd5A016PQ+YZ6SD5TmRmOjABVRQJkBun72WqgNwik3lyLKnz9KjVYlbAxT/7C xj5dAGMfLlK+UaVIsz9YY42j3nPonESGGGYjjU8w8nJhjf6XU/sht2BdLdJinb/1dx7V DG7GkbX3oC6xgOtJkGNnnAwxu+2ox7LphcnxzeojTL2qQgcAVxZl39Sz7Qng8O4Dz9La ZvXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@toxicpanda-com.20210112.gappssmtp.com header.s=20210112 header.b=Lvg74dlh; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q3-20020a056870328300b00101cb06bbb2si4075043oac.20.2022.07.22.06.33.04; Fri, 22 Jul 2022 06:33:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@toxicpanda-com.20210112.gappssmtp.com header.s=20210112 header.b=Lvg74dlh; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234835AbiGVNcn (ORCPT + 99 others); Fri, 22 Jul 2022 09:32:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45576 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234722AbiGVNcm (ORCPT ); Fri, 22 Jul 2022 09:32:42 -0400 Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3D1088E00 for ; Fri, 22 Jul 2022 06:32:40 -0700 (PDT) Received: by mail-qt1-x835.google.com with SMTP id w29so3462715qtv.9 for ; Fri, 22 Jul 2022 06:32:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=fgV0OyZQnEkaUoJHBRpOfKLsz4CrPGRVpIp6sQ3jsLI=; b=Lvg74dlhSvtItYMOwA2DzO3H8Ory07l09lJ/KO6+jm5zMPLycmJxkrAzn8UT48/pzo OHnjF/Rm2usgT6qQlSEpERRB+yL0P/Mbx/JDBuGWW+Iy/tcDt75GQtY4GPcJx5Z3F2Nn LQvv83REEhXubRIaRuEYIghvQDIyFJPcMYFENRypFUUGYV/kxzEESiUqsgU6Eyi73cko 4fBo6ZrA717N/22b82RqrbnAfqERFWRlErYzyR0aPv3zXQhksHeW2uceA7Zunhnd72Jg CGb/iAEoGezC3iK2GHNJbzrZQ2JEeUGrOHhBsUB723fiAoL2V07SBsBDPfqR+JuwFpf1 +vaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=fgV0OyZQnEkaUoJHBRpOfKLsz4CrPGRVpIp6sQ3jsLI=; b=PLBSJefBtx2b91dsKwLKJ1sYGK1XllEvteerzWrIfbaibA2NPmOEiZEyJ/OWV650sz HFjAn+oJeBzIPVLMQQaXC/btC02uPMKt0T/kXGqBLdLwlR6jIvTeAsuYmWQnYd4ZEi70 x7XaoB3p/tevbljVxuVaC7sqsy4t1kLvYwYEMIlL9nhmkP8SoeZCyj2XLXBMysTP+HmL aZIwfdnDFH1UVc/QVNiLvheHhfxoNi7t7Rk37aA6ryUHt77Z+7iufKvj2ng8pXDMREHJ f+wKutInSdSvwcX1JovjdWmZ0wdRWjKBCpXrpASDRFccz96L7XOh+MVGAREKJNZ0KUFc tg9w== X-Gm-Message-State: AJIora/beZuFyxCRarP8V5mqlYXbbPsRtiV7FUmZnLny107Ug5y5Gx/W gzY2k9olfLgh7RGcoVOHnHBCQQ== X-Received: by 2002:a05:622a:1811:b0:31e:f69e:a3d5 with SMTP id t17-20020a05622a181100b0031ef69ea3d5mr20358qtc.379.1658496759497; Fri, 22 Jul 2022 06:32:39 -0700 (PDT) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id h24-20020ac846d8000000b0031e9ff0c44fsm2889177qto.20.2022.07.22.06.32.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jul 2022 06:32:39 -0700 (PDT) Date: Fri, 22 Jul 2022 09:32:37 -0400 From: Josef Bacik To: Anna Schumaker Cc: Dave Chinner , Chuck Lever III , Linux NFS Mailing List , linux-fsdevel Subject: Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation Message-ID: References: <20220715184433.838521-1-anna@kernel.org> <20220715184433.838521-7-anna@kernel.org> <20220718011552.GK3600936@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Fri, Jul 22, 2022 at 08:45:13AM -0400, Anna Schumaker wrote: > On Thu, Jul 21, 2022 at 4:48 PM Josef Bacik wrote: > > > > On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote: > > > On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner wrote: > > > > > > > > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote: > > > > > > On Jul 15, 2022, at 2:44 PM, Anna Schumaker wrote: > > > > > > > > > > > > From: Anna Schumaker > > > > > > > > > > > > Rather than relying on the underlying filesystem to tell us where hole > > > > > > and data segments are through vfs_llseek(), let's instead do the hole > > > > > > compression ourselves. This has a few advantages over the old > > > > > > implementation: > > > > > > > > > > > > 1) A single call to the underlying filesystem through nfsd_readv() means > > > > > > the file can't change from underneath us in the middle of encoding. > > > > > > > > Hi Anna, > > > > > > > > I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start > > > > of nfsd4_encode_read_plus_data() that is used to trim the data that > > > > has already been read out of the file? > > > > > > There is also the vfs_llseek(SEEK_DATA) call at the start of > > > nfsd4_encode_read_plus_hole(). They are used to determine the length > > > of the current hole or data segment. > > > > > > > > > > > What's the problem with racing with a hole punch here? All it does > > > > is shorten the read data returned to match the new hole, so all it's > > > > doing is making the returned data "more correct". > > > > > > The problem is we call vfs_llseek() potentially many times when > > > encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a > > > loop where we alternate between hole and data segments until we've > > > encoded the requested number of bytes. My attempts at locking the file > > > have resulted in a deadlock since vfs_llseek() also locks the file, so > > > the file could change from underneath us during each iteration of the > > > loop. > > > > > > > > > > > OTOH, if something allocates over a hole that the read filled with > > > > zeros, what's the problem with occasionally returning zeros as data? > > > > Regardless, if this has raced with a write to the file that filled > > > > that hole, we're already returning stale data/hole information to > > > > the client regardless of whether we trim it or not.... > > > > > > > > i.e. I can't see a correctness or data integrity problem here that > > > > doesn't already exist, and I have my doubts that hole > > > > punching/filling racing with reads happens often enough to create a > > > > performance or bandwidth problem OTW. Hence I've really got no idea > > > > what the problem that needs to be solved here is. > > > > > > > > Can you explain what the symptoms of the problem a user would see > > > > that this change solves? > > > > > > This fixes xfstests generic/091 and generic/263, along with this > > > reported bug: https://bugzilla.kernel.org/show_bug.cgi?id=215673 > > > > > > > > > > 2) A single call to the underlying filestem also means that the > > > > > > underlying filesystem only needs to synchronize cached and on-disk > > > > > > data one time instead of potentially many speeding up the reply. > > > > > > > > SEEK_HOLE/DATA doesn't require cached data to be sync'd to disk to > > > > be coherent - that's only a problem FIEMAP has (and syncing cached > > > > data doesn't fix the TOCTOU coherency issue!). i.e. SEEK_HOLE/DATA > > > > will check the page cache for data if appropriate (e.g. unwritten > > > > disk extents may have data in memory over the top of them) instead > > > > of syncing data to disk. > > > > > > For some reason, btrfs on virtual hardware has terrible performance > > > numbers when using vfs_llseek() with files that are already in the > > > server's cache. I think it had something to do with how they lock > > > extents, and some extra work that needs to be done if the file already > > > exists in the server's memory but it's been a few years since I've > > > gone into their code to figure out where the slowdown is coming from. > > > See this section of my performance results wiki page: > > > https://wiki.linux-nfs.org/wiki/index.php/Read_Plus_May_2022#BTRFS_3 > > > > > > > I just did this locally, once in my test vm's and once on my continuous > > performance testing hardware and I'm not able to reproduce the numbers, so I > > think I'm doing something wrong. > > > > My test is stupid, I just dd a 5gib file, come behind it and punch holes every > > other 4k. Then I umount and remount, SEEK_DATA+SEEK_HOLE through the whole > > file, and then do it again so I have uncached and cached. The numbers I'm > > seeing are equivalent to ext4/xfs. Granted on my VM I had to redo the test > > because I had lockdep and other debugging on which makes us look stupid because > > of the extent locking stuff, but with it off everything appears normal. > > > > Does this more or less mirror your testing? Looking at our code we're not doing > > anything inherently stupid, so I'm not entirely sure what could be the problem. > > Thanks, > > I think that's pretty close to what the server is doing with the > current code, except the NFS server would also do a read for every > data segment it found. I've been using `vmtouch` to make sure the file > doesn't get evicted from the server's page cache for my cached data. > I messed with it some more and I can't get it to be slow. If you trip over something like this again just give a shout and I'd be happy to dig in. Thanks, Josef