Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp4199362pxb; Mon, 1 Feb 2021 15:31:33 -0800 (PST) X-Google-Smtp-Source: ABdhPJz/Pje4CfUdJYi9QBSEl2WrSu5Xt11iWteLzGqldPCuGEQ2s6APsiKv2swTJ8VqSxaJUejh X-Received: by 2002:a17:907:7347:: with SMTP id dq7mr3916128ejc.385.1612222293304; Mon, 01 Feb 2021 15:31:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612222293; cv=none; d=google.com; s=arc-20160816; b=sbl2f/Jqv5VrB+MkhPdLbCHtTdlGsZSww3RyTtPnrteuFyMjoKN6gQpAT3HdAM1l1I M+vsX31/g/lQDoshu2BvdeT2Nk0cRvhm9Jrj9yLgCRYXyrFr2EUEZGWi5WkVwFMuMSnb LA8M+gy9OxU0kExM370iu+3/Pu5QgCdHAO3QHdODDPpvavPa92k6kcT9vKgnXagPVbQu K2c01wK3kqum+MPCXcX6NX7N8jXlLxz0VMRHZJxU0mom2znFOTk3AEQNjdPazNq5bc7J IejFKFRR0oPH+INJRoxwyO1D1OK1G92Vc3hTmDR9eZDB3RB8j/oRQKrLRKIxDEbyfefY hLBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:references:in-reply-to :subject:cc:date:to:from; bh=e/SacW2YSPBccP/IJgZcV261/Unr6ocaRZnYvX48ocA=; b=GhgNCjKqFqPBA4pjtHp2E+c8+c7r5crcEglmajnXqLxZl6FfVlepyUwOmnShi0nsFc g28Yb2o89PfplbcfxeWNRoefP2Xq6GQmxnDPnolV4TadiHIpWVikORN8STYhF6fEyzaf LCMG9FgjMZkI1eyHnEPkq0m7jBtofHsHC7b6u8ymyWdBcKPNlZkasRZBCxUv7KrnqOl/ GmKd5x9QWMXPChUfwtY8cOiOPF99pin+VfgwVgmZCR742+dipHysj0D2VM2nc2qbQ1eN AOIc3jE8N7F3GKmtaEET0X/2Kio356mw4BH1oeeC3lS2G9QCH0XhZI1CwFItF/waD+9h NZow== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b12si11224949edz.497.2021.02.01.15.30.28; Mon, 01 Feb 2021 15:31:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231391AbhBAX2S (ORCPT + 99 others); Mon, 1 Feb 2021 18:28:18 -0500 Received: from mx2.suse.de ([195.135.220.15]:57734 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230288AbhBAX2N (ORCPT ); Mon, 1 Feb 2021 18:28:13 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 37B0EABD6; Mon, 1 Feb 2021 23:27:30 +0000 (UTC) From: NeilBrown To: Chuck Lever Date: Tue, 02 Feb 2021 10:27:25 +1100 Cc: Linux NFS Mailing List Subject: Re: releasing result pages in svc_xprt_release() In-Reply-To: <700333BB-0928-4772-ADFB-77D92CCE169C@oracle.com> References: <811BE98B-F196-4EC1-899F-6B62F313640C@oracle.com> <87im7ffjp0.fsf@notabene.neil.brown.name> <597824E7-3942-4F11-958F-A6E247330A9E@oracle.com> <878s88fz6s.fsf@notabene.neil.brown.name> <700333BB-0928-4772-ADFB-77D92CCE169C@oracle.com> Message-ID: <87wnvre5cy.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Mon, Feb 01 2021, Chuck Lever wrote: >> On Jan 31, 2021, at 6:45 PM, NeilBrown wrote: >>=20 >> On Fri, Jan 29 2021, Chuck Lever wrote: >>>=20 >>> What's your opinion? >>=20 >> To form a coherent opinion, I would need to know what that problem is. >> I certainly accept that there could be performance problems in releasing >> and re-allocating pages which might be resolved by batching, or by copyi= ng, >> or by better tracking. But without knowing what hot-spot you want to >> cool down, I cannot think about how that fits into the big picture. >> So: what exactly is the problem that you see? > > The problem is that each 1MB NFS READ, for example, hands 257 pages > back to the page allocator, then allocates another 257 pages. One page > is not terrible, but 510+ allocator calls for every RPC begins to get > costly. > > Also, remember that both allocating and freeing a page means an irqsave > spin lock -- that will serialize all other page allocations, including > allocation done by other nfsd threads. > > So I'd like to lower page allocator contention and the rate at which > IRQs are disabled and enabled when the NFS server becomes busy, as it > might with several 25 GbE NICs, for instance. > > Holding onto the same pages means we can make better use of TLB > entries -- fewer TLB flushes is always a good thing. > > I know that the network folks at Red Hat have been staring hard at > reducing memory allocation in the stack for several years. I recall > that Matthew Wilcox recently made similar improvements to the block > layer. > > With the advent of 100GbE and Optane-like durable storage, the balance > of memory allocation cost to device latency has shifted so that > superfluous memory allocation is noticeable. > > > At first I thought of creating a page allocator API that could grab > or free an array of pages while taking the allocator locks once. But > now I wonder if there are opportunities to reduce the amount of page > allocator traffic. Thanks. This helps me a lot. I wonder if there is some low-hanging fruit here. If I read the code correctly (which is not certain, but what I see does seem to agree with vague memories of how it all works), we currently do a lot of wasted alloc/frees for zero-copy reads. We allocate lots of pages and store the pointers in ->rq_respages (i.e. ->rq_pages) Then nfsd_splice_actor frees many of those pages and replaces the pointers with pointers to page-cache pages. Then we release those page-cache pages. We need to have allocated them, but we don't need to free them. We can add some new array for storing them, have nfsd_splice_actor move them to that array, and have svc_alloc_arg() move pages back from the store rather than re-allocating them. Or maybe something even more sophisticated where we only move them out of the store when we actually need them. Having the RDMA layer return pages when they are finished with might help. You might even be able to use atomics (cmpxchg) to handle the contention. But I'm not convinced it would be worth it. I *really* like your idea of a batch-API for page-alloc and page-free. This would likely be useful for other users, and it would be worth writing more code to get peak performance - things such as per-cpu queues of returned pages and so-forth (which presumably already exist). I cannot be sure that the batch-API would be better than a focused API just for RDMA -> NFSD. But my guess is that it would be at least nearly as good, and would likely get a lot more eyes on the code. NeilBrown --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQJCBAEBCAAsFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAmAYjl0OHG5laWxiQHN1 c2UuZGUACgkQOeye3VZigbmqDg/+In9IsLvTZAsUYhm1umxTifpLOiPov1ubUvs4 ycCRb+HYCNFJCLuKsa0StKPjDKEKWISahtJSp1J0see6zdBspm/T+1R9oQbLT/5y I3pD88LOx9GkjMiPUzkm7JZsYpSsuNAlvyfShUkykK3l5W/eamiaCtsRwhNWISnP ZKHwzwH2lQL3cAW9PHg/ZsWWJfmqcR93yTnVOOQKt4hPRUE60vUk5Y5MTTD6COcb XyaWt9SLHU9X2wCmPeZhqr/7stAdk30X1e9+bJ3++m+UZ1x34fJ/MvjZBZfSLIdt BeZHHg800hCSY8hHMhMQYaUcwg59rwIggWmc0h0XW7QFhijb0L6i5YCbJsK+odVj l2eUKp6lcb5t2EYx5NSazJb3Hnq/8fF7igrgOt5GcbVPk5IbPbibpxNtuD3ip9vh OF7T1dSp47/YNCn/XNTrsTbT8wjkFtRDo7rOhm4SA9aXyM3vkz+uMMCP/YfIfxyi WpxZMqeGFAywIiP8qde9yX+qfWMdXsrnfcv8+4SS2K7n1aQh+J/uWdj7UGandtad DFVkGkHHc+EMC6R7ufph0KQ1o7kdEDOl1/LEOnZscXT0M6OI3R14bSOpWRqmykzM r8vqR5jbKQ3lKrtQrNszKf65ZhUC0j76kt/te6v5wBG9tnVElBBhfBXHecpLm7u9 McsbYyg= =ovuK -----END PGP SIGNATURE----- --=-=-=--