Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4008913pxf; Tue, 16 Mar 2021 03:32:43 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyiowRIP4nFGz8rJ/nUq5Wc4LT/GOMPAzYz0Dx6VrklDHO1K5/0P/m4oQsN+jm7kOm4FC/J X-Received: by 2002:aa7:d492:: with SMTP id b18mr34947564edr.381.1615890763375; Tue, 16 Mar 2021 03:32:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1615890763; cv=none; d=google.com; s=arc-20160816; b=WyVz6ZtYy1mco+gsDtiuV07/7e/IpZ6zv1qMNPs6Jc8qbH+aGJepAPFhpDTM22z8tG YxnoBoRLD8qvkpFNzbOSeENk8XcbZK88+z0UGYkEIteCtJe8UHox5zDSwLozMKdEvRWo inbAoQdFiQ9NqNoyY3BI5Vi3vEOs5V2GZ4xzURQrqaoS9wwk8IcYX5l2mqfH6Z8KqZRu 2ubEG32yt2woHuc+IE3B42m5fxACbCYnMVmthDvgs6z82PsA0sbMX+609j5po8cX6lk+ Hf+5pm7S3JyTwsSpqGYaysHohQ++vLoVedSOqpbLLq2KLyPQ5HVhB+SyITjEzQe3HtrN xjyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=V0BTN/Z2jXatx08nLmn2iGZ7O1WuylzrrulrDON6R9k=; b=lOpNHvKqIIKNjGKilO+xteSUoxr6YLmF/uzD+wAiCz0NgOQ0faeDgStolVu2yUaYY5 WIbIPGFZ9VD/oZV8uW2nKB/ZUgJQRjqfFe4LGsz+8ETHl1qOznSe5VL3qn3ndsij/llG 746AKdie8brukhW4dxSontdakjer6nHHTlb7Ianoot5ZUQm+j1lLMuYdVreFjF4ieAMa lG73dOPfWDjegLOChCm6Csjxin9fT1b10E9LjS5Y2IOskvuEmq22wNPoMoAAWtJhnjaV XOBf3V31iOYSGIuU1haD8yI6CDx3H75zKCRcXYuVfkDQl+G9kmgESddmRhzLt+kR/Eyy rTKA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=xCib9ZN+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s21si13050436ejq.101.2021.03.16.03.32.19; Tue, 16 Mar 2021 03:32:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=xCib9ZN+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230259AbhCPK1S (ORCPT + 99 others); Tue, 16 Mar 2021 06:27:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235301AbhCPK1C (ORCPT ); Tue, 16 Mar 2021 06:27:02 -0400 Received: from mail-qv1-xf2e.google.com (mail-qv1-xf2e.google.com [IPv6:2607:f8b0:4864:20::f2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74BD9C06174A for ; Tue, 16 Mar 2021 03:27:00 -0700 (PDT) Received: by mail-qv1-xf2e.google.com with SMTP id h3so9209016qvh.8 for ; Tue, 16 Mar 2021 03:27:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=V0BTN/Z2jXatx08nLmn2iGZ7O1WuylzrrulrDON6R9k=; b=xCib9ZN+FiOtHqn2f2GIRYRiTffT7uatWQxo/AtFklW7YHFihG2zQ/vTE6gNPQNf+9 vRO50HskMB8FxFuo2e/yIPiiEjRMNR2akfE/kxzq3hVHcZW6gbBwrXGM/nkazzsBxspT IDDzm1pyKpT157y7ETbYQl89HFKamnR3YLguRbBuNUGLzFKgCZyAKpgnaNfuYaC+pVa6 LpmHEpA95pzB9VWM6M9mop0H7Z1xD8Vy9M7yIxp15Me9nGIHP1oJuFRaG4tLl0wyVweM Pvxba1+vGHWQ9Sr05/mHjYuGYAaztxJfRkzf+J9pX0/vBFCFzOae7JoD3GMPcoWagsvR gusg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=V0BTN/Z2jXatx08nLmn2iGZ7O1WuylzrrulrDON6R9k=; b=Mhw9Q0neutNXnEjrNfzA8+7OgOxTitltUfojevhPx21YSuhCA4VK/eTWXt7VIxVZyZ j9zhWsTwcxq/Ndv4PmrAjr9Ezie61QP1Hq3wnlN6bW5xtgAT2bigeVSpfERNcyaNZFz2 xCulHy43jA4Z3sBRNm3fRXGDSWKm/f6wWKyGouByq7mqWt9tIgWKM4anbjzECKfNuXTl Lp5pAjQ3ZXdpIVb3zQe3WQa3ySDwH3t0Q4M3sL84lnZ97vjTkiGNCDHSpgO97F5DvYE0 uo5Lwpfl3Pdr6HlS94v8FxAPHHl01z8YiLxsQsy7IvIMvhJQsylgS6mkCTRrNP1/s7mf hzYQ== X-Gm-Message-State: AOAM531TErkqDKssh6ZV9XuQkac8lGkDOG1Nf0uK8WGlXk8DgIUivyWY MqUZSrOCTrJot8AI0aXeSozXGg== X-Received: by 2002:a0c:f890:: with SMTP id u16mr15165954qvn.21.1615890419711; Tue, 16 Mar 2021 03:26:59 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:7693]) by smtp.gmail.com with ESMTPSA id c19sm14587625qkl.78.2021.03.16.03.26.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Mar 2021 03:26:59 -0700 (PDT) Date: Tue, 16 Mar 2021 06:26:58 -0400 From: Johannes Weiner To: Arjun Roy Cc: akpm@linux-foundation.org, davem@davemloft.net, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, arjunroy@google.com, shakeelb@google.com, edumazet@google.com, soheil@google.com, kuba@kernel.org, mhocko@kernel.org, shy828301@gmail.com, guro@fb.com Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy Message-ID: References: <20210316041645.144249-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210316041645.144249-1-arjunroy.kdev@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, Mar 15, 2021 at 09:16:45PM -0700, Arjun Roy wrote: > From: Arjun Roy > > TCP zerocopy receive is used by high performance network applications > to further scale. For RX zerocopy, the memory containing the network > data filled by the network driver is directly mapped into the address > space of high performance applications. To keep the TLB cost low, > these applications unmap the network memory in big batches. So, this > memory can remain mapped for long time. This can cause a memory > isolation issue as this memory becomes unaccounted after getting > mapped into the application address space. This patch adds the memcg > accounting for such memory. > > Accounting the network memory comes with its own unique challenges. > The high performance NIC drivers use page pooling to reuse the pages > to eliminate/reduce expensive setup steps like IOMMU. These drivers > keep an extra reference on the pages and thus we can not depend on the > page reference for the uncharging. The page in the pool may keep a > memcg pinned for arbitrary long time or may get used by other memcg. The page pool knows when a page is unmapped again and becomes available for recycling, right? Essentially the 'free' phase of that private allocator. That's where the uncharge should be done. For one, it's more aligned with the usual memcg charge lifetime rules. But also it doesn't add what is essentially a private driver callback to the generic file unmapping path. Finally, this will eliminate the need for making up a new charge type (MEMCG_DATA_SOCK) and allow using the standard kmem charging API. > This patch decouples the uncharging of the page from the refcnt and > associates it with the map count i.e. the page gets uncharged when the > last address space unmaps it. Now the question is, what if the driver > drops its reference while the page is still mapped? That is fine as > the address space also holds a reference to the page i.e. the > reference count can not drop to zero before the map count. > > Signed-off-by: Arjun Roy > Co-developed-by: Shakeel Butt > Signed-off-by: Shakeel Butt > Signed-off-by: Eric Dumazet > Signed-off-by: Soheil Hassas Yeganeh > --- > > Changelog since v1: > - Pages accounted for in this manner are now tracked via MEMCG_SOCK. > - v1 allowed for a brief period of double-charging, now we have a > brief period of under-charging to avoid undue memory pressure. I'm afraid we'll have to go back to v1. Let's address the issues raised with it: 1. The NR_FILE_MAPPED accounting. It is longstanding Linux behavior that driver pages mapped into userspace are accounted as file pages, because userspace is actually doing mmap() against a driver file/fd (as opposed to an anon mmap). That is how they show up in vmstat, in meminfo, and in the per process stats. There is no reason to make memcg deviate from this. If we don't like it, it should be taken on by changing vm_insert_page() - not trick rmap into thinking these arent memcg pages and then fixing it up with additional special-cased accounting callbacks. v1 did this right, it charged the pages the way we handle all other userspace pages: before rmap, and then let the generic VM code do the accounting for us with the cgroup-aware vmstat infrastructure. 2. The double charging. Could you elaborate how much we're talking about in any given batch? Is this a problem worth worrying about? The way I see it, any conflict here is caused by the pages being counted in the SOCK counter already, but not actually *tracked* on a per page basis. If it's worth addressing, we should look into fixing the root cause over there first if possible, before trying to work around it here. The newly-added GFP_NOFAIL is especially worrisome. The pages should be charged before we make promises to userspace, not be force-charged when it's too late. We have sk context when charging the inserted pages. Can we uncharge MEMCG_SOCK after each batch of inserts? That's only 32 pages worth of overcharging, so not more than the regular charge batch memcg is using. An even better way would be to do charge stealing where we reuse the existing MEMCG_SOCK charges and don't have to get any new ones at all - just set up page->memcg and remove the charge from the sk. But yeah, it depends a bit if this is a practical concern. Thanks, Johannes