Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4484946pxf; Tue, 23 Mar 2021 11:43:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwpZixMTR1ECrvMUCQ8Yj8MJNNf9HnIJ1/hCZUSo5yDpzvQE301qkSN+dpMXVQsy/ZVaLmv X-Received: by 2002:a17:906:2710:: with SMTP id z16mr6314922ejc.176.1616525029187; Tue, 23 Mar 2021 11:43:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616525029; cv=none; d=google.com; s=arc-20160816; b=DXodw3wqKByvPefA4RGsSfd3EWGE+S2Jsae/t3Begz4gXOkSPeLA02o2NxFRQUeRZV Sxh7JuHlwoXb5TwVQ+lYRaVi7I0AOLixcfZPpM7DEfnTZz+5faDV9yalU5R3u+cKpuhQ OELmFms6zs2lJAbNLA4UPXgBMj3r1JCkzWmB+twuWrz2fJNY4Mmz3/Q3j7u3MZTkFGLn KENRt8r7/TnfqGF0EqhWp8O47S9ie1iPvf7VU7KvyKiJIHJhFYGhk2fbSirBqIa2EQKE 0F0Orptf5uFAyOUJ8mjASv+eh68+s1w5F02tEEavwTwDsfToEJ/eBgztIe8wq6v0SJ86 rFow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=mD61Ls6vzVvQMSeaaF56QK5c7Jjn0dunNUYdNfL29o0=; b=nkYgNpEBWE+dvSrZ81awUdpeOcVmypSj98RE4gb4AQY7nI54kUvXsu5R+TAjEmn1q+ q+mNnKAfneVp8Rm/dEqnvY+t25LDG3a2rd8r/kEEVG2M5Owa+XMqRX63McCw1EfGkz32 3hFxmpI5iogTCUTcqVk00KkAGGq+0wK5BErzBm+MHWuxswum4kYLIJcivyycLmshIHao /Epyd/HehfMbNLWxQs6QPzToslN42+Hi71iKZ4D5yWSaVVBe1GG1S936h3vmENw8ID0x 1zESSKTGPZ8n+0X62GbrKlbs+pa5wRgHJbSY2kZe49FxKO2HTsTOUYCQF04sz4ur5X3I EZHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Yxm1As2H; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n13si14174815ejs.39.2021.03.23.11.43.24; Tue, 23 Mar 2021 11:43:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Yxm1As2H; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231834AbhCWSmZ (ORCPT + 99 others); Tue, 23 Mar 2021 14:42:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232220AbhCWSmR (ORCPT ); Tue, 23 Mar 2021 14:42:17 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BC50C061574 for ; Tue, 23 Mar 2021 11:42:17 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id x7-20020a17090a2b07b02900c0ea793940so12604377pjc.2 for ; Tue, 23 Mar 2021 11:42:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mD61Ls6vzVvQMSeaaF56QK5c7Jjn0dunNUYdNfL29o0=; b=Yxm1As2H3pEmc1Sg12rZKhMnJyeu07K+JLnvdk3QBaxxYlJEzDmemZrDF0Ip283m2n KvVcd/fnExUfAMVMlmsFS66ObnPWv6O00ZWkyipAsvnTig4pl3MO9WFA5Ors5hmGE2+E ITg2Vq10vzklKs4BrsWhXCdPmS6fEjxhlMCCo2GQantmnqtu+FevhJFfJG9IGcDfEmIc eBa34K/i+gPMCtu40dgX+BI7B/YgufwBp+AZzEepmWuDvlPwyqr8MMtabVl1weobp4T4 1QkNtv0dJIA+oSu1BPQcgGO/WR+0tCg22lXERtZFxhTmTZH92pb3+dpaCtHTRebJ6M1z 2dPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mD61Ls6vzVvQMSeaaF56QK5c7Jjn0dunNUYdNfL29o0=; b=Z+YMLVB/uSB8K4bwIfjOk4cJ08LXX7Wwf1uW8PTl/J6Ws3DSZhH36C0oUq9BCQa1xo rs4TmTdaCpe4DOLqwW2W6t3c/CITMF9Do5X0l4FXLLZub8GB+XdUIhWw6djZbNLFoGhS au81LnG5KjJq774jQKfr9ZnnVPyy5g5ZxB72PVjsRgO5XC5H6igPB6fItR6irST73cfU YbmG/ikwivT5a57yqmnBkzSo8rS7mjbxojkiaU+ukjpJ5MyvPdiyj8NN/7oZSkmfRbDP KuKEh/ytfs0aTY4/IAlxYeLcT/ZqpF+6MHJQBXalO1+reQONFaiVGSVu4bHnYYhBuyzL QOBQ== X-Gm-Message-State: AOAM532DIbsU6n8m61FYHTD0Z+b1Ef3ZQQo+0uY00JVEQSNcW+iIjiiK N1M8FfpwyYBza9a9gJgAkYQacol/J0qXzAg14y01Wg== X-Received: by 2002:a17:90a:ce0d:: with SMTP id f13mr5731015pju.85.1616524936259; Tue, 23 Mar 2021 11:42:16 -0700 (PDT) MIME-Version: 1.0 References: <20210316041645.144249-1-arjunroy.kdev@gmail.com> In-Reply-To: From: Arjun Roy Date: Tue, 23 Mar 2021 11:42:05 -0700 Message-ID: Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy To: Johannes Weiner Cc: Arjun Roy , Andrew Morton , David Miller , netdev , Linux Kernel Mailing List , Cgroups , Linux MM , Shakeel Butt , Eric Dumazet , Soheil Hassas Yeganeh , Jakub Kicinski , Michal Hocko , Yang Shi , Roman Gushchin Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 23, 2021 at 10:01 AM Johannes Weiner wrote: > > On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote: > > To make sure we're on the same page, then, here's a tentative > > mechanism - I'd rather get buy in before spending too much time on > > something that wouldn't pass muster afterwards. > > > > A) An opt-in mechanism, that a driver needs to explicitly support, in > > order to get properly accounted receive zerocopy. > > Yep, opt-in makes sense. That allows piece-by-piece conversion and > avoids us having to have a flag day. > > > B) Failure to opt-in (e.g. unchanged old driver) can either lead to > > unaccounted zerocopy (ie. existing behaviour) or, optionally, > > effectively disabled zerocopy (ie. any call to zerocopy will return > > something like EINVAL) (perhaps controlled via some sysctl, which > > either lets zerocopy through or not with/without accounting). > > I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not > disturb existing/working setups during the transition period. But the > exact policy is easy to change later on if we change our minds on it. > > > The proposed mechanism would involve: > > 1) Some way of marking a page as being allocated by a driver that has > > decided to opt into this mechanism. Say, a page flag, or a memcg flag. > > Right. I would stress it should not be a memcg flag or any direct > channel from the network to memcg, as this would limit its usefulness > while having the same maintenance overhead. > > It should make the network page a first class MM citizen - like an LRU > page or a slab page - which can be accounted and introspected as such, > including from the memcg side. > > So definitely a page flag. Works for me. > > > 2) A callback provided by the driver, that takes a struct page*, and > > returns a boolean. The value of the boolean being true indicates that > > any and all refs on the page are held by the driver. False means there > > exists at least one reference that is not held by the driver. > > I was thinking the PageNetwork flag would cover this, but maybe I'm > missing something? > The main reason for a driver callback is to handle whatever driver-specific behaviour needs to be handled (ie. while a driver may use code from net/core/page_pool.c, it also may roll its own arbitrary behaviour and data structures). And because it's not necessarily the case that a driver would take exactly 1 ref of its own on the page. > > 3) A branch in put_page() that, for pages marked thus, will consult > > the driver callback and if it returns true, will uncharge the memcg > > for the page. > > The way I picture it, put_page() (and release_pages) should do this: > > void __put_page(struct page *page) > { > if (is_zone_device_page(page)) { > put_dev_pagemap(page->pgmap); > > /* > * The page belongs to the device that created pgmap. Do > * not return it to page allocator. > */ > return; > } > + > + if (PageNetwork(page)) { > + put_page_network(page); > + /* Page belongs to the network stack, not the page allocator */ > + return; > + } > > if (unlikely(PageCompound(page))) > __put_compound_page(page); > else > __put_single_page(page); > } > > where put_page_network() is the network-side callback that uncharges > the page. > > (..and later can be extended to do all kinds of things when informed > that the page has been freed: update statistics (mod_page_state), put > it on a private network freelist, or ClearPageNetwork() and give it > back to the page allocator etc. > Yes, this is more or less what I had in mind, though put_page_network() would also need to avail itself of the callback mentioned previously. > But for starters it can set_page_count(page, 1) after the uncharge to > retain the current silent recycling behavior.) > This would be one example of where the driver could conceivably have >1 ref for whatever reason (https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L495) where it looks like it could take 2 refs on a page, perhaps storing 2 x 1500B packets on a single 4KB page. > > The anonymous struct you defined above is part of a union that I think > > normally is one qword in length (well, could be more depending on the > > typedefs I saw there) and I think that can be co-opted to provide the > > driver callback - though, it might require growing the struct by one > > more qword since there may be drivers like mlx5 that are already using > > the field already in there for dma_addr. > > The page cache / anonymous struct it's shared with is 5 words (double > linked list pointers, mapping, index, private), and the network struct > is currently one word, so you can add 4 words to a PageNetwork() page > without increasing the size of struct page. That should be plenty of > space to store auxiliary data for drivers, right? > Ah, I think I was looking more narrowly at an older version of the struct. The new one is much easier to parse. :) 4 words should be plenty, I think. > > Anyways, the callback could then be used by the driver to handle the > > other accounting quirks you mentioned, without needing to scan the > > full pool. > > Right. > > > Of course there are corner cases and such to properly account for, but > > I just wanted to provide a really rough sketch to see if this > > (assuming it were properly implemented) was what you had in mind. If > > so I can put together a v3 patch. > > Yeah, makes perfect sense. We can keep iterating like this any time > you feel you accumulate too many open questions. Not just for MM but > also for the networking folks - although I suspect that the first step > would be mostly about the MM infrastructure, and I'm not sure how much > they care about the internals there ;) > > > Per my response to Andrew earlier, this would make it even more > > confusing whether this is to be applied against net-next or mm trees. > > But that's a bridge to cross when we get to it. > > The mm tree includes -next, so it should be a safe development target > for the time being. > > I would then decide it based on how many changes your patch interacts > with on either side. Changes to struct page and the put path are not > very frequent, so I suspect it'll be easy to rebase to net-next and > route everything through there. And if there are heavy changes on both > sides, the -mm tree is the better route anyway. > > Does that sound reasonable? This sounds good to me. To summarize then, it seems to me that we're on the same page now. I'll put together a tentative v3 such that: 1. It uses pre-charging, as previously discussed. 2. It uses a page flag to delineate pages of a certain networking sort (ie. this mechanism). 3. It avails itself of up to 4 words of data inside struct page, inside the networking specific struct. 4. And it sets up this opt-in lifecycle notification for drivers that choose to use it, falling back to existing behaviour without. Thanks, -Arjun