Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Fri, 13 Dec 2019 09:48:45 +0100
From:   Jesper Dangaard Brouer <brouer@redhat.com>
To:     Yunsheng Lin <linyunsheng@huawei.com>
Cc:     "Li,Rongqing" <lirongqing@baidu.com>,
        Saeed Mahameed <saeedm@mellanox.com>,
        "ilias.apalodimas@linaro.org" <ilias.apalodimas@linaro.org>,
        "jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>,
        "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
        "mhocko@kernel.org" <mhocko@kernel.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        "bhelgaas@google.com" <bhelgaas@google.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>,
        brouer@redhat.com
Subject: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE
 condition
Message-ID: <20191213094845.56fb42a4@carbon>
In-Reply-To: <079a0315-efea-9221-8538-47decf263684@huawei.com>
References: <1575624767-3343-1-git-send-email-lirongqing@baidu.com>
        <9fecbff3518d311ec7c3aee9ae0315a73682a4af.camel@mellanox.com>
        <20191211194933.15b53c11@carbon>
        <831ed886842c894f7b2ffe83fe34705180a86b3b.camel@mellanox.com>
        <0a252066-fdc3-a81d-7a36-8f49d2babc01@huawei.com>
        <20191212111831.2a9f05d3@carbon>
        <7c555cb1-6beb-240d-08f8-7044b9087fe4@huawei.com>
        <1d4f10f4c0f1433bae658df8972a904f@baidu.com>
        <079a0315-efea-9221-8538-47decf263684@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Fri, 13 Dec 2019 14:53:37 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2019/12/13 14:27, Li,Rongqing wrote:
> >>
> >> It is good to allocate the rx page close to both cpu and device,
> >> but if both goal can not be reached, maybe we choose to allocate
> >> page that close to cpu?
> >>  
> > I think it is true
> > 
> > If it is true, , we can remove pool->p.nid, and replace
> > alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow,
> > and change pool_page_reusable as that page_to_nid(page) is checked
> > with numa_mem_id()  

No, as I explained before, you cannot use numa_mem_id() in pool_page_reusable,
because recycle call can happen from/on a remote CPU (and numa_mem_id()
uses the local CPU to determine the NUMA id).

Today we already have XDP cpumap redirect.  And we are working on
extending this to SKBs, which will increase the likelihood even more.

I don't think we want to not-recycle if a remote NUMA node CPU happened
to touch the packet?

(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory.  For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool.

With this in mind, it does seem strange that we are doing the check on
the "free"/recycles code path, that can run on remote CPUs...


> > since alloc_pages hint to use the current node page, and
> > __page_pool_alloc_pages_slow will be called in NAPI polling often
> > if recycle failed, after some cycle, the page will be from local
> > memory node.  

You are basically saying that the NUMA check should be moved to
allocation time, as it is running the RX-CPU (NAPI).  And eventually
after some time the pages will come from correct NUMA node.

I think we can do that, and only affect the semi-fast-path.
We just need to handle that pages in the ptr_ring that are recycled can
be from the wrong NUMA node.  In __page_pool_get_cached() when
consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
can evict pages from wrong NUMA node.

For the pool->alloc.cache we either accept, that it will eventually
after some time be emptied (it is only in a 100% XDP_DROP workload that
it will continue to reuse same pages).   Or we simply clear the
pool->alloc.cache when calling page_pool_update_nid().


> Yes if allocation and recycling are in the same NAPI polling context.

Which is a false assumption.

> As pointed out by Saeed and Ilias, the allocation and recycling seems
> to may not be happening in the same NAPI polling context, see:
> 
> "In the current code base if they are only called under NAPI this
> might be true. On the page_pool skb recycling patches though (yes
> we'll eventually send those :)) this is called from kfree_skb()."
> 
> So there may need some additionl attention.

Yes, as explained above.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer