Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp716139imu; Mon, 5 Nov 2018 07:45:29 -0800 (PST) X-Google-Smtp-Source: AJdET5e5+pYIGCenN5XY1+YoFkf4DsRn4nzvQw+zeaj7Gx3094kTXNe+7GMVVecrsWCxlCKqJ9ky X-Received: by 2002:a62:1693:: with SMTP id 141-v6mr23199412pfw.183.1541432729497; Mon, 05 Nov 2018 07:45:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541432729; cv=none; d=google.com; s=arc-20160816; b=cN9mFejLtMGNM84+3wXCBT3ImV6WDLe5U7GcK9XmLTlba7IijmNes1EPbm2XaMZUHF abuj6UhFacekAtfmfwCyOWAG1Zfdv5KVjkoCSSBV+o1AOfI542D72PfYKqiO5/rGIzng 3mpHfQJojz9ARYA4am1u5ryYsUlJ/M1jPWzOaO8vsl+Dtzi948m8Kg26lxClebGbRTc0 ts6DDJw81ixvF2sngDngmxyv/6pfsLk40k88VCCE59WSmo9PD0JBQBQfOmUrqzXvjjkW U7FbCK6YMapt/l0ML04c2XKpiHeOub08kvDXd9rS7r4TueX40O1YyrsjfAI67bRYVJVE 75Kw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=fVgf0VLg5n+oXeZPA5C84Zm/xhalUPDz4PT7AkjiX14=; b=C3201SslvSeWM+0PZsQ8FdSraXvJ6Qlk/xve8FXTaC4uVAq2TuLv8I7KwshinXwFkz MYQOAFxyALTk45w7TAJQjhzApu6eENZTBr5uIkzz7Cjt1CNlLKimHxmkuMpuHdnZE4DX /FnNkp+mW3iCnuhQN7Sod43mGlU16HOEP0kSS8BRkC8unkfchIZ0WCl1h9NTu4jnGiWb uy7WN7aMmxbs6pOMwqy6TKfthOARqeRdNaBTxueRBGWsklnp7MCgmGvk1O5qdM96623K 08S/4T53AjYgQLTTJSju5IyYHP7Tka7f7PcO8ISdAvgtUWOnNNzELMI4lMnQ2ELE0uVt lbZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=DoqOaX+H; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n188-v6si36403572pga.434.2018.11.05.07.45.07; Mon, 05 Nov 2018 07:45:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=DoqOaX+H; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387466AbeKFBEb (ORCPT + 99 others); Mon, 5 Nov 2018 20:04:31 -0500 Received: from mail-io1-f68.google.com ([209.85.166.68]:40179 "EHLO mail-io1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729019AbeKFBEa (ORCPT ); Mon, 5 Nov 2018 20:04:30 -0500 Received: by mail-io1-f68.google.com with SMTP id a23-v6so6815190iod.7; Mon, 05 Nov 2018 07:44:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=fVgf0VLg5n+oXeZPA5C84Zm/xhalUPDz4PT7AkjiX14=; b=DoqOaX+HrtcdLBmXNcHrZzhSgXgIACDGaqA9j9QaGH6n48DbEmIT96zCB4iuZI2BZ6 cz3/6D1MJIaS4dG3lS/71WAjOA2rWXxzTBILTrAk8edHrcySXjzzLx2APkHZAZ7lGPUs msC+xb0ZJIqT/7EWkWWI9EFpZbeMCWC2j8o4GCCx3Egc2lmgAbSkadD8dl8kHdxqQiqN SvWYV0vweIz4GRket+/Ku51WTn7WFS1pBhn2OnDROKJll/qoasrs2vaS1Fsi0TmK2iin MRjNna1YRoJ6AkMqb+tOUSVBdHIzc1Zdm6K3tRP0zvoJDevvY1FbRaN8QUHAqsPG6oKs aFNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=fVgf0VLg5n+oXeZPA5C84Zm/xhalUPDz4PT7AkjiX14=; b=OSn0ixV03jb3t9IY6mG+qxDvlcZPDlU0YRgIgAu/6H+Rg7FvgqrPmX8qFWTBXJIlzb csRmXzIqhGnvPoo/mfeBq3T691RgIjp1YUONgCjR6NSKzxQ98kakWE4jQvHig38SuixY dg3xhob/wzg4gV4xGxEMCNEiw5p5FWSZvSMC/gf80rRg+UID4VLTmBQgw/U6DntEPwJu DrYypxR/9psnqtXKPzuYvDmq7v9LzvEVk/ffV90Tf5I+t/5rOFO5QyTHmrUcjgfeCoKZ ITHeaDAjuNDB2K7QGMH51tob/fMLYqQhuwct7JzNOoFDjEUU5Ucu0qF9lmhRFORsvkOu e8aQ== X-Gm-Message-State: AGRZ1gKqFHzTL9+djMUCaINoa12cEDN+QG7lHwN5FjvO+vPrinUz3oqp gtpjuZtcqmCoHUruRaNIjB6k0Ti1sEuvmzQkDMA= X-Received: by 2002:a6b:e90f:: with SMTP id u15-v6mr17283835iof.200.1541432652149; Mon, 05 Nov 2018 07:44:12 -0800 (PST) MIME-Version: 1.0 References: <20181105085820.6341-1-aaron.lu@intel.com> In-Reply-To: <20181105085820.6341-1-aaron.lu@intel.com> From: Alexander Duyck Date: Mon, 5 Nov 2018 07:44:00 -0800 Message-ID: Subject: Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free() To: aaron.lu@intel.com Cc: linux-mm , LKML , Netdev , Andrew Morton , =?UTF-8?Q?Pawe=C5=82_Staszewski?= , Jesper Dangaard Brouer , Eric Dumazet , Tariq Toukan , ilias.apalodimas@linaro.org, yoel@kviknet.dk, Mel Gorman , Saeed Mahameed , Michal Hocko , Vlastimil Babka , dave.hansen@linux.intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu wrote: > > page_frag_free() calls __free_pages_ok() to free the page back to > Buddy. This is OK for high order page, but for order-0 pages, it > misses the optimization opportunity of using Per-Cpu-Pages and can > cause zone lock contention when called frequently. > > Pawe=C5=82 Staszewski recently shared his result of 'how Linux kernel > handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer > found the lock contention comes from page allocator: > > mlx5e_poll_tx_cq > | > --16.34%--napi_consume_skb > | > |--12.65%--__free_pages_ok > | | > | --11.86%--free_one_page > | | > | |--10.10%--queued_spin_lock_slowpath > | | > | --0.65%--_raw_spin_lock > | > |--1.55%--page_frag_free > | > --1.44%--skb_release_data > > Jesper explained how it happened: mlx5 driver RX-page recycle > mechanism is not effective in this workload and pages have to go > through the page allocator. The lock contention happens during > mlx5 DMA TX completion cycle. And the page allocator cannot keep > up at these speeds.[2] > > I thought that __free_pages_ok() are mostly freeing high order > pages and thought this is an lock contention for high order pages > but Jesper explained in detail that __free_pages_ok() here are > actually freeing order-0 pages because mlx5 is using order-0 pages > to satisfy its page pool allocation request.[3] > > The free path as pointed out by Jesper is: > skb_free_head() > -> skb_free_frag() > -> skb_free_frag() > -> page_frag_free() > And the pages being freed on this path are order-0 pages. > > Fix this by doing similar things as in __page_frag_cache_drain() - > send the being freed page to PCP if it's an order-0 page, or > directly to Buddy if it is a high order page. > > With this change, Pawe=C5=82 hasn't noticed lock contention yet in > his workload and Jesper has noticed a 7% performance improvement > using a micro benchmark and lock contention is gone. > > [1]: https://www.spinics.net/lists/netdev/msg531362.html > [2]: https://www.spinics.net/lists/netdev/msg531421.html > [3]: https://www.spinics.net/lists/netdev/msg531556.html > Reported-by: Pawe=C5=82 Staszewski > Analysed-by: Jesper Dangaard Brouer > Signed-off-by: Aaron Lu > --- > mm/page_alloc.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ae31839874b8..91a9a6af41a2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr) > { > struct page *page =3D virt_to_head_page(addr); > > - if (unlikely(put_page_testzero(page))) > - __free_pages_ok(page, compound_order(page)); > + if (unlikely(put_page_testzero(page))) { > + unsigned int order =3D compound_order(page); > + > + if (order =3D=3D 0) > + free_unref_page(page); > + else > + __free_pages_ok(page, order); > + } > } > EXPORT_SYMBOL(page_frag_free); > One thing I would suggest for Pawel to try would be to reduce the Tx qdisc size on his transmitting interfaces, Reduce the Tx ring size, and possibly increase the Tx interrupt rate. Ideally we shouldn't have too many packets in-flight and I suspect that is the issue that Pawel is seeing that is leading to the page pool allocator freeing up the memory. I know we like to try to batch things but the issue is processing too many Tx buffers in one batch leads to us eating up too much memory and causing evictions from the cache. Ideally the Rx and Tx rings and queues should be sized as small as possible while still allowing us to process up to our NAPI budget. Usually I run things with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we don't have more buffers stored there than we can place in the Tx ring. Then we can avoid the extra thrash of having to pull/push memory into and out of the freelists. Essentially the issue here ends up being another form of buffer bloat. With that said this change should be mostly harmless and does address the fact that we can have both regular order 0 pages and page frags used for skb->head. Acked-by: Alexander Duyck