Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp791193pxb; Tue, 3 Nov 2020 12:37:10 -0800 (PST) X-Google-Smtp-Source: ABdhPJyjxQz8Wc1j/GO/yzUqvetDcdWNmeb3l7lSwMsLXpLaXytCxvx5OBaXlg8U+S5ky0jd1gCs X-Received: by 2002:a05:6402:559:: with SMTP id i25mr8955783edx.128.1604435830139; Tue, 03 Nov 2020 12:37:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1604435830; cv=none; d=google.com; s=arc-20160816; b=I/yQqbpOprspWk2lzh4S6ydbJjcBsJfmaEYkuIr0azm58giGbG7PljaOetFrhRaEJy /tNr3v5xO3miJH61uH84YrE60EdbVxinuo9NcKMdY934/5DwNkoBOeFCOQRzAkeRxh4e ULhrjM5LP3t4BRViYwTHIKmn2/l026jg58oZmtfUBadO8NAJZJH8LUQ9RZYdiRgWqs+j nfHfl8PDRWrtykoYLRo5uCHQqcd9xFRad87IlSBNZh2ncDAdcHxuM7U9iGnr481P+jis S2RQWQsPrvJcyikTnz1XOW1N3mlHl4eZPyUrRhkwSu4IO2BmvMCZjY9vK87bJRA3iaZ+ ypJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=jhWIvk3/m3TC3v+j/QSDYklWLqY6U0e9Aq2JYeev2bo=; b=Dx2ti7KXLxFIwIYp9tE38m13hKzHjD3s2vNVXKavUG43N7OGu+9/HEYIc7MxfVRydj EmAfU+H+5Hh79ESy1eWIIyxxeEHmst2TDeciNqk6BHFP3w1fJQJc6V39NOx2+PMiKIO4 BVVbFOFzynxwqJXS3NUCq84Kqb/HDzhAY+Jvxe22c1Wzg75vbYNkjP5EK3KKfSpEaGrd Dcb9aeZ0PU9ixBwRYBV40HHKDYAw1Whe0uJXT8Z3ri2KugNf0437PylAl8R4Sq6AvdSw 3K0uC2XgfE9AZ5sYe7BYirNP5MSLgxwtgTJtuWz+aXFx7PMXjGtI4KBF9qLjyuZsGcUO I5aQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=vanESlzZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k22si5471893eds.368.2020.11.03.12.36.44; Tue, 03 Nov 2020 12:37:10 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=vanESlzZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729342AbgKCUfN (ORCPT + 99 others); Tue, 3 Nov 2020 15:35:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48216 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727688AbgKCUfM (ORCPT ); Tue, 3 Nov 2020 15:35:12 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD5FFC0613D1; Tue, 3 Nov 2020 12:35:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=jhWIvk3/m3TC3v+j/QSDYklWLqY6U0e9Aq2JYeev2bo=; b=vanESlzZiWnsgefLrh4w3dnT9Y cNAfQDzAn/FezTBH2U59DIZ5rsVUCvfd8K+oNJkOSNLjEJdVtt+5hPnCKCT6IGg3cQlfgzWDxreT7 i1ReQe92bQFhRSjJofShWJsHHNnRrekqF+tvQTw2fOgZKnKWUHPHvbsBG22zz5RZWpoi8dTZoBTMe vJa2Ww/U8oy9ERytyLbgx+4UibMt2juPVF4XdTttSav5EZg77xgz7PG3Pt4ncf14kqf1jG7MasQAP ucSaL4vodjorEfigR6KZQHT4PrjabxpRlMU2mNOO6o214i7ac71c9BJhjNgav0+tUXKKtCtIyYbpZ HIVxG/7Q==; Received: from willy by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1ka31A-0003zw-VI; Tue, 03 Nov 2020 20:35:01 +0000 Date: Tue, 3 Nov 2020 20:35:00 +0000 From: Matthew Wilcox To: Dongli Zhang Cc: linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, davem@davemloft.net, kuba@kernel.org, aruna.ramakrishna@oracle.com, bert.barbe@oracle.com, rama.nichanamatlu@oracle.com, venkat.x.venkatsubra@oracle.com, manjunath.b.patil@oracle.com, joe.jin@oracle.com, srinivas.eeda@oracle.com Subject: Re: [PATCH 1/1] mm: avoid re-using pfmemalloc page in page_frag_alloc() Message-ID: <20201103203500.GG27442@casper.infradead.org> References: <20201103193239.1807-1-dongli.zhang@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201103193239.1807-1-dongli.zhang@oracle.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: > The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb(). > This ends up to page_frag_alloc() to allocate skb->data from > page_frag_cache->va. > > During the memory pressure, page_frag_cache->va may be allocated as > pfmemalloc page. As a result, the skb->pfmemalloc is always true as > skb->data is from page_frag_cache->va. The skb will be dropped if the > sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour > under memory pressure. > > However, once kernel is not under memory pressure any longer (suppose large > amount of memory pages are just reclaimed), the page_frag_alloc() may still > re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a > result, the skb->pfmemalloc is always true unless page_frag_cache->va is > re-allocated, even the kernel is not under memory pressure any longer. > > Here is how kernel runs into issue. > > 1. The kernel is under memory pressure and allocation of > PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, > the pfmemalloc page is allocated for page_frag_cache->va. > > 2: All skb->data from page_frag_cache->va (pfmemalloc) will have > skb->pfmemalloc=true. The skb will always be dropped by sock without > SOCK_MEMALLOC. This is an expected behaviour. > > 3. Suppose a large amount of pages are reclaimed and kernel is not under > memory pressure any longer. We expect skb->pfmemalloc drop will not happen. > > 4. Unfortunately, page_frag_alloc() does not proactively re-allocate > page_frag_alloc->va and will always re-use the prior pfmemalloc page. The > skb->pfmemalloc is always true even kernel is not under memory pressure any > longer. > > Therefore, this patch always checks and tries to avoid re-using the > pfmemalloc page for page_frag_alloc->va. > > Cc: Aruna Ramakrishna > Cc: Bert Barbe > Cc: Rama Nichanamatlu > Cc: Venkat Venkatsubra > Cc: Manjunath Patil > Cc: Joe Jin > Cc: SRINIVAS > Signed-off-by: Dongli Zhang > --- > mm/page_alloc.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 23f5066bd4a5..291df2f9f8f3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc, > struct page *page; > int offset; > > + /* > + * Try to avoid re-using pfmemalloc page because kernel may already > + * run out of the memory pressure situation at any time. > + */ > + if (unlikely(nc->va && nc->pfmemalloc)) { > + page = virt_to_page(nc->va); > + __page_frag_cache_drain(page, nc->pagecnt_bias); > + nc->va = NULL; > + } I think this is the wrong way to solve this problem. Instead, we should use up this page, but refuse to recycle it. How about something like this (not even compile tested): +++ b/mm/page_alloc.c @@ -5139,6 +5139,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) goto refill; + if (nc->pfmemalloc) { + free_the_page(page); + goto refill; + } #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) /* if size can vary use size else just use PAGE_SIZE */