Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4283489ybv; Mon, 10 Feb 2020 16:16:23 -0800 (PST) X-Google-Smtp-Source: APXvYqxg7P/YnpeEsLugDt2PwgHjEHCk9XM8p691FDVnXef6lm9b+5rqKdMGzTXNk0Yrpc4o2VCd X-Received: by 2002:aca:b808:: with SMTP id i8mr1172488oif.66.1581380183677; Mon, 10 Feb 2020 16:16:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581380183; cv=none; d=google.com; s=arc-20160816; b=WgliBzwakqPHTI1//ZtA14EU94nueYoeIK4Yc2vNh7z9Fg0wgSiITH3ItVsuV+8HXD 2g7zhgHhDsFY9xMM0i7bUpAoefj+moLHubthDgDVYoNnu1rCvlOVUF/p1P3QlP5Us1xA grnFcl9ZULsnW5sDSrTPvff1nxFxG1ojEk2XpLElZH8Me+pgFRAvZ6CaOO86rcDaRwut qwNEhULi+4eVjxsOjKZWb0jue3XdBQ6+g28ox/kivM9g6OZ2EaD9ChFyZYE0rByLHIBi nYUFRBY9wFg2WwNOY1Kz5ISjxegGCRjj/4WrA6VQavoMArRrx+MAGN2gF8e10aih3Ala 0jbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:dkim-signature:content-transfer-encoding :mime-version:references:in-reply-to:message-id:date:subject:cc:to :from; bh=OT3RBMIOTYuZ2ESmgRWT99LM4mnzHujzb1/CcKG658o=; b=m2gSzNHEKY6uxRBcxsF6bcYmwa/cxqGoRvhodbVfVGrC8YeoQ3huu8e9xxvklIxrBt YT4DBi8dcKCuGbgnGq/YhY0mPGddfsq4eHU2+MV3w/ulSjFPiNB320mcjyZPnmhXUesp arz1lkmCqa7ysw3AvNZy1lCZqVL18iaM68ce3BcA1yo3w+hLip8xUcnd56ttF1sm3b4G FPGLH1DLWJ9T90F8c+84x9hDolunI/2K/h/a8kFv2vlqZoIgFXw3XrbiH9vnM0vn06GE Ggj52oR9vzTM35eWwA3SjQ25Bn6IUqjYhRt2OllXs2s32ajVxSzuZACZFZesrnLNNfo7 6MzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=oExQ1fZI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m16si1140725otj.7.2020.02.10.16.16.09; Mon, 10 Feb 2020 16:16:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=oExQ1fZI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727772AbgBKAPt (ORCPT + 99 others); Mon, 10 Feb 2020 19:15:49 -0500 Received: from hqnvemgate25.nvidia.com ([216.228.121.64]:18845 "EHLO hqnvemgate25.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727685AbgBKAPn (ORCPT ); Mon, 10 Feb 2020 19:15:43 -0500 Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Mon, 10 Feb 2020 16:15:11 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Mon, 10 Feb 2020 16:15:38 -0800 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Mon, 10 Feb 2020 16:15:38 -0800 Received: from HQMAIL111.nvidia.com (172.20.187.18) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 11 Feb 2020 00:15:38 +0000 Received: from hqnvemgw03.nvidia.com (10.124.88.68) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3 via Frontend Transport; Tue, 11 Feb 2020 00:15:38 +0000 Received: from blueforge.nvidia.com (Not Verified[10.110.48.28]) by hqnvemgw03.nvidia.com with Trustwave SEG (v7,5,8,10121) id ; Mon, 10 Feb 2020 16:15:38 -0800 From: John Hubbard To: Andrew Morton CC: Al Viro , Christoph Hellwig , Dan Williams , Dave Chinner , Ira Weiny , Jan Kara , Jason Gunthorpe , Jonathan Corbet , =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , "Kirill A . Shutemov" , Michal Hocko , Mike Kravetz , Shuah Khan , Vlastimil Babka , Matthew Wilcox , , , , , , LKML , John Hubbard , "Kirill A . Shutemov" Subject: [PATCH v6 07/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages Date: Mon, 10 Feb 2020 16:15:31 -0800 Message-ID: <20200211001536.1027652-8-jhubbard@nvidia.com> X-Mailer: git-send-email 2.25.0 In-Reply-To: <20200211001536.1027652-1-jhubbard@nvidia.com> References: <20200211001536.1027652-1-jhubbard@nvidia.com> MIME-Version: 1.0 X-NVConfidentiality: public Content-Transfer-Encoding: quoted-printable Content-Type: text/plain DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1581380111; bh=OT3RBMIOTYuZ2ESmgRWT99LM4mnzHujzb1/CcKG658o=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: In-Reply-To:References:MIME-Version:X-NVConfidentiality: Content-Transfer-Encoding:Content-Type; b=oExQ1fZIJtierFROJ1RC30C8N4WYwMPBSyPa0yCuoSl1kkNFBwk/x8eAoPQiNr92t MM4f8kwKN8cJ2WwhTX7Yio2YwHAKEHGrLlgBABnMDaGUQsL+1SV8inrDDgRLButV8I qFR4vYcCZUZIJB+8FLYC0qj5c1q6e5g+Xjet0rRqRCyX/WE2SvZLWYaDjsXqZhvOh3 iEYa5LlTPVK4mUNweaxrcsqJX9wDB6qGjvYtDYWX2gLHnyL+iTsefD6SEaJZ7pDgP5 GCcYmdfawUCeBZt1r5Rx47cnb1J0xHiGouKf7A/JXjtQoqxquHI1A7gkHyeVf8k4hm 3UQaY77nMTIdw== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Acked-by: Kirill A. Shutemov Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: John Hubbard --- Documentation/core-api/pin_user_pages.rst | 40 +++++------- include/linux/mm.h | 26 ++++++++ include/linux/mm_types.h | 7 +- mm/gup.c | 78 ++++++++++++++++++++--- mm/hugetlb.c | 6 ++ mm/page_alloc.c | 2 + mm/rmap.c | 6 ++ 7 files changed, 133 insertions(+), 32 deletions(-) diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core= -api/pin_user_pages.rst index 9829345428f8..7e5dd8b1b3f2 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper =20 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever g= up flags the caller provides. The caller is required to pass in a non-null st= ruct -pages* array, and the function then pin pages by incrementing each by a sp= ecial -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a s= pecial +value: GUP_PIN_COUNTING_BIAS. + +For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin co= unting +is achieved, by using the 3rd struct page in the compound page. A new stru= ct +page field, hpage_pinned_refcount, has been added in order to support this= . + +This approach for compound pages avoids the counting upper limit problems = that +are discussed below. Those limitations would have been aggravated severely= by +huge pages, because each tail page adds a refcount to the head page. And i= n +fact, testing revealed that, without a separate hpage_pinned_refcount fiel= d, +page overflows were seen in some huge page stress tests. + +This also means that huge pages and compound pages (of order > 1) do not s= uffer +from the false positives problem that is mentioned below.:: =20 Function -------- @@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10=3D=3D21 bits availabl= e for a counter that increments 10 bits at a time. =20 -TODO: for 1GB and larger huge pages, this is cutting it close. That's beca= use -when pin_user_pages() follows such pages, it increments the head page by "= 1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" f= or -pin_user_pages()) for each tail page. So if you have a 1GB huge page: - -* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that = is, - 10 bits at a time) -* There are 21 - 18 =3D=3D 3 bits available to count. Except that there ar= en't, - because you need to allow for a few normal get_page() calls on the head = page, - as well. Fortunately, the approach of using addition, rather than "hard" - bitfields, within page->_refcount, allows for sharing these bits gracefu= lly. - But we're still looking at about 8 references. - -This, however, is a missing feature more than anything else, because it's = easily -solved by addressing an obvious inefficiency in the original get_user_page= s() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware= of -this, so some work is required. Once that's in place, this limitation most= ly -disappears from view, because there will be ample refcounting range availa= ble. - * Callers must specifically request "dma-pinned tracking of pages". In oth= er words, just calling get_user_pages() will not suffice; a new set of func= tions, pin_user_page() and related, must be used. @@ -228,5 +221,6 @@ References * `Some slow progress on get_user_pages() (Apr 2, 2019) `_ * `DMA and get_user_pages() (LPC: Dec 12, 2018) `_ * `The trouble with get_user_pages() (Apr 30, 2018) `_ +* `LWN kernel index: get_user_pages() `_ =20 John Hubbard, October, 2019 diff --git a/include/linux/mm.h b/include/linux/mm.h index 8d4f9f4094f4..2f9ca976402b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page = *page) return page[1].compound_order; } =20 +static inline bool hpage_pincount_available(struct page *page) +{ + /* + * Can the page->hpage_pinned_refcount field be used? That field is in + * the 3rd page of the compound page, so the smallest (2-page) compound + * pages cannot support it. + */ + page =3D compound_head(page); + return PageCompound(page) && compound_order(page) > 1; +} + +static inline int compound_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + page =3D compound_head(page); + return atomic_read(compound_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int orde= r) { page[1].compound_order =3D order; @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned = long npages); * refcounts, and b) all the callers of this routine are expected to be ab= le to * deal gracefully with a false positive. * + * For huge pages, the result will be exactly correct. That's because we h= ave + * more tracking data available: the 3rd struct page in the compound page = is + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS + * scheme). + * * For more information, please see Documentation/vm/pin_user_pages.rst. * * @page: pointer to page to be queried. @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned l= ong npages); */ static inline bool page_maybe_dma_pinned(struct page *page) { + if (hpage_pincount_available(page)) + return compound_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c28911c3afa8..dd555e6d23f3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -137,7 +137,7 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ - unsigned long _compound_pad_2; + atomic_t hpage_pinned_refcount; /* For both global and memcg */ struct list_head deferred_list; }; @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct p= age *page) return &page[1].compound_mapcount; } =20 +static inline atomic_t *compound_pincount_ptr(struct page *page) +{ + return &page[2].hpage_pinned_refcount; +} + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/gup.c b/mm/gup.c index a2356482e1ea..4d0d94405639 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,22 @@ struct follow_page_context { unsigned int page_mask; }; =20 +static void hpage_pincount_add(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page !=3D compound_head(page), page); + + atomic_add(refs, compound_pincount_ptr(page)); +} + +static void hpage_pincount_sub(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page !=3D compound_head(page), page); + + atomic_sub(refs, compound_pincount_ptr(page)); +} + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_hea= d(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { - refs *=3D GUP_PIN_COUNTING_BIAS; - return try_get_compound_head(page, refs); + /* + * When pinning a compound page of order > 1 (which is what + * hpage_pincount_available() checks for), use an exact count to + * track it, via hpage_pincount_add/_sub(). + * + * However, be sure to *also* increment the normal page refcount + * field at least once, so that the page really is pinned. + */ + if (!hpage_pincount_available(page)) + refs *=3D GUP_PIN_COUNTING_BIAS; + + page =3D try_get_compound_head(page, refs); + if (!page) + return NULL; + + if (hpage_pincount_available(page)) + hpage_pincount_add(page, refs); + + return page; } =20 WARN_ON_ONCE(1); @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, un= signed int flags) if (flags & FOLL_GET) return try_get_page(page); else if (flags & FOLL_PIN) { + int refs =3D 1; + page =3D compound_head(page); =20 if (WARN_ON_ONCE(page_ref_count(page) <=3D 0)) return false; =20 - page_ref_add(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_add(page, 1); + else + refs =3D GUP_PIN_COUNTING_BIAS; + + /* + * Similar to try_grab_compound_head(): even if using the + * hpage_pincount_add/_sub() routines, be sure to + * *also* increment the normal page refcount field at least + * once, so that the page really is pinned. + */ + page_ref_add(page, refs); } =20 return true; @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, un= signed int flags) #ifdef CONFIG_DEV_PAGEMAP_OPS static bool __unpin_devmap_managed_user_page(struct page *page) { - int count; + int count, refs =3D 1; =20 if (!page_is_devmap_managed(page)) return false; =20 - count =3D page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs =3D GUP_PIN_COUNTING_BIAS; + + count =3D page_ref_sub_return(page, refs); =20 /* * devmap page refcounts are 1-based, rather than 0-based: if @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct pag= e *page) */ void unpin_user_page(struct page *page) { + int refs =3D 1; + page =3D compound_head(page); =20 /* @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page) if (__unpin_devmap_managed_user_page(page)) return; =20 - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs =3D GUP_PIN_COUNTING_BIAS; + + if (page_ref_sub_and_test(page, refs)) __put_page(page); } EXPORT_SYMBOL(unpin_user_page); @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsign= ed long addr, =20 static void put_compound_head(struct page *page, int refs, unsigned int fl= ags) { - if (flags & FOLL_PIN) - refs *=3D GUP_PIN_COUNTING_BIAS; + if (flags & FOLL_PIN) { + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *=3D GUP_PIN_COUNTING_BIAS; + } =20 VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ba1de6bc1402..3d31a235b53d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct pag= e *page, struct page *p =3D page + 1; =20 atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + for (i =3D 1; i < nr_pages; i++, p =3D mem_map_next(p, page, i)) { clear_compound_head(p); set_page_refcounted(p); @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *= page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } =20 /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb750a199..b2fe61035b7a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int= order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } =20 #ifdef CONFIG_DEBUG_PAGEALLOC diff --git a/mm/rmap.c b/mm/rmap.c index b3e381919835..e45b9b991e2f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */ @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, { BUG_ON(address < vma->vm_start || address >=3D vma->vm_end); atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __page_set_anon_rmap(page, vma, address, 1); } #endif /* CONFIG_HUGETLB_PAGE */ --=20 2.25.0