Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753450AbdLNU7K (ORCPT ); Thu, 14 Dec 2017 15:59:10 -0500 Received: from aserp2130.oracle.com ([141.146.126.79]:60527 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752361AbdLNU7G (ORCPT ); Thu, 14 Dec 2017 15:59:06 -0500 Subject: Re: [RFC PATCH 4/5] mm, hugetlb: get rid of surplus page accounting tricks To: Michal Hocko Cc: linux-mm@kvack.org, Naoya Horiguchi , Andrew Morton , LKML References: <20171204140117.7191-1-mhocko@kernel.org> <20171204140117.7191-5-mhocko@kernel.org> <8b07431a-547e-4330-4276-570ef861bb35@oracle.com> <20171214075011.GD16951@dhcp22.suse.cz> From: Mike Kravetz Message-ID: <83d90b3a-eef0-d4aa-6733-b5c9e5370d4b@oracle.com> Date: Thu, 14 Dec 2017 12:58:55 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171214075011.GD16951@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8745 signatures=668648 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1712140288 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6097 Lines: 153 On 12/13/2017 11:50 PM, Michal Hocko wrote: > On Wed 13-12-17 16:45:55, Mike Kravetz wrote: >> On 12/04/2017 06:01 AM, Michal Hocko wrote: >>> From: Michal Hocko >>> >>> alloc_surplus_huge_page increases the pool size and the number of >>> surplus pages opportunistically to prevent from races with the pool size >>> change. See d1c3fb1f8f29 ("hugetlb: introduce nr_overcommit_hugepages >>> sysctl") for more details. >>> >>> The resulting code is unnecessarily hairy, cause code duplication and >>> doesn't allow to share the allocation paths. Moreover pool size changes >>> tend to be very seldom so optimizing for them is not really reasonable. >>> Simplify the code and allow to allocate a fresh surplus page as long as >>> we are under the overcommit limit and then recheck the condition after >>> the allocation and drop the new page if the situation has changed. This >>> should provide a reasonable guarantee that an abrupt allocation requests >>> will not go way off the limit. >>> >>> If we consider races with the pool shrinking and enlarging then we >>> should be reasonably safe as well. In the first case we are off by one >>> in the worst case and the second case should work OK because the page is >>> not yet visible. We can waste CPU cycles for the allocation but that >>> should be acceptable for a relatively rare condition. >>> >>> Signed-off-by: Michal Hocko >>> --- >>> mm/hugetlb.c | 60 +++++++++++++++++++++--------------------------------------- >>> 1 file changed, 21 insertions(+), 39 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index a1b8b2888ec9..0c7dc269b6c0 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -1538,62 +1538,44 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) >>> static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, >>> int nid, nodemask_t *nmask) >>> { >>> - struct page *page; >>> - unsigned int r_nid; >>> + struct page *page = NULL; >>> >>> if (hstate_is_gigantic(h)) >>> return NULL; >>> >>> - /* >>> - * Assume we will successfully allocate the surplus page to >>> - * prevent racing processes from causing the surplus to exceed >>> - * overcommit >>> - * >>> - * This however introduces a different race, where a process B >>> - * tries to grow the static hugepage pool while alloc_pages() is >>> - * called by process A. B will only examine the per-node >>> - * counters in determining if surplus huge pages can be >>> - * converted to normal huge pages in adjust_pool_surplus(). A >>> - * won't be able to increment the per-node counter, until the >>> - * lock is dropped by B, but B doesn't drop hugetlb_lock until >>> - * no more huge pages can be converted from surplus to normal >>> - * state (and doesn't try to convert again). Thus, we have a >>> - * case where a surplus huge page exists, the pool is grown, and >>> - * the surplus huge page still exists after, even though it >>> - * should just have been converted to a normal huge page. This >>> - * does not leak memory, though, as the hugepage will be freed >>> - * once it is out of use. It also does not allow the counters to >>> - * go out of whack in adjust_pool_surplus() as we don't modify >>> - * the node values until we've gotten the hugepage and only the >>> - * per-node value is checked there. >>> - */ >>> spin_lock(&hugetlb_lock); >>> - if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { >>> - spin_unlock(&hugetlb_lock); >>> - return NULL; >>> - } else { >>> - h->nr_huge_pages++; >>> - h->surplus_huge_pages++; >>> - } >>> + if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) >>> + goto out_unlock; >>> spin_unlock(&hugetlb_lock); >>> >>> page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask); >>> + if (!page) >>> + goto out_unlock; >>> >>> spin_lock(&hugetlb_lock); >>> - if (page) { >>> + /* >>> + * We could have raced with the pool size change. >>> + * Double check that and simply deallocate the new page >>> + * if we would end up overcommiting the surpluses. Abuse >>> + * temporary page to workaround the nasty free_huge_page >>> + * codeflow >>> + */ >>> + if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { >>> + SetPageHugeTemporary(page); >>> + put_page(page); >>> + page = NULL; >>> + } else { >>> + h->surplus_huge_pages_node[page_to_nid(page)]++; >>> + h->surplus_huge_pages++; >>> INIT_LIST_HEAD(&page->lru); >>> r_nid = page_to_nid(page); >>> set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); >>> set_hugetlb_cgroup(page, NULL); >>> - /* >>> - * We incremented the global counters already >>> - */ >>> h->nr_huge_pages_node[r_nid]++; >>> h->surplus_huge_pages_node[r_nid]++; >>> - } else { >>> - h->nr_huge_pages--; >>> - h->surplus_huge_pages--; >> >> In the case of a successful surplus allocation, the following counters >> are incremented: >> >> h->surplus_huge_pages_node[page_to_nid(page)]++; >> h->surplus_huge_pages++; >> h->nr_huge_pages_node[r_nid]++; >> h->surplus_huge_pages_node[r_nid]++; >> >> Looks like per-node surplus_huge_pages_node is incremented twice, and >> global nr_huge_pages is not incremented at all. >> >> Also, you removed r_nid so I'm guessing this will not compile? > > Ups a hickup during the rebase/split up. The following code removes all > this so I haven't noticed. Thanks for catching that! > The incremental diff > --- > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 41d2d9082f0d..3c16cde72ceb 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1565,8 +1565,10 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > put_page(page); > page = NULL; > } else { > - h->surplus_huge_pages_node[page_to_nid(page)]++; > + int r_nid; > + > h->surplus_huge_pages++; > + h->nr_huge_pages++; > INIT_LIST_HEAD(&page->lru); > r_nid = page_to_nid(page); > set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); > With the incremental diff, Reviewed-by: Mike Kravetz -- Mike Kravetz