Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp899772imm; Mon, 21 May 2018 16:49:56 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpGjd33pBqw4XkB/FGoyur34TGqYXyNzK0bbYNN3vENLyar1qIGr2OPi0m8j70LsA5f4Fxz X-Received: by 2002:a63:4383:: with SMTP id q125-v6mr17550296pga.412.1526946596434; Mon, 21 May 2018 16:49:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526946596; cv=none; d=google.com; s=arc-20160816; b=z4FMheFVNNTZ7Ng5wf3ctpzU+Yw3baImuhCQvYRuKsuE4bwxwryP0HUOA/uFBoNbz+ V70tZ1CTSi82fkT5vqHP8wEW0S4SPVILvZrgnmrgjtYd9xNxB28ZeEuHhQDWhZTl4vAh 6iG0L7OXoalM1AZ2j+btxzbaB3srWCXgoF5D+ARpHD+DF3C8rZQyoE47GgcJ1mDjiEv6 t7A+xMZU1Cb/3rmZgBPY9O/cuK3sZWvIKKFc8nfB0ub/z3isqYCBoE/N61wAZQVCGXaq KUL1TIm1pTpRIqq6PquEAVvtzme6adlIgPoGfsCxrImKdJyf/k1v5qK93e5XbI/PnOSt PJrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=GfVpcMmImH1C72RwM3z9BIa7yIjWu4fmp1mWxa84A8w=; b=C+EO/IWXLDez3lcZ4DqvNgEh149b5TFi4hGaYhnNLYe3p2bieuJG1jHReTolaik0Q8 x0IZ5AkWx2vsoyrH8VHEbVYVK+upxIwvVy1sJ0LQn2P9BnmKUME3vP3JBtnzBebkmqDE aF5XFiBWcRwTj30rw2bkUM5+68EKdW4B2s3FfPDGT1eW4YIDDfbCghnxe/z7U6DNYna2 8uahAfWUS4HkoSii9G9TmGvpXM2zTY2xH05MrPVrggD4uF1BRIpi2bSK2jDEk1fGCI0Y /oNGMHHf8Ydkf8WzXUngQwGECrHP1j0698rn06NezSjYM8/8B/mZULyjgETzzPO3cZoF dezw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=pbiqAcBz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g2-v6si14068329pll.525.2018.05.21.16.49.39; Mon, 21 May 2018 16:49:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=pbiqAcBz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751574AbeEUXt2 (ORCPT + 99 others); Mon, 21 May 2018 19:49:28 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:35856 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751368AbeEUXt0 (ORCPT ); Mon, 21 May 2018 19:49:26 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w4LNlwqk134424; Mon, 21 May 2018 23:48:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=GfVpcMmImH1C72RwM3z9BIa7yIjWu4fmp1mWxa84A8w=; b=pbiqAcBzjWU1IO1+63XrtLzbd0PW7QTp1CHmOb3xXdGx0wrTochysSVoV4WkNBANNYG7 5JkycDZ4hHlAoeI+9ALS0/g/34cqq5CT6qzdHxuo9GnhmRwp6C8mLtR5QrJZtsRbs6bJ mxtNATgP7jSh+GJafouRQ8+YvG6Dplru1l0EbzvJYdlJJ1fIbSpvBsnLjJXfyBMYWmJ3 /SIRk7cRAi/JXXqQMLgQEuZjQeKrFK8sjljEXlWmaeZqc9+yHLEorsdxwD4Bb9dEyhV4 qqSxM9No8a1gVtFtjtHYObhViSerKa+Wo9nvtO3LLil1olAn6PUSknsAcEBqDd8SDgG5 tw== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2120.oracle.com with ESMTP id 2j2bw86aa3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 21 May 2018 23:48:47 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w4LNmkWo007604 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 21 May 2018 23:48:46 GMT Received: from abhmp0005.oracle.com (abhmp0005.oracle.com [141.146.116.11]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w4LNmibN003996; Mon, 21 May 2018 23:48:44 GMT Received: from [192.168.1.164] (/50.38.38.67) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 21 May 2018 16:48:44 -0700 Subject: Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface To: Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Cc: Reinette Chatre , Michal Hocko , Christopher Lameter , Guy Shattah , Anshuman Khandual , Michal Nazarewicz , David Nellans , Laura Abbott , Pavel Machek , Dave Hansen , Andrew Morton References: <20180503232935.22539-1-mike.kravetz@oracle.com> <20180503232935.22539-4-mike.kravetz@oracle.com> From: Mike Kravetz Message-ID: <57dfd52c-22a5-5546-f8f3-848f21710cc1@oracle.com> Date: Mon, 21 May 2018 16:48:43 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8900 signatures=668700 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1805210270 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/21/2018 01:54 AM, Vlastimil Babka wrote: > On 05/04/2018 01:29 AM, Mike Kravetz wrote: >> find_alloc_contig_pages() is a new interface that attempts to locate >> and allocate a contiguous range of pages. It is provided as a more > > How about dropping the 'find_' from the name, so it's more like other > allocator functions? All of them have to 'find' the free pages in some > sense. Sure > >> convenient interface than alloc_contig_range() which is currently >> used by CMA and gigantic huge pages. >> >> When attempting to allocate a range of pages, migration is employed >> if possible. There is no guarantee that the routine will succeed. >> So, the user must be prepared for failure and have a fall back plan. >> >> Signed-off-by: Mike Kravetz >> --- >> include/linux/gfp.h | 12 +++++ >> mm/page_alloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++- >> 2 files changed, 146 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index 86a0d06463ab..b0d11777d487 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -573,6 +573,18 @@ static inline bool pm_suspended_storage(void) >> extern int alloc_contig_range(unsigned long start, unsigned long end, >> unsigned migratetype, gfp_t gfp_mask); >> extern void free_contig_range(unsigned long pfn, unsigned long nr_pages); >> +extern struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp, >> + int nid, nodemask_t *nodemask); >> +extern void free_contig_pages(struct page *page, unsigned long nr_pages); >> +#else >> +static inline struct page *find_alloc_contig_pages(unsigned long nr_pages, >> + gfp_t gfp, int nid, nodemask_t *nodemask) >> +{ >> + return NULL; >> +} >> +static inline void free_contig_pages(struct page *page, unsigned long nr_pages) >> +{ >> +} >> #endif >> >> #ifdef CONFIG_CMA >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index cb1a5e0be6ee..d0a2d0da9eae 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -67,6 +67,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -7913,8 +7914,12 @@ int alloc_contig_range(unsigned long start, unsigned long end, >> >> /* Make sure the range is really isolated. */ >> if (test_pages_isolated(outer_start, end, false)) { >> - pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", >> - __func__, outer_start, end); >> +#ifdef MIGRATE_CMA >> + /* Only print messages for CMA allocations */ >> + if (migratetype == MIGRATE_CMA) > > I think is_migrate_cma() can be used to avoid the #ifdef. > Thanks. I missed that and did not want to create something new. >> + pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", >> + __func__, outer_start, end); >> +#endif >> ret = -EBUSY; >> goto done; >> } >> @@ -7950,6 +7955,133 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages) >> } >> WARN(count != 0, "%ld pages are still in use!\n", count); >> } >> + >> +/* >> + * Only check for obvious pfn/pages which can not be used/migrated. The >> + * migration code will do the final check. Under stress, this minimal set >> + * has been observed to provide the best results. The checks can be expanded >> + * if needed. > > Hm I kind of doubt this is optimal, it doesn't test almost anything > besides basic validity, so it won't exclude ranges where the allocation > will fail. I will write more in a reply to the header where complexity > is discussed. > Ok. This 'appeared' to work best in testing where I had all CPUs in tight loops calling this new interface to allocate and then free contiguous pages. I was somewhat surprised at the result, and it may just be due to the nature of my testing. >> + */ >> +static bool contig_pfn_range_valid(struct zone *z, unsigned long start_pfn, >> + unsigned long nr_pages) >> +{ >> + unsigned long i, end_pfn = start_pfn + nr_pages; >> + struct page *page; >> + >> + for (i = start_pfn; i < end_pfn; i++) { >> + if (!pfn_valid(i)) >> + return false; >> + >> + page = pfn_to_online_page(i); >> + >> + if (page_zone(page) != z) >> + return false; >> + >> + } >> + >> + return true; >> +} >> + >> +/* >> + * Search for and attempt to allocate contiguous allocations greater than >> + * MAX_ORDER. >> + */ >> +static struct page *__alloc_contig_pages_nodemask(gfp_t gfp, >> + unsigned long order, >> + int nid, nodemask_t *nodemask) >> +{ >> + unsigned long nr_pages, pfn, flags; >> + struct page *ret_page = NULL; >> + struct zonelist *zonelist; >> + struct zoneref *z; >> + struct zone *zone; >> + int rc; >> + >> + nr_pages = 1 << order; >> + zonelist = node_zonelist(nid, gfp); >> + for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp), >> + nodemask) { >> + pgdat_resize_lock(zone->zone_pgdat, &flags); >> + pfn = ALIGN(zone->zone_start_pfn, nr_pages); >> + while (zone_spans_pfn(zone, pfn + nr_pages - 1)) { >> + if (contig_pfn_range_valid(zone, pfn, nr_pages)) { >> + struct page *page = pfn_to_online_page(pfn); >> + unsigned int migratetype; >> + >> + /* >> + * All pageblocks in range must be of same >> + * migrate type. >> + */ >> + migratetype = get_pageblock_migratetype(page); >> + pgdat_resize_unlock(zone->zone_pgdat, &flags); >> + >> + rc = alloc_contig_range(pfn, pfn + nr_pages, >> + migratetype, gfp); >> + if (!rc) { >> + ret_page = pfn_to_page(pfn); >> + return ret_page; >> + } >> + pgdat_resize_lock(zone->zone_pgdat, &flags); >> + } >> + pfn += nr_pages; >> + } >> + pgdat_resize_unlock(zone->zone_pgdat, &flags); >> + } >> + >> + return ret_page; >> +} >> + >> +/** >> + * find_alloc_contig_pages() -- attempt to find and allocate a contiguous >> + * range of pages >> + * @nr_pages: number of pages to find/allocate >> + * @gfp: gfp mask used to limit search as well as during compaction >> + * @nid: target node >> + * @nodemask: mask of other possible nodes >> + * >> + * Pages can be freed with a call to free_contig_pages(), or by manually >> + * calling __free_page() for each page allocated. >> + * >> + * Return: pointer to 'order' pages on success, or NULL if not successful. >> + */ >> +struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp, >> + int nid, nodemask_t *nodemask) >> +{ >> + unsigned long i, alloc_order, order_pages; >> + struct page *pages; >> + >> + /* >> + * Underlying allocators perform page order sized allocations. >> + */ >> + alloc_order = get_count_order(nr_pages); > > So if takes arbitrary nr_pages but convert it to order anyway? I think > that's rather suboptimal and wasteful... e.g. a range could be skipped > because some of the pages added by rounding cannot be migrated away. Yes. My idea with this series was to use existing allocators which are all order based. Let me think about how to do allocation for arbitrary number of allocations. - For less than MAX_ORDER size we rely on the buddy allocator, so we are pretty much stuck with order sized allocation. However, allocations of this size are not really interesting as you can call existing routines directly. - For sizes greater than MAX_ORDER, we know that the allocation size will be at least pageblock sized. So, the isolate/migrate scheme can still be used for full pageblocks. We can then use direct migration for the remaining pages. This does complicate things a bit. I'm guessing that most (?all?) allocations will be order based. The use cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all order based. However, as commented in previous version taking arbitrary nr_pages makes interface more future proof. -- Mike Kravetz > > Vlastimil > >> + if (alloc_order < MAX_ORDER) { >> + pages = __alloc_pages_nodemask(gfp, (unsigned int)alloc_order, >> + nid, nodemask); >> + split_page(pages, alloc_order); >> + } else { >> + pages = __alloc_contig_pages_nodemask(gfp, alloc_order, nid, >> + nodemask); >> + } >> + >> + if (pages) { >> + /* >> + * More pages than desired could have been allocated due to >> + * rounding up to next page order. Free any excess pages. >> + */ >> + order_pages = 1UL << alloc_order; >> + for (i = nr_pages; i < order_pages; i++) >> + __free_page(pages + i); >> + } >> + >> + return pages; >> +} >> +EXPORT_SYMBOL_GPL(find_alloc_contig_pages); >> + >> +void free_contig_pages(struct page *page, unsigned long nr_pages) >> +{ >> + free_contig_range(page_to_pfn(page), nr_pages); >> +} >> +EXPORT_SYMBOL_GPL(free_contig_pages); >> #endif >> >> #if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA >> >