Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934601AbaDJEks (ORCPT ); Thu, 10 Apr 2014 00:40:48 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:39401 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750817AbaDJEkq (ORCPT ); Thu, 10 Apr 2014 00:40:46 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v2.0.1 X-SHieldMailCheckerPolicyVersion: FJ-ISEC-20120718-3 Message-ID: <53462094.6000407@jp.fujitsu.com> Date: Thu, 10 Apr 2014 13:39:48 +0900 From: Yasuaki Ishimatsu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Luiz Capitulino CC: , , , , , , , , , , , Subject: Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime References: <1396983740-26047-1-git-send-email-lcapitulino@redhat.com> <1396983740-26047-6-git-send-email-lcapitulino@redhat.com> <53449759.6040207@jp.fujitsu.com> <20140409135614.0fb55016@redhat.com> In-Reply-To: <20140409135614.0fb55016@redhat.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-SecurityPolicyCheck-GC: OK by FENCE-Mail Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2014/04/10 2:56), Luiz Capitulino wrote: > On Wed, 9 Apr 2014 09:42:01 +0900 > Yasuaki Ishimatsu wrote: > >> (2014/04/09 4:02), Luiz Capitulino wrote: >>> HugeTLB is limited to allocating hugepages whose size are less than >>> MAX_ORDER order. This is so because HugeTLB allocates hugepages via >>> the buddy allocator. Gigantic pages (that is, pages whose size is >>> greater than MAX_ORDER order) have to be allocated at boottime. >>> >>> However, boottime allocation has at least two serious problems. First, >>> it doesn't support NUMA and second, gigantic pages allocated at >>> boottime can't be freed. >>> >>> This commit solves both issues by adding support for allocating gigantic >>> pages during runtime. It works just like regular sized hugepages, >>> meaning that the interface in sysfs is the same, it supports NUMA, >>> and gigantic pages can be freed. >>> >>> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G >>> gigantic pages on node 1, one can do: >>> >>> # echo 2 > \ >>> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages >>> >>> And to free them all: >>> >>> # echo 0 > \ >>> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages >>> >>> The one problem with gigantic page allocation at runtime is that it >>> can't be serviced by the buddy allocator. To overcome that problem, this >>> commit scans all zones from a node looking for a large enough contiguous >>> region. When one is found, it's allocated by using CMA, that is, we call >>> alloc_contig_range() to do the actual allocation. For example, on x86_64 >>> we scan all zones looking for a 1GB contiguous region. When one is found, >>> it's allocated by alloc_contig_range(). >>> >>> One expected issue with that approach is that such gigantic contiguous >>> regions tend to vanish as runtime goes by. The best way to avoid this for >>> now is to make gigantic page allocations very early during system boot, say >>> from a init script. Other possible optimization include using compaction, >>> which is supported by CMA but is not explicitly used by this commit. >>> >>> It's also important to note the following: >>> >>> 1. Gigantic pages allocated at boottime by the hugepages= command-line >>> option can be freed at runtime just fine >>> >>> 2. This commit adds support for gigantic pages only to x86_64. The >>> reason is that I don't have access to nor experience with other archs. >>> The code is arch indepedent though, so it should be simple to add >>> support to different archs >>> >>> 3. I didn't add support for hugepage overcommit, that is allocating >>> a gigantic page on demand when >>> /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't >>> think it's reasonable to do the hard and long work required for >>> allocating a gigantic page at fault time. But it should be simple >>> to add this if wanted >>> >>> Signed-off-by: Luiz Capitulino >>> --- >>> mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- >>> 1 file changed, 147 insertions(+), 11 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 9dded98..2258045 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -679,11 +679,141 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) >>> ((node = hstate_next_node_to_free(hs, mask)) || 1); \ >>> nr_nodes--) >>> >>> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64) >>> +static void destroy_compound_gigantic_page(struct page *page, >>> + unsigned long order) >>> +{ >>> + int i; >>> + int nr_pages = 1 << order; >>> + struct page *p = page + 1; >>> + >>> + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>> + __ClearPageTail(p); >>> + set_page_refcounted(p); >>> + p->first_page = NULL; >>> + } >>> + >>> + set_compound_order(page, 0); >>> + __ClearPageHead(page); >>> +} >>> + >>> +static void free_gigantic_page(struct page *page, unsigned order) >>> +{ >>> + free_contig_range(page_to_pfn(page), 1 << order); >>> +} >>> + >>> +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count) >>> +{ >>> + unsigned long end_pfn = start_pfn + count; >>> + return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE); >>> +} >>> + >>> +static bool pfn_range_valid_gigantic(unsigned long start_pfn, >>> + unsigned long nr_pages) >>> +{ >>> + unsigned long i, end_pfn = start_pfn + nr_pages; >>> + struct page *page; >>> + >>> + for (i = start_pfn; i < end_pfn; i++) { >>> + if (!pfn_valid(i)) >>> + return false; >>> + >>> + page = pfn_to_page(i); >>> + >>> + if (PageReserved(page)) >>> + return false; >>> + >>> + if (page_count(page) > 0) >>> + return false; >>> + >>> + if (PageHuge(page)) >>> + return false; >>> + } >>> + >>> + return true; >>> +} >>> + >>> +static struct page *alloc_gigantic_page(int nid, unsigned order) >>> +{ >>> + unsigned long nr_pages = 1 << order; >>> + unsigned long ret, pfn, flags; >>> + struct zone *z; >>> + >>> + z = NODE_DATA(nid)->node_zones; >>> + for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) { >>> + spin_lock_irqsave(&z->lock, flags); >>> + >>> + pfn = ALIGN(z->zone_start_pfn, nr_pages); >>> + for (; pfn < zone_end_pfn(z); pfn += nr_pages) { >> >>> + if (pfn_range_valid_gigantic(pfn, nr_pages)) { >> >> How about it. It can reduce the indentation level. >> if (!pfn_range_valid_gigantic(...)) >> continue; >> >> And I think following check is necessary: >> if (pfn + nr_pages >= zone_end_pfn(z)) >> break; > > You're right that we have to check if the whole range is within the zone, > but the check above is off-by-one. What about the following: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 01c0068..b01cdeb 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -734,6 +734,13 @@ static bool pfn_range_valid_gigantic(unsigned long start_pfn, > return true; > } > > +static bool zone_spans_last_pfn(const struct zone *zone, > + unsigned long start_pfn, unsigned long nr_pages) > +{ > + unsigned long last_pfn = start_pfn + nr_pages - 1; > + return zone_spans_pfn(zone, last_pfn); > +} > + > static struct page *alloc_gigantic_page(int nid, unsigned order) > { > unsigned long nr_pages = 1 << order; > @@ -745,7 +752,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order) > spin_lock_irqsave(&z->lock, flags); > > pfn = ALIGN(z->zone_start_pfn, nr_pages); > - for (; pfn < zone_end_pfn(z); pfn += nr_pages) { > + while (zone_spans_last_pfn(z, pfn, nr_pages)) { > if (pfn_range_valid_gigantic(pfn, nr_pages)) { > /* > * We release the zone lock here because > @@ -760,6 +767,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order) > return pfn_to_page(pfn); > spin_lock_irqsave(&z->lock, flags); > } > + pfn += nr_pages; > } > > spin_unlock_irqrestore(&z->lock, flags); > Looks good to me. Thanks, Yasuaki Ishimatsu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/