Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934233AbaDIR4u (ORCPT ); Wed, 9 Apr 2014 13:56:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:3085 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933411AbaDIR4s (ORCPT ); Wed, 9 Apr 2014 13:56:48 -0400 Date: Wed, 9 Apr 2014 13:56:14 -0400 From: Luiz Capitulino To: Yasuaki Ishimatsu Cc: , , , , , , , , , , , Subject: Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime Message-ID: <20140409135614.0fb55016@redhat.com> In-Reply-To: <53449759.6040207@jp.fujitsu.com> References: <1396983740-26047-1-git-send-email-lcapitulino@redhat.com> <1396983740-26047-6-git-send-email-lcapitulino@redhat.com> <53449759.6040207@jp.fujitsu.com> Organization: Red Hat Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 9 Apr 2014 09:42:01 +0900 Yasuaki Ishimatsu wrote: > (2014/04/09 4:02), Luiz Capitulino wrote: > > HugeTLB is limited to allocating hugepages whose size are less than > > MAX_ORDER order. This is so because HugeTLB allocates hugepages via > > the buddy allocator. Gigantic pages (that is, pages whose size is > > greater than MAX_ORDER order) have to be allocated at boottime. > > > > However, boottime allocation has at least two serious problems. First, > > it doesn't support NUMA and second, gigantic pages allocated at > > boottime can't be freed. > > > > This commit solves both issues by adding support for allocating gigantic > > pages during runtime. It works just like regular sized hugepages, > > meaning that the interface in sysfs is the same, it supports NUMA, > > and gigantic pages can be freed. > > > > For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G > > gigantic pages on node 1, one can do: > > > > # echo 2 > \ > > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > > > > And to free them all: > > > > # echo 0 > \ > > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > > > > The one problem with gigantic page allocation at runtime is that it > > can't be serviced by the buddy allocator. To overcome that problem, this > > commit scans all zones from a node looking for a large enough contiguous > > region. When one is found, it's allocated by using CMA, that is, we call > > alloc_contig_range() to do the actual allocation. For example, on x86_64 > > we scan all zones looking for a 1GB contiguous region. When one is found, > > it's allocated by alloc_contig_range(). > > > > One expected issue with that approach is that such gigantic contiguous > > regions tend to vanish as runtime goes by. The best way to avoid this for > > now is to make gigantic page allocations very early during system boot, say > > from a init script. Other possible optimization include using compaction, > > which is supported by CMA but is not explicitly used by this commit. > > > > It's also important to note the following: > > > > 1. Gigantic pages allocated at boottime by the hugepages= command-line > > option can be freed at runtime just fine > > > > 2. This commit adds support for gigantic pages only to x86_64. The > > reason is that I don't have access to nor experience with other archs. > > The code is arch indepedent though, so it should be simple to add > > support to different archs > > > > 3. I didn't add support for hugepage overcommit, that is allocating > > a gigantic page on demand when > > /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't > > think it's reasonable to do the hard and long work required for > > allocating a gigantic page at fault time. But it should be simple > > to add this if wanted > > > > Signed-off-by: Luiz Capitulino > > --- > > mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- > > 1 file changed, 147 insertions(+), 11 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 9dded98..2258045 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -679,11 +679,141 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > > ((node = hstate_next_node_to_free(hs, mask)) || 1); \ > > nr_nodes--) > > > > +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64) > > +static void destroy_compound_gigantic_page(struct page *page, > > + unsigned long order) > > +{ > > + int i; > > + int nr_pages = 1 << order; > > + struct page *p = page + 1; > > + > > + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > > + __ClearPageTail(p); > > + set_page_refcounted(p); > > + p->first_page = NULL; > > + } > > + > > + set_compound_order(page, 0); > > + __ClearPageHead(page); > > +} > > + > > +static void free_gigantic_page(struct page *page, unsigned order) > > +{ > > + free_contig_range(page_to_pfn(page), 1 << order); > > +} > > + > > +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count) > > +{ > > + unsigned long end_pfn = start_pfn + count; > > + return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE); > > +} > > + > > +static bool pfn_range_valid_gigantic(unsigned long start_pfn, > > + unsigned long nr_pages) > > +{ > > + unsigned long i, end_pfn = start_pfn + nr_pages; > > + struct page *page; > > + > > + for (i = start_pfn; i < end_pfn; i++) { > > + if (!pfn_valid(i)) > > + return false; > > + > > + page = pfn_to_page(i); > > + > > + if (PageReserved(page)) > > + return false; > > + > > + if (page_count(page) > 0) > > + return false; > > + > > + if (PageHuge(page)) > > + return false; > > + } > > + > > + return true; > > +} > > + > > +static struct page *alloc_gigantic_page(int nid, unsigned order) > > +{ > > + unsigned long nr_pages = 1 << order; > > + unsigned long ret, pfn, flags; > > + struct zone *z; > > + > > + z = NODE_DATA(nid)->node_zones; > > + for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) { > > + spin_lock_irqsave(&z->lock, flags); > > + > > + pfn = ALIGN(z->zone_start_pfn, nr_pages); > > + for (; pfn < zone_end_pfn(z); pfn += nr_pages) { > > > + if (pfn_range_valid_gigantic(pfn, nr_pages)) { > > How about it. It can reduce the indentation level. > if (!pfn_range_valid_gigantic(...)) > continue; > > And I think following check is necessary: > if (pfn + nr_pages >= zone_end_pfn(z)) > break; You're right that we have to check if the whole range is within the zone, but the check above is off-by-one. What about the following: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 01c0068..b01cdeb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -734,6 +734,13 @@ static bool pfn_range_valid_gigantic(unsigned long start_pfn, return true; } +static bool zone_spans_last_pfn(const struct zone *zone, + unsigned long start_pfn, unsigned long nr_pages) +{ + unsigned long last_pfn = start_pfn + nr_pages - 1; + return zone_spans_pfn(zone, last_pfn); +} + static struct page *alloc_gigantic_page(int nid, unsigned order) { unsigned long nr_pages = 1 << order; @@ -745,7 +752,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order) spin_lock_irqsave(&z->lock, flags); pfn = ALIGN(z->zone_start_pfn, nr_pages); - for (; pfn < zone_end_pfn(z); pfn += nr_pages) { + while (zone_spans_last_pfn(z, pfn, nr_pages)) { if (pfn_range_valid_gigantic(pfn, nr_pages)) { /* * We release the zone lock here because @@ -760,6 +767,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order) return pfn_to_page(pfn); spin_lock_irqsave(&z->lock, flags); } + pfn += nr_pages; } spin_unlock_irqrestore(&z->lock, flags); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/