2014-02-10 17:38:27

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

HugeTLB command-line option hugepages= allows the user to specify how many
huge pages should be allocated at boot. On NUMA systems, this argument
automatically distributes huge pages allocation among nodes, which can
be undesirable.

The hugepagesnid= option introduced by this commit allows the user
to specify which NUMA nodes should be used to allocate boot-time HugeTLB
pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
from node 0 only. More details on patch 3/4 and patch 4/4.

Luiz capitulino (4):
memblock: memblock_virt_alloc_internal(): alloc from specified node
only
memblock: add memblock_virt_alloc_nid_nopanic()
hugetlb: add hugepagesnid= command-line option
hugetlb: hugepagesnid=: add 1G huge page support

Documentation/kernel-parameters.txt | 8 +++
arch/x86/mm/hugetlbpage.c | 35 ++++++++++++
include/linux/bootmem.h | 4 ++
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++
mm/memblock.c | 41 ++++++++++++--
6 files changed, 190 insertions(+), 3 deletions(-)

--
1.8.1.4


2014-02-10 17:28:18

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option

From: Luiz capitulino <[email protected]>

The HugeTLB command-line option hugepages= allow the user to specify
how many huge pages should be allocated at boot-time. On NUMA systems,
this option will try to automatically distribute the allocation equally
among nodes. For example, if you have a 2-node NUMA system and allocates
200 huge pages, than hugepages= will try to allocate 100 huge pages from
node0 and 100 from node1.

The hugepagesnid= option introduced by this commit allows the user
to specify the nodes from which huge pages are allocated. For example,
if you have a 2-node NUMA system and want 300 2M huge pages to be
allocated from node1, you could do:

hugepagesnid=1,300,2M

Or, say you want node0 to allocate 100 huge pages and node1 to
allocate 200 huge pages, you do:

hugepagesnid=0,100,2M hugepagesnid=1,200,1M

This commit adds support only for 2M huge pages, next commit will
add support for 1G huge pages.

Signed-off-by: Luiz Capitulino <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++
arch/x86/mm/hugetlbpage.c | 35 +++++++++++++++++
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 77 +++++++++++++++++++++++++++++++++++++
4 files changed, 122 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7116fda..3cbe950 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.

hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+ hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
+ Format: <nid>,<nrpages>,<size>
+ nid: NUMA node id to allocate pages from
+ nrpages: number of huge pages to allocate
+ size: huge pages size (see hugepagsz below)
+ This argument can be specified multiple times for different
+ NUMA nodes, but it shouldn't be mixed with hugepages= and
+ hugepagesz=.
hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
On x86-64 and powerpc, this option can be specified
multiple times interleaved with hugepages= to reserve
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8c9f647..91c5c98 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
return 1;
}
__setup("hugepagesz=", setup_hugepagesz);
+
+static __init int setup_hugepagesnid(char *opt)
+{
+ unsigned long order, ps, nid, nr_pages;
+ char size_str[3];
+
+ size_str[2] = '\0';
+ if (sscanf(opt, "%lu,%lu,%c%c",
+ &nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
+ printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
+ return 0;
+ }
+
+ if (!nr_pages) {
+ printk(KERN_ERR
+ "hugepagesnid: zero number of pages, ignoring\n");
+ return 0;
+ }
+
+ ps = memparse(size_str, NULL);
+ if (ps == PMD_SIZE) {
+ order = PMD_SHIFT - PAGE_SHIFT;
+ } else if (ps == PUD_SIZE && cpu_has_gbpages) {
+ order = PUD_SHIFT - PAGE_SHIFT;
+ } else {
+ printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
+ ps >> 20);
+ return 0;
+ }
+
+ hugetlb_add_nrpages_nid(order, nid, nr_pages);
+ return 1;
+}
+__setup("hugepagesnid=", setup_hugepagesnid);
+
#endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc4..aae2f9b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
int __init alloc_bootmem_huge_page(struct hstate *h);

void __init hugetlb_add_hstate(unsigned order);
+void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
+ unsigned long nr_pages);
struct hstate *size_to_hstate(unsigned long size);

#ifndef HUGE_MAX_HSTATE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9f..439c3b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
static struct hstate * __initdata parsed_hstate;
static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
+static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];

/*
* Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
h->max_huge_pages = i;
}

+static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
+ int nid,
+ unsigned long nr_pages)
+{
+ unsigned long i;
+ struct page *page;
+
+ for (i = 0; i < nr_pages; i++) {
+ page = alloc_fresh_huge_page_node(h, nid);
+ if (!page) {
+ count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+ break;
+ }
+ count_vm_event(HTLB_BUDDY_PGALLOC);
+ }
+
+ return i;
+}
+
+static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
+{
+ unsigned long i, *entry, ret = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ entry = &boot_alloc_nodes[hstate_index(h)][i];
+ if (*entry > 0)
+ ret += alloc_huge_pages_nid(h, i, *entry);
+ }
+
+ return ret;
+}
+
+static void __init hugetlb_init_hstates_nodes(void)
+{
+ struct hstate *h;
+ unsigned long ret;
+
+ for_each_hstate(h)
+ if (h->order < MAX_ORDER) {
+ ret = alloc_huge_pages_nodes(h);
+ h->max_huge_pages += ret;
+ }
+}
+
static void __init hugetlb_init_hstates(void)
{
struct hstate *h;
@@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
default_hstate.max_huge_pages = default_hstate_max_huge_pages;

hugetlb_init_hstates();
+ hugetlb_init_hstates_nodes();
gather_bootmem_prealloc();
report_hugepages();

@@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
parsed_hstate = h;
}

+void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
+ unsigned long nr_pages)
+{
+ struct hstate *h;
+ unsigned long *p;
+
+ if (parsed_hstate) {
+ printk(KERN_WARNING
+ "hugepagesnid: hugepagesz= specified, ignoring\n");
+ return;
+ }
+
+ for_each_hstate(h)
+ if (h->order == order)
+ break;
+
+ if (h->order != order) {
+ hugetlb_add_hstate(order);
+ parsed_hstate = NULL;
+ }
+
+ p = &boot_alloc_nodes[hstate_index(h)][nid];
+ if (*p != 0) {
+ printk(KERN_WARNING
+ "hugepagesnid: node %lu already specified, ignoring\n", nid);
+ return;
+ }
+
+ *p = nr_pages;
+}
+
static int __init hugetlb_nrpages_setup(char *s)
{
unsigned long *mhp;
--
1.8.1.4

2014-02-10 17:28:16

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support

From: Luiz capitulino <[email protected]>

Signed-off-by: Luiz capitulino <[email protected]>
---
mm/hugetlb.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 439c3b7..d759321 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
parsed_hstate = h;
}

+static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
+ int nid,
+ unsigned long nr_pages)
+{
+ struct huge_bootmem_page *m;
+ unsigned long i;
+ void *addr;
+
+ for (i = 0; i < nr_pages; i++) {
+ addr = memblock_virt_alloc_nid_nopanic(
+ huge_page_size(h), huge_page_size(h),
+ 0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
+ if (!addr)
+ break;
+ m = addr;
+ BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+ list_add(&m->list, &huge_boot_pages);
+ m->hstate = h;
+ }
+
+ h->max_huge_pages += i;
+}
+
void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
unsigned long nr_pages)
{
@@ -2080,6 +2103,9 @@ void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
}

*p = nr_pages;
+
+ if (h->order >= MAX_ORDER)
+ hugetlb_hstate_alloc_pages_nid(h, nid, nr_pages);
}

static int __init hugetlb_nrpages_setup(char *s)
--
1.8.1.4

2014-02-10 17:28:13

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only

From: Luiz capitulino <[email protected]>

If an allocation from the node specified by the nid argument fails,
memblock_virt_alloc_internal() automatically tries to allocate memory
from other nodes.

This is fine is the caller don't care which node is going to allocate
the memory. However, there are cases where the caller wants memory to
be allocated from the specified node only. If that's not possible, then
memblock_virt_alloc_internal() should just fail.

This commit adds a new flags argument to memblock_virt_alloc_internal()
where the caller can control this behavior.

Signed-off-by: Luiz capitulino <[email protected]>
---
mm/memblock.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 39a31e7..b0c7b2e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
}

+#define ALLOC_SPECIFIED_NODE_ONLY 0x1
+
/**
* memblock_virt_alloc_internal - allocate boot memory block
* @size: size of memory block to be allocated in bytes
@@ -1058,7 +1060,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
static void * __init memblock_virt_alloc_internal(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr,
- int nid)
+ int nid, unsigned int flags)
{
phys_addr_t alloc;
void *ptr;
@@ -1085,6 +1087,8 @@ again:
nid);
if (alloc)
goto done;
+ else if (flags & ALLOC_SPECIFIED_NODE_ONLY)
+ goto error;

if (nid != NUMA_NO_NODE) {
alloc = memblock_find_in_range_node(size, align, min_addr,
@@ -1145,7 +1149,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
__func__, (u64)size, (u64)align, nid, (u64)min_addr,
(u64)max_addr, (void *)_RET_IP_);
return memblock_virt_alloc_internal(size, align, min_addr,
- max_addr, nid);
+ max_addr, nid, 0);
}

/**
@@ -1177,7 +1181,7 @@ void * __init memblock_virt_alloc_try_nid(
__func__, (u64)size, (u64)align, nid, (u64)min_addr,
(u64)max_addr, (void *)_RET_IP_);
ptr = memblock_virt_alloc_internal(size, align,
- min_addr, max_addr, nid);
+ min_addr, max_addr, nid, 0);
if (ptr)
return ptr;

--
1.8.1.4

2014-02-10 17:29:04

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 2/4] memblock: add memblock_virt_alloc_nid_nopanic()

From: Luiz capitulino <[email protected]>

This function tries to allocate memory from the specified node only (vs.
automatically trying other nodes on failure).

This is going to be used by HugeTLB boot-time allocation code in next
commits.

Signed-off-by: Luiz capitulino <[email protected]>
---
include/linux/bootmem.h | 4 ++++
mm/memblock.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index db51fe4..6961b11 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -153,6 +153,10 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
+void * __init memblock_virt_alloc_nid_nopanic(
+ phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr, phys_addr_t max_addr,
+ int nid);
void *memblock_virt_alloc_try_nid(phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid);
void __memblock_free_early(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index b0c7b2e..9130d4b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1153,6 +1153,37 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
}

/**
+ * memblock_virt_alloc_nid_nopanic - allocate boot memory block from
+ * specified node
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ * is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ * is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ * allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * This function tries to allocate memory from @nid only. It @nid doesn't
+ * have enough memory, this function returns failure.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_nid_nopanic(
+ phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr, phys_addr_t max_addr,
+ int nid)
+{
+ memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+ __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+ (u64)max_addr, (void *)_RET_IP_);
+ return memblock_virt_alloc_internal(size, align, min_addr,
+ max_addr, nid,
+ ALLOC_SPECIFIED_NODE_ONLY);
+}
+
+/**
* memblock_virt_alloc_try_nid - allocate boot memory block with panicking
* @size: size of memory block to be allocated in bytes
* @align: alignment of the region and block's size
--
1.8.1.4

2014-02-10 17:56:12

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 12:27:44 -0500
Luiz Capitulino <[email protected]> wrote:

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.

s/2G/1G

I repeatedly did this mistake even when testing... For some reason my
brain insists on typing "2,2G" instead of "2,1G".

2014-02-10 23:13:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <[email protected]> wrote:

> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.

Grumble. "can be undesirable" is the entire reason for the entire
patchset. We need far, far more detail than can be conveyed in three
words, please!

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.
>

2014-02-10 23:27:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 12:27:47 -0500 Luiz Capitulino <[email protected]> wrote:

> From: Luiz capitulino <[email protected]>
>
> The HugeTLB command-line option hugepages= allow the user to specify
> how many huge pages should be allocated at boot-time. On NUMA systems,
> this option will try to automatically distribute the allocation equally
> among nodes. For example, if you have a 2-node NUMA system and allocates
> 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> node0 and 100 from node1.
>
> The hugepagesnid= option introduced by this commit allows the user
> to specify the nodes from which huge pages are allocated. For example,
> if you have a 2-node NUMA system and want 300 2M huge pages to be
> allocated from node1, you could do:
>
> hugepagesnid=1,300,2M
>
> Or, say you want node0 to allocate 100 huge pages and node1 to
> allocate 200 huge pages, you do:
>
> hugepagesnid=0,100,2M hugepagesnid=1,200,1M
>
> This commit adds support only for 2M huge pages, next commit will
> add support for 1G huge pages.
>
> ...
>
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> return 1;
> }
> __setup("hugepagesz=", setup_hugepagesz);
> +
> +static __init int setup_hugepagesnid(char *opt)
> +{
> + unsigned long order, ps, nid, nr_pages;
> + char size_str[3];
> +
> + size_str[2] = '\0';
> + if (sscanf(opt, "%lu,%lu,%c%c",
> + &nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> + printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> + return 0;
> + }

This will blow up if passed size=16M. We can expect that powerpc (at
least) will want to copy (or generalise) this code. It would be better
to avoid such restrictions at the outset.

> + if (!nr_pages) {
> + printk(KERN_ERR
> + "hugepagesnid: zero number of pages, ignoring\n");

The code contains rather a lot of these awkward wordwraps. Try using
pr_err() and you'll find the result quite pleasing!

> + return 0;
> + }
> +
> + ps = memparse(size_str, NULL);
> + if (ps == PMD_SIZE) {
> + order = PMD_SHIFT - PAGE_SHIFT;
> + } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> + order = PUD_SHIFT - PAGE_SHIFT;
> + } else {
> + printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> + ps >> 20);
> + return 0;
> + }
> +
> + hugetlb_add_nrpages_nid(order, nid, nr_pages);
> + return 1;
> +}
> +__setup("hugepagesnid=", setup_hugepagesnid);
> +
> #endif
>
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> static struct hstate * __initdata parsed_hstate;
> static unsigned long __initdata default_hstate_max_huge_pages;
> static unsigned long __initdata default_hstate_size;
> +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
>
> /*
> * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> h->max_huge_pages = i;
> }
>
> +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> + int nid,
> + unsigned long nr_pages)
> +{
> + unsigned long i;
> + struct page *page;
> +
> + for (i = 0; i < nr_pages; i++) {
> + page = alloc_fresh_huge_page_node(h, nid);
> + if (!page) {
> + count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> + break;
> + }
> + count_vm_event(HTLB_BUDDY_PGALLOC);
> + }
> +
> + return i;
> +}
> +
> +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> +{
> + unsigned long i, *entry, ret = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + entry = &boot_alloc_nodes[hstate_index(h)][i];
> + if (*entry > 0)
> + ret += alloc_huge_pages_nid(h, i, *entry);
> + }
> +
> + return ret;
> +}
> +
> +static void __init hugetlb_init_hstates_nodes(void)
> +{
> + struct hstate *h;
> + unsigned long ret;
> +
> + for_each_hstate(h)
> + if (h->order < MAX_ORDER) {
> + ret = alloc_huge_pages_nodes(h);
> + h->max_huge_pages += ret;
> + }
> +}

The patch adds code to mm/hugetlb.c which only x86 will use. I guess
that's OK medium-term if we expect other architectures will use it.
But if other architectures use it, setup_hugepagesnid() was in the wrong
directory.

Can I ask that you poke the ppc/arm/ia64/etc people, see whether they
will adpot this? Explaining the overall justification better in [patch
0/n] would help this discussion!

> static void __init hugetlb_init_hstates(void)
> {
> struct hstate *h;
> @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> default_hstate.max_huge_pages = default_hstate_max_huge_pages;
>
> hugetlb_init_hstates();
> + hugetlb_init_hstates_nodes();
> gather_bootmem_prealloc();
> report_hugepages();
>
> @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> parsed_hstate = h;
> }
>
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> + unsigned long nr_pages)

Using an unsigned long for a NUMA node ID is overkill, surely.

> +{
> + struct hstate *h;
> + unsigned long *p;
> +
> + if (parsed_hstate) {
> + printk(KERN_WARNING
> + "hugepagesnid: hugepagesz= specified, ignoring\n");
> + return;
> + }
> +
> + for_each_hstate(h)
> + if (h->order == order)
> + break;
> +
> + if (h->order != order) {
> + hugetlb_add_hstate(order);
> + parsed_hstate = NULL;
> + }
> +
> + p = &boot_alloc_nodes[hstate_index(h)][nid];
> + if (*p != 0) {
> + printk(KERN_WARNING
> + "hugepagesnid: node %lu already specified, ignoring\n", nid);
> + return;
> + }
> +
> + *p = nr_pages;
> +}
> +

2014-02-10 23:27:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only

On Mon, 10 Feb 2014 12:27:45 -0500 Luiz Capitulino <[email protected]> wrote:

> From: Luiz capitulino <[email protected]>
>
> If an allocation from the node specified by the nid argument fails,
> memblock_virt_alloc_internal() automatically tries to allocate memory
> from other nodes.
>
> This is fine is the caller don't care which node is going to allocate
> the memory. However, there are cases where the caller wants memory to
> be allocated from the specified node only. If that's not possible, then
> memblock_virt_alloc_internal() should just fail.
>
> This commit adds a new flags argument to memblock_virt_alloc_internal()
> where the caller can control this behavior.
>
> ...
>
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
> return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
> }
>
> +#define ALLOC_SPECIFIED_NODE_ONLY 0x1
> +
> /**
> * memblock_virt_alloc_internal - allocate boot memory block
> * @size: size of memory block to be allocated in bytes
> @@ -1058,7 +1060,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
> static void * __init memblock_virt_alloc_internal(
> phys_addr_t size, phys_addr_t align,
> phys_addr_t min_addr, phys_addr_t max_addr,
> - int nid)
> + int nid, unsigned int flags)
> {
> phys_addr_t alloc;
> void *ptr;
> @@ -1085,6 +1087,8 @@ again:
> nid);
> if (alloc)
> goto done;
> + else if (flags & ALLOC_SPECIFIED_NODE_ONLY)
> + goto error;

"else" is unneeded.

> if (nid != NUMA_NO_NODE) {
> alloc = memblock_find_in_range_node(size, align, min_addr,
>
> ...
>

2014-02-10 23:30:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support

On Mon, 10 Feb 2014 12:27:48 -0500 Luiz Capitulino <[email protected]> wrote:

>
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
> parsed_hstate = h;
> }
>
> +static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
> + int nid,
> + unsigned long nr_pages)
> +{
> + struct huge_bootmem_page *m;
> + unsigned long i;
> + void *addr;
> +
> + for (i = 0; i < nr_pages; i++) {
> + addr = memblock_virt_alloc_nid_nopanic(
> + huge_page_size(h), huge_page_size(h),
> + 0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
> + if (!addr)
> + break;
> + m = addr;
> + BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));

IS_ALIGNED()?

> + list_add(&m->list, &huge_boot_pages);
> + m->hstate = h;
> + }
> +
> + h->max_huge_pages += i;
> +}
> +
> void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> unsigned long nr_pages)
> {

Please cc Yinghai Lu <[email protected]> on these patches - he
understands memblock well and is a strong reviewer.

2014-02-11 02:54:24

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014, Luiz Capitulino wrote:

> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.
>

And when hugepages can no longer be allocated on a node because it is too
small, the remaining hugepages are distributed over nodes with memory
available, correct?

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.
>

Strange, it would seem better to just reserve as many hugepages as you
want so that you get the desired number on each node and then free the
ones you don't need at runtime.

That probably doesn't work because we can't free very large hugepages that
are reserved at boot, would fixing that issue reduce the need for this
patchset?

2014-02-11 03:58:18

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 2014-02-10 at 15:13 -0800, Andrew Morton wrote:
> On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <[email protected]> wrote:
>
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
>
> Grumble. "can be undesirable" is the entire reason for the entire
> patchset. We need far, far more detail than can be conveyed in three
> words, please!

One (not so real-world) scenario that comes right to mind which can
benefit for such a feature is the ability to study socket/node scaling
for hugepage aware applications. Yes, we do have numactl to bind
programs to resources, but I don't mind having a way of finer graining
hugetlb allocations, specially if it doesn't hurt anything.

Thanks,
Davidlohr

2014-02-11 09:20:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only

On Mon, Feb 10, 2014 at 12:27:45PM -0500, Luiz Capitulino wrote:
> From: Luiz capitulino <[email protected]>
>
> If an allocation from the node specified by the nid argument fails,
> memblock_virt_alloc_internal() automatically tries to allocate memory
> from other nodes.
>
> This is fine is the caller don't care which node is going to allocate
> the memory. However, there are cases where the caller wants memory to
> be allocated from the specified node only. If that's not possible, then
> memblock_virt_alloc_internal() should just fail.
>
> This commit adds a new flags argument to memblock_virt_alloc_internal()
> where the caller can control this behavior.
>
> Signed-off-by: Luiz capitulino <[email protected]>
> ---
> mm/memblock.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 39a31e7..b0c7b2e 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
> return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
> }
>
> +#define ALLOC_SPECIFIED_NODE_ONLY 0x1
> +

It's not a perfect fit but you could use gfp_t and GFP_THISNODE. The
meaning of the flag is recognised and while you are not using it with a
page allocator, we already use GFP flags with the slab allocator without
confusion.

--
Mel Gorman
SUSE Labs

2014-02-11 09:25:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, Feb 10, 2014 at 06:54:20PM -0800, David Rientjes wrote:
> On Mon, 10 Feb 2014, Luiz Capitulino wrote:
>
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> >
>
> And when hugepages can no longer be allocated on a node because it is too
> small, the remaining hugepages are distributed over nodes with memory
> available, correct?
>
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
> >
>
> Strange, it would seem better to just reserve as many hugepages as you
> want so that you get the desired number on each node and then free the
> ones you don't need at runtime.
>

Or take a stab at allocating 1G pages at runtime. It would require
finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
runtime. I would expect it would only work very early in the lifetime of
the system but if the user is willing to use kernel parameters to
allocate them then it should not be an issue.

--
Mel Gorman
SUSE Labs

2014-02-11 15:26:39

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 15:27:29 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 10 Feb 2014 12:27:47 -0500 Luiz Capitulino <[email protected]> wrote:
>
> > From: Luiz capitulino <[email protected]>
> >
> > The HugeTLB command-line option hugepages= allow the user to specify
> > how many huge pages should be allocated at boot-time. On NUMA systems,
> > this option will try to automatically distribute the allocation equally
> > among nodes. For example, if you have a 2-node NUMA system and allocates
> > 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> > node0 and 100 from node1.
> >
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify the nodes from which huge pages are allocated. For example,
> > if you have a 2-node NUMA system and want 300 2M huge pages to be
> > allocated from node1, you could do:
> >
> > hugepagesnid=1,300,2M
> >
> > Or, say you want node0 to allocate 100 huge pages and node1 to
> > allocate 200 huge pages, you do:
> >
> > hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> >
> > This commit adds support only for 2M huge pages, next commit will
> > add support for 1G huge pages.
> >
> > ...
> >
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> > return 1;
> > }
> > __setup("hugepagesz=", setup_hugepagesz);
> > +
> > +static __init int setup_hugepagesnid(char *opt)
> > +{
> > + unsigned long order, ps, nid, nr_pages;
> > + char size_str[3];
> > +
> > + size_str[2] = '\0';
> > + if (sscanf(opt, "%lu,%lu,%c%c",
> > + &nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> > + printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> > + return 0;
> > + }
>
> This will blow up if passed size=16M. We can expect that powerpc (at
> least) will want to copy (or generalise) this code. It would be better
> to avoid such restrictions at the outset.
>
> > + if (!nr_pages) {
> > + printk(KERN_ERR
> > + "hugepagesnid: zero number of pages, ignoring\n");
>
> The code contains rather a lot of these awkward wordwraps. Try using
> pr_err() and you'll find the result quite pleasing!

Will make both changes.

> > + return 0;
> > + }
> > +
> > + ps = memparse(size_str, NULL);
> > + if (ps == PMD_SIZE) {
> > + order = PMD_SHIFT - PAGE_SHIFT;
> > + } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> > + order = PUD_SHIFT - PAGE_SHIFT;
> > + } else {
> > + printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> > + ps >> 20);
> > + return 0;
> > + }
> > +
> > + hugetlb_add_nrpages_nid(order, nid, nr_pages);
> > + return 1;
> > +}
> > +__setup("hugepagesnid=", setup_hugepagesnid);
> > +
> > #endif
> >
> > ...
> >
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> > static struct hstate * __initdata parsed_hstate;
> > static unsigned long __initdata default_hstate_max_huge_pages;
> > static unsigned long __initdata default_hstate_size;
> > +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
> >
> > /*
> > * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > h->max_huge_pages = i;
> > }
> >
> > +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> > + int nid,
> > + unsigned long nr_pages)
> > +{
> > + unsigned long i;
> > + struct page *page;
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + page = alloc_fresh_huge_page_node(h, nid);
> > + if (!page) {
> > + count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > + break;
> > + }
> > + count_vm_event(HTLB_BUDDY_PGALLOC);
> > + }
> > +
> > + return i;
> > +}
> > +
> > +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> > +{
> > + unsigned long i, *entry, ret = 0;
> > +
> > + for (i = 0; i < MAX_NUMNODES; i++) {
> > + entry = &boot_alloc_nodes[hstate_index(h)][i];
> > + if (*entry > 0)
> > + ret += alloc_huge_pages_nid(h, i, *entry);
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +static void __init hugetlb_init_hstates_nodes(void)
> > +{
> > + struct hstate *h;
> > + unsigned long ret;
> > +
> > + for_each_hstate(h)
> > + if (h->order < MAX_ORDER) {
> > + ret = alloc_huge_pages_nodes(h);
> > + h->max_huge_pages += ret;
> > + }
> > +}
>
> The patch adds code to mm/hugetlb.c which only x86 will use. I guess
> that's OK medium-term if we expect other architectures will use it.

Yes, we do.

> But if other architectures use it, setup_hugepagesnid() was in the wrong
> directory.

The page size parameter is arch-dependent. I can think of two options:

1. Let setup_hugepagesnid() do all the work (which is what this series
does). Not ideal, because will cause some code duplication when other
archs implement it

2. Use existing helpers, or add this one to mm/hugetlb.c:

mm/parse_hugepagesnid_str(const char *str, int *ps, int *nid, int *nr_pages);

To be called from arch-dependent code (eg. setup_hugepagesnid()). So,
arch dependent calls parse_hugepagesnid_str(), do page size checks and
calls hugetlb_add_nrpages_nid()

> Can I ask that you poke the ppc/arm/ia64/etc people, see whether they
> will adpot this? Explaining the overall justification better in [patch
> 0/n] would help this discussion!

Absolutely.

>
> > static void __init hugetlb_init_hstates(void)
> > {
> > struct hstate *h;
> > @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> > default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> >
> > hugetlb_init_hstates();
> > + hugetlb_init_hstates_nodes();
> > gather_bootmem_prealloc();
> > report_hugepages();
> >
> > @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> > parsed_hstate = h;
> > }
> >
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > + unsigned long nr_pages)
>
> Using an unsigned long for a NUMA node ID is overkill, surely.

OK.

>
> > +{
> > + struct hstate *h;
> > + unsigned long *p;
> > +
> > + if (parsed_hstate) {
> > + printk(KERN_WARNING
> > + "hugepagesnid: hugepagesz= specified, ignoring\n");
> > + return;
> > + }
> > +
> > + for_each_hstate(h)
> > + if (h->order == order)
> > + break;
> > +
> > + if (h->order != order) {
> > + hugetlb_add_hstate(order);
> > + parsed_hstate = NULL;
> > + }
> > +
> > + p = &boot_alloc_nodes[hstate_index(h)][nid];
> > + if (*p != 0) {
> > + printk(KERN_WARNING
> > + "hugepagesnid: node %lu already specified, ignoring\n", nid);
> > + return;
> > + }
> > +
> > + *p = nr_pages;
> > +}
> > +
>

2014-02-11 15:27:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, Feb 11, 2014 at 09:25:14AM +0000, Mel Gorman wrote:
> On Mon, Feb 10, 2014 at 06:54:20PM -0800, David Rientjes wrote:
> > On Mon, 10 Feb 2014, Luiz Capitulino wrote:
> >
> > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > automatically distributes huge pages allocation among nodes, which can
> > > be undesirable.
> > >
> >
> > And when hugepages can no longer be allocated on a node because it is too
> > small, the remaining hugepages are distributed over nodes with memory
> > available, correct?
> >
> > > The hugepagesnid= option introduced by this commit allows the user
> > > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > > from node 0 only. More details on patch 3/4 and patch 4/4.
> > >
> >
> > Strange, it would seem better to just reserve as many hugepages as you
> > want so that you get the desired number on each node and then free the
> > ones you don't need at runtime.

You have to know the behaviour of the allocator, and rely on that
to allocate the exact number of 1G hugepages on a particular node.

Is that desired in constrast with specifying the exact number, and
location, of hugepages to allocated?

> Or take a stab at allocating 1G pages at runtime. It would require
> finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> runtime. I would expect it would only work very early in the lifetime of
> the system but if the user is willing to use kernel parameters to
> allocate them then it should not be an issue.

Can be an improvement on top of the current patchset? Certain use-cases
require allocation guarantees (even if that requires kernel parameters).

2014-02-11 15:27:24

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support

On Mon, 10 Feb 2014 15:30:32 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 10 Feb 2014 12:27:48 -0500 Luiz Capitulino <[email protected]> wrote:
>
> >
> > ...
> >
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
> > parsed_hstate = h;
> > }
> >
> > +static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
> > + int nid,
> > + unsigned long nr_pages)
> > +{
> > + struct huge_bootmem_page *m;
> > + unsigned long i;
> > + void *addr;
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + addr = memblock_virt_alloc_nid_nopanic(
> > + huge_page_size(h), huge_page_size(h),
> > + 0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
> > + if (!addr)
> > + break;
> > + m = addr;
> > + BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
>
> IS_ALIGNED()?
>
> > + list_add(&m->list, &huge_boot_pages);
> > + m->hstate = h;
> > + }
> > +
> > + h->max_huge_pages += i;
> > +}
> > +
> > void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > unsigned long nr_pages)
> > {
>
> Please cc Yinghai Lu <[email protected]> on these patches - he
> understands memblock well and is a strong reviewer.

Will do for v2, with your comments addressed.

2014-02-11 15:36:47

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 18:54:20 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Mon, 10 Feb 2014, Luiz Capitulino wrote:
>
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> >
>
> And when hugepages can no longer be allocated on a node because it is too
> small, the remaining hugepages are distributed over nodes with memory
> available, correct?

No. hugepagesnid= tries to obey what was specified by the uses as much as
possible. So, if you specify that 10 1G huge pages should be allocated from
node0 but only 7 1G pages can actually be allocated, then hugepagesnid= will
do just that.

> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
> >
>
> Strange, it would seem better to just reserve as many hugepages as you
> want so that you get the desired number on each node and then free the
> ones you don't need at runtime.

You mean, for example, if I have a 2 node system and want 2 1G huge pages
from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
on node 0 after boot? That seems very cumbersome to me. Besides, what if
node0 needs this memory during boot?

> That probably doesn't work because we can't free very large hugepages that
> are reserved at boot, would fixing that issue reduce the need for this
> patchset?

I don't think so.

2014-02-11 15:42:27

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, 10 Feb 2014 15:13:54 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <[email protected]> wrote:
>
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
>
> Grumble. "can be undesirable" is the entire reason for the entire
> patchset. We need far, far more detail than can be conveyed in three
> words, please!

Right, sorry for that. I'll improve this for v2, but a better introduction
for the series would be something like the following.

Today, HugeTLB provides support for controlling allocation of persistent
huge pages on a NUMA system through sysfs. So, for example, if a sysadmin
wants to allocate 300 2M huge pages on node 1, s/he can do:

echo 300 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

This works as long as you have enough contiguous pages, which may work
for 2M pages, but is harder for 1G huge pages. For those, it's better or even
required to reserve them at boot.

To this end we have the hugepages= command-line option, which works but misses
the per node control. This option evenly distributes huge pages among nodes.
However, we have users who want more flexibility. They want to be able to
specify something like: allocate 2 1G huge pages from node0 and 4 1G huge page
from node1. This is what this series implements.

It's basically per node allocation control for 1G huge pages, but it's
important to note that this series is not intrusive. All it does is to set
the initial per node allocation. All the functions and data structure added
by this series are only used once at boot, after that they are discarded and
rest in oblivion.

2014-02-11 17:10:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > Or take a stab at allocating 1G pages at runtime. It would require
> > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > runtime. I would expect it would only work very early in the lifetime of
> > the system but if the user is willing to use kernel parameters to
> > allocate them then it should not be an issue.
>
> Can be an improvement on top of the current patchset? Certain use-cases
> require allocation guarantees (even if that requires kernel parameters).
>

Sure, they're not mutually exclusive. It would just avoid the need to
create a new kernel parameter and use the existing interfaces.

--
Mel Gorman
SUSE Labs

2014-02-11 20:19:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, Feb 11, 2014 at 05:10:35PM +0000, Mel Gorman wrote:
> On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > > Or take a stab at allocating 1G pages at runtime. It would require
> > > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > > runtime. I would expect it would only work very early in the lifetime of
> > > the system but if the user is willing to use kernel parameters to
> > > allocate them then it should not be an issue.
> >
> > Can be an improvement on top of the current patchset? Certain use-cases
> > require allocation guarantees (even if that requires kernel parameters).
> >
>
> Sure, they're not mutually exclusive. It would just avoid the need to
> create a new kernel parameter and use the existing interfaces.

Yes, the problem is there is no guarantee is there?

2014-02-11 21:17:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Mon, Feb 10, 2014 at 12:27:44PM -0500, Luiz Capitulino wrote:
> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.
>
> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.

The syntax seems very confusing. Can you make that more obvious?

-Andi

2014-02-11 21:31:28

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, 11 Feb 2014 22:17:32 +0100
Andi Kleen <[email protected]> wrote:

> On Mon, Feb 10, 2014 at 12:27:44PM -0500, Luiz Capitulino wrote:
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> >
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
>
> The syntax seems very confusing. Can you make that more obvious?

I guess that my bad description in this email may have contributed to make
it look confusing.

The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
to me. I honestly can't think of anything better than that, but I'm open for
suggestions.

2014-02-12 02:37:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

> The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> to me. I honestly can't think of anything better than that, but I'm open for
> suggestions.

hugepages_node=nid:nr-pages:size,... ?

-Andi

--
[email protected] -- Speaking for myself only.

2014-02-12 03:45:18

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option

(2014/02/11 2:27), Luiz Capitulino wrote:
> From: Luiz capitulino <[email protected]>
>
> The HugeTLB command-line option hugepages= allow the user to specify
> how many huge pages should be allocated at boot-time. On NUMA systems,
> this option will try to automatically distribute the allocation equally
> among nodes. For example, if you have a 2-node NUMA system and allocates
> 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> node0 and 100 from node1.
>
> The hugepagesnid= option introduced by this commit allows the user
> to specify the nodes from which huge pages are allocated. For example,
> if you have a 2-node NUMA system and want 300 2M huge pages to be
> allocated from node1, you could do:
>
> hugepagesnid=1,300,2M
>
> Or, say you want node0 to allocate 100 huge pages and node1 to
> allocate 200 huge pages, you do:
>
> hugepagesnid=0,100,2M hugepagesnid=1,200,1M
>
> This commit adds support only for 2M huge pages, next commit will
> add support for 1G huge pages.
>
> Signed-off-by: Luiz Capitulino <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 8 ++++
> arch/x86/mm/hugetlbpage.c | 35 +++++++++++++++++
> include/linux/hugetlb.h | 2 +
> mm/hugetlb.c | 77 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 122 insertions(+)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 7116fda..3cbe950 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
>
> hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> + hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
> + Format: <nid>,<nrpages>,<size>
> + nid: NUMA node id to allocate pages from
> + nrpages: number of huge pages to allocate
> + size: huge pages size (see hugepagsz below)
> + This argument can be specified multiple times for different
> + NUMA nodes, but it shouldn't be mixed with hugepages= and
> + hugepagesz=.
> hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
> On x86-64 and powerpc, this option can be specified
> multiple times interleaved with hugepages= to reserve
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 8c9f647..91c5c98 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> return 1;
> }
> __setup("hugepagesz=", setup_hugepagesz);
> +
> +static __init int setup_hugepagesnid(char *opt)
> +{
> + unsigned long order, ps, nid, nr_pages;
> + char size_str[3];
> +
> + size_str[2] = '\0';
> + if (sscanf(opt, "%lu,%lu,%c%c",
> + &nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> + printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> + return 0;
> + }
> +
> + if (!nr_pages) {
> + printk(KERN_ERR
> + "hugepagesnid: zero number of pages, ignoring\n");
> + return 0;
> + }
> +
> + ps = memparse(size_str, NULL);
> + if (ps == PMD_SIZE) {
> + order = PMD_SHIFT - PAGE_SHIFT;
> + } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> + order = PUD_SHIFT - PAGE_SHIFT;
> + } else {
> + printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> + ps >> 20);
> + return 0;
> + }

You must check that nid is valid or not. When nid is MAX_NUMNODES or more,
hugetlb_add_nrpages_nid() sets nr_pages to wrong memory region.

Thanks,
Yasuaki Ishimatsu

> +
> + hugetlb_add_nrpages_nid(order, nid, nr_pages);
> + return 1;
> +}
> +__setup("hugepagesnid=", setup_hugepagesnid);
> +
> #endif
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8c43cc4..aae2f9b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> int __init alloc_bootmem_huge_page(struct hstate *h);
>
> void __init hugetlb_add_hstate(unsigned order);
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> + unsigned long nr_pages);
> struct hstate *size_to_hstate(unsigned long size);
>
> #ifndef HUGE_MAX_HSTATE
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c01cb9f..439c3b7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> static struct hstate * __initdata parsed_hstate;
> static unsigned long __initdata default_hstate_max_huge_pages;
> static unsigned long __initdata default_hstate_size;
> +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
>
> /*
> * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> h->max_huge_pages = i;
> }
>
> +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> + int nid,
> + unsigned long nr_pages)
> +{
> + unsigned long i;
> + struct page *page;
> +
> + for (i = 0; i < nr_pages; i++) {
> + page = alloc_fresh_huge_page_node(h, nid);
> + if (!page) {
> + count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> + break;
> + }
> + count_vm_event(HTLB_BUDDY_PGALLOC);
> + }
> +
> + return i;
> +}
> +
> +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> +{
> + unsigned long i, *entry, ret = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + entry = &boot_alloc_nodes[hstate_index(h)][i];
> + if (*entry > 0)
> + ret += alloc_huge_pages_nid(h, i, *entry);
> + }
> +
> + return ret;
> +}
> +
> +static void __init hugetlb_init_hstates_nodes(void)
> +{
> + struct hstate *h;
> + unsigned long ret;
> +
> + for_each_hstate(h)
> + if (h->order < MAX_ORDER) {
> + ret = alloc_huge_pages_nodes(h);
> + h->max_huge_pages += ret;
> + }
> +}
> +
> static void __init hugetlb_init_hstates(void)
> {
> struct hstate *h;
> @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> default_hstate.max_huge_pages = default_hstate_max_huge_pages;
>
> hugetlb_init_hstates();
> + hugetlb_init_hstates_nodes();
> gather_bootmem_prealloc();
> report_hugepages();
>
> @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> parsed_hstate = h;
> }
>
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> + unsigned long nr_pages)
> +{
> + struct hstate *h;
> + unsigned long *p;
> +
> + if (parsed_hstate) {
> + printk(KERN_WARNING
> + "hugepagesnid: hugepagesz= specified, ignoring\n");
> + return;
> + }
> +
> + for_each_hstate(h)
> + if (h->order == order)
> + break;
> +
> + if (h->order != order) {
> + hugetlb_add_hstate(order);
> + parsed_hstate = NULL;
> + }
> +
> + p = &boot_alloc_nodes[hstate_index(h)][nid];
> + if (*p != 0) {
> + printk(KERN_WARNING
> + "hugepagesnid: node %lu already specified, ignoring\n", nid);
> + return;
> + }
> +
> + *p = nr_pages;
> +}
> +
> static int __init hugetlb_nrpages_setup(char *s)
> {
> unsigned long *mhp;
>

2014-02-12 03:59:44

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, 11 Feb 2014, Luiz Capitulino wrote:

> > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > automatically distributes huge pages allocation among nodes, which can
> > > be undesirable.
> > >
> >
> > And when hugepages can no longer be allocated on a node because it is too
> > small, the remaining hugepages are distributed over nodes with memory
> > available, correct?
>
> No. hugepagesnid= tries to obey what was specified by the uses as much as
> possible.

I'm referring to what I quoted above, the hugepages= parameter. I'm
saying that using existing functionality you can reserve an excess of
hugepages and then free unneeded hugepages at runtime to get the desired
amount allocated only on a specific node.

> > Strange, it would seem better to just reserve as many hugepages as you
> > want so that you get the desired number on each node and then free the
> > ones you don't need at runtime.
>
> You mean, for example, if I have a 2 node system and want 2 1G huge pages
> from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
> on node 0 after boot? That seems very cumbersome to me. Besides, what if
> node0 needs this memory during boot?
>

All of this functionality, including the current hugepages= reservation at
boot, needs to show that it can't be done as late as when you could run an
initscript to do the reservation at runtime and fragmentation is at its
lowest level when userspace first becomes available.

I don't see any justification given in the patchset that suggests you
can't simply do this in an initscript if it is possible to allocate 1GB
pages at runtime. If it's too late because of oom, then your userspace is
going to oom anyway if you reserve the hugepages at boot; if it's too late
because of fragmentation, let's work on that issue (and justification why
things like movablecore= don't work for you).

2014-02-12 04:02:00

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Wed, 12 Feb 2014, Andi Kleen wrote:

> > The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> > to me. I honestly can't think of anything better than that, but I'm open for
> > suggestions.
>
> hugepages_node=nid:nr-pages:size,... ?
>

I think that if we actually want this support that it should behave like
hugepages= and hugepagesz=, i.e. you specify a hugepagesnode= and, if
present, all remaining hugepages= and hugepagesz= parameters act only on
that node unless overridden by another hugepagesnode=.

2014-02-12 10:39:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, Feb 11, 2014 at 06:15:57PM -0200, Marcelo Tosatti wrote:
> On Tue, Feb 11, 2014 at 05:10:35PM +0000, Mel Gorman wrote:
> > On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > > > Or take a stab at allocating 1G pages at runtime. It would require
> > > > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > > > runtime. I would expect it would only work very early in the lifetime of
> > > > the system but if the user is willing to use kernel parameters to
> > > > allocate them then it should not be an issue.
> > >
> > > Can be an improvement on top of the current patchset? Certain use-cases
> > > require allocation guarantees (even if that requires kernel parameters).
> > >
> >
> > Sure, they're not mutually exclusive. It would just avoid the need to
> > create a new kernel parameter and use the existing interfaces.
>
> Yes, the problem is there is no guarantee is there?
>

There is no guarantee anyway and early in the lifetime of the system there
is going to be very little difference in success rates. In case there is a
misunderstanding here, I'm not looking to NAK a series that adds another
kernel parameter. If it was me, I would have tried runtime allocation
first to avoid adding a new interface but it's a personal preference.

--
Mel Gorman
SUSE Labs

2014-02-12 19:33:56

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Wed, 12 Feb 2014 03:37:11 +0100
Andi Kleen <[email protected]> wrote:

> > The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> > to me. I honestly can't think of anything better than that, but I'm open for
> > suggestions.
>
> hugepages_node=nid:nr-pages:size,... ?

Looks good, I'll consider using it for v2.

2014-02-12 19:39:39

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option

On Wed, 12 Feb 2014 12:44:36 +0900
Yasuaki Ishimatsu <[email protected]> wrote:

> (2014/02/11 2:27), Luiz Capitulino wrote:
> > From: Luiz capitulino <[email protected]>
> >
> > The HugeTLB command-line option hugepages= allow the user to specify
> > how many huge pages should be allocated at boot-time. On NUMA systems,
> > this option will try to automatically distribute the allocation equally
> > among nodes. For example, if you have a 2-node NUMA system and allocates
> > 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> > node0 and 100 from node1.
> >
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify the nodes from which huge pages are allocated. For example,
> > if you have a 2-node NUMA system and want 300 2M huge pages to be
> > allocated from node1, you could do:
> >
> > hugepagesnid=1,300,2M
> >
> > Or, say you want node0 to allocate 100 huge pages and node1 to
> > allocate 200 huge pages, you do:
> >
> > hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> >
> > This commit adds support only for 2M huge pages, next commit will
> > add support for 1G huge pages.
> >
> > Signed-off-by: Luiz Capitulino <[email protected]>
> > ---
> > Documentation/kernel-parameters.txt | 8 ++++
> > arch/x86/mm/hugetlbpage.c | 35 +++++++++++++++++
> > include/linux/hugetlb.h | 2 +
> > mm/hugetlb.c | 77 +++++++++++++++++++++++++++++++++++++
> > 4 files changed, 122 insertions(+)
> >
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 7116fda..3cbe950 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
> >
> > hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> > + hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
> > + Format: <nid>,<nrpages>,<size>
> > + nid: NUMA node id to allocate pages from
> > + nrpages: number of huge pages to allocate
> > + size: huge pages size (see hugepagsz below)
> > + This argument can be specified multiple times for different
> > + NUMA nodes, but it shouldn't be mixed with hugepages= and
> > + hugepagesz=.
> > hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
> > On x86-64 and powerpc, this option can be specified
> > multiple times interleaved with hugepages= to reserve
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 8c9f647..91c5c98 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> > return 1;
> > }
> > __setup("hugepagesz=", setup_hugepagesz);
> > +
> > +static __init int setup_hugepagesnid(char *opt)
> > +{
> > + unsigned long order, ps, nid, nr_pages;
> > + char size_str[3];
> > +
> > + size_str[2] = '\0';
> > + if (sscanf(opt, "%lu,%lu,%c%c",
> > + &nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> > + printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> > + return 0;
> > + }
> > +
> > + if (!nr_pages) {
> > + printk(KERN_ERR
> > + "hugepagesnid: zero number of pages, ignoring\n");
> > + return 0;
> > + }
> > +
> > + ps = memparse(size_str, NULL);
> > + if (ps == PMD_SIZE) {
> > + order = PMD_SHIFT - PAGE_SHIFT;
> > + } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> > + order = PUD_SHIFT - PAGE_SHIFT;
> > + } else {
> > + printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> > + ps >> 20);
> > + return 0;
> > + }
>
> You must check that nid is valid or not. When nid is MAX_NUMNODES or more,
> hugetlb_add_nrpages_nid() sets nr_pages to wrong memory region.

Correct. I'll fix it, thank for reviewing.

>
> Thanks,
> Yasuaki Ishimatsu
>
> > +
> > + hugetlb_add_nrpages_nid(order, nid, nr_pages);
> > + return 1;
> > +}
> > +__setup("hugepagesnid=", setup_hugepagesnid);
> > +
> > #endif
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 8c43cc4..aae2f9b 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> > int __init alloc_bootmem_huge_page(struct hstate *h);
> >
> > void __init hugetlb_add_hstate(unsigned order);
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > + unsigned long nr_pages);
> > struct hstate *size_to_hstate(unsigned long size);
> >
> > #ifndef HUGE_MAX_HSTATE
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c01cb9f..439c3b7 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> > static struct hstate * __initdata parsed_hstate;
> > static unsigned long __initdata default_hstate_max_huge_pages;
> > static unsigned long __initdata default_hstate_size;
> > +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
> >
> > /*
> > * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > h->max_huge_pages = i;
> > }
> >
> > +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> > + int nid,
> > + unsigned long nr_pages)
> > +{
> > + unsigned long i;
> > + struct page *page;
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + page = alloc_fresh_huge_page_node(h, nid);
> > + if (!page) {
> > + count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > + break;
> > + }
> > + count_vm_event(HTLB_BUDDY_PGALLOC);
> > + }
> > +
> > + return i;
> > +}
> > +
> > +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> > +{
> > + unsigned long i, *entry, ret = 0;
> > +
> > + for (i = 0; i < MAX_NUMNODES; i++) {
> > + entry = &boot_alloc_nodes[hstate_index(h)][i];
> > + if (*entry > 0)
> > + ret += alloc_huge_pages_nid(h, i, *entry);
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +static void __init hugetlb_init_hstates_nodes(void)
> > +{
> > + struct hstate *h;
> > + unsigned long ret;
> > +
> > + for_each_hstate(h)
> > + if (h->order < MAX_ORDER) {
> > + ret = alloc_huge_pages_nodes(h);
> > + h->max_huge_pages += ret;
> > + }
> > +}
> > +
> > static void __init hugetlb_init_hstates(void)
> > {
> > struct hstate *h;
> > @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> > default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> >
> > hugetlb_init_hstates();
> > + hugetlb_init_hstates_nodes();
> > gather_bootmem_prealloc();
> > report_hugepages();
> >
> > @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> > parsed_hstate = h;
> > }
> >
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > + unsigned long nr_pages)
> > +{
> > + struct hstate *h;
> > + unsigned long *p;
> > +
> > + if (parsed_hstate) {
> > + printk(KERN_WARNING
> > + "hugepagesnid: hugepagesz= specified, ignoring\n");
> > + return;
> > + }
> > +
> > + for_each_hstate(h)
> > + if (h->order == order)
> > + break;
> > +
> > + if (h->order != order) {
> > + hugetlb_add_hstate(order);
> > + parsed_hstate = NULL;
> > + }
> > +
> > + p = &boot_alloc_nodes[hstate_index(h)][nid];
> > + if (*p != 0) {
> > + printk(KERN_WARNING
> > + "hugepagesnid: node %lu already specified, ignoring\n", nid);
> > + return;
> > + }
> > +
> > + *p = nr_pages;
> > +}
> > +
> > static int __init hugetlb_nrpages_setup(char *s)
> > {
> > unsigned long *mhp;
> >
>
>

2014-02-12 20:09:33

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option

On Tue, 11 Feb 2014 19:59:40 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 11 Feb 2014, Luiz Capitulino wrote:
>
> > > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > > automatically distributes huge pages allocation among nodes, which can
> > > > be undesirable.
> > > >
> > >
> > > And when hugepages can no longer be allocated on a node because it is too
> > > small, the remaining hugepages are distributed over nodes with memory
> > > available, correct?
> >
> > No. hugepagesnid= tries to obey what was specified by the uses as much as
> > possible.
>
> I'm referring to what I quoted above, the hugepages= parameter.

Oh, OK.

> I'm
> saying that using existing functionality you can reserve an excess of
> hugepages and then free unneeded hugepages at runtime to get the desired
> amount allocated only on a specific node.

I got that part. I only think this is not a good solution as I explained
bellow.

> > > Strange, it would seem better to just reserve as many hugepages as you
> > > want so that you get the desired number on each node and then free the
> > > ones you don't need at runtime.
> >
> > You mean, for example, if I have a 2 node system and want 2 1G huge pages
> > from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
> > on node 0 after boot? That seems very cumbersome to me. Besides, what if
> > node0 needs this memory during boot?
> >
>
> All of this functionality, including the current hugepages= reservation at
> boot, needs to show that it can't be done as late as when you could run an
> initscript to do the reservation at runtime and fragmentation is at its
> lowest level when userspace first becomes available.

It's not that it can't. The point is that for 1G huge pages it's more
reliable to allocate them as early as possible during the kernel boot
process. I'm all for having/improving 1G allocation support at run-time,
and volunteer to help with that effort, but that's something that can
(and IMO should) be done on top of this series.

> I don't see any justification given in the patchset that suggests you
> can't simply do this in an initscript if it is possible to allocate 1GB
> pages at runtime. If it's too late because of oom, then your userspace is
> going to oom anyway if you reserve the hugepages at boot; if it's too late
> because of fragmentation, let's work on that issue (and justification why
> things like movablecore= don't work for you).
>