On a NUMA system, HugeTLB provides support for allocating per-node huge pages
through sysfs. For example, to allocate 300 2M huge pages on node1, one can do:
echo 300 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
This works as long as you have enough contiguous pages. Which may work for
2M pages, but for 1G huge pages this is likely to fail due to fragmentation
or won't even work actually, as allocating more than MAX_ORDER pages at runtime
doesn't seem to work out of the box for some archs. For 1G huge pages it's
better or even required to reserve them during the kernel boot, which is when
the allocation is more likely to succeed.
To this end we have the hugepages= command-line option, which works but misses
the per node allocation support. This option evenly distributes huge pages
among nodes on a NUMA system. This behavior is very limiting and unflexible.
There are use-cases where users wants to be able to specify which nodes 1G
huge pages should be allocated from.
This series addresses this problem by adding a new kernel comand-line option
called hugepages_node=, which users can use to configure initial huge page
allocation on NUMA. The new option syntax is:
hugepages_node=nid:nr_pages:size,...
For example, this command-line:
hugepages_node=0:300:2M,1:4:1G
Allocates 300 2M huge pages from node0 and 4 1G huge pages from node1.
hugepages_node= is non-intrusive (it doesn't touch any core HugeTLB code).
Indeed, all functions and the array added by this series are run only once
and discarded after boot. All the hugepages_node= option does it to set
initial huge page allocation among NUMA nodes.
Changelog:
o v2
- Change syntax to hugepages_node=nid:nr_pages:size,... [Andi Kleen]
- Several small improvements [Andrew Morton]
- Validate node index [Yasuaki Ishimatsu]
- Use GFP_THISNODE [Mel Gorman]
- Fold 2MB support patch with 1GB support patch
- Improve logs and intro email
Luiz capitulino (4):
memblock: memblock_virt_alloc_internal(): add __GFP_THISNODE flag
support
memblock: add memblock_virt_alloc_nid_nopanic()
hugetlb: add parse_pagesize_str()
hugetlb: add hugepages_node= command-line option
Documentation/kernel-parameters.txt | 8 +++
arch/x86/mm/hugetlbpage.c | 77 ++++++++++++++++++++++++--
include/linux/bootmem.h | 4 ++
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 106 ++++++++++++++++++++++++++++++++++++
mm/memblock.c | 44 ++++++++++++++-
6 files changed, 232 insertions(+), 9 deletions(-)
--
1.8.1.4
From: Luiz capitulino <[email protected]>
Currently, if an allocation from the node specified by the nid argument
fails, memblock_virt_alloc_internal() automatically tries to allocate memory
from other nodes.
This is fine if the caller don't care about which node is going to allocate
the memory. However, there are cases where the caller wants memory to be
allocated from the specified node only. If that's not possible, then
memblock_virt_alloc_internal() should just fail.
This commit adds a new flags argument to memblock_virt_alloc_internal()
where the caller can control this behavior. The flags argument is of type
gfp_t, so that we can (re-)use the __GFP_THISNODE definition.
Signed-off-by: Luiz capitulino <[email protected]>
---
mm/memblock.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index 39a31e7..f3821ef 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1035,12 +1035,18 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
* @min_addr: the lower bound of the memory region to allocate (phys address)
* @max_addr: the upper bound of the memory region to allocate (phys address)
* @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ * @flags: control how memory is allocated
*
* The @min_addr limit is dropped if it can not be satisfied and the allocation
* will fall back to memory below @min_addr. Also, allocation may fall back
* to any node in the system if the specified node can not
* hold the requested memory.
*
+ * The @flags argument is one of:
+ *
+ * %__GFP_THISNODE: memory is allocated from the node specified by @nid only.
+ * If that fails, an error is returned
+ *
* The allocation is performed from memory region limited by
* memblock.current_limit if @max_addr == %BOOTMEM_ALLOC_ACCESSIBLE.
*
@@ -1058,7 +1064,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
static void * __init memblock_virt_alloc_internal(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr,
- int nid)
+ int nid, gfp_t flags)
{
phys_addr_t alloc;
void *ptr;
@@ -1085,6 +1091,8 @@ again:
nid);
if (alloc)
goto done;
+ if (flags & __GFP_THISNODE)
+ goto error;
if (nid != NUMA_NO_NODE) {
alloc = memblock_find_in_range_node(size, align, min_addr,
@@ -1145,7 +1153,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
__func__, (u64)size, (u64)align, nid, (u64)min_addr,
(u64)max_addr, (void *)_RET_IP_);
return memblock_virt_alloc_internal(size, align, min_addr,
- max_addr, nid);
+ max_addr, nid, 0);
}
/**
@@ -1177,7 +1185,7 @@ void * __init memblock_virt_alloc_try_nid(
__func__, (u64)size, (u64)align, nid, (u64)min_addr,
(u64)max_addr, (void *)_RET_IP_);
ptr = memblock_virt_alloc_internal(size, align,
- min_addr, max_addr, nid);
+ min_addr, max_addr, nid, 0);
if (ptr)
return ptr;
--
1.8.1.4
From: Luiz capitulino <[email protected]>
The HugeTLB command-line option hugepages= allows a user to specify how
many huge pages should be allocated at boot. This option is needed because
it improves reliability when allocating 1G huge pages, which are better
allocated as early as possible due to fragmentation.
However, hugepages= has a limitation. On NUMA systems, hugepages= will
automatically distribute memory allocation equally among nodes. For
example, if you have a 2-node NUMA system and allocate 200 huge pages,
than hugepages= will try to allocate 100 huge pages from node0 and 100
from node1.
This is very unflexible, as it doesn't allow you to specify which nodes
the huge pages should be allocated from. For example, there are use-cases
where the user wants to specify that a 1GB huge page should be allocated
from node 2 or that 300 2MB huge pages should be allocated from node 0.
The hugepages_node= command-line option introduced by this commit allows
just that.
The syntax is:
hugepages_node=nid:nr_pages:size,...
For example, to allocate one 1GB page from node 2, you can do:
hugepages_node=2:1:1G
Or, to allocate 300 2MB huge pages from node 0 and 5 1GB huge pages from node 1:
hugepages_node=0:300:2M,1:5:1G
Also, please note the following:
- All the hugepages_node= option does is to set initial memory allocation
distribution among nodes. It doesn't do anything intrusive. All functions
and the array added by this commit are run onlt once at boot and discarded
thereafter
- This commit adds support only for the x86 architecture. Adding support for
other archs is welcome and should be simple, it's just a matter of porting
setup_hugepages_node()
- When an error is encountered while parsing, allocations are obeyed
up to the error and parsing is then aborted. This is simplest way to
deal with errors
- Mixing hugepages_node= and hugepages= options is not supported
Signed-off-by: Luiz capitulino <[email protected]>
---
Documentation/kernel-parameters.txt | 8 +++
arch/x86/mm/hugetlbpage.c | 52 ++++++++++++++++++
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 106 ++++++++++++++++++++++++++++++++++++
4 files changed, 168 insertions(+)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7116fda..bbceb73 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1133,6 +1133,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
(when the CPU supports the "pdpe1gb" cpuinfo flag)
Note that 1GB pages can only be allocated at boot time
using hugepages= and not freed afterwards.
+ hugepages_node= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
+ Format: <nid>:<nrpages>:<size>,...
+ nid: NUMA node id to allocate pages from
+ nrpages: number of huge pages to allocate
+ size: huge pages size (same as hugepagesz= above)
+ On error, allocations are obeyed up to the error and then parsing
+ is aborted. This option shouldn't be mixed with other hugepages=
+ options.
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 968db71..88318818 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -201,4 +201,56 @@ static __init int setup_hugepagesz(char *opt)
return 1;
}
__setup("hugepagesz=", setup_hugepagesz);
+
+/*
+ * Format is "nid:nr_pages:size,...". Example: "0:300:2M,1:10:1G"
+ *
+ * To make error handling simple we halt parsing on the first error,
+ * which means that the user's allocation requests will be obeyed until
+ * an error is found (if any).
+ */
+static __init int setup_hugepages_node(char *opt)
+{
+ unsigned nid, nr_pages, order;
+ char *p, size_str[16];
+ int ret;
+
+ do {
+ p = strchr(opt, ',');
+ if (p)
+ *p = '\0';
+
+ ret = sscanf(opt, "%u:%u:%15s", &nid, &nr_pages, size_str);
+ if (ret != 3) {
+ pr_err("hugepages_node: bad syntax, aborting\n");
+ return 0;
+ }
+
+ if (nid >= MAX_NUMNODES) {
+ pr_err("hugepages_node: invalid numa node: %u, "
+ "aborting\n", nid);
+ return 0;
+ }
+
+ ret = parse_pagesize_str(size_str, &order);
+ if (ret < 0) {
+ pr_err("hugepages_node: unsupported page size: %s, "
+ "aborting\n", size_str);
+ return 0;
+ }
+
+ ret = hugetlb_boot_alloc_add_nid(nid, nr_pages, order);
+ if (ret < 0)
+ return 0;
+
+ if (p) {
+ *p = ',';
+ opt = ++p;
+ }
+ } while (p);
+
+ return 1;
+}
+__setup("hugepages_node=", setup_hugepages_node);
+
#endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc4..2c1c01a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
int __init alloc_bootmem_huge_page(struct hstate *h);
void __init hugetlb_add_hstate(unsigned order);
+int __init hugetlb_boot_alloc_add_nid(unsigned nid, unsigned nr_pages,
+ unsigned order);
struct hstate *size_to_hstate(unsigned long size);
#ifndef HUGE_MAX_HSTATE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9f..3e9e929 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
static struct hstate * __initdata parsed_hstate;
static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
+static unsigned int __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
/*
* Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
h->max_huge_pages = i;
}
+static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
+ int nid,
+ unsigned long nr_pages)
+{
+ unsigned long i;
+ struct page *page;
+
+ for (i = 0; i < nr_pages; i++) {
+ page = alloc_fresh_huge_page_node(h, nid);
+ if (!page) {
+ count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+ break;
+ }
+ count_vm_event(HTLB_BUDDY_PGALLOC);
+ }
+
+ return i;
+}
+
+static unsigned __init alloc_huge_pages_nodes(struct hstate *h)
+{
+ unsigned i, *entry, ret = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ entry = &boot_alloc_nodes[hstate_index(h)][i];
+ if (*entry > 0)
+ ret += alloc_huge_pages_nid(h, i, *entry);
+ }
+
+ return ret;
+}
+
+static void __init hugetlb_init_hstates_nodes(void)
+{
+ struct hstate *h;
+ unsigned ret;
+
+ for_each_hstate(h)
+ if (h->order < MAX_ORDER) {
+ ret = alloc_huge_pages_nodes(h);
+ h->max_huge_pages += ret;
+ }
+}
+
static void __init hugetlb_init_hstates(void)
{
struct hstate *h;
@@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
default_hstate.max_huge_pages = default_hstate_max_huge_pages;
hugetlb_init_hstates();
+ hugetlb_init_hstates_nodes();
gather_bootmem_prealloc();
report_hugepages();
@@ -2005,6 +2051,66 @@ void __init hugetlb_add_hstate(unsigned order)
parsed_hstate = h;
}
+static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
+ int nid,
+ unsigned long nr_pages)
+{
+ struct huge_bootmem_page *m;
+ unsigned long i;
+ void *addr;
+
+ for (i = 0; i < nr_pages; i++) {
+ addr = memblock_virt_alloc_nid_nopanic(
+ huge_page_size(h), huge_page_size(h),
+ 0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
+ if (!addr)
+ break;
+ m = addr;
+ BUG_ON(!IS_ALIGNED((unsigned long) virt_to_phys(m),
+ huge_page_size(h)));
+ list_add(&m->list, &huge_boot_pages);
+ m->hstate = h;
+ }
+
+ h->max_huge_pages += i;
+}
+
+int __init hugetlb_boot_alloc_add_nid(unsigned nid, unsigned nr_pages,
+ unsigned order)
+{
+ struct hstate *h;
+ unsigned *p;
+
+ if (parsed_hstate) {
+ pr_err("hugepages_node: hugepagesz has been specified, "
+ "aborting\n");
+ return -1;
+ }
+
+ for_each_hstate(h)
+ if (h->order == order)
+ break;
+
+ if (h->order != order) {
+ hugetlb_add_hstate(order);
+ parsed_hstate = NULL;
+ }
+
+ p = &boot_alloc_nodes[hstate_index(h)][nid];
+ if (*p != 0) {
+ pr_err("hugepages_node: node %u already specified, "
+ "aborting\n", nid);
+ return -1;
+ }
+
+ *p = nr_pages;
+
+ if (h->order >= MAX_ORDER)
+ hugetlb_hstate_alloc_pages_nid(h, nid, nr_pages);
+
+ return 0;
+}
+
static int __init hugetlb_nrpages_setup(char *s)
{
unsigned long *mhp;
--
1.8.1.4
From: Luiz capitulino <[email protected]>
This commit moves current setup_hugepagez() logic to function called
parse_pagesize_str(), so that it can be used by the next commit.
There should be no functional changes, except for the following:
- When calling memparse(), setup_hugepagesz() was passing the retptr
argument, but the result was unused. So parse_pagesize_str() pass NULL
instead
- Change printk(KERN_ERR) to pr_err() and make the error message a little
bit nicer for bad command-lines like "hugepagesz=1X"
Signed-off-by: Luiz capitulino <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8c9f647..968db71 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -173,18 +173,31 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
#endif /* CONFIG_HUGETLB_PAGE */
#ifdef CONFIG_X86_64
-static __init int setup_hugepagesz(char *opt)
+static __init int parse_pagesize_str(char *str, unsigned *order)
{
- unsigned long ps = memparse(opt, &opt);
+ unsigned long ps = memparse(str, NULL);
if (ps == PMD_SIZE) {
- hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+ *order = PMD_SHIFT - PAGE_SHIFT;
} else if (ps == PUD_SIZE && cpu_has_gbpages) {
- hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+ *order = PUD_SHIFT - PAGE_SHIFT;
} else {
- printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
- ps >> 20);
+ /* invalid page size */
+ return -1;
+ }
+
+ return 0;
+}
+
+static __init int setup_hugepagesz(char *opt)
+{
+ unsigned order;
+
+ if (parse_pagesize_str(opt, &order)) {
+ pr_err("hugepagesz: Unsupported page size %s\n", opt);
return 0;
}
+
+ hugetlb_add_hstate(order);
return 1;
}
__setup("hugepagesz=", setup_hugepagesz);
--
1.8.1.4
From: Luiz capitulino <[email protected]>
This function tries to allocate memory from the specified node only (vs.
falling back to other nodes).
This is going to be used by HugeTLB boot-time allocation code in next
commits.
Signed-off-by: Luiz capitulino <[email protected]>
---
include/linux/bootmem.h | 4 ++++
mm/memblock.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+)
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index db51fe4..6961b11 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -153,6 +153,10 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
+void * __init memblock_virt_alloc_nid_nopanic(
+ phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr, phys_addr_t max_addr,
+ int nid);
void *memblock_virt_alloc_try_nid(phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid);
void __memblock_free_early(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index f3821ef..2a5d72e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1157,6 +1157,36 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
}
/**
+ * memblock_virt_alloc_nid_nopanic - allocate boot memory block from
+ * specified node
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ * is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ * is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ * allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * This function tries to allocate memory from @nid only. It @nid doesn't
+ * have enough memory, this function returns failure.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_nid_nopanic(
+ phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr, phys_addr_t max_addr,
+ int nid)
+{
+ memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+ __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+ (u64)max_addr, (void *)_RET_IP_);
+ return memblock_virt_alloc_internal(size, align, min_addr,
+ max_addr, nid, __GFP_THISNODE);
+}
+
+/**
* memblock_virt_alloc_try_nid - allocate boot memory block with panicking
* @size: size of memory block to be allocated in bytes
* @align: alignment of the region and block's size
--
1.8.1.4
On Thu, 13 Feb 2014, Luiz Capitulino wrote:
> From: Luiz capitulino <[email protected]>
>
> The HugeTLB command-line option hugepages= allows a user to specify how
> many huge pages should be allocated at boot. This option is needed because
> it improves reliability when allocating 1G huge pages, which are better
> allocated as early as possible due to fragmentation.
>
> However, hugepages= has a limitation. On NUMA systems, hugepages= will
> automatically distribute memory allocation equally among nodes. For
> example, if you have a 2-node NUMA system and allocate 200 huge pages,
> than hugepages= will try to allocate 100 huge pages from node0 and 100
> from node1.
>
> This is very unflexible, as it doesn't allow you to specify which nodes
> the huge pages should be allocated from. For example, there are use-cases
> where the user wants to specify that a 1GB huge page should be allocated
> from node 2 or that 300 2MB huge pages should be allocated from node 0.
>
> The hugepages_node= command-line option introduced by this commit allows
> just that.
>
> The syntax is:
>
> hugepages_node=nid:nr_pages:size,...
>
Again, I think this syntax is horrendous and doesn't couple well with the
other hugepage-related kernel command line options. We already have
hugepages= and hugepagesz= which you can interleave on the command line to
get 100 2M hugepages and 10 1GB hugepages, for example.
This patchset is simply introducing another variable to the matter: the
node that the hugepages should be allocated on. So just introduce a
hugepagesnode= parameter to couple with the others so you can do
hugepagesz=<size> hugepagesnode=<nid> hugepages=<#>
instead of having completely confusing interfaces where you want to do
hugepages_node=1:1:1G for a 1GB hugepage on page 1 (and try remembering
which "1" means what, yuck) and "hugepagesz=1GB hugepages=1" if you're
indifferent to the node.
On Fri, 14 Feb 2014 15:14:22 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Thu, 13 Feb 2014, Luiz Capitulino wrote:
>
> > From: Luiz capitulino <[email protected]>
> >
> > The HugeTLB command-line option hugepages= allows a user to specify how
> > many huge pages should be allocated at boot. This option is needed because
> > it improves reliability when allocating 1G huge pages, which are better
> > allocated as early as possible due to fragmentation.
> >
> > However, hugepages= has a limitation. On NUMA systems, hugepages= will
> > automatically distribute memory allocation equally among nodes. For
> > example, if you have a 2-node NUMA system and allocate 200 huge pages,
> > than hugepages= will try to allocate 100 huge pages from node0 and 100
> > from node1.
> >
> > This is very unflexible, as it doesn't allow you to specify which nodes
> > the huge pages should be allocated from. For example, there are use-cases
> > where the user wants to specify that a 1GB huge page should be allocated
> > from node 2 or that 300 2MB huge pages should be allocated from node 0.
> >
> > The hugepages_node= command-line option introduced by this commit allows
> > just that.
> >
> > The syntax is:
> >
> > hugepages_node=nid:nr_pages:size,...
> >
>
> Again, I think this syntax is horrendous and doesn't couple well with the
> other hugepage-related kernel command line options. We already have
> hugepages= and hugepagesz= which you can interleave on the command line to
> get 100 2M hugepages and 10 1GB hugepages, for example.
>
> This patchset is simply introducing another variable to the matter: the
> node that the hugepages should be allocated on. So just introduce a
> hugepagesnode= parameter to couple with the others so you can do
>
> hugepagesz=<size> hugepagesnode=<nid> hugepages=<#>
That was my first try but it turned out really bad. First, for every node
you specify you need three options. So, if you want to setup memory for
three nodes you'll need to specify nine options. And it gets worse, because
hugepagesz= and hugepages= have strict ordering (which is a mistake, IMHO) so
you have to specify them in the right order otherwise things don't work as
expected and you have no idea why (have been there myself).
IMO, hugepages_node=<nid>:<nr_pages>:<size>,... is good enough. It's concise,
and don't depend on any other option to function. Also, there are lots of other
kernel command-line options that require you to specify multiple fields, so
it's not like hugepages_node= is totally different in that regard.
>
> instead of having completely confusing interfaces where you want to do
> hugepages_node=1:1:1G for a 1GB hugepage on page 1 (and try remembering
> which "1" means what, yuck) and "hugepagesz=1GB hugepages=1" if you're
> indifferent to the node.
>
On Fri, 14 Feb 2014, Luiz Capitulino wrote:
> > Again, I think this syntax is horrendous and doesn't couple well with the
> > other hugepage-related kernel command line options. We already have
> > hugepages= and hugepagesz= which you can interleave on the command line to
> > get 100 2M hugepages and 10 1GB hugepages, for example.
> >
> > This patchset is simply introducing another variable to the matter: the
> > node that the hugepages should be allocated on. So just introduce a
> > hugepagesnode= parameter to couple with the others so you can do
> >
> > hugepagesz=<size> hugepagesnode=<nid> hugepages=<#>
>
> That was my first try but it turned out really bad. First, for every node
> you specify you need three options.
Just like you need two options today to specify a number of hugepages of a
particular non-default size. You only need to use hugepagesz= or
hugepagenode= if you want a non-default size or a specify a particular
node.
> So, if you want to setup memory for
> three nodes you'll need to specify nine options.
And you currently need six if you want to specify three different hugepage
sizes (?). But who really specifies three different hugepage sizes on the
command line that are needed to be reserved at boot?
If that's really the usecase, it seems like you want the old
CONFIG_PAGE_SHIFT patch.
> And it gets worse, because
> hugepagesz= and hugepages= have strict ordering (which is a mistake, IMHO) so
> you have to specify them in the right order otherwise things don't work as
> expected and you have no idea why (have been there myself).
>
How is that difficult? hugepages= is the "noun", hugepagesz= is the
"adjective". hugepages=100 hugepagesz=1G hugepages=4 makes perfect sense
to me, and I actually don't allocate hugepages on the command line, nor
have I looked at Documentation/kernel-parameters.txt to check if I'm
constructing it correctly. It just makes sense and once you learn it it's
just natural.
> IMO, hugepages_node=<nid>:<nr_pages>:<size>,... is good enough. It's concise,
> and don't depend on any other option to function. Also, there are lots of other
> kernel command-line options that require you to specify multiple fields, so
> it's not like hugepages_node= is totally different in that regard.
>
I doubt Andrew is going to want a completely different format for hugepage
allocations that want to specify a node and have to deal with people who
say hugepages_node=2:1:1G and constantly have to lookup if it's 2
hugepages on node 1 or 1 hugepage on node 2.
On Sat, 15 Feb 2014 02:06:38 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Fri, 14 Feb 2014, Luiz Capitulino wrote:
>
> > > Again, I think this syntax is horrendous and doesn't couple well with the
> > > other hugepage-related kernel command line options. We already have
> > > hugepages= and hugepagesz= which you can interleave on the command line to
> > > get 100 2M hugepages and 10 1GB hugepages, for example.
> > >
> > > This patchset is simply introducing another variable to the matter: the
> > > node that the hugepages should be allocated on. So just introduce a
> > > hugepagesnode= parameter to couple with the others so you can do
> > >
> > > hugepagesz=<size> hugepagesnode=<nid> hugepages=<#>
> >
> > That was my first try but it turned out really bad. First, for every node
> > you specify you need three options.
>
> Just like you need two options today to specify a number of hugepages of a
> particular non-default size. You only need to use hugepagesz= or
> hugepagenode= if you want a non-default size or a specify a particular
> node.
>
> > So, if you want to setup memory for
> > three nodes you'll need to specify nine options.
>
> And you currently need six if you want to specify three different hugepage
> sizes (?). But who really specifies three different hugepage sizes on the
> command line that are needed to be reserved at boot?
hugepages= and hugepages_node= are similar, but have different semantics.
hugepagesz= and hugepages= create a pool of huge pages of the specified size.
This means that the number of times you specify those options are limited by
the number of different huge pages sizes an arch supports. For x86_64 for
example, this limit is two so one would not specify those options more than
two times. And this doesn't count default_hugepagesz=, which allows you to
drop one hugepagesz= option.
hugepages_node= allows you to allocate huge pages per node, so the number of
times you can specify this option is limited by the number of nodes. Also,
hugepages_node= create the pools, if necessary (at least one will be). For
this reason I think it makes a lot of sense to have different options.
> If that's really the usecase, it seems like you want the old
> CONFIG_PAGE_SHIFT patch.
>
> > And it gets worse, because
> > hugepagesz= and hugepages= have strict ordering (which is a mistake, IMHO) so
> > you have to specify them in the right order otherwise things don't work as
> > expected and you have no idea why (have been there myself).
> >
>
> How is that difficult? hugepages= is the "noun", hugepagesz= is the
> "adjective". hugepages=100 hugepagesz=1G hugepages=4 makes perfect sense
> to me, and I actually don't allocate hugepages on the command line, nor
> have I looked at Documentation/kernel-parameters.txt to check if I'm
> constructing it correctly. It just makes sense and once you learn it it's
> just natural.
>
> > IMO, hugepages_node=<nid>:<nr_pages>:<size>,... is good enough. It's concise,
> > and don't depend on any other option to function. Also, there are lots of other
> > kernel command-line options that require you to specify multiple fields, so
> > it's not like hugepages_node= is totally different in that regard.
> >
>
> I doubt Andrew is going to want a completely different format for hugepage
> allocations that want to specify a node and have to deal with people who
> say hugepages_node=2:1:1G and constantly have to lookup if it's 2
> hugepages on node 1 or 1 hugepage on node 2.
Andrew?
On Mon, 17 Feb 2014, Luiz Capitulino wrote:
> hugepages= and hugepages_node= are similar, but have different semantics.
>
> hugepagesz= and hugepages= create a pool of huge pages of the specified size.
> This means that the number of times you specify those options are limited by
> the number of different huge pages sizes an arch supports. For x86_64 for
> example, this limit is two so one would not specify those options more than
> two times. And this doesn't count default_hugepagesz=, which allows you to
> drop one hugepagesz= option.
>
> hugepages_node= allows you to allocate huge pages per node, so the number of
> times you can specify this option is limited by the number of nodes. Also,
> hugepages_node= create the pools, if necessary (at least one will be). For
> this reason I think it makes a lot of sense to have different options.
>
I understand you may want to add as much code as you can to the boot code
so that you can parse all this information in short-form, and it's
understood that it's possible to specify a different number of varying
hugepage sizes on individual nodes, but let's come back down to reality
here.
Lacking from your entire patchset is a specific example of what you want
to do. So I think we're all guessing what exactly your usecase is and we
aren't getting any help. Are you really suggesting that a customer wants
to allocate 4 1GB hugepages on node 0, 12 2MB hugepages on node 0, 6 1GB
hugepages on node 1, 24 2MB hugepages on node 1, 2 1GB hugepages on node
2, 100 2MB hugepages on node 3, etc? Please.
If that's actually the usecase then I'll renew my objection to the entire
patchset and say you want to add the ability to dynamically allocate 1GB
pages and free them at runtime early in initscripts. If something is
going to be added to init code in the kernel then it better be trivial
since all this can be duplicated in userspace if you really want to be
fussy about it.
On Sat, 2014-02-15 at 02:06 -0800, David Rientjes wrote:
> On Fri, 14 Feb 2014, Luiz Capitulino wrote:
>
> > > Again, I think this syntax is horrendous and doesn't couple well with the
> > > other hugepage-related kernel command line options. We already have
> > > hugepages= and hugepagesz= which you can interleave on the command line to
> > > get 100 2M hugepages and 10 1GB hugepages, for example.
> > >
> > > This patchset is simply introducing another variable to the matter: the
> > > node that the hugepages should be allocated on. So just introduce a
> > > hugepagesnode= parameter to couple with the others so you can do
> > >
> > > hugepagesz=<size> hugepagesnode=<nid> hugepages=<#>
> >
> > That was my first try but it turned out really bad. First, for every node
> > you specify you need three options.
>
> Just like you need two options today to specify a number of hugepages of a
> particular non-default size. You only need to use hugepagesz= or
> hugepagenode= if you want a non-default size or a specify a particular
> node.
>
> > So, if you want to setup memory for
> > three nodes you'll need to specify nine options.
>
> And you currently need six if you want to specify three different hugepage
> sizes (?). But who really specifies three different hugepage sizes on the
> command line that are needed to be reserved at boot?
>
> If that's really the usecase, it seems like you want the old
> CONFIG_PAGE_SHIFT patch.
>
> > And it gets worse, because
> > hugepagesz= and hugepages= have strict ordering (which is a mistake, IMHO) so
> > you have to specify them in the right order otherwise things don't work as
> > expected and you have no idea why (have been there myself).
> >
>
> How is that difficult? hugepages= is the "noun", hugepagesz= is the
> "adjective". hugepages=100 hugepagesz=1G hugepages=4 makes perfect sense
> to me, and I actually don't allocate hugepages on the command line, nor
> have I looked at Documentation/kernel-parameters.txt to check if I'm
> constructing it correctly. It just makes sense and once you learn it it's
> just natural.
This can get annoying _really_ fast for larger systems.
> > IMO, hugepages_node=<nid>:<nr_pages>:<size>,... is good enough. It's concise,
> > and don't depend on any other option to function. Also, there are lots of other
> > kernel command-line options that require you to specify multiple fields, so
> > it's not like hugepages_node= is totally different in that regard.
> >
Agreed.
>
> I doubt Andrew is going to want a completely different format for hugepage
> allocations that want to specify a node and have to deal with people who
> say hugepages_node=2:1:1G and constantly have to lookup if it's 2
> hugepages on node 1 or 1 hugepage on node 2.
I guess most users won't even be aware of this new parameter and those
who really care will have the choice.
On Mon, Feb 17, 2014 at 03:23:16PM -0800, David Rientjes wrote:
> On Mon, 17 Feb 2014, Luiz Capitulino wrote:
>
> > hugepages= and hugepages_node= are similar, but have different semantics.
> >
> > hugepagesz= and hugepages= create a pool of huge pages of the specified size.
> > This means that the number of times you specify those options are limited by
> > the number of different huge pages sizes an arch supports. For x86_64 for
> > example, this limit is two so one would not specify those options more than
> > two times. And this doesn't count default_hugepagesz=, which allows you to
> > drop one hugepagesz= option.
> >
> > hugepages_node= allows you to allocate huge pages per node, so the number of
> > times you can specify this option is limited by the number of nodes. Also,
> > hugepages_node= create the pools, if necessary (at least one will be). For
> > this reason I think it makes a lot of sense to have different options.
> >
>
> I understand you may want to add as much code as you can to the boot code
> so that you can parse all this information in short-form, and it's
> understood that it's possible to specify a different number of varying
> hugepage sizes on individual nodes, but let's come back down to reality
> here.
>
> Lacking from your entire patchset is a specific example of what you want
> to do. So I think we're all guessing what exactly your usecase is and we
> aren't getting any help. Are you really suggesting that a customer wants
> to allocate 4 1GB hugepages on node 0, 12 2MB hugepages on node 0, 6 1GB
> hugepages on node 1, 24 2MB hugepages on node 1, 2 1GB hugepages on node
> 2, 100 2MB hugepages on node 3, etc? Please.
Customer has 32GB machine. He wants 8 1GB pages for his performance
critical application on node0 (KVM guest), and other guests and
pagecache etc. using the remaining 26GB of memory.
> If that's actually the usecase then I'll renew my objection to the entire
> patchset and say you want to add the ability to dynamically allocate 1GB
> pages and free them at runtime early in initscripts. If something is
> going to be added to init code in the kernel then it better be trivial
> since all this can be duplicated in userspace if you really want to be
> fussy about it.
Not sure what is the point here. The command line interface addition
being proposed is simple, is it not?
On Tue, 18 Feb 2014, Marcelo Tosatti wrote:
> > Lacking from your entire patchset is a specific example of what you want
> > to do. So I think we're all guessing what exactly your usecase is and we
> > aren't getting any help. Are you really suggesting that a customer wants
> > to allocate 4 1GB hugepages on node 0, 12 2MB hugepages on node 0, 6 1GB
> > hugepages on node 1, 24 2MB hugepages on node 1, 2 1GB hugepages on node
> > 2, 100 2MB hugepages on node 3, etc? Please.
>
> Customer has 32GB machine. He wants 8 1GB pages for his performance
> critical application on node0 (KVM guest), and other guests and
> pagecache etc. using the remaining 26GB of memory.
>
Wow, is that it? (This still doesn't present a clear picture since we
don't know how much memory is on node 0, though.)
So Luiz's example of setting up different size hugepages on three
different nodes requiring nine kernel command line parameters doesn't even
have a legitimate usecase today.
Back to the original comment on this patchset, forgetting all this
parameter parsing stuff, if you had the ability to free 1GB pages at
runtime then your problem is already solved, correct? If that 32GB
machine has two nodes, then doing "hugepagesz=1G hugepages=16" will boot
just fine and then an init script frees the 8 1GB pages on node1.
It gets trickier if there are four nodes and each node is 8GB. Then you'd
be ooming the machine if you did "hugepagesz=1G hugepages=32". You could
actually do "hugepagesz=1G hugepages=29" and free the hugepages on
everything except for node 0, but I feel like movablecore= would be a
better option.
So why not just work on making 1GB pages dynamically allocatable and
freeable at runtime? It feels like it would be a much more heavily used
feature than a special command line parameter for a single customer.
> > If that's actually the usecase then I'll renew my objection to the
> > entire patchset and say you want to add the ability to dynamically
> > allocate 1GB pages and free them at runtime early in initscripts. If
> > something is going to be added to init code in the kernel then it
> > better be trivial since all this can be duplicated in userspace if you
> > really want to be fussy about it.
>
> Not sure what is the point here. The command line interface addition
> being proposed is simple, is it not?
>
You can't specify an interleave behavior with Luiz's command line
interface so now we'd have two different interfaces for allocating
hugepage sizes depending on whether you're specifying a node or not.
It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
to look at previous messages in this thread to see if that means 16 1GB
pages on node 1 or 1 1GB pages on node 16.)
On Tue, Feb 18, 2014 at 02:16:42PM -0800, David Rientjes wrote:
> On Tue, 18 Feb 2014, Marcelo Tosatti wrote:
>
> > > Lacking from your entire patchset is a specific example of what you want
> > > to do. So I think we're all guessing what exactly your usecase is and we
> > > aren't getting any help. Are you really suggesting that a customer wants
> > > to allocate 4 1GB hugepages on node 0, 12 2MB hugepages on node 0, 6 1GB
> > > hugepages on node 1, 24 2MB hugepages on node 1, 2 1GB hugepages on node
> > > 2, 100 2MB hugepages on node 3, etc? Please.
> >
> > Customer has 32GB machine. He wants 8 1GB pages for his performance
> > critical application on node0 (KVM guest), and other guests and
> > pagecache etc. using the remaining 26GB of memory.
> >
>
> Wow, is that it? (This still doesn't present a clear picture since we
> don't know how much memory is on node 0, though.)
>
> So Luiz's example of setting up different size hugepages on three
> different nodes requiring nine kernel command line parameters doesn't even
> have a legitimate usecase today.
>
> Back to the original comment on this patchset, forgetting all this
> parameter parsing stuff, if you had the ability to free 1GB pages at
> runtime then your problem is already solved, correct? If that 32GB
> machine has two nodes, then doing "hugepagesz=1G hugepages=16" will boot
> just fine and then an init script frees the 8 1GB pages on node1.
>
> It gets trickier if there are four nodes and each node is 8GB. Then you'd
> be ooming the machine if you did "hugepagesz=1G hugepages=32". You could
> actually do "hugepagesz=1G hugepages=29" and free the hugepages on
> everything except for node 0, but I feel like movablecore= would be a
> better option.
>
> So why not just work on making 1GB pages dynamically allocatable and
> freeable at runtime? It feels like it would be a much more heavily used
> feature than a special command line parameter for a single customer.
David,
We agree that, in the future, we'd like to provide the ability to
dynamically allocate and free 1GB pages at runtime.
Extending the kernel command line interface is a first step.
Do you have a concrete objection to that first step ?
> > > If that's actually the usecase then I'll renew my objection to the
> > > entire patchset and say you want to add the ability to dynamically
> > > allocate 1GB pages and free them at runtime early in initscripts. If
> > > something is going to be added to init code in the kernel then it
> > > better be trivial since all this can be duplicated in userspace if you
> > > really want to be fussy about it.
> >
> > Not sure what is the point here. The command line interface addition
> > being proposed is simple, is it not?
> >
>
> You can't specify an interleave behavior with Luiz's command line
> interface so now we'd have two different interfaces for allocating
> hugepage sizes depending on whether you're specifying a node or not.
> It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
> to look at previous messages in this thread to see if that means 16 1GB
> pages on node 1 or 1 1GB pages on node 16.)
What syntax do you prefer and why ?
On Wed, 19 Feb 2014, Marcelo Tosatti wrote:
> We agree that, in the future, we'd like to provide the ability to
> dynamically allocate and free 1GB pages at runtime.
>
> Extending the kernel command line interface is a first step.
>
> Do you have a concrete objection to that first step ?
>
Yes, my concrete objection is that the command line interface is
unnecessary if you can dynamically allocate and free 1GB pages at runtime
unless memory will be so fragmented that it cannot be done when userspace
is brought up. That is not your use case, thus this support is not
needed. I think Mel also brought up this point.
There's no "first step" about it, this is unnecessary for your use case if
you can do it at runtime. I'm not sure what's so surprising about this.
> > You can't specify an interleave behavior with Luiz's command line
> > interface so now we'd have two different interfaces for allocating
> > hugepage sizes depending on whether you're specifying a node or not.
> > It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
> > to look at previous messages in this thread to see if that means 16 1GB
> > pages on node 1 or 1 1GB pages on node 16.)
>
> What syntax do you prefer and why ?
>
I'm not sure it's interesting to talk about since this patchset is
unnecessary if you can do it at runtime, but since "hugepagesz=" and
"hugepages=" have existed for many kernel releases, we must maintain
backwards compatibility. Thus, it seems, the easiest addition would have
been "hugepagesnode=" which I've mentioned several times, there's no
reason to implement yet another command line option purely as a shorthand
which hugepage_node=1:2:1G is and in a very cryptic way.
On Wed, 19 Feb 2014 19:46:41 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Wed, 19 Feb 2014, Marcelo Tosatti wrote:
>
> > We agree that, in the future, we'd like to provide the ability to
> > dynamically allocate and free 1GB pages at runtime.
> >
> > Extending the kernel command line interface is a first step.
> >
> > Do you have a concrete objection to that first step ?
> >
>
> Yes, my concrete objection is that the command line interface is
> unnecessary if you can dynamically allocate and free 1GB pages at runtime
> unless memory will be so fragmented that it cannot be done when userspace
> is brought up. That is not your use case, thus this support is not
Yes it is. The early boot is the most reliable moment to allocate huge pages
and we want to take advantage from that.
I would understand your position against this series if it was intrusive
or if it was changing the code in an undesirable way, but the code it adds
is completely self-contained and runs only once at boot. It gives us just
one more choice for huge pages allocation at boot. I don't see the huge
problem here.
> needed. I think Mel also brought up this point.
>
> There's no "first step" about it, this is unnecessary for your use case if
> you can do it at runtime. I'm not sure what's so surprising about this.
>
> > > You can't specify an interleave behavior with Luiz's command line
> > > interface so now we'd have two different interfaces for allocating
> > > hugepage sizes depending on whether you're specifying a node or not.
> > > It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
> > > to look at previous messages in this thread to see if that means 16 1GB
> > > pages on node 1 or 1 1GB pages on node 16.)
> >
> > What syntax do you prefer and why ?
> >
>
> I'm not sure it's interesting to talk about since this patchset is
> unnecessary if you can do it at runtime, but since "hugepagesz=" and
> "hugepages=" have existed for many kernel releases, we must maintain
> backwards compatibility. Thus, it seems, the easiest addition would have
> been "hugepagesnode=" which I've mentioned several times, there's no
> reason to implement yet another command line option purely as a shorthand
> which hugepage_node=1:2:1G is and in a very cryptic way.
>
On Wed, 19 Feb 2014, Luiz Capitulino wrote:
> > Yes, my concrete objection is that the command line interface is
> > unnecessary if you can dynamically allocate and free 1GB pages at runtime
> > unless memory will be so fragmented that it cannot be done when userspace
> > is brought up. That is not your use case, thus this support is not
>
> Yes it is. The early boot is the most reliable moment to allocate huge pages
> and we want to take advantage from that.
>
Your use case is 8GB of hugepages on a 32GB machine. It shouldn't be
necessary to do that at boot.
Thanks.
On Wed, 19 Feb 2014 20:51:55 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Wed, 19 Feb 2014, Luiz Capitulino wrote:
>
> > > Yes, my concrete objection is that the command line interface is
> > > unnecessary if you can dynamically allocate and free 1GB pages at runtime
> > > unless memory will be so fragmented that it cannot be done when userspace
> > > is brought up. That is not your use case, thus this support is not
> >
> > Yes it is. The early boot is the most reliable moment to allocate huge pages
> > and we want to take advantage from that.
> >
>
> Your use case is 8GB of hugepages on a 32GB machine. It shouldn't be
> necessary to do that at boot.
That's shortsighted because it's tied to a particular machine. The same
customer asked for more flexibility, too.
Look, we're also looking forward to allocating 1G huge pages from user-space.
We actually agree here. What we're suggesting is having _both_, the
command-line option (which offers higher reliability and is a low hanging
fruit right now) _and_ later we add support to allocate 1G huge pages from
user-space. No loss here, that's the maximum benefit for all users.
On Wed, Feb 19, 2014 at 07:46:41PM -0800, David Rientjes wrote:
> On Wed, 19 Feb 2014, Marcelo Tosatti wrote:
>
> > We agree that, in the future, we'd like to provide the ability to
> > dynamically allocate and free 1GB pages at runtime.
> >
> > Extending the kernel command line interface is a first step.
> >
> > Do you have a concrete objection to that first step ?
> >
>
> Yes, my concrete objection is that the command line interface is
> unnecessary if you can dynamically allocate and free 1GB pages at runtime
> unless memory will be so fragmented that it cannot be done when userspace
> is brought up. That is not your use case, thus this support is not
> needed. I think Mel also brought up this point.
>
> There's no "first step" about it, this is unnecessary for your use case if
> you can do it at runtime. I'm not sure what's so surprising about this.
Mel has clearly has no objection to the command line. You can also
allocate 2M pages at runtime, and that is no reason for "hugepages="
interface to not exist.
There is a number of parameters that are modifiable via the kernel
command line, so following your reasoning, they should all be removed,
because it can be done at runtime.
> > > You can't specify an interleave behavior with Luiz's command line
> > > interface so now we'd have two different interfaces for allocating
> > > hugepage sizes depending on whether you're specifying a node or not.
> > > It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
> > > to look at previous messages in this thread to see if that means 16 1GB
> > > pages on node 1 or 1 1GB pages on node 16.)
> >
> > What syntax do you prefer and why ?
> >
>
> I'm not sure it's interesting to talk about since this patchset is
> unnecessary if you can do it at runtime, but since "hugepagesz=" and
> "hugepages=" have existed for many kernel releases, we must maintain
> backwards compatibility.
Yes, we'd like to maintain backwards compatibility.
> Thus, it seems, the easiest addition would have
> been "hugepagesnode=" which I've mentioned several times, there's no
> reason to implement yet another command line option purely as a shorthand
> which hugepage_node=1:2:1G is and in a very cryptic way.
Can you state your suggestion clearly (or point to such messages), and
list the advantages of it versus the proposed patch ?
On Wed, Feb 19, 2014 at 07:46:41PM -0800, David Rientjes wrote:
> On Wed, 19 Feb 2014, Marcelo Tosatti wrote:
>
> > We agree that, in the future, we'd like to provide the ability to
> > dynamically allocate and free 1GB pages at runtime.
> >
> > Extending the kernel command line interface is a first step.
> >
> > Do you have a concrete objection to that first step ?
> >
>
> Yes, my concrete objection is that the command line interface is
> unnecessary if you can dynamically allocate and free 1GB pages at runtime
> unless memory will be so fragmented that it cannot be done when userspace
> is brought up. That is not your use case, thus this support is not
> needed. I think Mel also brought up this point.
>
> There's no "first step" about it, this is unnecessary for your use case if
> you can do it at runtime. I'm not sure what's so surprising about this.
>
> > > You can't specify an interleave behavior with Luiz's command line
> > > interface so now we'd have two different interfaces for allocating
> > > hugepage sizes depending on whether you're specifying a node or not.
> > > It's "hugepagesz=1G hugepages=16" vs "hugepage_node=1:16:1G" (and I'd have
> > > to look at previous messages in this thread to see if that means 16 1GB
> > > pages on node 1 or 1 1GB pages on node 16.)
> >
> > What syntax do you prefer and why ?
> >
>
> I'm not sure it's interesting to talk about since this patchset is
> unnecessary if you can do it at runtime, but since "hugepagesz=" and
> "hugepages=" have existed for many kernel releases, we must maintain
> backwards compatibility. Thus, it seems, the easiest addition would have
> been "hugepagesnode=" which I've mentioned several times, there's no
> reason to implement yet another command line option purely as a shorthand
> which hugepage_node=1:2:1G is and in a very cryptic way.
There is one point from Davidlohr Bueso in favour of the proposed
command line interface. Did you consider that aspect?
On Thu, 20 Feb 2014, Marcelo Tosatti wrote:
> Mel has clearly has no objection to the command line. You can also
> allocate 2M pages at runtime, and that is no reason for "hugepages="
> interface to not exist.
>
The "hugepages=" interface does exist and for good reason, when
fragmentation is such that you cannot allocate that number of hugepages at
runtime easily. That's lacking from your use case: why can't your
customer do it from an initscript? So far, all you've said is that your
customer wants 8 1GB hugepages on node 0 for a 32GB machine.
> There is a number of parameters that are modifiable via the kernel
> command line, so following your reasoning, they should all be removed,
> because it can be done at runtime.
>
1GB is of such granularity that you'd typically either be (a) oom so that
your userspace couldn't even start, or (b) have enough memory such that
userspace would be able to start and allocate them dynamically through an
initscript.
> Yes, we'd like to maintain backwards compatibility.
>
Good, see below.
> > Thus, it seems, the easiest addition would have
> > been "hugepagesnode=" which I've mentioned several times, there's no
> > reason to implement yet another command line option purely as a shorthand
> > which hugepage_node=1:2:1G is and in a very cryptic way.
>
> Can you state your suggestion clearly (or point to such messages), and
> list the advantages of it versus the proposed patch ?
>
My suggestion was posted on the same day this patchset was posted:
http://marc.info/?l=linux-kernel&m=139241967514884 it would be helpful if
you read the thread before asking for something that has been repeated
over and over.
There's no need to implement a shorthand that combines a few kernel
command line options.
That's not the issue, anymore, though, since there's no need for the
patchset to begin with if you can dynamically allocate 1GB hugepages at
runtime. If your customer wanted 4096 2MB hugepages on node 0 instead of
8 1GB hugepages on node 0, we'd not be having this conversation.
Do I really need to do your work for you and work on 1GB hugepages at
runtime, which many more people would be interested in? Or are we just
seeking the easiest way out here with something that shuts the customer up
and leaves a kernel command line option that we'll need to maintain to
avoid breaking backwards compatibility in the future?
On Thu, 20 Feb 2014, Marcelo Tosatti wrote:
> > I'm not sure it's interesting to talk about since this patchset is
> > unnecessary if you can do it at runtime, but since "hugepagesz=" and
> > "hugepages=" have existed for many kernel releases, we must maintain
> > backwards compatibility. Thus, it seems, the easiest addition would have
> > been "hugepagesnode=" which I've mentioned several times, there's no
> > reason to implement yet another command line option purely as a shorthand
> > which hugepage_node=1:2:1G is and in a very cryptic way.
>
> There is one point from Davidlohr Bueso in favour of the proposed
> command line interface. Did you consider that aspect?
>
I did before he posted it, in
http://marc.info/?l=linux-kernel&m=139267940609315. I don't think "large
machines" open up the use case for 4 1GB hugepages on node 0, 12 2MB
hugepages on node 0, 6 1GB hugepages on node 1, 24 2MB hugepages on node
1, 2 1GB hugepages on node 2, 100 2MB hugepages on node 3, etc.
On Thu, Feb 20, 2014 at 03:15:46PM -0800, David Rientjes wrote:
> On Thu, 20 Feb 2014, Marcelo Tosatti wrote:
>
> > Mel has clearly has no objection to the command line. You can also
> > allocate 2M pages at runtime, and that is no reason for "hugepages="
> > interface to not exist.
> >
>
> The "hugepages=" interface does exist and for good reason, when
> fragmentation is such that you cannot allocate that number of hugepages at
> runtime easily. That's lacking from your use case: why can't your
> customer do it from an initscript? So far, all you've said is that your
> customer wants 8 1GB hugepages on node 0 for a 32GB machine.
See below about particular hugepages distribution.
> > There is a number of parameters that are modifiable via the kernel
> > command line, so following your reasoning, they should all be removed,
> > because it can be done at runtime.
> >
>
> 1GB is of such granularity that you'd typically either be (a) oom so that
> your userspace couldn't even start, or (b) have enough memory such that
> userspace would be able to start and allocate them dynamically through an
> initscript.
There are a number of kernel command line parameters which can be
modified in runtime as well.
> > Yes, we'd like to maintain backwards compatibility.
> >
>
> Good, see below.
>
> > > Thus, it seems, the easiest addition would have
> > > been "hugepagesnode=" which I've mentioned several times, there's no
> > > reason to implement yet another command line option purely as a shorthand
> > > which hugepage_node=1:2:1G is and in a very cryptic way.
> >
> > Can you state your suggestion clearly (or point to such messages), and
> > list the advantages of it versus the proposed patch ?
> >
>
> My suggestion was posted on the same day this patchset was posted:
> http://marc.info/?l=linux-kernel&m=139241967514884 it would be helpful if
> you read the thread before asking for something that has been repeated
> over and over.
Please, repeat it. Compare your suggestion to the proposed interface.
Copy & paste if necessary. You have not given a single technical point
so far.
There is not ONE technical point in this message:
http://marc.info/?l=linux-kernel&m=139241967514884
You are asking what is the use-case.
> There's no need to implement a shorthand that combines a few kernel
> command line options.
>
> That's not the issue, anymore, though, since there's no need for the
> patchset to begin with if you can dynamically allocate 1GB hugepages at
> runtime. If your customer wanted 4096 2MB hugepages on node 0 instead of
> 8 1GB hugepages on node 0, we'd not be having this conversation.
A particular distribution is irrelevant. What you want is a non default
distribution of 1GB hugepages.
Can you agree with that ? (forget about particular values, please).
> Do I really need to do your work for you and work on 1GB hugepages at
> runtime, which many more people would be interested in?
Our interest is satistified by this patch. The interest below is
satisfied by the patch.
> Or are we just
> seeking the easiest way out here with something that shuts the customer up
> and leaves a kernel command line option that we'll need to maintain to
> avoid breaking backwards compatibility in the future?
We'd like to backport the minimal amount of code (minimal amount of
code = kernel command line interface).
USE CASE FOR "hugetlb: add hugepages_node= command-line option" PATCH.
Requirement1) Assume N 1GB hugepages on node A. No other hugepages on any node
are desired (because that memory must be free, for other uses).
Requirement2) Assume that failure to allocate N 1GB hugepages is not acceptable.
For this pair of requirements, THERE IS NO DIFFERENCE AT ALL whether
1GB pages are allocated in userspace or via kernel command line.
All it matters is that pages are allocated accordingly to Requirement1).
Feel free to replace page distribution on Requirement1) by your
imagination of how people can make use of computers.
Now, your suggestion is to allocate pages during runtime. OK, can you
explain what are the advantages of doing so (for use case above) ?
Or for any use case which _MUST ALLOCATE_ a given number of 1GB hugepages ?
"I object to a new kernel command line parameter because that
parameter can be specified during runtime" is not a valid objection to me.
It seems you are interested in writing code to allocate 1GB pages during
runtime? Sure, go ahead, no one is stopping you.
On Thu, 20 Feb 2014 15:15:46 -0800 (PST)
David Rientjes <[email protected]> wrote:
> Do I really need to do your work for you and work on 1GB hugepages at
> runtime, which many more people would be interested in? Or are we just
> seeking the easiest way out here with something that shuts the customer up
> and leaves a kernel command line option that we'll need to maintain to
> avoid breaking backwards compatibility in the future?
We're seeking a pragmatic solution.
I've said many times in this thread that we're also interested on being
able to allocate 1GB at runtime and would work on it on top of the
command-line option, which is ready, works and solves a real world problem.
On Thu, 20 Feb 2014, Marcelo Tosatti wrote:
> > 1GB is of such granularity that you'd typically either be (a) oom so that
> > your userspace couldn't even start, or (b) have enough memory such that
> > userspace would be able to start and allocate them dynamically through an
> > initscript.
>
> There are a number of kernel command line parameters which can be
> modified in runtime as well.
>
We could also make the kernel command line implement a shell scripting
language of your choice. There's no technical objection to it, of course
you can do it, but is it in the interest of the kernel in terms of
maintainability?
> You are asking what is the use-case.
>
I'm asking what the use case is because it's still not explained. You say
a customer wants 8 1GB hugepages on node 0 on a 32GB machine. Perfectly
understandable. The only thing missing, and is practically begging to be
answered in this thread, is why must it be done on the command line? That
would be the justification for the patchset. Andrew asked for Luiz to
elaborate originally and even today the use case is not well described.
If you're asking for a maintenance burden to be accepted forever, it seems
like as part of your due diligence that you would show why it must be done
that way. Being the easiest or "pragmatic" is not it, there is a much
larger set of people who would be interested in dynamic allocation, myself
and Google included.
> A particular distribution is irrelevant. What you want is a non default
> distribution of 1GB hugepages.
>
> Can you agree with that ? (forget about particular values, please).
>
I agree that your customer wants a non-default distribution of 1GB
hugepages, yes, that's clear. The questions that have not been answered:
why must it be done this way as opposed to runtime? If 1GB hugepages
could be dynamically allocated, would your customer be able to use it? If
not, why not? If dynamic allocation resolves all the issues, then is this
patchset a needless maintenance burden if we had such support today?
On Fri, Feb 21, 2014 at 02:07:08AM -0800, David Rientjes wrote:
> On Thu, 20 Feb 2014, Marcelo Tosatti wrote:
>
> > > 1GB is of such granularity that you'd typically either be (a) oom so that
> > > your userspace couldn't even start, or (b) have enough memory such that
> > > userspace would be able to start and allocate them dynamically through an
> > > initscript.
> >
> > There are a number of kernel command line parameters which can be
> > modified in runtime as well.
> >
>
> We could also make the kernel command line implement a shell scripting
> language of your choice. There's no technical objection to it, of course
> you can do it, but is it in the interest of the kernel in terms of
> maintainability?
>
> > You are asking what is the use-case.
> >
>
> I'm asking what the use case is because it's still not explained. You say
> a customer wants 8 1GB hugepages on node 0 on a 32GB machine. Perfectly
> understandable. The only thing missing, and is practically begging to be
> answered in this thread, is why must it be done on the command line? That
> would be the justification for the patchset. Andrew asked for Luiz to
> elaborate originally and even today the use case is not well described.
It is explained. You deleted it while replying (feel free to ask for
more information there and it will be provided).
> If you're asking for a maintenance burden to be accepted forever, it seems
> like as part of your due diligence that you would show why it must be done
> that way.
It does not have to be maintained forever. See
27be457000211a6903968dfce06d5f73f051a217 for one or git log for many
commands which have been removed.
If it becomes a maintenance burden, it can be removed.
> Being the easiest or "pragmatic" is not it, there is a much
> larger set of people who would be interested in dynamic allocation, myself
> and Google included.
OK.
> > A particular distribution is irrelevant. What you want is a non default
> > distribution of 1GB hugepages.
> >
> > Can you agree with that ? (forget about particular values, please).
> >
>
> I agree that your customer wants a non-default distribution of 1GB
> hugepages, yes, that's clear. The questions that have not been answered:
> why must it be done this way as opposed to runtime? If 1GB hugepages
> could be dynamically allocated, would your customer be able to use it? If
> not, why not? If dynamic allocation resolves all the issues, then is this
> patchset a needless maintenance burden if we had such support today?
It must be done this way because:
1) its the only interface which is easily backportable.
2) it improves the kernel command line interface from incomplete
(lacking the ability to specify node<->page correlation), to
a complete interface.
And also, the existance of the command line interface does not interfere in
any way with the dynamic allocation in userspace (just as you can
allocate 2M pages via kernel command line _and_ allocate during
runtime).
On Fri, 21 Feb 2014, Marcelo Tosatti wrote:
> > I agree that your customer wants a non-default distribution of 1GB
> > hugepages, yes, that's clear. The questions that have not been answered:
> > why must it be done this way as opposed to runtime? If 1GB hugepages
> > could be dynamically allocated, would your customer be able to use it? If
> > not, why not? If dynamic allocation resolves all the issues, then is this
> > patchset a needless maintenance burden if we had such support today?
>
> It must be done this way because:
>
> 1) its the only interface which is easily backportable.
>
There's no pending patchset that adds dynamic allocation of GB hugepages
so you can't comment on what is easily backportable and which isn't.
> 2) it improves the kernel command line interface from incomplete
> (lacking the ability to specify node<->page correlation), to
> a complete interface.
>
If GB hugepages can be allocated dynamically, I really think we should be
able to remove hugepagesz= entirely for x86 after a few years of
supporting it for backwards compatibility, even though Linus has insisted
that we never break userspace in the past (which should discourage us
from adding additional command line interfaces which are obsoleted in the
future, such as in this case).
Still waiting on an answer to whether your customer would be able to
dynamically allocate 1GB hugepages at runtime if we had such support and,
if not, please show why not?
> > 2) it improves the kernel command line interface from incomplete
> > (lacking the ability to specify node<->page correlation), to
> > a complete interface.
> >
>
> If GB hugepages can be allocated dynamically, I really think we should be
> able to remove hugepagesz= entirely for x86 after a few years of
> supporting it for backwards compatibility, even though Linus has insisted
That doesn't make any sense. Why break a perfectly fine interface?
-Andi
On Fri, 21 Feb 2014, Andi Kleen wrote:
> > > 2) it improves the kernel command line interface from incomplete
> > > (lacking the ability to specify node<->page correlation), to
> > > a complete interface.
> > >
> >
> > If GB hugepages can be allocated dynamically, I really think we should be
> > able to remove hugepagesz= entirely for x86 after a few years of
> > supporting it for backwards compatibility, even though Linus has insisted
>
> That doesn't make any sense. Why break a perfectly fine interface?
>
I think doing hugepagesz= and not default_hugepagesz= is more of a hack
just because we lack support for dynamically allocating some class of
hugepage sizes and this is the only way to currently do it; if we had
support for doing it at runtime then that hack probably isn't needed. You
would still be able to do default_hugepagesz=1G and allocate a ton of them
when fragmentation is a concern and it can only truly be done at boot.
Even then, with such a large size it doesn't seem absolutely necessary
since you'd either be (a) oom as a result of all those hugepages or (b)
there would be enough memory for initscripts to do this at runtime, this
isn't the case with 2MB.
But, like I said, I'm not sure we'd ever be able to totally remove it
because of backwards compatibility, but the point is that nobody would
have to use it anymore as a hack for 1GB.
> But, like I said, I'm not sure we'd ever be able to totally remove it
> because of backwards compatibility, but the point is that nobody would
> have to use it anymore as a hack for 1GB.
Again it's a perfectly fine and widely used interface. Any talk of removing
it is wrong.
-Andi
On Mon, 17 Feb 2014 21:47:36 -0800 Davidlohr Bueso <[email protected]> wrote:
> > How is that difficult? hugepages= is the "noun", hugepagesz= is the
> > "adjective". hugepages=100 hugepagesz=1G hugepages=4 makes perfect sense
> > to me, and I actually don't allocate hugepages on the command line, nor
> > have I looked at Documentation/kernel-parameters.txt to check if I'm
> > constructing it correctly. It just makes sense and once you learn it it's
> > just natural.
>
> This can get annoying _really_ fast for larger systems.
Yes, I do prefer the syntax Luiz is proposing.
But I think it would be better if it made hugepages= and hugepagesz=
obsolete, so we can emit a printk if people use those, telling them
to migrate because the old options are going away.
Something like
hugepages_node=1:4:1G
and
hugepages_node=:16:1G
?
> But I think it would be better if it made hugepages= and hugepagesz=
> obsolete, so we can emit a printk if people use those, telling them
> to migrate because the old options are going away.
Not sure why everyone wants to break existing systems. These options
have existed for many years, you cannot not just remove them.
Also the old options are totally fine and work adequately for the
vast majority of users who do not need to control node assignment
fine grained.
-Andi
On Sat, 2014-02-22 at 05:03 +0100, Andi Kleen wrote:
> > But I think it would be better if it made hugepages= and hugepagesz=
> > obsolete, so we can emit a printk if people use those, telling them
> > to migrate because the old options are going away.
>
> Not sure why everyone wants to break existing systems. These options
> have existed for many years, you cannot not just remove them.
>
> Also the old options are totally fine and work adequately for the
> vast majority of users who do not need to control node assignment
> fine grained.
Yes, please, why can't both options just coexist.
On Sat, 22 Feb 2014 05:03:50 +0100 Andi Kleen <[email protected]> wrote:
> > But I think it would be better if it made hugepages= and hugepagesz=
> > obsolete, so we can emit a printk if people use those, telling them
> > to migrate because the old options are going away.
>
> Not sure why everyone wants to break existing systems. These options
> have existed for many years, you cannot not just remove them.
Because we care about the quality of kernel interfaces and
implementation. Five years? We'll still be around then.
> Also the old options are totally fine and work adequately for the
> vast majority of users who do not need to control node assignment
> fine grained.
It will be old, unneeded cruft. Don't accumulate cruft. If we
possibly can remove the old stuff, we should. It's what we do. Maybe
in five years we'll find we can't remove them. We'll see. At least we
tried.
Hopefully by then none of these interfaces will be in use anyway.