2007-01-25 21:37:22

by David Rientjes

[permalink] [raw]
Subject: [patch -mm 1/5] x86_64: configurable fake numa node sizes

Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes. These nodes can be used in conjunction with cpusets for
coarse memory resource management.

The old command-line option is still supported:
numa=fake=32 gives 32 fake NUMA nodes, ignoring the NUMA setup of the
actual machine.

But now you may configure your system for the node sizes of your choice:
numa=fake=2*512,1024,2*256
gives two 512M nodes, one 1024M node, two 256M nodes, and
the rest of system memory to a sixth node.

Cc: Andi Kleen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/x86_64/boot-options.txt | 9 +-
arch/x86_64/mm/numa.c | 249 +++++++++++++++++++--------------
include/asm-x86_64/mmzone.h | 2 +-
3 files changed, 150 insertions(+), 110 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
index 625a21d..6ccdb5e 100644
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -149,7 +149,14 @@ NUMA

numa=noacpi Don't parse the SRAT table for NUMA setup

- numa=fake=X Fake X nodes and ignore NUMA setup of the actual machine.
+ numa=fake=CMDLINE
+ If a number, fakes CMDLINE nodes and ignores NUMA setup of the
+ actual machine. Otherwise, system memory is configured
+ depending on the sizes and coefficients listed. For example:
+ numa=fake=2*512,1024,4*256
+ gives two 512M nodes, a 1024M node, and four 256M nodes. If
+ the last character of CMDLINE is a comma, the remaining system
+ memory is not allocated to an additional node.

numa=hotadd=percent
Only allow hotadd memory to preallocate page structures upto
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 9ff3141..0417921 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -276,125 +276,160 @@ void __init numa_init_array(void)

#ifdef CONFIG_NUMA_EMU
/* Numa emulation */
-int numa_fake __initdata = 0;
+#define E820_ADDR_HOLE_SIZE(start, end) \
+ (e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) << \
+ PAGE_SHIFT)
+char *cmdline __initdata;

/*
- * This function is used to find out if the start and end correspond to
- * different zones.
+ * Setups up nid to range from addr to addr + size. If the end boundary is
+ * greater than max_addr, then max_addr is used instead. The return value is 0
+ * if there is additional memory left for allocation past addr and -1 otherwise.
+ * addr is adjusted to be at the end of the node.
*/
-int zone_cross_over(unsigned long start, unsigned long end)
+static int setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
+ u64 size, u64 max_addr)
{
- if ((start < (MAX_DMA32_PFN << PAGE_SHIFT)) &&
- (end >= (MAX_DMA32_PFN << PAGE_SHIFT)))
- return 1;
- return 0;
+ int ret = 0;
+ nodes[nid].start = *addr;
+ *addr += size;
+ if (*addr >= max_addr) {
+ *addr = max_addr;
+ ret = -1;
+ }
+ nodes[nid].end = *addr;
+ node_set_online(nid);
+ printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
+ nodes[nid].start, nodes[nid].end,
+ (nodes[nid].end - nodes[nid].start) >> 20);
+ return ret;
}

-static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+/*
+ * Splits num_nodes nodes up equally starting at node_start. The return value
+ * is the number of nodes split up and addr is adjusted to be at the end of the
+ * last node allocated.
+ */
+static int split_nodes_equally(struct bootnode *nodes, u64 *addr, u64 max_addr,
+ int node_start, int num_nodes)
{
- int i, big;
- struct bootnode nodes[MAX_NUMNODES];
- unsigned long sz, old_sz;
- unsigned long hole_size;
- unsigned long start, end;
- unsigned long max_addr = (end_pfn << PAGE_SHIFT);
-
- start = (start_pfn << PAGE_SHIFT);
- hole_size = e820_hole_size(start, max_addr);
- sz = (max_addr - start - hole_size) / numa_fake;
-
- /* Kludge needed for the hash function */
-
- old_sz = sz;
- /*
- * Round down to the nearest FAKE_NODE_MIN_SIZE.
- */
- sz &= FAKE_NODE_MIN_HASH_MASK;
+ unsigned int big;
+ u64 size;
+ int i;

+ if (num_nodes <= 0)
+ return -1;
+ if (num_nodes > MAX_NUMNODES)
+ num_nodes = MAX_NUMNODES;
+ size = (max_addr - *addr - E820_ADDR_HOLE_SIZE(*addr, max_addr)) /
+ num_nodes;
/*
- * We ensure that each node is at least 64MB big. Smaller than this
- * size can cause VM hiccups.
+ * Calculate the number of big nodes that can be allocated as a result
+ * of consolidating the leftovers.
*/
- if (sz == 0) {
- printk(KERN_INFO "Not enough memory for %d nodes. Reducing "
- "the number of nodes\n", numa_fake);
- numa_fake = (max_addr - start - hole_size) / FAKE_NODE_MIN_SIZE;
- printk(KERN_INFO "Number of fake nodes will be = %d\n",
- numa_fake);
- sz = FAKE_NODE_MIN_SIZE;
+ big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * num_nodes) /
+ FAKE_NODE_MIN_SIZE;
+
+ /* Round down to nearest FAKE_NODE_MIN_SIZE. */
+ size &= FAKE_NODE_MIN_HASH_MASK;
+ if (!size) {
+ printk(KERN_ERR "Not enough memory for each node. "
+ "NUMA emulation disabled.\n");
+ return -1;
}
- /*
- * Find out how many nodes can get an extra NODE_MIN_SIZE granule.
- * This logic ensures the extra memory gets distributed among as many
- * nodes as possible (as compared to one single node getting all that
- * extra memory.
- */
- big = ((old_sz - sz) * numa_fake) / FAKE_NODE_MIN_SIZE;
- printk(KERN_INFO "Fake node Size: %luMB hole_size: %luMB big nodes: "
- "%d\n",
- (sz >> 20), (hole_size >> 20), big);
- memset(&nodes,0,sizeof(nodes));
- end = start;
- for (i = 0; i < numa_fake; i++) {
- /*
- * In case we are not able to allocate enough memory for all
- * the nodes, we reduce the number of fake nodes.
- */
- if (end >= max_addr) {
- numa_fake = i - 1;
- break;
- }
- start = nodes[i].start = end;
- /*
- * Final node can have all the remaining memory.
- */
- if (i == numa_fake-1)
- sz = max_addr - start;
- end = nodes[i].start + sz;
- /*
- * Fir "big" number of nodes get extra granule.
- */
+
+ for (i = node_start; i < num_nodes + node_start; i++) {
+ u64 end = *addr + size;
if (i < big)
end += FAKE_NODE_MIN_SIZE;
/*
- * Iterate over the range to ensure that this node gets at
- * least sz amount of RAM (excluding holes)
+ * The final node can have the remaining system RAM. Other
+ * nodes receive roughly the same amount of available pages.
*/
- while ((end - start - e820_hole_size(start, end)) < sz) {
- end += FAKE_NODE_MIN_SIZE;
- if (end >= max_addr)
- break;
+ if (i == num_nodes + node_start - 1)
+ end = max_addr;
+ else
+ while (end - *addr - E820_ADDR_HOLE_SIZE(*addr, end) <
+ size)
+ end += FAKE_NODE_MIN_SIZE;
+ if (setup_node_range(i, nodes, addr, end - *addr, max_addr) < 0)
+ break;
+ }
+ return i - node_start + 1;
+}
+
+/*
+ * Sets up the system RAM area from start_pfn to end_pfn according to the
+ * numa=fake command-line option.
+ */
+static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct bootnode nodes[MAX_NUMNODES];
+ u64 addr = start_pfn << PAGE_SHIFT;
+ u64 max_addr = end_pfn << PAGE_SHIFT;
+ unsigned int coeff;
+ unsigned int num = 0;
+ int num_nodes = 0;
+ u64 size;
+ int i;
+
+ memset(&nodes, 0, sizeof(nodes));
+ /*
+ * If the numa=fake command-line is just a single number N, split the
+ * system RAM into N fake nodes.
+ */
+ if (!strchr(cmdline, '*') && !strchr(cmdline, ',')) {
+ num_nodes = split_nodes_equally(nodes, &addr, max_addr, 0,
+ simple_strtol(cmdline, NULL, 0));
+ if (num_nodes < 0)
+ return num_nodes;
+ goto out;
+ }
+
+ /* Parse the command line. */
+ for (coeff = 1; ; cmdline++) {
+ if (*cmdline && isdigit(*cmdline)) {
+ num = num * 10 + *cmdline - '0';
+ continue;
}
- /*
- * Look at the next node to make sure there is some real memory
- * to map. Bad things happen when the only memory present
- * in a zone on a fake node is IO hole.
- */
- while (e820_hole_size(end, end + FAKE_NODE_MIN_SIZE) > 0) {
- if (zone_cross_over(start, end + sz)) {
- end = (MAX_DMA32_PFN << PAGE_SHIFT);
- break;
+ if (*cmdline == '*')
+ coeff = num;
+ if (!*cmdline || *cmdline == ',') {
+ /*
+ * Round down to the nearest FAKE_NODE_MIN_SIZE.
+ * Command-line coefficients are in megabytes.
+ */
+ size = (num << 20) & FAKE_NODE_MIN_HASH_MASK;
+ if (size) {
+ for (i = 0; i < coeff; i++, num_nodes++)
+ if (setup_node_range(num_nodes, nodes,
+ &addr, size, max_addr) < 0)
+ goto done;
+ coeff = 1;
}
- if (end >= max_addr)
- break;
- end += FAKE_NODE_MIN_SIZE;
}
- if (end > max_addr)
- end = max_addr;
- nodes[i].end = end;
- printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n",
- i,
- nodes[i].start, nodes[i].end,
- (nodes[i].end - nodes[i].start) >> 20);
- node_set_online(i);
- }
- memnode_shift = compute_hash_shift(nodes, numa_fake);
- if (memnode_shift < 0) {
- memnode_shift = 0;
- printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
- return -1;
- }
- for_each_online_node(i) {
+ if (!*cmdline)
+ break;
+ num = 0;
+ }
+done:
+ if (!num_nodes)
+ return -1;
+ /* Fill remainder of system RAM with a final node, if appropriate. */
+ if (addr < max_addr && *(cmdline - 1) != ',') {
+ setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
+ max_addr);
+ num_nodes++;
+ }
+out:
+ memnode_shift = compute_hash_shift(nodes, num_nodes);
+ if (memnode_shift < 0) {
+ memnode_shift = 0;
+ printk(KERN_ERR "No NUMA hash function found. NUMA emulation "
+ "disabled.\n");
+ return -1;
+ }
+ for_each_online_node(i) {
e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
nodes[i].end >> PAGE_SHIFT);
setup_node_bootmem(i, nodes[i].start, nodes[i].end);
@@ -402,14 +437,15 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
numa_init_array();
return 0;
}
-#endif
+#undef E820_ADDR_HOLE_SIZE
+#endif /* CONFIG_NUMA_EMU */

void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
{
int i;

#ifdef CONFIG_NUMA_EMU
- if (numa_fake && !numa_emulation(start_pfn, end_pfn))
+ if (cmdline && !numa_emulation(start_pfn, end_pfn))
return;
#endif

@@ -502,11 +538,8 @@ static __init int numa_setup(char *opt)
if (!strncmp(opt,"off",3))
numa_off = 1;
#ifdef CONFIG_NUMA_EMU
- if(!strncmp(opt, "fake=", 5)) {
- numa_fake = simple_strtoul(opt+5,NULL,0); ;
- if (numa_fake >= MAX_NUMNODES)
- numa_fake = MAX_NUMNODES;
- }
+ if (!strncmp(opt, "fake=", 5))
+ cmdline = opt + 5;
#endif
#ifdef CONFIG_ACPI_NUMA
if (!strncmp(opt,"noacpi",6))
diff --git a/include/asm-x86_64/mmzone.h b/include/asm-x86_64/mmzone.h
index fb558fb..19a8937 100644
--- a/include/asm-x86_64/mmzone.h
+++ b/include/asm-x86_64/mmzone.h
@@ -49,7 +49,7 @@ extern int pfn_valid(unsigned long pfn);

#ifdef CONFIG_NUMA_EMU
#define FAKE_NODE_MIN_SIZE (64*1024*1024)
-#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1ul))
+#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1uL))
#endif

#endif


2007-01-25 21:37:29

by David Rientjes

[permalink] [raw]
Subject: [patch -mm 3/5] x86_64: fixed-size remaining fake nodes

Extends the numa=fake x86_64 command-line option to split the remaining
system memory into nodes of fixed size. Any leftover memory is allocated
to a final node unless the command-line ends with a comma.

For example:
numa=fake=2*512,*128 gives two 512M nodes and the remaining system
memory is split into nodes of 128M each.

Cc: Andi Kleen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/x86_64/boot-options.txt | 15 ++++++----
arch/x86_64/mm/numa.c | 47 ++++++++++++++++++++++++++-------
2 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
index 0721416..9917b9f 100644
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -153,12 +153,15 @@ NUMA
If a number, fakes CMDLINE nodes and ignores NUMA setup of the
actual machine. Otherwise, system memory is configured
depending on the sizes and coefficients listed. For example:
- numa=fake=2*512,1024,4*256
- gives two 512M nodes, a 1024M node, and four 256M nodes. If
- the last character of CMDLINE is a *, the remaining system
- memory is divided up equally among its previous coefficient.
- If the last character is a comma, the remaining system
- memory is not allocated to an additional node.
+ numa=fake=2*512,1024,4*256,*128
+ gives two 512M nodes, a 1024M node, four 256M nodes, and the
+ rest split into 128M chunks. If the last character of CMDLINE
+ is a *, the remaining memory is divided up equally among its
+ coefficient:
+ numa=fake=2*512,2*
+ gives two 512M nodes and the rest split into two nodes. If
+ the last character is a comma, the remaining system memory is
+ not allocated to an additional node.

numa=hotadd=percent
Only allow hotadd memory to preallocate page structures upto
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 3344d60..2ee228b 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -359,6 +359,21 @@ static int split_nodes_equally(struct bootnode *nodes, u64 *addr, u64 max_addr,
}

/*
+ * Splits the remaining system RAM into chunks of size. The remaining memory is
+ * always assigned to a final node and can be asymmetric. Returns the number of
+ * nodes split.
+ */
+static int split_nodes_by_size(struct bootnode *nodes, u64 *addr, u64 max_addr,
+ int node_start, u64 size)
+{
+ int i = node_start;
+ size = (size << 20) & FAKE_NODE_MIN_HASH_MASK;
+ while (!setup_node_range(i++, nodes, addr, size, max_addr))
+ ;
+ return i - node_start;
+}
+
+/*
* Sets up the system RAM area from start_pfn to end_pfn according to the
* numa=fake command-line option.
*/
@@ -367,9 +382,10 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
struct bootnode nodes[MAX_NUMNODES];
u64 addr = start_pfn << PAGE_SHIFT;
u64 max_addr = end_pfn << PAGE_SHIFT;
- unsigned int coeff;
- unsigned int num = 0;
int num_nodes = 0;
+ int coeff_flag;
+ int coeff = -1;
+ int num = 0;
u64 size;
int i;

@@ -387,29 +403,34 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
}

/* Parse the command line. */
- for (coeff = 1; ; cmdline++) {
+ for (coeff_flag = 0; ; cmdline++) {
if (*cmdline && isdigit(*cmdline)) {
num = num * 10 + *cmdline - '0';
continue;
}
- if (*cmdline == '*')
- coeff = num;
+ if (*cmdline == '*') {
+ if (num > 0)
+ coeff = num;
+ coeff_flag = 1;
+ }
if (!*cmdline || *cmdline == ',') {
+ if (!coeff_flag)
+ coeff = 1;
/*
* Round down to the nearest FAKE_NODE_MIN_SIZE.
* Command-line coefficients are in megabytes.
*/
size = (num << 20) & FAKE_NODE_MIN_HASH_MASK;
- if (size) {
+ if (size)
for (i = 0; i < coeff; i++, num_nodes++)
if (setup_node_range(num_nodes, nodes,
&addr, size, max_addr) < 0)
goto done;
- coeff = 1;
- }
+ if (!*cmdline)
+ break;
+ coeff_flag = 0;
+ coeff = -1;
}
- if (!*cmdline)
- break;
num = 0;
}
done:
@@ -417,6 +438,12 @@ done:
return -1;
/* Fill remainder of system RAM, if appropriate. */
if (addr < max_addr) {
+ if (coeff_flag && coeff < 0) {
+ /* Split remaining nodes into num-sized chunks */
+ num_nodes += split_nodes_by_size(nodes, &addr, max_addr,
+ num_nodes, num);
+ goto out;
+ }
switch (*(cmdline - 1)) {
case '*':
/* Split remaining nodes into coeff chunks */

2007-01-25 21:37:42

by David Rientjes

[permalink] [raw]
Subject: [patch -mm 2/5] x86_64: split remaining fake nodes equally

Extends the numa=fake x86_64 command-line option to split the remaining
system memory into equal-sized nodes.

For example:
numa=fake=2*512,4* gives two 512M nodes and the remaining system
memory is split into four approximately equal
chunks.

Cc: Andi Kleen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/x86_64/boot-options.txt | 4 +++-
arch/x86_64/mm/numa.c | 24 +++++++++++++++++++-----
2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt
index 6ccdb5e..0721416 100644
--- a/Documentation/x86_64/boot-options.txt
+++ b/Documentation/x86_64/boot-options.txt
@@ -155,7 +155,9 @@ NUMA
depending on the sizes and coefficients listed. For example:
numa=fake=2*512,1024,4*256
gives two 512M nodes, a 1024M node, and four 256M nodes. If
- the last character of CMDLINE is a comma, the remaining system
+ the last character of CMDLINE is a *, the remaining system
+ memory is divided up equally among its previous coefficient.
+ If the last character is a comma, the remaining system
memory is not allocated to an additional node.

numa=hotadd=percent
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 0417921..3344d60 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -415,11 +415,25 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
done:
if (!num_nodes)
return -1;
- /* Fill remainder of system RAM with a final node, if appropriate. */
- if (addr < max_addr && *(cmdline - 1) != ',') {
- setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
- max_addr);
- num_nodes++;
+ /* Fill remainder of system RAM, if appropriate. */
+ if (addr < max_addr) {
+ switch (*(cmdline - 1)) {
+ case '*':
+ /* Split remaining nodes into coeff chunks */
+ if (coeff <= 0)
+ break;
+ num_nodes += split_nodes_equally(nodes, &addr, max_addr,
+ num_nodes, coeff);
+ break;
+ case ',':
+ /* Do not allocate remaining system RAM */
+ break;
+ default:
+ /* Give one final node */
+ setup_node_range(num_nodes, nodes, &addr,
+ max_addr - addr, max_addr);
+ num_nodes++;
+ }
}
out:
memnode_shift = compute_hash_shift(nodes, num_nodes);

2007-01-25 21:38:00

by David Rientjes

[permalink] [raw]
Subject: [patch -mm 5/5] x86_64: fake numa for cpusets document

Create a document to explain how to use numa=fake in conjunction with
cpusets for coarse memory resource management.

An attempt to get more awareness and testing for this feature.

Cc: Andi Kleen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/x86_64/fake-numa-for-cpusets | 66 ++++++++++++++++++++++++++++
1 files changed, 66 insertions(+), 0 deletions(-)
create mode 100644 Documentation/x86_64/fake-numa-for-cpusets

diff --git a/Documentation/x86_64/fake-numa-for-cpusets b/Documentation/x86_64/fake-numa-for-cpusets
new file mode 100644
index 0000000..d1a985c
--- /dev/null
+++ b/Documentation/x86_64/fake-numa-for-cpusets
@@ -0,0 +1,66 @@
+Using numa=fake and CPUSets for Resource Management
+Written by David Rientjes <[email protected]>
+
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management. Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks. This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+
+For more information on the features of cpusets, see Documentation/cpusets.txt.
+There are a number of different configurations you can use for your needs. For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/x86_64/boot-options.txt.
+
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,". This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets. As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+
+ Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+ Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+ Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+ Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+ ...
+ On node 0 totalpages: 130975
+ On node 1 totalpages: 131072
+ On node 2 totalpages: 131072
+ On node 3 totalpages: 131072
+
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets:
+
+ [root@xroads /]# mkdir exampleset
+ [root@xroads /]# mount -t cpuset none exampleset
+ [root@xroads /]# mkdir exampleset/ddset
+ [root@xroads /]# cd exampleset/ddset
+ [root@xroads /exampleset/ddset]# echo 0-1 > cpus
+ [root@xroads /exampleset/ddset]# echo 0-1 > mems
+
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems:
+
+ [root@xroads /exampleset/ddset]# echo $$ > tasks
+ [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+ [1] 13425
+
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+ Unrestricted Restricted
+ MemTotal: 3091900 kB 3091900 kB
+ MemFree: 42113 kB 1513236 kB
+
+This allows for coarse memory management for the tasks you assign to particular
+cpusets. Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.

2007-01-25 21:47:57

by David Rientjes

[permalink] [raw]
Subject: [patch -mm 4/5] x86_64: fake numa function annotations

Mark the new numa=fake x86_64 helper functions, setup_node_range(),
split_nodes_equally(), and split_nodes_by_size() as __init.

Cc: Andi Kleen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
arch/x86_64/mm/numa.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 2ee228b..5d8fee6 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -287,8 +287,8 @@ char *cmdline __initdata;
* if there is additional memory left for allocation past addr and -1 otherwise.
* addr is adjusted to be at the end of the node.
*/
-static int setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
- u64 size, u64 max_addr)
+static int __init setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
+ u64 size, u64 max_addr)
{
int ret = 0;
nodes[nid].start = *addr;
@@ -310,8 +310,9 @@ static int setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
* is the number of nodes split up and addr is adjusted to be at the end of the
* last node allocated.
*/
-static int split_nodes_equally(struct bootnode *nodes, u64 *addr, u64 max_addr,
- int node_start, int num_nodes)
+static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
+ u64 max_addr, int node_start,
+ int num_nodes)
{
unsigned int big;
u64 size;
@@ -363,8 +364,8 @@ static int split_nodes_equally(struct bootnode *nodes, u64 *addr, u64 max_addr,
* always assigned to a final node and can be asymmetric. Returns the number of
* nodes split.
*/
-static int split_nodes_by_size(struct bootnode *nodes, u64 *addr, u64 max_addr,
- int node_start, u64 size)
+static int __init split_nodes_by_size(struct bootnode *nodes, u64 *addr,
+ u64 max_addr, int node_start, u64 size)
{
int i = node_start;
size = (size << 20) & FAKE_NODE_MIN_HASH_MASK;

2007-01-29 13:39:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch -mm 3/5] x86_64: fixed-size remaining fake nodes

On Thursday 25 January 2007 22:37, David Rientjes wrote:
> Any leftover memory is allocated
> to a final node unless the command-line ends with a comma.

That sounds like syntactical vinegar and a nasty trap. Remember
that venus probe that got lost because of a wrong comma.
Can you find some nicer syntax for that please?

Also it's pretty complex. Are there use cases for all of this?

-Andi

2007-01-29 18:39:10

by David Rientjes

[permalink] [raw]
Subject: Re: [patch -mm 3/5] x86_64: fixed-size remaining fake nodes

On Mon, 29 Jan 2007, Andi Kleen wrote:

> On Thursday 25 January 2007 22:37, David Rientjes wrote:
> > Any leftover memory is allocated
> > to a final node unless the command-line ends with a comma.
>
> That sounds like syntactical vinegar and a nasty trap. Remember
> that venus probe that got lost because of a wrong comma.
> Can you find some nicer syntax for that please?
>

The only other appropriate syntax that comes to mind is perhaps a
command-line that ends with a 0. For example, numa=fake=2*512,0 would
allocate two 512M nodes and nothing for the remaining RAM.

> Also it's pretty complex. Are there use cases for all of this?
>

There are. Configurable node sizes (i.e. 'numa=fake=512,4*128', etc) are
the major concept and help to avoid the overhead associated with something
like 64 nodes of 64M each on a 4G machine. We've seen some inefficiencies
with scanning through so many zone lists on page_alloc when we encounter a
full node. Additional support such as 'numa=fake=2*512,*128' are used
more for machines where you're unsure of their total system RAM in the
first place but want to make sure you have the node sizes you need.

David

2007-01-30 01:20:11

by David Rientjes

[permalink] [raw]
Subject: Re: [patch -mm 3/5] x86_64: fixed-size remaining fake nodes

On Mon, 29 Jan 2007, David Rientjes wrote:

> On Mon, 29 Jan 2007, Andi Kleen wrote:
>
> > On Thursday 25 January 2007 22:37, David Rientjes wrote:
> > > Any leftover memory is allocated
> > > to a final node unless the command-line ends with a comma.
> >
> > That sounds like syntactical vinegar and a nasty trap. Remember
> > that venus probe that got lost because of a wrong comma.
> > Can you find some nicer syntax for that please?
> >
>

I agree it's not a good idea to prevent the remaining RAM from being
allocated to an additional node. It was helpful in testing and the
gathering of benchmarks for the purpose of memory management, but not for
real-world cases. It's been removed.

David