2004-10-12 15:08:00

by Christoph Lameter

[permalink] [raw]
Subject: NUMA: Patch for node based swapping

In a NUMA systems single nodes may run out of memory. This may occur even
by only reading from files which will clutter node memory with cached
pages from the file.

However, as long as the system as a whole does have enough memory
available, kswapd is not run at all. This means that a process allocating
memory and running on a node that has no memory left, will get memory
allocated from other nodes which is inefficient to handle. It would be
better if kswapd would throw out some pages (maybe some of the cached
pages from files that have only once been read) to reclaim memory in the
node.

The following patch checks the memory usage after each allocation in a
zone. If the allocation in a zone falls below a certain minimum, kswapd is
started for that zone alone.

The minimum may be controlled through /proc/sys/vm/node_swap.
By default node_swap is set to 100 which means that kswapd will be run on
a zone if less than 10% are available after allocation.

Nick Piggin has a much better overall solution in the overhaul of the
memory subsystem that he is working on. I hope this patch may provide
a solution until Nick's patch gets into the kernel.

Index: linux-2.6.9-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/page_alloc.c 2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/mm/page_alloc.c 2004-10-11 12:54:51.000000000 -0700
@@ -41,6 +41,9 @@
long nr_swap_pages;
int numnodes = 1;
int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+int sysctl_node_swap = 100; /* invoke kswapd when local node memory lower than 20% */
+#endif

EXPORT_SYMBOL(totalram_pages);
EXPORT_SYMBOL(nr_swap_pages);
@@ -483,6 +486,13 @@
p = &z->pageset[cpu];
if (pg == orig) {
z->pageset[cpu].numa_hit++;
+ /*
+ * If zone allocation leaves less than a (sysctl_node_swap * 10) %
+ * of the zone free then invoke kswapd.
+ * (to make it efficient we do (pages * sysctl_node_swap) / 1024))
+ */
+ if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
+ wakeup_kswapd(z);
} else {
p->numa_miss++;
zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.9-rc4.orig/kernel/sysctl.c 2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/kernel/sysctl.c 2004-10-11 12:54:51.000000000 -0700
@@ -65,6 +65,9 @@
extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

#if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
int unknown_nmi_panic;
@@ -800,7 +803,17 @@
.extra1 = &zero,
},
#endif
- { .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+ {
+ .ctl_name = VM_NODE_SWAP,
+ .procname = "node_swap",
+ .data = &sysctl_node_swap,
+ .maxlen = sizeof(sysctl_node_swap),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec
+ },
+#endif
+ { .ctl_name = 0 }
};

static ctl_table proc_table[] = {
Index: linux-2.6.9-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.9-rc4.orig/include/linux/sysctl.h 2004-10-10 19:58:05.000000000 -0700
+++ linux-2.6.9-rc4/include/linux/sysctl.h 2004-10-11 12:54:51.000000000 -0700
@@ -167,6 +167,7 @@
VM_HUGETLB_GROUP=25, /* permitted hugetlb group */
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
+ VM_NODE_SWAP=28, /* Swap local node memory limit (in % *10) */
};


Index: linux-2.6.9-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/vmscan.c 2004-10-10 19:57:04.000000000 -0700
+++ linux-2.6.9-rc4/mm/vmscan.c 2004-10-11 12:54:51.000000000 -0700
@@ -1168,9 +1168,11 @@
*/
void wakeup_kswapd(struct zone *zone)
{
+ extern int sysctl_node_swap;
+
if (zone->present_pages == 0)
return;
- if (zone->free_pages > zone->pages_low)
+ if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
return;
if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
return;


2004-10-12 15:17:42

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

> In a NUMA systems single nodes may run out of memory. This may occur even
> by only reading from files which will clutter node memory with cached
> pages from the file.
>
> However, as long as the system as a whole does have enough memory
> available, kswapd is not run at all. This means that a process allocating
> memory and running on a node that has no memory left, will get memory
> allocated from other nodes which is inefficient to handle. It would be
> better if kswapd would throw out some pages (maybe some of the cached
> pages from files that have only once been read) to reclaim memory in the
> node.
>
> The following patch checks the memory usage after each allocation in a
> zone. If the allocation in a zone falls below a certain minimum, kswapd is
> started for that zone alone.

I agree it's a problem, but you really don't want to go kicking pages out
to disk when we have free memory - the solution is, I think, to migrate
the least-recently used pages out to the other node, not all the way to
disk. The page relocate stuff from the defrag code being proposed may help
(if they fix it not to go via swap ;-)). I'll try to find some time to
look at it again.

M.

PS, might be possible to add a mechanism to ask kswapd to reclaim some
cache pages without doing swapout, but I fear of messing with the delicate
balance of the universe - cache vs user.

2004-10-12 15:22:57

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 2004-10-12 08:02:40 -0700, Christoph Lameter <[email protected]>
wrote in message <[email protected]>:
> --- linux-2.6.9-rc4.orig/mm/page_alloc.c 2004-10-10 19:57:03.000000000 -0700
> +++ linux-2.6.9-rc4/mm/page_alloc.c 2004-10-11 12:54:51.000000000 -0700
> @@ -483,6 +486,13 @@
> p = &z->pageset[cpu];
> if (pg == orig) {
> z->pageset[cpu].numa_hit++;
> + /*
> + * If zone allocation leaves less than a (sysctl_node_swap * 10) %
> + * of the zone free then invoke kswapd.
> + * (to make it efficient we do (pages * sysctl_node_swap) / 1024))
> + */
> + if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
> + wakeup_kswapd(z);
> } else {
> p->numa_miss++;
> zonelist->zones[0]->pageset[cpu].numa_foreign++;

Shouldn't the comment read "less than (sysctl_node_swap / 10) %",
because the value in sysctl_node_swap is actually percent*10, so you
need the reverse action here?!

MfG, JBG

--
Jan-Benedict Glaw [email protected] . +49-172-7608481 _ O _
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O
fuer einen Freien Staat voll Freier B?rger" | im Internet! | im Irak! O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));


Attachments:
(No filename) (1.27 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-10-12 15:29:47

by Rik van Riel

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 12 Oct 2004, Christoph Lameter wrote:

> The minimum may be controlled through /proc/sys/vm/node_swap.
> By default node_swap is set to 100 which means that kswapd will be run on
> a zone if less than 10% are available after allocation.

That sounds like an extraordinarily bad idea for eg. AMD64
systems, which have a very low numa factor.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-12 15:42:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 12 Oct 2004, Rik van Riel wrote:
> On Tue, 12 Oct 2004, Christoph Lameter wrote:
> > The minimum may be controlled through /proc/sys/vm/node_swap.
> > By default node_swap is set to 100 which means that kswapd will be run on
> > a zone if less than 10% are available after allocation.
> That sounds like an extraordinarily bad idea for eg. AMD64
> systems, which have a very low numa factor.

Any other suggestions?

2004-10-12 15:41:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 12 Oct 2004, Martin J. Bligh wrote:

> PS, might be possible to add a mechanism to ask kswapd to reclaim some
> cache pages without doing swapout, but I fear of messing with the delicate
> balance of the universe - cache vs user.

That is also my concern. I think the patch is useful to address the
immediate issue.

2004-10-12 15:52:52

by Rik van Riel

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 12 Oct 2004, Christoph Lameter wrote:

> Any other suggestions?

Since this is meant as a stop gap patch, waiting for a real
solution, and is only relevant for big (and rare) systems,
it would be an idea to at least leave it off by default.

I think it would be safe to assume that a $100k system has
a system administrator looking after it, while a $5k AMD64
whitebox might not have somebody watching its performance.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-12 19:34:16

by Anton Blanchard

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping


> That sounds like an extraordinarily bad idea for eg. AMD64
> systems, which have a very low numa factor.

Same with ppc64.

Anton

2004-10-12 19:57:42

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

Rik van Riel <[email protected]> writes:

> On Tue, 12 Oct 2004, Christoph Lameter wrote:
>
>> The minimum may be controlled through /proc/sys/vm/node_swap.
>> By default node_swap is set to 100 which means that kswapd will be run on
>> a zone if less than 10% are available after allocation.
>
> That sounds like an extraordinarily bad idea for eg. AMD64
> systems, which have a very low numa factor.

As a optional sysctl it makes sense even on AMD64. On some benchmarks
you see the differences between local and remote memory very clearly.

-Andi

2004-10-12 20:20:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

On Tue, 12 Oct 2004, Rik van Riel wrote:

> On Tue, 12 Oct 2004, Christoph Lameter wrote:
>
> > Any other suggestions?
>
> Since this is meant as a stop gap patch, waiting for a real
> solution, and is only relevant for big (and rare) systems,
> it would be an idea to at least leave it off by default.
>
> I think it would be safe to assume that a $100k system has
> a system administrator looking after it, while a $5k AMD64
> whitebox might not have somebody watching its performance.

Ok. Will do that then. Should I submit the patch to Andrew?

2004-10-12 21:33:52

by Ray Bryant

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

This patch is a bad idea and should not be merged into the mainline.

(1) On bids for large SGI machines, we often see the requirement that
> 90% of main memory be allocatable to user programs. If, as suggested,
one were to set the /proc/sys/vm/node_swap to 10%, then any allocation
(e. g. alloction of a page cache page) will kick off kswapd when the
customer has allocated > 90% of storage. The result is that kswapd will be
more or less constantly running on every node in the system. Since that
same 90% requirement is often used to size the amount of memory purchased
to run the customers primary application, we have a recipe for providing
poor performance for that principle application. Aa a result we will
likely end up disabling this feature on those large SGI machines, were it
to end up in one of our kernels.

Setting the node_swap limit to less than 10% would keep this from happening,
of course, but in this case the improvement gained is marginal and likely
not worth the effort.

(2) In HPC applications, it is not sufficient to get "mostly" local storage.
Quite often such applications "settle in" on a set of nodes and sit there and
compute for an extremely long time. Any imbalance in execution times between
the threads of such an application (e. g. due to one thread having one or more
pages located on a remote node) results in the entire application being slowed
down (A parallel application often runs only as quickly as its slowest thread.)

The application people running benchmarks for our systems insist on getting
100% of the storage they request as local to be truly backed by local storage.
Getting 98% of that figure is not acceptable. Because this patch kicks off
kswapd asynchronously from the storage request, the current page being
allocated can still end up being allocated off node. If one tries to solve
this problem by setting the threshold lower (say at 20% of main memory), then
when the benchmark allocates 90% of memory we end up back in a situation
described above where any storage allocation will cause kswapd to run.
(Remember that even in an idle system, Linux is constantly scribbling stuff
out to disk -- so there are always allocations going on via way of
__alloc_pages()>) Even then there is no guarentee that kswapd will be able to
free up storage quickly enough to keep ahead of allocations. (The real
problem, here, of course is that clean page cache pages can fill up the node,
and cause off node allocations to occur; and we would like to free those instead.)

I have patches that I am currently working on to do the latter instead of
the approach of this patch, and once we get those working I'd prefer to
see those included in the mainline instead of this solution.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-10-13 11:00:48

by Nick Piggin

[permalink] [raw]
Subject: Re: NUMA: Patch for node based swapping

Christoph Lameter wrote:
> On Tue, 12 Oct 2004, Rik van Riel wrote:
>
>
>>On Tue, 12 Oct 2004, Christoph Lameter wrote:
>>
>>
>>>Any other suggestions?
>>
>>Since this is meant as a stop gap patch, waiting for a real
>>solution, and is only relevant for big (and rare) systems,
>>it would be an idea to at least leave it off by default.
>>
>>I think it would be safe to assume that a $100k system has
>>a system administrator looking after it, while a $5k AMD64
>>whitebox might not have somebody watching its performance.
>
>
> Ok. Will do that then. Should I submit the patch to Andrew?
>

I can't see the harm in sending it after 2.6.9 if it defaults
to off (maybe also make it CONFIG_NUMA).

OTOH, if it is going to be painful to remove later on, then
maybe leave it local to your tree.

It's true that I have something a bit more sophisticated in
the pipe, but it is going to be an uphill battle to get it
and everything it depends on merged - so don't count on it for
2.6.10 :P

2004-10-13 15:15:20

by Christoph Lameter

[permalink] [raw]
Subject: NUMA: Patch for node based swapping V2

This was discussed yesterday on linux-mm.

Changelog:
* NUMA: Add ability to invoke kswapd on a node if local memory falls below a
certain threshold. A node may fill up its memory by simply copying a file
which will fill up the nodes memory with cached pages. The nodes memory will
currently only be reclaimed if all nodes in the system fall below a certain
threshhold. Until that time the processes on the node will only be allocated
off node memory. Invoking kswapd on a node fixes this situation until
a better solution can be found.
* Threshold may be set in /proc/sys/vm/node_swap in percent * 10. The threshold
is set to zero by default which means that node swapping is off.

Index: linux-2.6.9-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/page_alloc.c 2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/mm/page_alloc.c 2004-10-13 07:58:57.000000000 -0700
@@ -41,6 +41,19 @@
long nr_swap_pages;
int numnodes = 1;
int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+/*
+ * sysctl_node_swap is a percentage of the pages available
+ * in a zone multiplied by 10. If the available pages
+ * in a zone drop below this limit then kswapd is invoked
+ * for this zone alone. This results in the reclaiming
+ * of local memory. Local memory may be filled up by simply reading
+ * a file. If local memory is not available the off node memory
+ * will be allocated to a process which makes all memory access
+ * less efficient then they could be.
+ */
+int sysctl_node_swap = 0;
+#endif

EXPORT_SYMBOL(totalram_pages);
EXPORT_SYMBOL(nr_swap_pages);
@@ -483,6 +496,14 @@
p = &z->pageset[cpu];
if (pg == orig) {
z->pageset[cpu].numa_hit++;
+ /*
+ * If zone allocation has left less than
+ * (sysctl_node_swap / 10) % of the zone free invoke kswapd.
+ * (the page limit is obtained through (pages*limit)/1024 to
+ * make the calculation more efficient)
+ */
+ if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
+ wakeup_kswapd(z);
} else {
p->numa_miss++;
zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.9-rc4.orig/kernel/sysctl.c 2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/kernel/sysctl.c 2004-10-11 12:54:51.000000000 -0700
@@ -65,6 +65,9 @@
extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

#if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
int unknown_nmi_panic;
@@ -800,7 +803,17 @@
.extra1 = &zero,
},
#endif
- { .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+ {
+ .ctl_name = VM_NODE_SWAP,
+ .procname = "node_swap",
+ .data = &sysctl_node_swap,
+ .maxlen = sizeof(sysctl_node_swap),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec
+ },
+#endif
+ { .ctl_name = 0 }
};

static ctl_table proc_table[] = {
Index: linux-2.6.9-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.9-rc4.orig/include/linux/sysctl.h 2004-10-10 19:58:05.000000000 -0700
+++ linux-2.6.9-rc4/include/linux/sysctl.h 2004-10-11 12:54:51.000000000 -0700
@@ -167,6 +167,7 @@
VM_HUGETLB_GROUP=25, /* permitted hugetlb group */
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
+ VM_NODE_SWAP=28, /* Swap local node memory limit (in % *10) */
};


Index: linux-2.6.9-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/vmscan.c 2004-10-10 19:57:04.000000000 -0700
+++ linux-2.6.9-rc4/mm/vmscan.c 2004-10-11 12:54:51.000000000 -0700
@@ -1168,9 +1168,11 @@
*/
void wakeup_kswapd(struct zone *zone)
{
+ extern int sysctl_node_swap;
+
if (zone->present_pages == 0)
return;
- if (zone->free_pages > zone->pages_low)
+ if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
return;
if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
return;