LinuxLists.cc - [PATCH/RFC] A method for clearing out page cache

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Questions concerning this page cache patch that Martin submitted,
as a merge of something originally written by Ray Bryant.

The following patch is not really a patch. It is a few questions, a
couple minor space tweaks, and a never compiled nor tested rewrite of
proc_do_toss_page_cache_nodes() to try to make it look a little
prettier.

Some of the issues are cosmetic, but some I suspect warrant competent
response by Martin or Ray, before this goes into *-mm, such as some
questions as to whether locking is adequate, or a kmalloc() size might
be forced huge by the user. And my suggested rewrite changes the kernel
API in one error case, so better to decide that matter before it is
too widely used.

Specifically:

1) A couple of kmalloc's are done using lengths that
so far as I could tell, came straight from user land.
Never let the user size a kernel malloc without limit,
as it makes it way too easy to ask for something huge,
and give the kernel indigestion. If the lengths in
question are actually limited, then nevermind (or comment
in a terse one-liner, for worry warts such as myself).

2) Beware that this patch depends on the cpuset patch:
new-bitmap-list-format-for-cpusets.patch
which is still in *-mm only, for the routines
bitmap_scnlistprintf/bitmap_parselist.

3) Should the maxlen of a nodemask for the sysctl
handler for proc_do_toss_page_cache_nodes be the byte
length of the kernels internal binary nodemask, or
a reasonable upper bound on the max length of the
ascii representation thereof, which is about the value:
100 + 6 * MAX_NUMNODES
when using the bitmap_scnlistprintf/bitmap_parselist
format.

4) A couple of existing blank lines were nuked by this
patch - I restored them. I though them to be nice blank lines ;).

5) The requirement to read the string in one read(2) syscall
seemed like it might be draconian. If the available
apparatus supports it, better to allocate the ascii buffer
on the open for read, let the reads (and seeks) feast on
that buffer, using f_pos as it should be used, and freeing
the buffer on the close. Mind you, I have no idea if the
sysctl.c apparatus conveniently supports this.

6) The kernel header bitops.h is no longer needed by sysctl.c,
following my (uncompiled, untested) rewrite.

7) Instead of two counters to track how many threads remained
to be waited for, toss_done and nodes_to_toss, my rewrite
just has one: num_toss_threads_active. It bumps that value
once each kthread it starts, decrements it as each thread
finishes, and waits for it to get back to zero in the loop.

8) Several changes in the rewrite of proc_do_toss_page_cache_nodes():
- rename 'retval' to 'ret' (more common, shorter)
- nuke bitmap and use nodemask routines
- dont error if some nodes offline (general idea is to
either do something useful and claim success, or do
nothing at all, and complain of error, but dont both
do something useful and complain.)
- convert to a single return, at bottom of function
- XXX Comment: doesn't this code require locking node_online_map?
- Remove unused 'started'
- Remove no longer used 'i'
- Remove no longer used 'errors'
- Replace 3 line bitop for loop with one line for_each_node_mask
- Replace 15 lines of 'validity checking' with one line check
for node being online

9) Comment - dont we need to protect the kernel global variable
toss_page_cache_nodes from simulaneous access by two tasks?

Index: 2.6.11-rc4/include/linux/sysctl.h
===================================================================
--- 2.6.11-rc4.orig/include/linux/sysctl.h 2005-02-14 18:26:28.000000000 -0800
+++ 2.6.11-rc4/include/linux/sysctl.h 2005-02-14 18:27:31.000000000 -0800
@@ -803,6 +803,7 @@ extern int proc_doulongvec_ms_jiffies_mi
struct file *, void __user *, size_t *, loff_t *);
extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
void __user *, size_t *, loff_t *);
+
extern int do_sysctl (int __user *name, int nlen,
void __user *oldval, size_t __user *oldlenp,
void __user *newval, size_t newlen);
Index: 2.6.11-rc4/kernel/sysctl.c
===================================================================
--- 2.6.11-rc4.orig/kernel/sysctl.c 2005-02-14 18:26:28.000000000 -0800
+++ 2.6.11-rc4/kernel/sysctl.c 2005-02-14 18:27:46.000000000 -0800
@@ -42,7 +42,6 @@
#include <linux/dcache.h>
#include <linux/syscalls.h>
#include <linux/bitmap.h>
-#include <linux/bitops.h>
#include <linux/nodemask.h>

#include <asm/uaccess.h>
@@ -839,6 +838,8 @@ static ctl_table vm_table[] = {
.ctl_name = VM_TOSS_PAGE_CACHE_NODES,
.procname = "toss_page_cache_nodes",
.data = &toss_page_cache_nodes,
+/* XXX Should this be the length of the binary nodemask,
+ or of its ascii representation? */
.maxlen = sizeof(nodemask_t),
.mode = 0644,
.proc_handler = &proc_do_toss_page_cache_nodes,
@@ -1993,6 +1994,9 @@ int proc_dobitmap_list(ctl_table *table,
if (write) {
if (!table->maxlen || !table->data)
return -EPERM;
+/* XXX If this *lenp is direct from user space, it needs to be
+ * bounded to avoid a denial of service attack - asking for
+ * a huge buffer. */
if ((buff = kmalloc(*lenp + 1, GFP_KERNEL)) == 0)
return -ENOMEM;
if (copy_from_user(buff, buffer, *lenp))
@@ -2005,8 +2009,14 @@ int proc_dobitmap_list(ctl_table *table,
} else {
if (!table->maxlen || !table->data)
return -EPERM;
+/* XXX The following requirement seems draconian. Shouldn't this
+ * buffer be allocated on the open, read using normal f_pos
+ * arithmetic, and free'd on the close? */
/* we require the user to read the string in one operation */
if (filp->f_pos == 0) {
+/* XXX If this *lenp is direct from user space, it needs to be
+ * bounded to avoid a denial of service attack - asking for
+ * a huge buffer. */
if ((buff = kmalloc(*lenp, GFP_KERNEL)) == 0)
return -ENOMEM;
retval = bitmap_scnlistprintf(buff, (*lenp)-1,
@@ -2085,6 +2095,7 @@ int proc_doulongvec_ms_jiffies_minmax(ct
return -ENOSYS;
}

+
#endif /* CONFIG_PROC_FS */

Index: 2.6.11-rc4/mm/vmscan.c
===================================================================
--- 2.6.11-rc4.orig/mm/vmscan.c 2005-02-14 18:26:28.000000000 -0800
+++ 2.6.11-rc4/mm/vmscan.c 2005-02-14 18:41:07.000000000 -0800
@@ -904,7 +904,7 @@ void toss_page_cache_pages_node(int node
return;
}

-static atomic_t toss_done;
+static atomic_t num_toss_threads_active;

int toss_pages_thread(void *arg)
{
@@ -912,10 +912,11 @@ int toss_pages_thread(void *arg)

if (node_online(node))
toss_page_cache_pages_node(node);
- atomic_inc(&toss_done);
+ atomic_dec(&num_toss_threads_active);
do_exit(0);
}

+/* XXX What protects toss_page_cache_nodes from simultaneous access? */
nodemask_t toss_page_cache_nodes = NODE_MASK_NONE;

/*
@@ -927,68 +928,39 @@ nodemask_t toss_page_cache_nodes = NODE_
int proc_do_toss_page_cache_nodes(ctl_table *table, int write, struct file *filep,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
- int i, errors=0;
- int retval, node, started, nodes_to_toss;
- /*
- * grumble. so many bitmap routines, so many different types,
- * such a fussy C-compiler that likes to warn you about these.
- */
- union {
- volatile void *vv;
- const unsigned long *cu;
- } bitmap;
-
- retval = proc_dobitmap_list(table, write, filep, buffer, lenp, ppos);
- if (retval < 0)
- return retval;
+ int ret, node;
+
+ ret = proc_dobitmap_list(table, write, filep, buffer, lenp, ppos);
+ if (ret < 0)
+ goto done;
+ ret = 0;

if (!write)
- return 0;
+ goto done;

- /* do some validity checking */
- bitmap.vv = (volatile void *) &toss_page_cache_nodes;
- for (i = 0; i < num_online_nodes(); i++) {
- if (test_bit(i, bitmap.vv) && !node_online(i)) {
- errors++;
- clear_bit(i, bitmap.vv);
- }
- }
- if (num_online_nodes() < MAX_NUMNODES) {
- for (i = num_online_nodes(); i < MAX_NUMNODES; i++) {
- if (test_bit(i, bitmap.vv)) {
- errors++;
- clear_bit(i, bitmap.vv);
- }
- }
+/* XXX Dont we need to lock node_online_map here somehow? */
+ /* if no online nodes specified - error */
+ if (!nodes_intersect(toss_page_cache_nodes, node_online_map)) {
+ ret = -EINVAL;
+ goto done;
}

/* create kernel threads to go toss the page cache pages */
- atomic_set(&toss_done, 0);
- nodes_to_toss = bitmap_weight(bitmap.cu, MAX_NUMNODES);
- started = 0;
- for (node = find_first_bit(bitmap.cu, MAX_NUMNODES);
- node < MAX_NUMNODES;
- node = find_next_bit(bitmap.cu, MAX_NUMNODES, node+1)) {
- kthread_run(&toss_pages_thread, (void *)(long)node,
- "toss_thread");
- started++;
+ atomic_set(&num_toss_threads_active, 0);
+ for_each_node_mask(node, toss_page_cache_nodes) {
+ if (node_online(node)) {
+ kthread_run(&toss_pages_thread, node, "toss_thread");
+ atomic_inc(&num_toss_threads_active);
+ }
}

/* wait for the kernel threads to complete */
- while (atomic_read(&toss_done) < nodes_to_toss) {
+ while (atomic_read(&num_toss_threads_active) > 0) {
__set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(10);
}
-
- /*
- * if user just did "echo 0-31 > /proc/sys/vm/toss_page_cache_nodes"
- * but fewer nodes than that exist, there is no good way to get the error
- * code back to them anyway, so let the page cache toss occur
- * on nodes that do exist and return -EINVAL afterwards.
- */
- if (errors)
- return -EINVAL;
- return 0;
+done:
+ return ret;
}
#endif

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-16 19:56:09

by Martin Hicks

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
> Questions concerning this page cache patch that Martin submitted,
> as a merge of something originally written by Ray Bryant.
>
> The following patch is not really a patch. It is a few questions, a
> couple minor space tweaks, and a never compiled nor tested rewrite of
> proc_do_toss_page_cache_nodes() to try to make it look a little
> prettier.

Thanks for the review Paul. I'll take a harder look at your feedback
and reply.

--
Martin Hicks || Silicon Graphics Inc. || [email protected]

2005-02-21 19:33:39

by Martin Hicks

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Hi,

I've made a bunch of changes that Paul suggested. I've also responded
to his concerns further down. Paul correctly pointed out that this
patch uses some helper functions that are part of the cpusets patch. I
should have mentioned this before.

The major changes are:

- Cleanup proc_dobitmask_list() a bit more, including adding bounds
checking on *lenp.

- An important bugfix in vmscan.c around line 390. Go to the
keep_locked label, not the "keep" label.

- Add locking in proc_do_toss_page_cache_nodes() to protect the global
nodemask_t from getting corrupted.

- Change a few functions to "static"

- Paul Jackson's suggested changes to greatly simplify
proc_do_toss_page_cache_nodes()

The patch is inlined at the end of the mail.

On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
>
> 1) A couple of kmalloc's are done using lengths that
> so far as I could tell, came straight from user land.

Okay, I've stuck in maximums that are based on MAX_NUMNODES.

>
> 2) Beware that this patch depends on the cpuset patch:
> new-bitmap-list-format-for-cpusets.patch
> which is still in *-mm only, for the routines
> bitmap_scnlistprintf/bitmap_parselist.

Thanks. I hadn't realized that.

> 3) Should the maxlen of a nodemask for the sysctl
> handler for proc_do_toss_page_cache_nodes be the byte
> length of the kernels internal binary nodemask, or

It is the byte length of the kernel's bitmask struct.

> 5) The requirement to read the string in one read(2) syscall
> seemed like it might be draconian. If the available

But that's the way the rest of the sysctl read functions work. There's
no safe way that I can see to ensure that the data doesn't change in
between two consecutive read calls.

> 9) Comment - dont we need to protect the kernel global variable
> toss_page_cache_nodes from simulaneous access by two tasks?

yes, I protected this with a semaphore.

mh

--
Martin Hicks Wild Open Source Inc.
[email protected] 613-266-2296

This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes. The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

Signed-off-by: Martin Hicks <[email protected]>
Signed-off-by: Ray Bryant <[email protected]>

[mort@tomahawk patches]$ diffstat toss_page_cache_nodes_v2.patch
include/linux/sysctl.h | 3 +
kernel/sysctl.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 201 insertions(+), 2 deletions(-)

Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h 2005-02-16 12:43:19.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-02-19 10:36:41.000000000 -0800
@@ -170,6 +170,7 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on */
};

@@ -803,6 +804,8 @@
void __user *, size_t *, loff_t *);
extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
struct file *, void __user *, size_t *, loff_t *);
+extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
+ void __user *, size_t *, loff_t *);

extern int do_sysctl (int __user *name, int nlen,
void __user *oldval, size_t __user *oldlenp,
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c 2005-02-16 12:43:19.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c 2005-02-21 10:49:18.000000000 -0800
@@ -41,6 +41,8 @@
#include <linux/limits.h>
#include <linux/dcache.h>
#include <linux/syscalls.h>
+#include <linux/bitmap.h>
+#include <linux/nodemask.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -72,6 +74,12 @@
void __user *, size_t *, loff_t *);
#endif

+#ifdef CONFIG_NUMA
+extern nodemask_t toss_page_cache_nodes;
+extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+#endif
+
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
static int minolduid;
@@ -836,6 +844,16 @@
.strategy = &sysctl_jiffies,
},
#endif
+#ifdef CONFIG_NUMA
+ {
+ .ctl_name = VM_TOSS_PAGE_CACHE_NODES,
+ .procname = "toss_page_cache_nodes",
+ .data = &toss_page_cache_nodes,
+ .maxlen = sizeof(nodemask_t),
+ .mode = 0644,
+ .proc_handler = &proc_do_toss_page_cache_nodes,
+ },
+#endif
{ .ctl_name = 0 }
};

@@ -2071,6 +2089,83 @@
do_proc_dointvec_userhz_jiffies_conv,NULL);
}

+/**
+ * proc_dobitmap_list -- read/write a bitmap list in ascii format
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @filp: the file structure
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ *
+ * Reads/writes a bitmap specified in "list" format. That is
+ * reads a list of comma separated items, where each item is
+ * either a bit number, or a range of bit numbers separated by
+ * a "-". E. g., 3,5-12,14. Converts this to a bitmap where
+ * each of the specified bits is set.
+ *
+ * Restrictions: user is required to read output string in one
+ * read operation.
+ *
+ * If the bitmap is being read by the user process, it is copied
+ * and a newline '\n' is added. It is truncated if the buffer is
+ * not large enough.
+ *
+ * Returns 0 on success.
+ */
+int proc_dobitmap_list(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ char *buf;
+ int retval;
+ int len;
+
+ if (!table->maxlen || !table->data || !*lenp ||
+ (*ppos && !write)) {
+ *lenp = 0;
+ return 0;
+ }
+
+ if (write) {
+ /* This is a generous upper bound on the size of the
+ * ascii text input length */
+ if (*lenp > 4*MAX_NUMNODES)
+ return -EMSGSIZE;
+ len = *lenp + 1;
+ } else {
+ len = 4*MAX_NUMNODES;
+ }
+
+ buf = kmalloc(len, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ if (write) {
+ if (copy_from_user(buf, buffer, *lenp))
+ return -EFAULT;
+ buf[len] = 0; /* nul-terminate */
+ retval = bitmap_parselist(buf, (unsigned long *)table->data,
+ table->maxlen*sizeof(unsigned long));
+ } else {
+ /* parse the bitmap but leave room for '\n' */
+ len = bitmap_scnlistprintf(buf, len-1,
+ (const unsigned long *) table->data,
+ table->maxlen*sizeof(unsigned long));
+ /* Insert '\n' if we have something to return */
+ if (len)
+ buf[len++] = '\n';
+ /* the length requested is less than what's available */
+ if (*lenp < len)
+ len = *lenp;
+ if (copy_to_user(buffer, buf, len))
+ return -EFAULT;
+ *ppos += len;
+ *lenp = len;
+ retval = 0;
+ }
+ kfree(buf);
+ return retval;
+}
+
#else /* CONFIG_PROC_FS */

int proc_dostring(ctl_table *table, int write, struct file *filp,
Index: linux-2.6.10/mm/vmscan.c
===================================================================
--- linux-2.6.10.orig/mm/vmscan.c 2005-02-16 12:43:19.000000000 -0800
+++ linux-2.6.10/mm/vmscan.c 2005-02-21 10:39:36.000000000 -0800
@@ -33,6 +33,8 @@
#include <linux/cpuset.h>
#include <linux/notifier.h>
#include <linux/rwsem.h>
+#include <linux/sysctl.h>
+#include <linux/kthread.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -66,6 +68,9 @@
/* How many pages shrink_cache() should reclaim */
int nr_to_reclaim;

+ /* Can we reclaim mapped pages? */
+ int may_reclaim_mapped;
+
/* Ask shrink_caches, or shrink_zone to scan at this priority */
unsigned int priority;

@@ -384,6 +389,9 @@
if (page_mapped(page) || PageSwapCache(page))
sc->nr_scanned++;

+ if (page_mapped(page) && !sc->may_reclaim_mapped)
+ goto keep_locked;
+
if (PageWriteback(page))
goto keep_locked;

@@ -725,7 +733,7 @@
* Now use this metric to decide whether to start moving mapped memory
* onto the inactive list.
*/
- if (swap_tendency >= 100)
+ if (swap_tendency >= 100 && sc->may_reclaim_mapped)
reclaim_mapped = 1;

while (!list_empty(&l_hold)) {
@@ -889,7 +897,98 @@
shrink_zone(zone, sc);
}
}
-
+
+#ifdef CONFIG_NUMA
+/*
+ * Scan this node and release all clean page cache pages
+ */
+static void toss_page_cache_pages_node(int node)
+{
+ int i;
+ struct scan_control sc;
+ struct zone *z;
+
+ sc.gfp_mask = 0;
+ sc.may_reclaim_mapped = 0;
+ sc.priority = DEF_PRIORITY;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ z = &NODE_DATA(node)->node_zones[i];
+ if (!z->present_pages)
+ continue;
+ sc.nr_to_scan = z->nr_active + z->nr_inactive;
+ sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+
+ refill_inactive_zone(z, &sc);
+ shrink_cache(z, &sc);
+ }
+ return;
+}
+
+static DECLARE_MUTEX(toss_page_cache_nodes_sem);
+nodemask_t toss_page_cache_nodes;
+static atomic_t num_toss_threads_active;
+
+static int toss_pages_thread(void *arg)
+{
+ int node = (int)(long)arg;
+
+ if (node_online(node))
+ toss_page_cache_pages_node(node);
+ atomic_dec(&num_toss_threads_active);
+ do_exit(0);
+}
+
+/*
+ * wrapper routine for proc_dobitmap_list that also calls
+ * toss_page_cache_pages_node() for each node set in bitmap
+ * (when called for a write operation). Read operations
+ * don't do anything beyond what proc_dobitmap_list() does.
+ */
+int proc_do_toss_page_cache_nodes(ctl_table *table, int write, struct file *filep,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret, node;
+
+ down_interruptible(&toss_page_cache_nodes_sem);
+ ret = proc_dobitmap_list(table, write, filep, buffer, lenp, ppos);
+ if (ret < 0)
+ goto done;
+ ret = 0;
+
+ if (!write)
+ goto done;
+
+ /* if no online nodes specified - error */
+ if (!nodes_intersects(toss_page_cache_nodes, node_online_map)) {
+ ret = -EINVAL;
+ goto done;
+ }
+
+ /* create kernel threads to go toss the page cache pages */
+ atomic_set(&num_toss_threads_active, 0);
+ for_each_node_mask(node, toss_page_cache_nodes) {
+ if (node_online(node)) {
+ kthread_run(&toss_pages_thread, (void *)(long)node,
+ "toss_thread");
+ atomic_inc(&num_toss_threads_active);
+ } else {
+ node_clear(node, toss_page_cache_nodes);
+ }
+ }
+
+ /* wait for the kernel threads to complete */
+ while (atomic_read(&num_toss_threads_active) > 0) {
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(10);
+ }
+done:
+ up(&toss_page_cache_nodes_sem);
+ return ret;
+}
+#endif
+
/*
* This is the main entry point to direct page reclaim.
*
@@ -915,6 +1014,7 @@
int i;

sc.gfp_mask = gfp_mask;
+ sc.may_reclaim_mapped = 1;
sc.may_writepage = 0;

inc_page_state(allocstall);
@@ -1014,6 +1114,7 @@
total_scanned = 0;
total_reclaimed = 0;
sc.gfp_mask = GFP_KERNEL;
+ sc.may_reclaim_mapped = 1;
sc.may_writepage = 0;
sc.nr_mapped = read_page_state(nr_mapped);

2005-02-21 21:42:40

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Martin Hicks <[email protected]> wrote:
>
> This patch introduces a new sysctl for NUMA systems that tries to drop
> as much of the page cache as possible from a set of nodes. The
> motivation for this patch is for setting up High Performance Computing
> jobs, where initial memory placement is very important to overall
> performance.

- Using a write to /proc for this seems a bit hacky. Why not simply add
a new system call for it?

- Starting a kernel thread for each node might be overkill. Yes, it
would take longer if one process was to do all the work, but does this
operation need to be very fast?

If it does, then userspace could arrange for that concurrency by
starting a number of processes to perform the toss, each with a different
nodemask.

- Dropping "as much pagecache as possible" might be a bit crude. I
wonder if we should pass in some additional parameter which specifies how
much of the node's pagecache should be removed.

Or, better, specify how much free memory we will actually require on
this node. The syscall terminates when it determines that enough
pagecache has been removed.

- To make the syscall more general, we should be able to reclaim mapped
pagecache and anonymous memory as well.

So what it comes down to is

sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)

where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).

2005-02-21 21:52:59

by Nish Aravamudan

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

On Mon, 21 Feb 2005 14:27:21 -0500, Martin Hicks
<[email protected]> wrote:
>
> Hi,
>
> I've made a bunch of changes that Paul suggested. I've also responded
> to his concerns further down. Paul correctly pointed out that this
> patch uses some helper functions that are part of the cpusets patch. I
> should have mentioned this before.

<snip>

> This patch introduces a new sysctl for NUMA systems that tries to drop
> as much of the page cache as possible from a set of nodes. The
> motivation for this patch is for setting up High Performance Computing
> jobs, where initial memory placement is very important to overall
> performance.

<snip>

> + /* wait for the kernel threads to complete */
> + while (atomic_read(&num_toss_threads_active) > 0) {
> + __set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(10);
> + }

<snip>

Would it be possible to use msleep_interruptible() here? Or is it a
strict check every 10 ticks, regardless of HZ? Could a comment be
inserted indicating
which is the case?

Thanks,
Nish

2005-02-21 22:13:53

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Andrew wrote:
> sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
> ...
> - To make the syscall more general, we should be able to reclaim mapped
> pagecache and anonymous memory as well.

sys_free_node_memory() - nice.

Does it make sense to also have it be able to free up slab cache,
calling shrink_slab()?

Did you mean to pass a nodemask, or a single node id? Passing a single
node id is easier - we've shown that it is difficult to pass bitmaps
across the user/kernel boundary without confusions. But if only a
single node id is passed, then you get the thread per node that you just
argued was sometimes overkill.

I'd prefer the single node id, because it's easier to get right.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-21 22:23:51

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Andrew Morton wrote:
> Martin Hicks <[email protected]> wrote:
>
>>This patch introduces a new sysctl for NUMA systems that tries to drop
>> as much of the page cache as possible from a set of nodes. The
>> motivation for this patch is for setting up High Performance Computing
>> jobs, where initial memory placement is very important to overall
>> performance.
>
>
> - Using a write to /proc for this seems a bit hacky. Why not simply add
> a new system call for it?
>

We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.

> - Starting a kernel thread for each node might be overkill. Yes, it
> would take longer if one process was to do all the work, but does this
> operation need to be very fast?
>

It is possible that this call might need to be executed at the start of
each batch job in the system. The reason for using a kernel thread was
that there was no good way to start concurrency due to a write to /proc.

> If it does, then userspace could arrange for that concurrency by
> starting a number of processes to perform the toss, each with a different
> nodemask.
>

That works fine as well if we can get a system call number assigned and
avoids the hackiness of both /proc and the kernel threads.

> - Dropping "as much pagecache as possible" might be a bit crude. I
> wonder if we should pass in some additional parameter which specifies how
> much of the node's pagecache should be removed.
>
> Or, better, specify how much free memory we will actually require on
> this node. The syscall terminates when it determines that enough
> pagecache has been removed.

Our thoughts exactly. This is clearly a "big hammer" and we want to
make a lighter hammer to free up a certain number of pages. Indeed,
we would like to have these calls occur automatically from __alloc_pages()
when we try to allocate local storage and find that there isn't any.
For our workloads, we want to free up unmapped, clean pagecache, if that
is what is keeping us from allocating a local page. Not all workloads
want that, however, so we would probably use a sysctl() to enable/disable
this.

However, the first step is to do this manually from user space.

>
> - To make the syscall more general, we should be able to reclaim mapped
> pagecache and anonymous memory as well.
>
>
> So what it comes down to is
>
> sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
>
> where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
> mapped pagecache, anonymous memory, slab, ...).

Do we have to implement all of those or just allow for the possibility of that
being implemented in the future? E. g. in our case we'd just implement the
bit that says "unmapped pagecache".

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 22:28:28

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Paul Jackson <[email protected]> wrote:
>
> Andrew wrote:
> > sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
> > ...
> > - To make the syscall more general, we should be able to reclaim mapped
> > pagecache and anonymous memory as well.
>
> sys_free_node_memory() - nice.
>
> Does it make sense to also have it be able to free up slab cache,
> calling shrink_slab()?

Yes, I suggested that slab be one of the `what_to_free' flags. (Some of
this may be tricky to implement. But a good interface with an
initially-crappy implementation is OK ;)

> Did you mean to pass a nodemask, or a single node id? Passing a single
> node id is easier - we've shown that it is difficult to pass bitmaps
> across the user/kernel boundary without confusions. But if only a
> single node id is passed, then you get the thread per node that you just
> argued was sometimes overkill.

I meant a single node ID. With a bitmap, the kernel needs to futz around
scanning the bitmap, launching kernel threads, etc.

I'm proposing that there be no kernel threads at all. If you have four nodes:

for i in 0 1 2 3
do
call-sys_free_node_memory $i -1 -1 &
done

> I'd prefer the single node id, because it's easier to get right.

yup.

2005-02-21 22:41:35

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Ray Bryant <[email protected]> wrote:
>
> Andrew Morton wrote:
> > Martin Hicks <[email protected]> wrote:
> >
> >>This patch introduces a new sysctl for NUMA systems that tries to drop
> >> as much of the page cache as possible from a set of nodes. The
> >> motivation for this patch is for setting up High Performance Computing
> >> jobs, where initial memory placement is very important to overall
> >> performance.
> >
> >
> > - Using a write to /proc for this seems a bit hacky. Why not simply add
> > a new system call for it?
> >
>
> We did it this way because it was easier to get it into SLES9 that way.
> But there is no particular reason that we couldn't use a system call.
> It's just that we figured adding system calls is hard.

aarggh. This is why you should target kernel.org kernels first. Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.

> > If it does, then userspace could arrange for that concurrency by
> > starting a number of processes to perform the toss, each with a different
> > nodemask.
> >
>
> That works fine as well if we can get a system call number assigned and
> avoids the hackiness of both /proc and the kernel threads.

syscall numbers are per-arch. We don't need to assign a syscall number for
this one - we can surely have this ready for 2.6.12. Simply include i386
and ia64 in the initial patch and other architectures will catch up pretty
quickly. (It would be nice to generate patches for the arch maintainers,
however).

> > - Dropping "as much pagecache as possible" might be a bit crude. I
> > wonder if we should pass in some additional parameter which specifies how
> > much of the node's pagecache should be removed.
> >
> > Or, better, specify how much free memory we will actually require on
> > this node. The syscall terminates when it determines that enough
> > pagecache has been removed.
>
> Our thoughts exactly. This is clearly a "big hammer" and we want to
> make a lighter hammer to free up a certain number of pages. Indeed,
> we would like to have these calls occur automatically from __alloc_pages()
> when we try to allocate local storage and find that there isn't any.
> For our workloads, we want to free up unmapped, clean pagecache, if that
> is what is keeping us from allocating a local page. Not all workloads
> want that, however, so we would probably use a sysctl() to enable/disable
> this.
>
> However, the first step is to do this manually from user space.

Yup. The thing is, lots of people want this feature for various reasons.
Not just numerical-computing-users-on-NUMA. We should get it right for
them too.

Especially kernel developers, who have various nasty userspace tools which
will manually reclaim pagecache. But non-kernel-developers will use it
too, when they think the VM is screwing them over ;)

I think Solaris used to have such a tool - /usr/etc/chill, although I
don't know if it had kernel support.

> >
> > - To make the syscall more general, we should be able to reclaim mapped
> > pagecache and anonymous memory as well.
> >
> >
> > So what it comes down to is
> >
> > sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
> >
> > where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
> > mapped pagecache, anonymous memory, slab, ...).
>
> Do we have to implement all of those or just allow for the possibility of that
> being implemented in the future? E. g. in our case we'd just implement the
> bit that says "unmapped pagecache".

Well... please take a look at what's involved. It should just be a matter
of sprinkling a few test such as

+ if (sc->mode & SC_RECLAIM_SLAB) {
...
+ }

into the existing code. If things turn nasty then we can take another look
at it.

2005-02-21 22:56:34

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Andrew Morton wrote:
> Ray Bryant <[email protected]> wrote:

>>
>>We did it this way because it was easier to get it into SLES9 that way.
>>But there is no particular reason that we couldn't use a system call.
>>It's just that we figured adding system calls is hard.
>
>
> aarggh. This is why you should target kernel.org kernels first. Now we
> risk ending up with poor old suse carrying an obsolete interface and
> application developers have to be able to cater for both interfaces.
>

I agree, but time-to-market decisions overrode that. Anyway, everyone
uses a program called "bcfree" to actually do the buffer-cache freeing,
so changing the interface is not as bad as all that.

Let us put something together along these lines and we will get back to you.

Thanks,
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-21 23:53:31

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Andrew wrote:
> Yes, I ... [clarifies pj's various confusions]

Yup - all sounds good - thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-22 07:53:23

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

* Andrew Morton <[email protected]> wrote:

> > However, the first step is to do this manually from user space.
>
> Yup. The thing is, lots of people want this feature for various
> reasons. Not just numerical-computing-users-on-NUMA. We should get
> it right for them too.
>
> Especially kernel developers, who have various nasty userspace tools
> which will manually reclaim pagecache. But non-kernel-developers will
> use it too, when they think the VM is screwing them over ;)

app designers very frequently think that the VM gets its act wrong (most
of the time for the wrong reasons), and the last thing we want to enable
them is to hack real problems around. How are we supposed to debug VM
problems where one player periodically flushes the whole pagecache? If
that flushing, when disabled, 'results in the app being broken' (_if_
the app gives any option to disable the flushing). Providing APIs to
flush system caches, sysctl or syscall, is the road to VM madness.

If the goal is to override the pagecache (and other kernel caches) on a
given node then for God's sake, think a bit harder. E.g. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that. Giving a
half-assed tool to clean out one aspect of the system caches will only
muddy the waters, with no real road back to sanity.

Ingo

2005-02-22 08:07:39

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Ingo Molnar <[email protected]> wrote:
>
> app designers very frequently think that the VM gets its act wrong (most
> of the time for the wrong reasons), and the last thing we want to enable
> them is to hack real problems around.

Not really. Memory reclaim tries to predict the future and expects some
sort of "average" workload. For some workloads that prediction is
hopelessly wrong. Although we could surely provide manual hinting
machinery which is less crude than this proposal.

> . enable users to
> specify an 'allocation priority' of some sort, which kicks out the
> pagecache on the local node - or something like that.

Yes, that would be preferable - I don't know what the difficulty is with
that. sys_set_mempolicy() should provide a sufficiently good hint.

2005-02-22 08:26:35

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

* Andrew Morton <[email protected]> wrote:

> > . enable users to
> > specify an 'allocation priority' of some sort, which kicks out the
> > pagecache on the local node - or something like that.
>
> Yes, that would be preferable - I don't know what the difficulty is
> with that. sys_set_mempolicy() should provide a sufficiently good
> hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.

It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).

At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.

Ingo

2005-02-22 11:28:08

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Ingo wrote:
> app designers very frequently think that the VM gets its act wrong (most
> of the time for the wrong reasons),

As Martin wrote, when he submitted this patch:
> The motivation for this patch is for setting up High Performance
> Computing jobs, where initial memory placement is very important to
> overall performance.

Any left over cache is wrong, for this situation. The only right
answer, no fault of the VM that it can't predict such, is to clear the
past cache and ensure that all allocations are satisfied with node-local
memory, and no page out delays, for all the threads in such tightly
coupled jobs. These jobs have been sized to use every ounce of CPU and
Memory from sometimes hundreds of nodes, and for hours or days, using
tightly coupled MPI and OpenMP codes. Any misplaced pages (off the
local node) or paging delays quickly leads to erratic and reduced
performance.

Flushing all the cache like this hurts any normal load that has any
continuity of working set, and such flushing is not cheap. I have not
observed much interest in doing this, outside of appropriate use when
starting up a big HPC app, as described above, or the test and debug
situations that you mention. For certain HPC apps, it can be essential
to repeatable job performance.

Granted, this might not be for most systems. Perhaps a CONFIG option,
so that by default this worked on builds for big honkin numa boxes, but
was an -ENOSYS error on ordinary sized systems? Though I prefer not to
create artificial distinctions between configurations, without good
reason, perhaps this is such a reason.

Making the API ugly won't reduce its use much, rather just increase code
maintenance costs a bit, and breed a few more bugs. Those who think
they want this will find a way to do it. If something's worth doing,
it's worth doing cleanly.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-02-22 17:26:56

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
>
>>> . enable users to
>>> specify an 'allocation priority' of some sort, which kicks out the
>>> pagecache on the local node - or something like that.
>>
>>Yes, that would be preferable - I don't know what the difficulty is
>>with that. sys_set_mempolicy() should provide a sufficiently good
>>hint.
>
>
> yes. I'm not against some flushing mechanism for debugging or test
> purposes (it can be useful to start from a new, clean state - and as
> such the sysctl for root only and depending on KERNEL_DEBUG is probably
> better than an explicit syscall), but the idea to give a flushing API to
> applications is bad i believe.
>

We're pretty agnostic about this. I agree that if we were to make this
a system call, then it should be restricted to root. Or make it a
sysctl. Whichever way you guys want to go is fine with us.

> It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
> VM problems and i fear that it will destroy the evolution of VM
> priority/placement/affinity APIs (NUMAlib, etc.).
>

I have two observations about this:

(1) It is our intent to use the infrastructure provided by this patch
as the basis for an automatic (i. e. included with the VM) approach
that selectively removes unused page cache pages before spilling
off node. We just figured it would be easier to get the
infrastructure in place first.

(2) If a sufficiently well behaved application knows in advance how
much free memory it needs per node, then it makes sense to provide
a mechanism for the application to request this, rather than for
the VM to try to puzzle this out later. Automatic algorithms in
the VM are never perfect; they should be reserved to work in those
cases where the application(s) either cooperate in such a way to
make memory demands impossible to predict, or the application
programmer can't (or can't take the time to) predict how much
memory the application will use.

> At least making it sufficiently painful to use (via the originally
> proposed root-only sysctl) could still preserve some of the incentive to
> provide a clean solution for applications. 'Time to market' constraints
> should not be considered when adding core mechanisms.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-22 18:46:04

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Paul Jackson <[email protected]> wrote:
>
> As Martin wrote, when he submitted this patch:
> > The motivation for this patch is for setting up High Performance
> > Computing jobs, where initial memory placement is very important to
> > overall performance.
>
> Any left over cache is wrong, for this situation.

So... Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?

2005-02-22 18:59:19

by Martin Hicks

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

On Tue, Feb 22, 2005 at 10:45:35AM -0800, Andrew Morton wrote:
> Paul Jackson <[email protected]> wrote:
> >
> > As Martin wrote, when he submitted this patch:
> > > The motivation for this patch is for setting up High Performance
> > > Computing jobs, where initial memory placement is very important to
> > > overall performance.
> >
> > Any left over cache is wrong, for this situation.
>
> So... Cannot the applicaiton remove all its pagecache with posix_fadvise()
> prior to exitting?

I think Paul's referring to pagecache (as well as other caches) that are
on the node from other uses, not necessarily another HPC job that has
recently terminated.

mh

--
Martin Hicks || Silicon Graphics Inc. || [email protected]

2005-02-22 19:01:26

[permalink] [raw]

Subject: Re: [PATCH/RFC] A method for clearing out page cache

Andrew Morton wrote:
> Paul Jackson <[email protected]> wrote:
>
>> As Martin wrote, when he submitted this patch:
>> > The motivation for this patch is for setting up High Performance
>> > Computing jobs, where initial memory placement is very important to
>> > overall performance.
>>
>> Any left over cache is wrong, for this situation.
>
>
> So... Cannot the applicaiton remove all its pagecache with posix_fadvise()
> prior to exitting?
>

Even if we modified all applications to do this, it still wouldn't help for
dirty page cache, which would eventually become cleaned, and hang around long
after the application has departed.

But the previous statement has a false hypothesis, namely, that we could
change all applications to do this.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-02-23 00:14:43