2005-03-20 01:51:31

by Paul Jackson

[permalink] [raw]
Subject: [Patch] cpusets policy kill no swap

Question:

Should I call oom_kill_process(), oom_kill_task(),
or __oom_kill_task(), when the current task decides
that it is better to die than to swap, so calls the
routine mm/oom_kill.c:oom_attempt_suicide() that this
patch adds, below?

My best guess is either one of oom_kill_task() or
__oom_kill_task() is fine, depending on whether I
want to take down the rest of the tasks sharing the
offending tasks mm.

Andrew will probably not want to pick up this patch, at least
until this question is answered.

Review and feedback is welcome.

===

This patch adds a boolean flag 'policy_kill_no_swap' to each
cpuset. If this flag is set in some cpuset, and if a task
attached to that cpuset tries to allocate a page of memory and
gets far enough in the __alloc_pages() code that the _next_
step would be to wake the swapper (call wakeup_kswapd) then
instead of that, the task is killed immediately.

For normal desktop usage, this makes almost no sense, and so
of course the default setting of 'policy_kill_no_swap' is false
(zero).

For certain HPC apps on big iron numa, this policy has been found
to be essential. In these cases, the app has been carefully
sized to fit, tightly, on the CPUs and Memory Nodes allowed
to it. The app may consume dozens or hundreds of nodes, using
up almost all the main memory on each node, running for hours
or days in very tight coupling. Such an app as would use
policy_kill_no_swap must not swap, for two reasons:

1) If it starts to swap, then the required performance is
hopelessly and totally lost. The customer requires the
application to come down at that point.

2) If it starts to swap, it starts to impact the performance
of other apps elsewhere on the system. This was much worse
on Linux 2.4 kernels, where the swapper was not well behaved.
But even on Linux 2.6 kernels, additional i/o and kernel
work is invoked, which has unpredictable impact on the
remaining system performance.

The implementation is simple enough. Each cpuset directory has
one more special file, 'policy_kill_no_swap', containing a zero
(false) or one (true). The default is false. The value is
inherited by newly created sub-cpusets.

A hook is added to mm/page_alloc.c:__alloc_pages(), just before
the wakeup_kswapd() logic, which checks this flag in the current
tasks cpuset, and kills the process if the flag is set, with
an explanatory printk. A new routine, oom_attempt_suicide(),
is added to mm/oom_kill.c, to handle the killing. If this
happens, the task never gets to the point of invoking the
swapper.

This mechanisms differs from a general purpose out-of-memory
killer in various ways, including:

* An oom-killer tries to score the bad buy, to avoid shooting
the innocent little task that just happened to ask for one
page too many.
* The policy_kill_no_swap hook kills the current requester.
* It takes severe memory pressure to wake up an oom-killer.
* The policy_kill_no_swap hook triggers on the slightest
pressure that exceeds readily free memory.
* The oom-killer can be useful on a general purpose system.
* The policy_kill_no_swap hook is only useful for carefully
tuned apps running on dedicated nodes on large systems.

In short - simple enough, but quite specialized.

This patch has been built, booted and tested for function on
an ia64 SN2 platform. It has been built with and without
CONFIG_CPUSETS enabled on an i386 platform.

It would not surprise me if a few more such cpuset policy flags
showed up over the next year, to affect scheduling or allocation
for all tasks in a cpuset. However I have no more such flags
queued up, or even with an agreed design, at this time.

Signed-off-by: Paul Jackson <[email protected]>

Index: 2.6.12-pj/Documentation/cpusets.txt
===================================================================
--- 2.6.12-pj.orig/Documentation/cpusets.txt 2005-03-19 01:10:46.000000000 -0800
+++ 2.6.12-pj/Documentation/cpusets.txt 2005-03-19 14:41:21.000000000 -0800
@@ -166,6 +166,7 @@ containing the following files describin
- mems: list of Memory Nodes in that cpuset
- cpu_exclusive flag: is cpu placement exclusive?
- mem_exclusive flag: is memory placement exclusive?
+ - policy_kill_no_swap: kill task if its memory use would wake swapper
- tasks: list of tasks (by pid) attached to that cpuset

New cpusets are created using the mkdir system call or shell
@@ -333,7 +334,7 @@ Now you want to do something with this c

In this directory you can find several files:
# ls
-cpus cpu_exclusive mems mem_exclusive tasks
+cpu_exclusive cpus mem_exclusive mems policy_kill_no_swap tasks

Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
Index: 2.6.12-pj/include/linux/cpuset.h
===================================================================
--- 2.6.12-pj.orig/include/linux/cpuset.h 2005-03-19 00:38:05.000000000 -0800
+++ 2.6.12-pj/include/linux/cpuset.h 2005-03-19 02:06:41.000000000 -0800
@@ -24,6 +24,7 @@ void cpuset_update_current_mems_allowed(
void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
int cpuset_zone_allowed(struct zone *z);
+int cpuset_is_kill_no_swap(void);
extern struct file_operations proc_cpuset_operations;
extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);

@@ -53,6 +54,11 @@ static inline int cpuset_zone_allowed(st
return 1;
}

+static inline int cpuset_is_kill_no_swap(void)
+{
+ return 0;
+}
+
static inline char *cpuset_task_status_allowed(struct task_struct *task,
char *buffer)
{
Index: 2.6.12-pj/include/linux/swap.h
===================================================================
--- 2.6.12-pj.orig/include/linux/swap.h 2005-03-19 00:38:05.000000000 -0800
+++ 2.6.12-pj/include/linux/swap.h 2005-03-19 14:59:04.000000000 -0800
@@ -149,6 +149,7 @@ struct swap_list_t {

/* linux/mm/oom_kill.c */
extern void out_of_memory(int gfp_mask);
+extern void oom_attempt_suicide(const char *msg);

/* linux/mm/memory.c */
extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
Index: 2.6.12-pj/kernel/cpuset.c
===================================================================
--- 2.6.12-pj.orig/kernel/cpuset.c 2005-03-19 01:10:43.000000000 -0800
+++ 2.6.12-pj/kernel/cpuset.c 2005-03-19 16:22:00.000000000 -0800
@@ -83,6 +83,7 @@ struct cpuset {
typedef enum {
CS_CPU_EXCLUSIVE,
CS_MEM_EXCLUSIVE,
+ CS_KILL_NO_SWAP,
CS_REMOVED,
CS_NOTIFY_ON_RELEASE
} cpuset_flagbits_t;
@@ -98,6 +99,11 @@ static inline int is_mem_exclusive(const
return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
}

+static inline int is_kill_no_swap(const struct cpuset *cs)
+{
+ return !!test_bit(CS_KILL_NO_SWAP, &cs->flags);
+}
+
static inline int is_removed(const struct cpuset *cs)
{
return !!test_bit(CS_REMOVED, &cs->flags);
@@ -643,8 +649,7 @@ static int update_nodemask(struct cpuset

/*
* update_flag - read a 0 or a 1 in a file and update associated flag
- * bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
- * CS_NOTIFY_ON_RELEASE)
+ * bit: which cpuset_flagbits_t bit (CS_*) to update
* cs: the cpuset to update
* buf: the buffer where we read the 0 or 1
*/
@@ -736,6 +741,7 @@ typedef enum {
FILE_MEMLIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
+ FILE_KILL_NO_SWAP,
FILE_NOTIFY_ON_RELEASE,
FILE_TASKLIST,
} cpuset_filetype_t;
@@ -783,6 +789,9 @@ static ssize_t cpuset_common_file_write(
case FILE_MEM_EXCLUSIVE:
retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer);
break;
+ case FILE_KILL_NO_SWAP:
+ retval = update_flag(CS_KILL_NO_SWAP, cs, buffer);
+ break;
case FILE_NOTIFY_ON_RELEASE:
retval = update_flag(CS_NOTIFY_ON_RELEASE, cs, buffer);
break;
@@ -884,6 +893,9 @@ static ssize_t cpuset_common_file_read(s
case FILE_MEM_EXCLUSIVE:
*s++ = is_mem_exclusive(cs) ? '1' : '0';
break;
+ case FILE_KILL_NO_SWAP:
+ *s++ = is_kill_no_swap(cs) ? '1' : '0';
+ break;
case FILE_NOTIFY_ON_RELEASE:
*s++ = notify_on_release(cs) ? '1' : '0';
break;
@@ -1210,6 +1222,11 @@ static struct cftype cft_mem_exclusive =
.private = FILE_MEM_EXCLUSIVE,
};

+static struct cftype cft_kill_no_swap = {
+ .name = "policy_kill_no_swap",
+ .private = FILE_KILL_NO_SWAP,
+};
+
static struct cftype cft_notify_on_release = {
.name = "notify_on_release",
.private = FILE_NOTIFY_ON_RELEASE,
@@ -1227,6 +1244,8 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_mem_exclusive)) < 0)
return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_kill_no_swap)) < 0)
+ return err;
if ((err = cpuset_add_file(cs_dentry, &cft_notify_on_release)) < 0)
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
@@ -1257,6 +1276,8 @@ static long cpuset_create(struct cpuset
cs->flags = 0;
if (notify_on_release(parent))
set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+ if (is_kill_no_swap(parent))
+ set_bit(CS_KILL_NO_SWAP, &cs->flags);
cs->cpus_allowed = CPU_MASK_NONE;
cs->mems_allowed = NODE_MASK_NONE;
atomic_set(&cs->count, 0);
@@ -1501,6 +1522,14 @@ int cpuset_zone_allowed(struct zone *z)
}

/*
+ * Should current task be killed to avoid kicking swapper?
+ */
+int cpuset_is_kill_no_swap(void)
+{
+ return current->cpuset && is_kill_no_swap(current->cpuset);
+}
+
+/*
* proc_cpuset_show()
* - Print tasks cpuset path into seq_file.
* - Used for /proc/<pid>/cpuset.
Index: 2.6.12-pj/mm/oom_kill.c
===================================================================
--- 2.6.12-pj.orig/mm/oom_kill.c 2005-03-19 00:38:04.000000000 -0800
+++ 2.6.12-pj/mm/oom_kill.c 2005-03-19 16:37:10.000000000 -0800
@@ -15,6 +15,7 @@
* kernel subsystems and hints as to where to find out what things do.
*/

+#include <linux/interrupt.h>
#include <linux/mm.h>
#include <linux/sched.h>
#include <linux/swap.h>
@@ -290,3 +291,32 @@ retry:
__set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(1);
}
+
+/**
+ * oom_attempt_suicide - Unless we're in interrupt or a 'special' malloc,
+ * or already trying to exit, die.
+ */
+
+void oom_attempt_suicide(const char *msg)
+{
+ struct mm_struct *mm;
+
+ if (in_interrupt())
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+ if (test_tsk_thread_flag(current, TIF_MEMDIE))
+ return;
+ if (current->flags & PF_EXITING)
+ return;
+
+ printk(KERN_CRIT "Killing process %d (%s) - %s\n",
+ current->pid, current->comm, msg);
+ read_lock(&tasklist_lock);
+ mm = oom_kill_task(current);
+ read_unlock(&tasklist_lock);
+ if (mm)
+ mmput(mm);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(1);
+}
Index: 2.6.12-pj/mm/page_alloc.c
===================================================================
--- 2.6.12-pj.orig/mm/page_alloc.c 2005-03-19 00:38:04.000000000 -0800
+++ 2.6.12-pj/mm/page_alloc.c 2005-03-19 14:58:38.000000000 -0800
@@ -774,6 +774,9 @@ __alloc_pages(unsigned int gfp_mask, uns
goto got_pg;
}

+ if (cpuset_is_kill_no_swap())
+ oom_attempt_suicide("cpuset policy_kill_no_swap set");
+
for (i = 0; (z = zones[i]) != NULL; i++)
wakeup_kswapd(z, order);


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401


2005-03-20 06:59:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch] cpusets policy kill no swap

Paul Jackson <[email protected]> wrote:
>
> This mechanisms differs from a general purpose out-of-memory
> killer in various ways, including:
>
> * An oom-killer tries to score the bad buy, to avoid shooting
> the innocent little task that just happened to ask for one
> page too many.
> * The policy_kill_no_swap hook kills the current requester.
> * It takes severe memory pressure to wake up an oom-killer.
> * The policy_kill_no_swap hook triggers on the slightest
> pressure that exceeds readily free memory.
> * The oom-killer can be useful on a general purpose system.
> * The policy_kill_no_swap hook is only useful for carefully
> tuned apps running on dedicated nodes on large systems.
>

There are a lot of reasons why we would wake kswapd apart from starting
swapout. Such as to reclaim clean pagecache or some dcache+icache.

> In short - simple enough, but quite specialized.

Way too specialised, I suspect. Is it not possible to have a little
userspace daemon which monitors the long-running applications's rss and
whacks it if the rss gets too large?

The patch you have simply kills the process when all the eligible zones
reach their upper watermark. Again, we can probably determine that state
from userspace right now. If not, it would be simple enough to add the
required info to /proc somewhere.

2005-03-20 07:11:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [Patch] cpusets policy kill no swap

Interesting comments, Andrew. Thanks

It will likely be a couple of days before
I respond to them. I suspect a couple
of us SGI folks should powwow first.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-22 23:10:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [Patch] cpusets policy kill no swap

Thanks Andrew - you're right. Drop this patch in /dev/null.

* I will look around for some way that user code can
detect that a task has provoked swapping, or propose
a small patch, perhaps to /proc, for that, if need be.

* I agree that the action, killing a task or whatever, can
and should be instigated by user level code. The kernel
provides the essential mechanisms; user code decides the
policy, and elaborates the mechanisms.

* I'm concerned that polling some /proc state will either be too
wasteful of cycles (if we poll fast) or have too much delay to
trigger (if we poll slow). Though I need some real numbers,
to see if this is a real problem. It was definitely a problem
in a past life, but that may not apply here. The Linux 2.6
swapper is much more NUMA friendly.

Note, however, that something like rlimit, used to impose
other limits on task resource consumption, depends on specific
kernel hooks to catch the violation (using too much memory,
say) rather than insisting that user space code scan /proc
information looking for violators. The former is just way
too efficient compared to the latter.

* I'm still casting about for appropriate mechanisms (if polling
some /proc data is not adequate) to:
1) enable user space code to control some kernel trigger
that fires when a task causes more swapping than the
setting allows (something like rlimit?), and
2) an economical mechanism for the kernel to deliver such
events back to user space (call_usermodehelper or
satisfying a read on a special file?).

If you, or any lurker, has further thoughts, they would be
welcome.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401