2005-09-01 09:16:43

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 0/4] cpusets mems_allowed constrain GFP_KERNEL, oom killer

The following patch is proposed for inclusion in 2.6.14.

This patch extends the use of the cpuset attribute 'mem_exclusive'
to support cpuset configurations that:
1) allow GFP_KERNEL allocations to come from a potentially larger
set of memory nodes than GFP_USER allocations, and
2) can constrain the oom killer to tasks running in cpusets in
a specified subtree of the cpuset hierarchy.

Here's an example usage scenario. For a few hours or more, a large
NUMA system at a University is to be divided in two halves, with a
bunch of student jobs running in half the system under some form
of batch manager, and with a big research project running in the
other half. Each of the student jobs is placed in a small cpuset, but
should share the classic Unix time share facilities, such as buffered
pages of files in /bin and /usr/lib. The big research project wants no
interference whatsoever from the student jobs, and has highly tuned,
unusual memory and i/o patterns that intend to make full use of all
the main memory on the nodes available to it.

In this example, we have two big sibling cpusets, one of which is
further divided into a more dynamic set of child cpusets.

We want kernel memory allocations constrained by the two big cpusets,
and user allocations constrained by the smaller child cpusets where
present. And we require that the oom killer not operate across the two
halves of this system, or else the first time a student job runs amuck,
the big research project will likely be first inline to get shot.

Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research
project really does run amuck allocating memory, it should be shot,
not some other task outside the research projects mem_exclusive cpuset.

I propose to extend the use of the 'mem_exclusive' flag of cpusets
to manage such scenarios. Let memory allocations for user space
(GFP_USER) be constrained by a tasks current cpuset, but memory
allocations for kernel space (GFP_KERNEL) by constrained by the
nearest mem_exclusive ancestor of the current cpuset, even though
kernel space allocations will still _prefer_ to remain within the
current tasks cpuset, if memory is easily available.

Let the oom killer be constrained to consider only tasks that are in
overlapping mem_exclusive cpusets (it won't help much to kill a task
that normally cannot allocate memory on any of the same nodes as the
ones on which the current task can allocate.)

The current constraints imposed on setting mem_exclusive are unchanged.
A cpuset may only be mem_exclusive if its parent is also mem_exclusive,
and a mem_exclusive cpuset may not overlap any of its siblings
memory nodes.

This patch was presented on linux-mm in early July 2005, though did not
generate much feedback at that time. It has been built for a variety of
arch's using cross tools, and built, booted and tested for function
on SN2 (ia64).

There are 4 patches in this set:
1) Some minor cleanup, and some improvements to the code layout
of one routine to make subsequent patches cleaner.
2) Add another GFP flag - __GFP_HARDWALL. It marks memory
requests for USER space, which are tightly confined by the
current tasks cpuset.
3) Now memory requests (such as KERNEL) that not marked HARDWALL can
if short on memory, look in the potentially larger pool of memory
defined by the nearest mem_exclusive ancestor cpuset of the current
tasks cpuset.
4) Finally, modify the oom killer to skip any task whose mem_exclusive
cpuset doesn't overlap ours.

Patch (1), the one time I looked on an SN2 (ia64) build, actually saved
32 bytes of kernel text space. Patch (2) has no affect on the size
of kernel text space (it just adds a preprocessor flag). Patches (3)
and (4) added about 600 bytes each of kernel text space, mostly in
kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373


2005-09-01 09:09:16

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 1/4] cpusets oom_kill tweaks

This patch applies a few comment and code cleanups to mm/oom_kill.c
prior to applying a few small patches to improve cpuset management of
memory placement.

The comment changed in oom_kill.c was seriously misleading. The code
layout change in select_bad_process() makes room for adding another
condition on which a process can be spared the oom killer (see the
subsequent cpuset_nodes_overlap patch for this addition).

Also a couple typos and spellos that bugged me, while I was here.

This patch should have no material affect.

Signed-off-by: Paul Jackson <[email protected]>

Index: linux-2.6.13-mem_exclusive_oom/mm/oom_kill.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/mm/oom_kill.c
+++ linux-2.6.13-mem_exclusive_oom/mm/oom_kill.c
@@ -6,8 +6,8 @@
* for goading me into coding this file...
*
* The routines in this file are used to kill a process when
- * we're seriously out of memory. This gets called from kswapd()
- * in linux/mm/vmscan.c when we really run out of memory.
+ * we're seriously out of memory. This gets called from __alloc_pages()
+ * in mm/page_alloc.c when we really run out of memory.
*
* Since we won't call these routines often (on a well-configured
* machine) this file will double as a 'coding guide' and a signpost
@@ -26,7 +26,7 @@
/**
* oom_badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
- * @p: current uptime in seconds
+ * @uptime: current uptime in seconds
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
@@ -57,9 +57,9 @@ unsigned long badness(struct task_struct

/*
* Processes which fork a lot of child processes are likely
- * a good choice. We add the vmsize of the childs if they
+ * a good choice. We add the vmsize of the children if they
* have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of childs
+ * machine with an endless amount of children
*/
list_for_each(tsk, &p->children) {
struct task_struct *chld;
@@ -143,28 +143,32 @@ static struct task_struct * select_bad_p
struct timespec uptime;

do_posix_clock_monotonic_gettime(&uptime);
- do_each_thread(g, p)
- /* skip the init task with pid == 1 */
- if (p->pid > 1 && p->oomkilladj != OOM_DISABLE) {
- unsigned long points;
+ do_each_thread(g, p) {
+ unsigned long points;
+ int releasing;

- /*
- * This is in the process of releasing memory so wait it
- * to finish before killing some other task by mistake.
- */
- if ((unlikely(test_tsk_thread_flag(p, TIF_MEMDIE)) || (p->flags & PF_EXITING)) &&
- !(p->flags & PF_DEAD))
- return ERR_PTR(-1UL);
- if (p->flags & PF_SWAPOFF)
- return p;
-
- points = badness(p, uptime.tv_sec);
- if (points > maxpoints || !chosen) {
- chosen = p;
- maxpoints = points;
- }
+ /* skip the init task with pid == 1 */
+ if (p->pid == 1)
+ continue;
+ if (p->oomkilladj == OOM_DISABLE)
+ continue;
+ /*
+ * This is in the process of releasing memory so for wait it
+ * to finish before killing some other task by mistake.
+ */
+ releasing = test_tsk_thread_flag(p, TIF_MEMDIE) ||
+ p->flags & PF_EXITING;
+ if (releasing && !(p->flags & PF_DEAD))
+ return ERR_PTR(-1UL);
+ if (p->flags & PF_SWAPOFF)
+ return p;
+
+ points = badness(p, uptime.tv_sec);
+ if (points > maxpoints || !chosen) {
+ chosen = p;
+ maxpoints = points;
}
- while_each_thread(g, p);
+ } while_each_thread(g, p);
return chosen;
}

@@ -189,7 +193,8 @@ static void __oom_kill_task(task_t *p)
return;
}
task_unlock(p);
- printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm);
+ printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n",
+ p->pid, p->comm);

/*
* We give our sacrificial lamb high priority and access to

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-09-01 09:09:17

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 2/4] cpusets new __GFP_HARDWALL flag

Add another GFP flag: __GFP_HARDWALL.

A subsequent "cpuset_zone_allowed" patch will use this flag to mark
GFP_USER allocations, and distinguish them from GFP_KERNEL allocations.

Allocations (such as GFP_USER) marked GFP_HARDWALL are constrainted to
the current tasks cpuset. Other allocations (such as GFP_KERNEL) can
steal from the possibly larger nearest mem_exclusive cpuset ancestor,
if memory is tight on every node in the current cpuset.

This patch collides with Mel Gorman's patch to reduce fragmentation
in the standard buddy allocator, which adds two GFP flags. This was
discussed on linux-mm in July. Most likely, one of his flags for
user reclaimable memory can be the same as my __GFP_HARDWALL flag,
under some generic name meaning its user address space memory.

Signed-off-by: Paul Jackson <[email protected]>

Index: linux-2.6.13-mem_exclusive_oom/include/linux/gfp.h
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/include/linux/gfp.h
+++ linux-2.6.13-mem_exclusive_oom/include/linux/gfp.h
@@ -40,6 +40,7 @@ struct vm_area_struct;
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
#define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
#define __GFP_NORECLAIM 0x20000u /* No realy zone reclaim during allocation */
+#define __GFP_HARDWALL 0x40000u /* Enforce hardwall cpuset memory allocs */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -48,14 +49,15 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_NORECLAIM)
+ __GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
+ __GFP_HIGHMEM)

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-09-01 09:09:39

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 3/4] cpusets formalize intermediate GFP_KERNEL containment

This patch depends on the previous patches cpuset_gfp_hardwall_flag
and cpuset_mm_alloc_oom_fixes.

This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of
memory placement resolution. With this patch, there are now the
following four layers of memory placement available:

1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) above is new, with this patch. The call used to
check whether a zone (its node, actually) is in a cpuset (in its
mems_allowed, actually) is extended to take a gfp_mask argument,
and its logic is extended, in the case that __GFP_HARDWALL is not
set in the flag bits, to look up the cpuset hierarchy for the nearest
enclosing mem_exclusive cpuset, to determine if placement is allowed.
The definition of GFP_USER, which used to be identical to GFP_KERNEL,
is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current
tasks cpuset, so long as any node therein is not too tight on memory,
but will escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to
handle several jobs, each job in its own cpuset, but using common
kernel memory for caches and such. Swapper and oom_kill activity is
also constrained to Layer (2). A task in or below one mem_exclusive
cpuset should not cause swapping on nodes in another non-overlapping
mem_exclusive cpuset, nor provoke oom_killing of a task in another
such cpuset. Heavy use of kernel memory for i/o caching and such
by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for
memory allocations of each job, while sharing kernel memory
allocations between several jobs, in an enclosing mem_exclusive
cpuset.

Like Dinakar's patch earlier to enable administering sched domains
using the cpu_exclusive flag, this patch also provides a useful
meaning to a cpuset flag that had previously done nothing much
useful other than restrict what cpuset configurations were allowed.

Signed-off-by: Paul Jackson <[email protected]>

Index: linux-2.6.13-mem_exclusive_oom/Documentation/cpusets.txt
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/Documentation/cpusets.txt
+++ linux-2.6.13-mem_exclusive_oom/Documentation/cpusets.txt
@@ -60,6 +60,18 @@ all of the cpus in the system. This remo
load balancing code trying to pull tasks outside of the cpu exclusive
cpuset only to be prevented by the tasks' cpus_allowed mask.

+A cpuset that is mem_exclusive restricts kernel allocations for
+page, buffer and other data commonly shared by the kernel across
+multiple users. All cpusets, whether mem_exclusive or not, restrict
+allocations of memory for user space. This enables configuring a
+system so that several independent jobs can share common kernel
+data, such as file system pages, while isolating each jobs user
+allocation in its own cpuset. To do this, construct a large
+mem_exclusive cpuset to hold all the jobs, and construct child,
+non-mem_exclusive cpusets for each individual job. Only a small
+amount of typical kernel memory, such as requests from interrupt
+handlers, is allowed to be taken outside even a mem_exclusive cpuset.
+
User level code may create and destroy cpusets by name in the cpuset
virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
Index: linux-2.6.13-mem_exclusive_oom/include/linux/cpuset.h
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/include/linux/cpuset.h
+++ linux-2.6.13-mem_exclusive_oom/include/linux/cpuset.h
@@ -23,7 +23,7 @@ void cpuset_init_current_mems_allowed(vo
void cpuset_update_current_mems_allowed(void);
void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
-int cpuset_zone_allowed(struct zone *z);
+extern int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask);
extern struct file_operations proc_cpuset_operations;
extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);

@@ -48,7 +48,8 @@ static inline int cpuset_zonelist_valid_
return 1;
}

-static inline int cpuset_zone_allowed(struct zone *z)
+static inline int cpuset_zone_allowed(struct zone *z,
+ unsigned int __nocast gfp_mask)
{
return 1;
}
Index: linux-2.6.13-mem_exclusive_oom/mm/page_alloc.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/mm/page_alloc.c
+++ linux-2.6.13-mem_exclusive_oom/mm/page_alloc.c
@@ -806,11 +806,14 @@ __alloc_pages(unsigned int __nocast gfp_
classzone_idx = zone_idx(zones[0]);

restart:
- /* Go through the zonelist once, looking for a zone with enough free */
+ /*
+ * Go through the zonelist once, looking for a zone with enough free.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
for (i = 0; (z = zones[i]) != NULL; i++) {
int do_reclaim = should_reclaim_zone(z, gfp_mask);

- if (!cpuset_zone_allowed(z))
+ if (!cpuset_zone_allowed(z, __GFP_HARDWALL))
continue;

/*
@@ -845,6 +848,7 @@ zone_reclaim_retry:
*
* This is the last chance, in general, before the goto nopage.
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
for (i = 0; (z = zones[i]) != NULL; i++) {
if (!zone_watermark_ok(z, order, z->pages_min,
@@ -852,7 +856,7 @@ zone_reclaim_retry:
gfp_mask & __GFP_HIGH))
continue;

- if (wait && !cpuset_zone_allowed(z))
+ if (wait && !cpuset_zone_allowed(z, gfp_mask))
continue;

page = buffered_rmqueue(z, order, gfp_mask);
@@ -867,7 +871,7 @@ zone_reclaim_retry:
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; (z = zones[i]) != NULL; i++) {
- if (!cpuset_zone_allowed(z))
+ if (!cpuset_zone_allowed(z, gfp_mask))
continue;
page = buffered_rmqueue(z, order, gfp_mask);
if (page)
@@ -903,7 +907,7 @@ rebalance:
gfp_mask & __GFP_HIGH))
continue;

- if (!cpuset_zone_allowed(z))
+ if (!cpuset_zone_allowed(z, gfp_mask))
continue;

page = buffered_rmqueue(z, order, gfp_mask);
@@ -922,7 +926,7 @@ rebalance:
classzone_idx, 0, 0))
continue;

- if (!cpuset_zone_allowed(z))
+ if (!cpuset_zone_allowed(z, __GFP_HARDWALL))
continue;

page = buffered_rmqueue(z, order, gfp_mask);
Index: linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/kernel/cpuset.c
+++ linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
@@ -1611,17 +1611,81 @@ int cpuset_zonelist_valid_mems_allowed(s
return 0;
}

+/*
+ * nearest_exclusive_ancestor() - Returns the nearest mem_exclusive
+ * ancestor to the specified cpuset. Call while holding cpuset_sem.
+ * If no ancestor is mem_exclusive (an unusual configuration), then
+ * returns the root cpuset.
+ */
+static const struct cpuset *nearest_exclusive_ancestor(const struct cpuset *cs)
+{
+ while (!is_mem_exclusive(cs) && cs->parent)
+ cs = cs->parent;
+ return cs;
+}
+
/**
- * cpuset_zone_allowed - is zone z allowed in current->mems_allowed
- * @z: zone in question
+ * cpuset_zone_allowed - Can we allocate memory on zone z's memory node?
+ * @z: is this zone on an allowed node?
+ * @gfp_mask: memory allocation flags (we use __GFP_HARDWALL)
*
- * Is zone z allowed in current->mems_allowed, or is
- * the CPU in interrupt context? (zone is always allowed in this case)
- */
-int cpuset_zone_allowed(struct zone *z)
+ * If we're in interrupt, yes, we can always allocate. If zone
+ * z's node is in our tasks mems_allowed, yes. If it's not a
+ * __GFP_HARDWALL request and this zone's nodes is in the nearest
+ * mem_exclusive cpuset ancestor to this tasks cpuset, yes.
+ * Otherwise, no.
+ *
+ * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
+ * and do not allow allocations outside the current tasks cpuset.
+ * GFP_KERNEL allocations are not so marked, so can escape to the
+ * nearest mem_exclusive ancestor cpuset.
+ *
+ * Scanning up parent cpusets requires cpuset_sem. The __alloc_pages()
+ * routine only calls here with __GFP_HARDWALL bit _not_ set if
+ * it's a GFP_KERNEL allocation, and all nodes in the current tasks
+ * mems_allowed came up empty on the first pass over the zonelist.
+ * So only GFP_KERNEL allocations, if all nodes in the cpuset are
+ * short of memory, might require taking the cpuset_sem semaphore.
+ *
+ * The first loop over the zonelist in mm/page_alloc.c:__alloc_pages()
+ * calls here with __GFP_HARDWALL always set in gfp_mask, enforcing
+ * hardwall cpusets - no allocation on a node outside the cpuset is
+ * allowed (unless in interrupt, of course).
+ *
+ * The second loop doesn't even call here for GFP_ATOMIC requests
+ * (if the __alloc_pages() local variable 'wait' is set). That check
+ * and the checks below have the combined affect in the second loop of
+ * the __alloc_pages() routine that:
+ * in_interrupt - any node ok (current task context irrelevant)
+ * GFP_ATOMIC - any node ok
+ * GFP_KERNEL - any node in enclosing mem_exclusive cpuset ok
+ * GFP_USER - only nodes in current tasks mems allowed ok.
+ **/
+
+int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask)
{
- return in_interrupt() ||
- node_isset(z->zone_pgdat->node_id, current->mems_allowed);
+ int node; /* node that zone z is on */
+ const struct cpuset *cs; /* current cpuset ancestors */
+ int allowed = 1; /* is allocation in zone z allowed? */
+
+ if (in_interrupt())
+ return 1;
+ node = z->zone_pgdat->node_id;
+ if (node_isset(node, current->mems_allowed))
+ return 1;
+ if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
+ return 0;
+
+ /* Not hardwall and node outside mems_allowed: scan up cpusets */
+ down(&cpuset_sem);
+ cs = current->cpuset;
+ if (!cs)
+ goto done; /* current task exiting */
+ cs = nearest_exclusive_ancestor(cs);
+ allowed = node_isset(node, cs->mems_allowed);
+done:
+ up(&cpuset_sem);
+ return allowed;
}

/*
Index: linux-2.6.13-mem_exclusive_oom/mm/vmscan.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/mm/vmscan.c
+++ linux-2.6.13-mem_exclusive_oom/mm/vmscan.c
@@ -890,7 +890,7 @@ shrink_caches(struct zone **zones, struc
if (zone->present_pages == 0)
continue;

- if (!cpuset_zone_allowed(zone))
+ if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

zone->temp_priority = sc->priority;
@@ -938,7 +938,7 @@ int try_to_free_pages(struct zone **zone
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];

- if (!cpuset_zone_allowed(zone))
+ if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

zone->temp_priority = DEF_PRIORITY;
@@ -984,7 +984,7 @@ out:
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];

- if (!cpuset_zone_allowed(zone))
+ if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

zone->prev_priority = zone->temp_priority;
@@ -1254,7 +1254,7 @@ void wakeup_kswapd(struct zone *zone, in
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
- if (!cpuset_zone_allowed(zone))
+ if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
return;
if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
return;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-09-01 09:09:48

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 4/4] cpusets confine oom_killer to mem_exclusive cpuset

Now the real motivation for this cpuset mem_exclusive patch series
seems trivial. This patch depends on the previous cpuset_zone_allowed
patch and its prerequisites.

This patch keeps a task in or under one mem_exclusive cpuset from
provoking an oom kill of a task under a non-overlapping mem_exclusive
cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed
to escape mem_exclusive containment, there is little to gain from
oom killing a task under a non-overlapping mem_exclusive cpuset, as
almost all kernel and user memory allocation must come from disjoint
memory nodes.

This patch enables configuring a system so that a runaway job under
one mem_exclusive cpuset cannot cause the killing of a job in another
such cpuset that might be using very high compute and memory resources
for a prolonged time.

Signed-off-by: Paul Jackson <[email protected]>

Index: linux-2.6.13-mem_exclusive_oom/include/linux/cpuset.h
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/include/linux/cpuset.h
+++ linux-2.6.13-mem_exclusive_oom/include/linux/cpuset.h
@@ -24,6 +24,7 @@ void cpuset_update_current_mems_allowed(
void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
extern int cpuset_zone_allowed(struct zone *z, unsigned int __nocast gfp_mask);
+extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
extern struct file_operations proc_cpuset_operations;
extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);

@@ -54,6 +55,11 @@ static inline int cpuset_zone_allowed(st
return 1;
}

+static inline int cpuset_excl_nodes_overlap(const struct task_struct *p)
+{
+ return 1;
+}
+
static inline char *cpuset_task_status_allowed(struct task_struct *task,
char *buffer)
{
Index: linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/kernel/cpuset.c
+++ linux-2.6.13-mem_exclusive_oom/kernel/cpuset.c
@@ -1688,6 +1688,39 @@ done:
return allowed;
}

+/**
+ * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
+ * @p: pointer to task_struct of some other task.
+ *
+ * Description: Return true if the nearest mem_exclusive ancestor
+ * cpusets of tasks @p and current overlap. Used by oom killer to
+ * determine if task @p's memory usage might impact the memory
+ * available to the current task.
+ *
+ * Acquires cpuset_sem - not suitable for calling from a fast path.
+ **/
+
+int cpuset_excl_nodes_overlap(const struct task_struct *p)
+{
+ const struct cpuset *cs1, *cs2; /* my and p's cpuset ancestors */
+ int overlap = 0; /* do cpusets overlap? */
+
+ down(&cpuset_sem);
+ cs1 = current->cpuset;
+ if (!cs1)
+ goto done; /* current task exiting */
+ cs2 = p->cpuset;
+ if (!cs2)
+ goto done; /* task p is exiting */
+ cs1 = nearest_exclusive_ancestor(cs1);
+ cs2 = nearest_exclusive_ancestor(cs2);
+ overlap = nodes_intersects(cs1->mems_allowed, cs2->mems_allowed);
+done:
+ up(&cpuset_sem);
+
+ return overlap;
+}
+
/*
* proc_cpuset_show()
* - Print tasks cpuset path into seq_file.
Index: linux-2.6.13-mem_exclusive_oom/mm/oom_kill.c
===================================================================
--- linux-2.6.13-mem_exclusive_oom.orig/mm/oom_kill.c
+++ linux-2.6.13-mem_exclusive_oom/mm/oom_kill.c
@@ -20,6 +20,7 @@
#include <linux/swap.h>
#include <linux/timex.h>
#include <linux/jiffies.h>
+#include <linux/cpuset.h>

/* #define DEBUG */

@@ -152,6 +153,10 @@ static struct task_struct * select_bad_p
continue;
if (p->oomkilladj == OOM_DISABLE)
continue;
+ /* If p's nodes don't overlap ours, it won't help to kill p. */
+ if (!cpuset_excl_nodes_overlap(p))
+ continue;
+
/*
* This is in the process of releasing memory so for wait it
* to finish before killing some other task by mistake.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-09-01 09:39:47

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: [PATCH 1/4] cpusets oom_kill tweaks

On 9/1/05, Paul Jackson <[email protected]> wrote:
> This patch applies a few comment and code cleanups to mm/oom_kill.c
> prior to applying a few small patches to improve cpuset management of
> memory placement.
>
> The comment changed in oom_kill.c was seriously misleading. The code
> layout change in select_bad_process() makes room for adding another
> condition on which a process can be spared the oom killer (see the
> subsequent cpuset_nodes_overlap patch for this addition).
>
> Also a couple typos and spellos that bugged me, while I was here.
>
> This patch should have no material affect.

Why bother to have just added a variable, `releasing'?
--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

2005-09-01 09:58:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 1/4] cpusets oom_kill tweaks

Coywolf wrote:
> Why bother ...

The line length in characters was getting too long, the logic was
getting too convoluted, and the comment only applied to an unobvious
portion of the line.

Providing a name for the logical condition that a complicated
expression computes is one of the ways I find useful to make
code easier to read, and to resolve problems such as those above.

My primary goal in writing code is to minimize the time and effort
it will take a typical reader to properly understand the code.
I write first and foremost for humans.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-01 10:49:20

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: [PATCH 1/4] cpusets oom_kill tweaks

On 9/1/05, Paul Jackson <[email protected]> wrote:
> Coywolf wrote:
> > Why bother ...
>
> The line length in characters was getting too long, the logic was

Yeah. That long line bugged me too when I was writing my lca oom-killer patch.

> getting too convoluted, and the comment only applied to an unobvious
> portion of the line.
>
> Providing a name for the logical condition that a complicated
> expression computes is one of the ways I find useful to make
> code easier to read, and to resolve problems such as those above.

Maybe.

>
> My primary goal in writing code is to minimize the time and effort
> it will take a typical reader to properly understand the code.
> I write first and foremost for humans.

Hmm, I really wish xfs guys follow that too.
--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

2005-09-06 08:09:27

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/4] cpusets mems_allowed constrain GFP_KERNEL, oom killer

Andrew,

Please throw away the following 4 patches in 2.6.13-mm1:

cpusets-oom_kill-tweaks.patch
cpusets-new-__gfp_hardwall-flag.patch
cpusets-formalize-intermediate-gfp_kernel-containment.patch
cpusets-confine-oom_killer-to-mem_exclusive-cpuset.patch

You will see almost the same patches come back at you, in another
week, after I first send some patches to rework handling the global
cpuset semaphore cpuset_sem.

My code reading leads me to think there is a rare lockup possibility
here, where a task already holding cpuset_sem could try to get it
again in the new cpuset_zone_allowed() code.

Only systems actively manipulating cpusets have any chance of seeing
this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-06 22:47:52

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/4] cpusets mems_allowed constrain GFP_KERNEL, oom killer

Andrew,

Yesterday, I wrote:
> Please throw away the following 4 patches in 2.6.13-mm1:
>
> cpusets-oom_kill-tweaks.patch
> cpusets-new-__gfp_hardwall-flag.patch
> cpusets-formalize-intermediate-gfp_kernel-containment.patch
> cpusets-confine-oom_killer-to-mem_exclusive-cpuset.patch


Looks like you sent these 4 patches onto Linus before you saw
the above.

That's fine, no problem. I will work on top of these patches
now that they are in Linus's tree.

Just forget I asked you to throw them away.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401