2009-01-07 18:41:28

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 0/4] Memory controller soft limit patches


Here is v1 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. We'll compare shares and soft limits
below. I've had soft limit implementations earlier, but I've discarded those
approaches in favour of this one.

Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.

This is an RFC implementation and is not meant for inclusion

TODOs

1. The shares interface is not yet implemented, the current soft limit
implementation is not yet hierarchy aware. The end goal is to add
a shares interface on top of soft limits and to maintain shares in
a manner similar to the group scheduler
2. The current implementation maintains the delta from the soft limit
and pushes back groups to their soft limits, a ratio of delta/soft_limit
is more useful
3. It would be nice to have more targetted reclaim (in terms of pages to
recalim) interface. So that groups are pushed back, close to their soft
limits.

Tests
-----

I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system.

Please review, comment.

Series
------

memcg-soft-limit-documentation.patch
memcg-add-soft-limit-interface.patch
memcg-organize-over-soft-limit-groups.patch
memcg-soft-limit-reclaim-on-contention.patch

--
Balbir


2009-01-07 18:41:44

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 1/4] Memory controller soft limit documentation

From: Balbir Singh <[email protected]>

Add documentation for soft limit feature support.

Signed-off-by: Balbir Singh <[email protected]>
---

Documentation/controllers/memory.txt | 28 ++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff -puN Documentation/controllers/memory.txt~memcg-soft-limit-documentation Documentation/controllers/memory.txt
--- a/Documentation/controllers/memory.txt~memcg-soft-limit-documentation
+++ a/Documentation/controllers/memory.txt
@@ -360,7 +360,33 @@ cgroups created below it.

NOTE2: This feature can be enabled/disabled per subtree.

-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention (through do_try_to_free_pages(),
+while allocating), control groups are pushed back to their soft limits if
+possible. If the soft limit of each control group is very high, they are
+pushed back as much as possible to make sure that one control group does not
+starve the others.
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+
+8. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
_

--
Balbir

2009-01-07 18:42:05

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 2/4] Memory controller soft limit interface

From: Balbir Singh <[email protected]>

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/res_counter.h | 39 ++++++++++++++++++++++++++++++++++
kernel/res_counter.c | 3 ++
mm/memcontrol.c | 20 +++++++++++++++++
3 files changed, 62 insertions(+)

diff -puN mm/memcontrol.c~memcg-add-soft-limit-interface mm/memcontrol.c
--- a/mm/memcontrol.c~memcg-add-soft-limit-interface
+++ a/mm/memcontrol.c
@@ -1811,6 +1811,20 @@ static int mem_cgroup_write(struct cgrou
else
ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
+ case RES_SOFT_LIMIT:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ /*
+ * For memsw, soft limits are hard to implement in terms
+ * of semantics, for now, we support soft limits for
+ * control without swap
+ */
+ if (type == _MEM)
+ ret = res_counter_set_soft_limit(&memcg->res, val);
+ else
+ ret = -EINVAL;
+ break;
default:
ret = -EINVAL; /* should be BUG() ? */
break;
@@ -2010,6 +2024,12 @@ static struct cftype mem_cgroup_files[]
.read_u64 = mem_cgroup_read,
},
{
+ .name = "soft_limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
.name = "failcnt",
.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
diff -puN include/linux/res_counter.h~memcg-add-soft-limit-interface include/linux/res_counter.h
--- a/include/linux/res_counter.h~memcg-add-soft-limit-interface
+++ a/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
*/
unsigned long long limit;
/*
+ * the limit that usage can be exceed
+ */
+ unsigned long long soft_limit;
+ /*
* the number of unsuccessful attempts to consume the resource
*/
unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
RES_MAX_USAGE,
RES_LIMIT,
RES_FAILCNT,
+ RES_SOFT_LIMIT,
};

/*
@@ -130,6 +135,28 @@ static inline bool res_counter_limit_che
return false;
}

+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+ unsigned long long excess;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->usage <= cnt->soft_limit)
+ excess = 0;
+ else
+ excess = cnt->usage - cnt->soft_limit;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return excess;
+}
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -178,4 +205,16 @@ static inline int res_counter_set_limit(
return ret;
}

+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+ unsigned long long soft_limit)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->soft_limit = soft_limit;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
#endif
diff -puN kernel/res_counter.c~memcg-add-soft-limit-interface kernel/res_counter.c
--- a/kernel/res_counter.c~memcg-add-soft-limit-interface
+++ a/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter
{
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
+ counter->soft_limit = (unsigned long long)LLONG_MAX;
counter->parent = parent;
}

@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *c
return &counter->limit;
case RES_FAILCNT:
return &counter->failcnt;
+ case RES_SOFT_LIMIT:
+ return &counter->soft_limit;
};

BUG();
_

--
Balbir

2009-01-07 18:42:26

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

From: Balbir Singh <[email protected]>

This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

Signed-off-by: Balbir Singh <[email protected]>
---

mm/memcontrol.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)

diff -puN mm/memcontrol.c~memcg-organize-over-soft-limit-groups mm/memcontrol.c
--- a/mm/memcontrol.c~memcg-organize-over-soft-limit-groups
+++ a/mm/memcontrol.c
@@ -28,6 +28,7 @@
#include <linux/bit_spinlock.h>
#include <linux/rcupdate.h>
#include <linux/mutex.h>
+#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/spinlock.h>
@@ -119,6 +120,13 @@ struct mem_cgroup_lru_info {
};

/*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+static struct rb_root mem_cgroup_soft_limit_exceeded_groups;
+static DEFINE_MUTEX(memcg_soft_limit_tree_mutex);
+
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -166,12 +174,18 @@ struct mem_cgroup {

unsigned int swappiness;

+ struct rb_node mem_cgroup_node;
+ unsigned long long usage_in_excess;
+ unsigned long last_tree_update;
+
/*
* statistics. This must be placed at the end of memcg.
*/
struct mem_cgroup_stat stat;
};

+#define MEM_CGROUP_TREE_UPDATE_INTERVAL (HZ)
+
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -203,6 +217,39 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);

+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+ struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
+ struct rb_node *parent = NULL;
+ struct mem_cgroup *mem_node;
+
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ while (*p) {
+ parent = *p;
+ mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
+ if (mem->usage_in_excess < mem_node->usage_in_excess)
+ p = &(*p)->rb_left;
+ /*
+ * We can't avoid mem cgroups that are over their soft
+ * limit by the same amount
+ */
+ else if (mem->usage_in_excess >= mem_node->usage_in_excess)
+ p = &(*p)->rb_right;
+ }
+ rb_link_node(&mem->mem_cgroup_node, parent, p);
+ rb_insert_color(&mem->mem_cgroup_node,
+ &mem_cgroup_soft_limit_exceeded_groups);
+ mem->last_tree_update = jiffies;
+ mutex_unlock(&memcg_soft_limit_tree_mutex);
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+ mutex_unlock(&memcg_soft_limit_tree_mutex);
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -917,6 +964,10 @@ static void __mem_cgroup_commit_charge(s
struct page_cgroup *pc,
enum charge_type ctype)
{
+ unsigned long long prev_usage_in_excess, new_usage_in_excess;
+ bool updated_tree = false;
+ unsigned long next_update;
+
/* try_charge() can return NULL to *memcg, taking care of it. */
if (!mem)
return;
@@ -937,6 +988,30 @@ static void __mem_cgroup_commit_charge(s
mem_cgroup_charge_statistics(mem, pc, true);

unlock_page_cgroup(pc);
+
+ mem_cgroup_get(mem);
+ prev_usage_in_excess = mem->usage_in_excess;
+ new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+
+ next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
+ if (new_usage_in_excess && time_after(jiffies, next_update)) {
+ if (prev_usage_in_excess)
+ mem_cgroup_remove_exceeded(mem);
+ mem_cgroup_insert_exceeded(mem);
+ updated_tree = true;
+ } else if (prev_usage_in_excess && !new_usage_in_excess) {
+ mem_cgroup_remove_exceeded(mem);
+ updated_tree = true;
+ }
+
+ if (updated_tree) {
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ mem->last_tree_update = jiffies;
+ mem->usage_in_excess = new_usage_in_excess;
+ mutex_unlock(&memcg_soft_limit_tree_mutex);
+ }
+ mem_cgroup_put(mem);
+
}

/**
@@ -2218,6 +2293,7 @@ mem_cgroup_create(struct cgroup_subsys *
if (cont->parent == NULL) {
enable_swap_cgroup();
parent = NULL;
+ mem_cgroup_soft_limit_exceeded_groups = RB_ROOT;
} else {
parent = mem_cgroup_from_cont(cont->parent);
mem->use_hierarchy = parent->use_hierarchy;
@@ -2231,6 +2307,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->memsw, NULL);
}
mem->last_scanned_child = NULL;
+ mem->usage_in_excess = 0;
+ mem->last_tree_update = 0; /* Yes, time begins at 0 here */
spin_lock_init(&mem->reclaim_param_lock);

if (parent)
_

--
Balbir

2009-01-07 18:42:46

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 4/4] Memory controller soft limit reclaim on contention

From: Balbir Singh <[email protected]>

This patch allows reclaim from memory cgroups on contention (via the
__alloc_pages_internal() path). If a order greater than 0 is specified, we
anyway fall back on try_to_free_pages().

memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest amount and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.

Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memcontrol.h | 1
mm/memcontrol.c | 82 +++++++++++++++++++++++++++++++++--
mm/page_alloc.c | 10 +++-
3 files changed, 88 insertions(+), 5 deletions(-)

diff -puN mm/memcontrol.c~memcg-soft-limit-reclaim-on-contention mm/memcontrol.c
--- a/mm/memcontrol.c~memcg-soft-limit-reclaim-on-contention
+++ a/mm/memcontrol.c
@@ -177,6 +177,7 @@ struct mem_cgroup {
struct rb_node mem_cgroup_node;
unsigned long long usage_in_excess;
unsigned long last_tree_update;
+ bool on_tree;

/*
* statistics. This must be placed at the end of memcg.
@@ -184,7 +185,7 @@ struct mem_cgroup {
struct mem_cgroup_stat stat;
};

-#define MEM_CGROUP_TREE_UPDATE_INTERVAL (HZ)
+#define MEM_CGROUP_TREE_UPDATE_INTERVAL (HZ/4)

enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -217,13 +218,15 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);

-static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
{
struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
struct rb_node *parent = NULL;
struct mem_cgroup *mem_node;

- mutex_lock(&memcg_soft_limit_tree_mutex);
+ if (mem->on_tree)
+ return;
+
while (*p) {
parent = *p;
mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
@@ -240,16 +243,54 @@ static void mem_cgroup_insert_exceeded(s
rb_insert_color(&mem->mem_cgroup_node,
&mem_cgroup_soft_limit_exceeded_groups);
mem->last_tree_update = jiffies;
+ mem->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
+{
+ if (!mem->on_tree)
+ return;
+ rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+ mem->on_tree = false;
+}
+
+static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
+{
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ __mem_cgroup_insert_exceeded(mem);
mutex_unlock(&memcg_soft_limit_tree_mutex);
}

static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
{
mutex_lock(&memcg_soft_limit_tree_mutex);
- rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
+ __mem_cgroup_remove_exceeded(mem);
mutex_unlock(&memcg_soft_limit_tree_mutex);
}

+static struct mem_cgroup *mem_cgroup_get_largest_soft_limit_exceeding_node(void)
+{
+ struct rb_node *rightmost = NULL;
+ struct mem_cgroup *mem = NULL;
+
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ rightmost = rb_last(&mem_cgroup_soft_limit_exceeded_groups);
+ if (!rightmost)
+ goto done; /* Nothing to reclaim from */
+
+ mem = rb_entry(rightmost, struct mem_cgroup, mem_cgroup_node);
+ mem_cgroup_get(mem);
+ /*
+ * Remove the node now but someone else can add it back,
+ * we will to add it back at the end of reclaim to its correct
+ * position in the tree.
+ */
+ __mem_cgroup_remove_exceeded(mem);
+done:
+ mutex_unlock(&memcg_soft_limit_tree_mutex);
+ return mem;
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -1795,6 +1836,37 @@ try_to_free:
goto out;
}

+unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask)
+{
+ unsigned long nr_reclaimed = 0;
+ struct mem_cgroup *mem;
+
+ do {
+ mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
+ if (!mem)
+ break;
+ if (mem_cgroup_is_obsolete(mem)) {
+ mem_cgroup_put(mem);
+ continue;
+ }
+ nr_reclaimed +=
+ try_to_free_mem_cgroup_pages(mem, gfp_mask, false,
+ get_swappiness(mem));
+ mutex_lock(&memcg_soft_limit_tree_mutex);
+ mem->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+ /*
+ * We need to remove and reinsert the node in its correct
+ * position
+ */
+ __mem_cgroup_remove_exceeded(mem);
+ if (mem->usage_in_excess)
+ __mem_cgroup_insert_exceeded(mem);
+ mutex_unlock(&memcg_soft_limit_tree_mutex);
+ mem_cgroup_put(mem);
+ } while (!nr_reclaimed);
+ return nr_reclaimed;
+}
+
int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
{
return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
@@ -2309,6 +2381,8 @@ mem_cgroup_create(struct cgroup_subsys *
mem->last_scanned_child = NULL;
mem->usage_in_excess = 0;
mem->last_tree_update = 0; /* Yes, time begins at 0 here */
+ mem->on_tree = false;
+
spin_lock_init(&mem->reclaim_param_lock);

if (parent)
diff -puN mm/vmscan.c~memcg-soft-limit-reclaim-on-contention mm/vmscan.c
diff -puN include/linux/memcontrol.h~memcg-soft-limit-reclaim-on-contention include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~memcg-soft-limit-reclaim-on-contention
+++ a/include/linux/memcontrol.h
@@ -117,6 +117,7 @@ static inline bool mem_cgroup_disabled(v
}

extern bool mem_cgroup_oom_called(struct task_struct *task);
+extern unsigned long mem_cgroup_soft_limit_reclaim(gfp_t gfp_mask);

#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
diff -puN mm/page_alloc.c~memcg-soft-limit-reclaim-on-contention mm/page_alloc.c
--- a/mm/page_alloc.c~memcg-soft-limit-reclaim-on-contention
+++ a/mm/page_alloc.c
@@ -1582,7 +1582,15 @@ nofail_alloc:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+ did_some_progress = mem_cgroup_soft_limit_reclaim(gfp_mask);
+ /*
+ * If we made no progress or need higher order allocations
+ * try_to_free_pages() is still our best bet, since mem_cgroup
+ * reclaim does not handle freeing pages greater than order 0
+ */
+ if (!did_some_progress || order)
+ did_some_progress = try_to_free_pages(zonelist, order,
+ gfp_mask);

p->reclaim_state = NULL;
p->flags &= ~PF_MEMALLOC;
_

--
Balbir

2009-01-07 19:00:47

by Dhaval Giani

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

On Thu, Jan 08, 2009 at 12:11:10AM +0530, Balbir Singh wrote:
>
> Here is v1 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. We'll compare shares and soft limits
> below. I've had soft limit implementations earlier, but I've discarded those
> approaches in favour of this one.
>
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
>
> This is an RFC implementation and is not meant for inclusion
>
> TODOs
>
> 1. The shares interface is not yet implemented, the current soft limit
> implementation is not yet hierarchy aware. The end goal is to add
> a shares interface on top of soft limits and to maintain shares in
> a manner similar to the group scheduler

Just to clarify, when there is no contention, you want to share memory
proportionally?

thanks,
--
regards,
Dhaval

2009-01-08 00:31:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

On Thu, 08 Jan 2009 00:11:10 +0530
Balbir Singh <[email protected]> wrote:

>
> Here is v1 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. We'll compare shares and soft limits
> below. I've had soft limit implementations earlier, but I've discarded those
> approaches in favour of this one.
>
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
>
> This is an RFC implementation and is not meant for inclusion
>
Core implemantation seems simple and the feature sounds good.
But, before reviewing into details, 3 points.

1. please fix current bugs on hierarchy management, before new feature.
AFAIK, OOM-Kill under hierarchy is broken. (I have patches but waits for
merge window close.)
I wonder there will be some others. Lockdep error which Nishimura reported
are all fixed now ?

2. You inserts reclaim-by-soft-limit into alloc_pages(). But, to do this,
you have to pass zonelist to try_to_free_mem_cgroup_pages() and have to modify
try_to_free_mem_cgroup_pages().
2-a) If not, when the memory request is for gfp_mask==GFP_DMA or allocation
is under a cpuset, memory reclaim will not work correctlly.
2-b) try_to_free_mem_cgroup_pages() cannot do good work for order > 1 allocation.

Please try fake-numa (or real NUMA machine) and cpuset.

3. If you want to insert hooks to "generic" page allocator, it's better to add CC to
Rik van Riel, Kosaki Motohiro, at leaset.

To be honest, I myself don't like to add a hook to alloc_pages() directly.
Can we implment call soft-limit like kswapd (or on kswapd()) ?
i.e. in moderate way ?

A happy new year,

-Kame



> TODOs
>
> 1. The shares interface is not yet implemented, the current soft limit
> implementation is not yet hierarchy aware. The end goal is to add
> a shares interface on top of soft limits and to maintain shares in
> a manner similar to the group scheduler
> 2. The current implementation maintains the delta from the soft limit
> and pushes back groups to their soft limits, a ratio of delta/soft_limit
> is more useful
> 3. It would be nice to have more targetted reclaim (in terms of pages to
> recalim) interface. So that groups are pushed back, close to their soft
> limits.
>
> Tests
> -----
>
> I've run two memory intensive workloads with differing soft limits and
> seen that they are pushed back to their soft limit on contention. Their usage
> was their soft limit plus additional memory that they were able to grab
> on the system.
>
> Please review, comment.
>
> Series
> ------
>
> memcg-soft-limit-documentation.patch
> memcg-add-soft-limit-interface.patch
> memcg-organize-over-soft-limit-groups.patch
> memcg-soft-limit-reclaim-on-contention.patch
>
> --
> Balbir
>

2009-01-08 00:38:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

On Thu, 8 Jan 2009 00:26:27 +0530
Dhaval Giani <[email protected]> wrote:

> On Thu, Jan 08, 2009 at 12:11:10AM +0530, Balbir Singh wrote:
> >
> > Here is v1 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. We'll compare shares and soft limits
> > below. I've had soft limit implementations earlier, but I've discarded those
> > approaches in favour of this one.
> >
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> >
> > This is an RFC implementation and is not meant for inclusion
> >
> > TODOs
> >
> > 1. The shares interface is not yet implemented, the current soft limit
> > implementation is not yet hierarchy aware. The end goal is to add
> > a shares interface on top of soft limits and to maintain shares in
> > a manner similar to the group scheduler
>
> Just to clarify, when there is no contention, you want to share memory
> proportionally?
>
I don't like to add "share" as the kernel interface of memcg.
We used "bytes" to do (hard) limit. Please just use "bytes".

Thanks,
-Kame

2009-01-08 01:13:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

On Thu, 08 Jan 2009 00:11:28 +0530
Balbir Singh <[email protected]> wrote:

> From: Balbir Singh <[email protected]>
>
> This patch introduces a RB-Tree for storing memory cgroups that are over their
> soft limit. The overall goal is to
>
> 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> We are careful about updates, updates take place only after a particular
> time interval has passed
> 2. We remove the node from the RB-Tree when the usage goes below the soft
> limit
>
> The next set of patches will exploit the RB-Tree to get the group that is
> over its soft limit by the largest amount and reclaim from it, when we
> face memory contention.
>

Hmm, Could you clarify following ?

- Usage of memory at insertsion and usage of memory at reclaim is different.
So, this *sorted* order by RB-tree isn't the best order in general.
Why don't you sort this at memory-reclaim dynamically ?
- Considering above, the look of RB tree can be

+30M (an amount over soft limit is 30M)
/ \
-15M +60M
?

At least, pleease remove the node at uncharge() when the usage goes down.

Thanks,
-Kame




> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> mm/memcontrol.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 78 insertions(+)
>
> diff -puN mm/memcontrol.c~memcg-organize-over-soft-limit-groups mm/memcontrol.c
> --- a/mm/memcontrol.c~memcg-organize-over-soft-limit-groups
> +++ a/mm/memcontrol.c
> @@ -28,6 +28,7 @@
> #include <linux/bit_spinlock.h>
> #include <linux/rcupdate.h>
> #include <linux/mutex.h>
> +#include <linux/rbtree.h>
> #include <linux/slab.h>
> #include <linux/swap.h>
> #include <linux/spinlock.h>
> @@ -119,6 +120,13 @@ struct mem_cgroup_lru_info {
> };
>
> /*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +static struct rb_root mem_cgroup_soft_limit_exceeded_groups;
> +static DEFINE_MUTEX(memcg_soft_limit_tree_mutex);
> +
> +/*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -166,12 +174,18 @@ struct mem_cgroup {
>
> unsigned int swappiness;
>
> + struct rb_node mem_cgroup_node;
> + unsigned long long usage_in_excess;
> + unsigned long last_tree_update;
> +
> /*
> * statistics. This must be placed at the end of memcg.
> */
> struct mem_cgroup_stat stat;
> };
>
> +#define MEM_CGROUP_TREE_UPDATE_INTERVAL (HZ)
> +
> enum charge_type {
> MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -203,6 +217,39 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> static void mem_cgroup_get(struct mem_cgroup *mem);
> static void mem_cgroup_put(struct mem_cgroup *mem);
>
> +static void mem_cgroup_insert_exceeded(struct mem_cgroup *mem)
> +{
> + struct rb_node **p = &mem_cgroup_soft_limit_exceeded_groups.rb_node;
> + struct rb_node *parent = NULL;
> + struct mem_cgroup *mem_node;
> +
> + mutex_lock(&memcg_soft_limit_tree_mutex);
> + while (*p) {
> + parent = *p;
> + mem_node = rb_entry(parent, struct mem_cgroup, mem_cgroup_node);
> + if (mem->usage_in_excess < mem_node->usage_in_excess)
> + p = &(*p)->rb_left;
> + /*
> + * We can't avoid mem cgroups that are over their soft
> + * limit by the same amount
> + */
> + else if (mem->usage_in_excess >= mem_node->usage_in_excess)
> + p = &(*p)->rb_right;
> + }
> + rb_link_node(&mem->mem_cgroup_node, parent, p);
> + rb_insert_color(&mem->mem_cgroup_node,
> + &mem_cgroup_soft_limit_exceeded_groups);
> + mem->last_tree_update = jiffies;
> + mutex_unlock(&memcg_soft_limit_tree_mutex);
> +}
> +
> +static void mem_cgroup_remove_exceeded(struct mem_cgroup *mem)
> +{
> + mutex_lock(&memcg_soft_limit_tree_mutex);
> + rb_erase(&mem->mem_cgroup_node, &mem_cgroup_soft_limit_exceeded_groups);
> + mutex_unlock(&memcg_soft_limit_tree_mutex);
> +}
> +
> static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> struct page_cgroup *pc,
> bool charge)
> @@ -917,6 +964,10 @@ static void __mem_cgroup_commit_charge(s
> struct page_cgroup *pc,
> enum charge_type ctype)
> {
> + unsigned long long prev_usage_in_excess, new_usage_in_excess;
> + bool updated_tree = false;
> + unsigned long next_update;
> +
> /* try_charge() can return NULL to *memcg, taking care of it. */
> if (!mem)
> return;
> @@ -937,6 +988,30 @@ static void __mem_cgroup_commit_charge(s
> mem_cgroup_charge_statistics(mem, pc, true);
>
> unlock_page_cgroup(pc);
> +
> + mem_cgroup_get(mem);
> + prev_usage_in_excess = mem->usage_in_excess;
> + new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> +
> + next_update = mem->last_tree_update + MEM_CGROUP_TREE_UPDATE_INTERVAL;
> + if (new_usage_in_excess && time_after(jiffies, next_update)) {
> + if (prev_usage_in_excess)
> + mem_cgroup_remove_exceeded(mem);
> + mem_cgroup_insert_exceeded(mem);
> + updated_tree = true;
> + } else if (prev_usage_in_excess && !new_usage_in_excess) {
> + mem_cgroup_remove_exceeded(mem);
> + updated_tree = true;
> + }
> +
> + if (updated_tree) {
> + mutex_lock(&memcg_soft_limit_tree_mutex);
> + mem->last_tree_update = jiffies;
> + mem->usage_in_excess = new_usage_in_excess;
> + mutex_unlock(&memcg_soft_limit_tree_mutex);
> + }
> + mem_cgroup_put(mem);
> +
> }
>
> /**
> @@ -2218,6 +2293,7 @@ mem_cgroup_create(struct cgroup_subsys *
> if (cont->parent == NULL) {
> enable_swap_cgroup();
> parent = NULL;
> + mem_cgroup_soft_limit_exceeded_groups = RB_ROOT;
> } else {
> parent = mem_cgroup_from_cont(cont->parent);
> mem->use_hierarchy = parent->use_hierarchy;
> @@ -2231,6 +2307,8 @@ mem_cgroup_create(struct cgroup_subsys *
> res_counter_init(&mem->memsw, NULL);
> }
> mem->last_scanned_child = NULL;
> + mem->usage_in_excess = 0;
> + mem->last_tree_update = 0; /* Yes, time begins at 0 here */
> spin_lock_init(&mem->reclaim_param_lock);
>
> if (parent)
> _
>
> --
> Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-08 03:47:04

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

* KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 09:37:00]:

> On Thu, 8 Jan 2009 00:26:27 +0530
> Dhaval Giani <[email protected]> wrote:
>
> > On Thu, Jan 08, 2009 at 12:11:10AM +0530, Balbir Singh wrote:
> > >
> > > Here is v1 of the new soft limit implementation. Soft limits is a new feature
> > > for the memory resource controller, something similar has existed in the
> > > group scheduler in the form of shares. We'll compare shares and soft limits
> > > below. I've had soft limit implementations earlier, but I've discarded those
> > > approaches in favour of this one.
> > >
> > > Soft limits are the most useful feature to have for environments where
> > > the administrator wants to overcommit the system, such that only on memory
> > > contention do the limits become active. The current soft limits implementation
> > > provides a soft_limit_in_bytes interface for the memory controller and not
> > > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > > that exceed their soft limit and starts reclaiming from the group that
> > > exceeds this limit by the maximum amount.
> > >
> > > This is an RFC implementation and is not meant for inclusion
> > >
> > > TODOs
> > >
> > > 1. The shares interface is not yet implemented, the current soft limit
> > > implementation is not yet hierarchy aware. The end goal is to add
> > > a shares interface on top of soft limits and to maintain shares in
> > > a manner similar to the group scheduler
> >
> > Just to clarify, when there is no contention, you want to share memory
> > proportionally?
> >
> I don't like to add "share" as the kernel interface of memcg.
> We used "bytes" to do (hard) limit. Please just use "bytes".
>

Yes, we'll have soft limit in bytes, but for a hierarchical view,
shares do make a lot of sense. The user can use whichever interface
suits them the most.

--
Balbir

2009-01-08 04:00:00

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

* KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 09:30:40]:

> On Thu, 08 Jan 2009 00:11:10 +0530
> Balbir Singh <[email protected]> wrote:
>
> >
> > Here is v1 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. We'll compare shares and soft limits
> > below. I've had soft limit implementations earlier, but I've discarded those
> > approaches in favour of this one.
> >
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> >
> > This is an RFC implementation and is not meant for inclusion
> >
> Core implemantation seems simple and the feature sounds good.

Thanks!

> But, before reviewing into details, 3 points.
>
> 1. please fix current bugs on hierarchy management, before new feature.
> AFAIK, OOM-Kill under hierarchy is broken. (I have patches but waits for
> merge window close.)

I've not hit the OOM-kill issue under hierarchy so far, is the OOM
killer selecting a bad task to kill? I'll debug/reproduce the issue.
I am not posting these patches for inclusion, fixing bugs is
definitely the highest priority.

> I wonder there will be some others. Lockdep error which Nishimura reported
> are all fixed now ?

I run all my kernels and tests with lockdep enabled, I did not see any
lockdep errors showing up.

>
> 2. You inserts reclaim-by-soft-limit into alloc_pages(). But, to do this,
> you have to pass zonelist to try_to_free_mem_cgroup_pages() and have to modify
> try_to_free_mem_cgroup_pages().
> 2-a) If not, when the memory request is for gfp_mask==GFP_DMA or allocation
> is under a cpuset, memory reclaim will not work correctlly.

The idea behind adding the code in alloc_pages() is to detect
contention and trim mem cgroups down, if they have grown beyond their
soft limit

> 2-b) try_to_free_mem_cgroup_pages() cannot do good work for order > 1 allocation.
>
> Please try fake-numa (or real NUMA machine) and cpuset.

Yes, order > 1 is documented in the patch and you can see the code as
well. Your suggestion is to look at the gfp_mask as well, I'll do
that.

>
> 3. If you want to insert hooks to "generic" page allocator, it's better to add CC to
> Rik van Riel, Kosaki Motohiro, at leaset.

Sure, I'll do that in the next patchset.

>
> To be honest, I myself don't like to add a hook to alloc_pages() directly.
> Can we implment call soft-limit like kswapd (or on kswapd()) ?
> i.e. in moderate way ?
>

Yes, that might be another point to experiment with, I'll try that in
the next iteration.


> A happy new year,
>

A very happy new year to you as well.

> -Kame
>

--
Balbir

2009-01-08 04:22:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

On Thu, 8 Jan 2009 09:29:30 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 09:30:40]:
>
> > On Thu, 08 Jan 2009 00:11:10 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > >
> > > Here is v1 of the new soft limit implementation. Soft limits is a new feature
> > > for the memory resource controller, something similar has existed in the
> > > group scheduler in the form of shares. We'll compare shares and soft limits
> > > below. I've had soft limit implementations earlier, but I've discarded those
> > > approaches in favour of this one.
> > >
> > > Soft limits are the most useful feature to have for environments where
> > > the administrator wants to overcommit the system, such that only on memory
> > > contention do the limits become active. The current soft limits implementation
> > > provides a soft_limit_in_bytes interface for the memory controller and not
> > > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > > that exceed their soft limit and starts reclaiming from the group that
> > > exceeds this limit by the maximum amount.
> > >
> > > This is an RFC implementation and is not meant for inclusion
> > >
> > Core implemantation seems simple and the feature sounds good.
>
> Thanks!
>
> > But, before reviewing into details, 3 points.
> >
> > 1. please fix current bugs on hierarchy management, before new feature.
> > AFAIK, OOM-Kill under hierarchy is broken. (I have patches but waits for
> > merge window close.)
>
> I've not hit the OOM-kill issue under hierarchy so far, is the OOM
> killer selecting a bad task to kill? I'll debug/reproduce the issue.
> I am not posting these patches for inclusion, fixing bugs is
> definitely the highest priority.
>
Assume follwoing hierarchy.

group_A/ limit=100M usage=1M
group_01/ no limit usage=1M
group_02/ no limit usage=98M (does memory leak.)

Q. What happens a task on group_02 causes oom ?
A. A task in group_A dies.


is my problem. (As I said, I'll post a patch .) This is my homework for a month.
(I'll use CSS_ID to fix this.)
Any this will allow to skip my logic to check "Is this OOM is from memcg?"
And makes system panic if vm.panic_on_oom==1.





> > I wonder there will be some others. Lockdep error which Nishimura reported
> > are all fixed now ?
>
> I run all my kernels and tests with lockdep enabled, I did not see any
> lockdep errors showing up.
>
ok.

> >
> > 2. You inserts reclaim-by-soft-limit into alloc_pages(). But, to do this,
> > you have to pass zonelist to try_to_free_mem_cgroup_pages() and have to modify
> > try_to_free_mem_cgroup_pages().
> > 2-a) If not, when the memory request is for gfp_mask==GFP_DMA or allocation
> > is under a cpuset, memory reclaim will not work correctlly.
>
> The idea behind adding the code in alloc_pages() is to detect
> contention and trim mem cgroups down, if they have grown beyond their
> soft limit
>
Allowing usual direct reclaim go on and just waking up "balance_soft_limit_daemon()"
will be enough.

> > 2-b) try_to_free_mem_cgroup_pages() cannot do good work for order > 1 allocation.
> >
> > Please try fake-numa (or real NUMA machine) and cpuset.
>
> Yes, order > 1 is documented in the patch and you can see the code as
> well. Your suggestion is to look at the gfp_mask as well, I'll do
> that.
>
and zonelist/nodemask.

generic try_to_free_pages() doesn't have nodemask as its argument but it checks cpuset.

In shrink_zones().
==
1504 /*
1505 * Take care memory controller reclaiming has small influence
1506 * to global LRU.
1507 */
1508 if (scan_global_lru(sc)) {
1509 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
1510 continue;
1511 note_zone_scanning_priority(zone, priority);
1512
1513 if (zone_is_all_unreclaimable(zone) &&
1514 priority != DEF_PRIORITY)
1515 continue; /* Let kswapd poll it */
1516 sc->all_unreclaimable = 0;
1517 } else {
1518 /*
1519 * Ignore cpuset limitation here. We just want to reduce
1520 * # of used pages by us regardless of memory shortage.
1521 */
1522 sc->all_unreclaimable = 0;
1523 mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
1524 priority);
1525 }
==
This is because "reclaim by memcg" can happen even if there are enough memory.
try_to_free_mem_cgroup_pages() is called when "hit limit".

So, there will be some issues to be improved if you want to use
try_to_free_mem_cgroup_pages() for recovering "memory shortage".
I think above is one of issue. Some more assumption will corrupt.

-Kame

2009-01-08 04:26:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

* KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 10:11:48]:

> On Thu, 08 Jan 2009 00:11:28 +0530
> Balbir Singh <[email protected]> wrote:
>
> > From: Balbir Singh <[email protected]>
> >
> > This patch introduces a RB-Tree for storing memory cgroups that are over their
> > soft limit. The overall goal is to
> >
> > 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
> > We are careful about updates, updates take place only after a particular
> > time interval has passed
> > 2. We remove the node from the RB-Tree when the usage goes below the soft
> > limit
> >
> > The next set of patches will exploit the RB-Tree to get the group that is
> > over its soft limit by the largest amount and reclaim from it, when we
> > face memory contention.
> >
>
> Hmm, Could you clarify following ?
>
> - Usage of memory at insertsion and usage of memory at reclaim is different.
> So, this *sorted* order by RB-tree isn't the best order in general.

True, but we frequently update the tree at an interval of HZ/4.
Updating at every page fault sounded like an overkill and building the
entire tree at reclaim is an overkill too.

> Why don't you sort this at memory-reclaim dynamically ?
> - Considering above, the look of RB tree can be
>
> +30M (an amount over soft limit is 30M)
> / \
> -15M +60M

We don't have elements below their soft limit in the tree

> ?
>
> At least, pleease remove the node at uncharge() when the usage goes down.
>

We do remove the tree if it goes under its soft limit at commit_charge,
I thought I had the same code in uncharge(), but clearly that is
missing. Thanks, I'll add it there.


> Thanks,
> -Kame

--
Balbir

2009-01-08 04:30:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

On Thu, 8 Jan 2009 09:55:58 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 10:11:48]:
> > Hmm, Could you clarify following ?
> >
> > - Usage of memory at insertsion and usage of memory at reclaim is different.
> > So, this *sorted* order by RB-tree isn't the best order in general.
>
> True, but we frequently update the tree at an interval of HZ/4.
> Updating at every page fault sounded like an overkill and building the
> entire tree at reclaim is an overkill too.
>
"sort" is not necessary.
If this feature is implemented as background daemon,
just select the worst one at each iteration is enough.


> > Why don't you sort this at memory-reclaim dynamically ?
> > - Considering above, the look of RB tree can be
> >
> > +30M (an amount over soft limit is 30M)
> > / \
> > -15M +60M
>
> We don't have elements below their soft limit in the tree
>
> > ?
> >
> > At least, pleease remove the node at uncharge() when the usage goes down.
> >
>
> We do remove the tree if it goes under its soft limit at commit_charge,
> I thought I had the same code in uncharge(), but clearly that is
> missing. Thanks, I'll add it there.
>

Ah, ok. I missed it. Thank you for clalification.

Regards,
-Kame

2009-01-08 04:41:21

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

* KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 13:28:55]:

> On Thu, 8 Jan 2009 09:55:58 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 10:11:48]:
> > > Hmm, Could you clarify following ?
> > >
> > > - Usage of memory at insertsion and usage of memory at reclaim is different.
> > > So, this *sorted* order by RB-tree isn't the best order in general.
> >
> > True, but we frequently update the tree at an interval of HZ/4.
> > Updating at every page fault sounded like an overkill and building the
> > entire tree at reclaim is an overkill too.
> >
> "sort" is not necessary.
> If this feature is implemented as background daemon,
> just select the worst one at each iteration is enough.

OK, definitely an alternative worth considering, but the trade-off is
lazy building (your suggestion), which involves actively seeing the
usage of all cgroups (and if they are large, O(c), c is number of
cgroups can be quite a bit) versus building the tree as and when the
fault occurs and controlled by some interval.

--
Balbir

2009-01-08 04:47:19

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches

> > 1. please fix current bugs on hierarchy management, before new feature.
> > AFAIK, OOM-Kill under hierarchy is broken. (I have patches but waits for
> > merge window close.)
>
> I've not hit the OOM-kill issue under hierarchy so far, is the OOM
> killer selecting a bad task to kill? I'll debug/reproduce the issue.
> I am not posting these patches for inclusion, fixing bugs is
> definitely the highest priority.
>
I agree.

Just FYI, I have several bug fix patches for current memcg(that is for .29).
I've been testing them now, and it survives my test(rmdir aftre task move
under memory pressure and page migration) w/o big problem(except oom) for hours
in both use_hierarchy==0/1 case.

> > I wonder there will be some others. Lockdep error which Nishimura reported
> > are all fixed now ?
>
> I run all my kernels and tests with lockdep enabled, I did not see any
> lockdep errors showing up.
>
I think Paul's hierarchy_mutex patches fixed the dead lock, I haven't seen
the dead lock after the patch.
(Although, it may cause another dead lock when other subsystems are added.)


Thanks,
Daisuke Nishimura.

2009-01-08 04:58:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] Memory controller soft limit organize cgroups

On Thu, 8 Jan 2009 10:11:08 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 13:28:55]:
>
> > On Thu, 8 Jan 2009 09:55:58 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > * KAMEZAWA Hiroyuki <[email protected]> [2009-01-08 10:11:48]:
> > > > Hmm, Could you clarify following ?
> > > >
> > > > - Usage of memory at insertsion and usage of memory at reclaim is different.
> > > > So, this *sorted* order by RB-tree isn't the best order in general.
> > >
> > > True, but we frequently update the tree at an interval of HZ/4.
> > > Updating at every page fault sounded like an overkill and building the
> > > entire tree at reclaim is an overkill too.
> > >
> > "sort" is not necessary.
> > If this feature is implemented as background daemon,
> > just select the worst one at each iteration is enough.
>
> OK, definitely an alternative worth considering, but the trade-off is
> lazy building (your suggestion), which involves actively seeing the
> usage of all cgroups (and if they are large, O(c), c is number of
> cgroups can be quite a bit) versus building the tree as and when the
> fault occurs and controlled by some interval.
>
I never think there will be "thousands" of memcg. O(c) is not so bad
if it's on background.

But usual cost of adding res_counter_soft_limit_excess(&mem->res); is big...
This maintainance cost of tree is always necessary even while there are no
memory shortage.

BTW,
- mutex is bad. Can you use mutex while __GFP_WAIT is unset ?

- what happens when a big uncharge() occurs and no new charge() happens ?
please add

+ mem = mem_cgroup_get_largest_soft_limit_exceeding_node();
if ( mem is still over soft limit )
do reclaim....

at least.

-Kame

2009-01-14 01:46:19

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] Memory controller soft limit documentation

On Wed, Jan 7, 2009 at 10:41 AM, Balbir Singh <[email protected]> wrote:
> -7. TODO
> +7. Soft limits
> +
> +Soft limits allow for greater sharing of memory. The idea behind soft limits
> +is to allow control groups to use as much of the memory as needed, provided
> +
> +a. There is no memory contention
> +b. They do not exceed their hard limit
> +
> +When the system detects memory contention (through do_try_to_free_pages(),
> +while allocating), control groups are pushed back to their soft limits if
> +possible. If the soft limit of each control group is very high, they are
> +pushed back as much as possible to make sure that one control group does not
> +starve the others.

Can you give an example here of how to implement the following setup:

- we have a high-priority latency-sensitive server job A and a bunch
of low-priority batch jobs B, C and D

- each job *may* need up to 2GB of memory, but generally each tends to
use <1GB of memory

- we want to run all four jobs on a 4GB machine

- we don't want A to ever have to wait for memory to be reclaimed (as
it's serving latency-sensitive queries), so the kernel should be
squashing B/C/D down *before* memory actually runs out.

Is this possible with the proposed hard/soft limit setup? Or do we
need some additional support for keeping a pool of pre-reserved free
memory available?

Paul

2009-01-14 05:30:29

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] Memory controller soft limit documentation

* Paul Menage <[email protected]> [2009-01-13 17:45:54]:

> On Wed, Jan 7, 2009 at 10:41 AM, Balbir Singh <[email protected]> wrote:
> > -7. TODO
> > +7. Soft limits
> > +
> > +Soft limits allow for greater sharing of memory. The idea behind soft limits
> > +is to allow control groups to use as much of the memory as needed, provided
> > +
> > +a. There is no memory contention
> > +b. They do not exceed their hard limit
> > +
> > +When the system detects memory contention (through do_try_to_free_pages(),
> > +while allocating), control groups are pushed back to their soft limits if
> > +possible. If the soft limit of each control group is very high, they are
> > +pushed back as much as possible to make sure that one control group does not
> > +starve the others.
>
> Can you give an example here of how to implement the following setup:
>
> - we have a high-priority latency-sensitive server job A and a bunch
> of low-priority batch jobs B, C and D
>
> - each job *may* need up to 2GB of memory, but generally each tends to
> use <1GB of memory
>
> - we want to run all four jobs on a 4GB machine
>
> - we don't want A to ever have to wait for memory to be reclaimed (as
> it's serving latency-sensitive queries), so the kernel should be
> squashing B/C/D down *before* memory actually runs out.
>
> Is this possible with the proposed hard/soft limit setup? Or do we
> need some additional support for keeping a pool of pre-reserved free
> memory available?

This is a more complex scenario, It sounds like B/C and D should be
hard limited to 2G or another value, depending on how much you want to
pre-reserve for A (all B/C and D should be in the same cgroup). Then
you want to use soft limits within the B/C/D cgroup. You don't want to
hard limit A, but just setup a 2G soft limit for it.

The notion of prioritized jobs and reservation does not exist yet, but
once we support soft limits and overcommit via soft limits, we could
consider looking at what design aspects would help with it.

--
Balbir