From: Balbir Singh <[email protected]>
New Feature: Soft limits for memory resource controller.
Here is v9 of the new soft limit implementation. Soft limits is a new feature
for the memory resource controller, something similar has existed in the
group scheduler in the form of shares. The CPU controllers interpretation
of shares is very different though.
Soft limits are the most useful feature to have for environments where
the administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits implementation
provides a soft_limit_in_bytes interface for the memory controller and not
for memory+swap controller. The implementation maintains an RB-Tree of groups
that exceed their soft limit and starts reclaiming from the group that
exceeds this limit by the maximum amount.
v9 attempts to address several review comments for v8 by Kamezawa, including
moving over to an event based approach for soft limit rb tree management,
simplification of data structure names and many others. Comments not
addressed have been answered via email or I've added comments in the code.
TODOs
1. The current implementation maintains the delta from the soft limit
and pushes back groups to their soft limits, a ratio of delta/soft_limit
might be more useful
Tests
-----
I've run two memory intensive workloads with differing soft limits and
seen that they are pushed back to their soft limit on contention. Their usage
was their soft limit plus additional memory that they were able to grab
on the system. Soft limit can take a while before we see the expected
results.
I ran overhead tests (reaim) and found no significant overhead as a result
of these patches.
Please review, comment.
Series
------
memcg-soft-limits-documentation.patch
memcg-soft-limits-interface.patch
memcg-soft-limits-organize.patch
memcg-soft-limits-refactor-reclaim-bits
memcg-soft-limits-reclaim-on-contention.patch
--
Balbir
Feature: Add documentation for soft limits
From: Balbir Singh <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
---
Documentation/cgroups/memory.txt | 37 ++++++++++++++++++++++++++++++++++++-
1 files changed, 36 insertions(+), 1 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ab0a021..b871f25 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -379,7 +379,42 @@ cgroups created below it.
NOTE2: This feature can be enabled/disabled per subtree.
-7. TODO
+7. Soft limits
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+Please note that soft limits is a best effort feature, it comes with
+no guarantees, but it does its best to make sure that when memory is
+heavily contended for, memory is allocated based on the soft limit
+hints/setup. Currently soft limit based reclaim is setup such that
+it gets invoked from balance_pgdat (kswapd).
+
+7.1 Interface
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 megabytes)
+
+# echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use
+
+# echo 1G > memory.soft_limit_in_bytes
+
+NOTE1: Soft limits take effect over a long period of time, since they involve
+ reclaiming memory for balancing between memory cgroups
+NOTE2: It is recommended to set the soft limit always below the hard limit,
+ otherwise the hard limit will take precedence.
+
+8. TODO
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
--
Balbir
Feature: Add soft limits interface to resource counters
From: Balbir Singh <[email protected]>
Changelog v8..v1 (yeah this patch changed only in v8)
1. Change ULONGLOGMAX to RESOURCE_MAX while initializing the soft limit
counter
Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
by the hierarchy code.
Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.
Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.
Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/res_counter.h | 58 +++++++++++++++++++++++++++++++++++++++++++
kernel/res_counter.c | 3 ++
mm/memcontrol.c | 20 +++++++++++++++
3 files changed, 81 insertions(+), 0 deletions(-)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 511f42f..fcb9884 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
*/
unsigned long long limit;
/*
+ * the limit that usage can be exceed
+ */
+ unsigned long long soft_limit;
+ /*
* the number of unsuccessful attempts to consume the resource
*/
unsigned long long failcnt;
@@ -87,6 +91,7 @@ enum {
RES_MAX_USAGE,
RES_LIMIT,
RES_FAILCNT,
+ RES_SOFT_LIMIT,
};
/*
@@ -132,6 +137,36 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
return false;
}
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+ if (cnt->usage < cnt->soft_limit)
+ return true;
+
+ return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+ unsigned long long excess;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->usage <= cnt->soft_limit)
+ excess = 0;
+ else
+ excess = cnt->usage - cnt->soft_limit;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return excess;
+}
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -147,6 +182,17 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
return ret;
}
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+ bool ret;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ ret = res_counter_soft_limit_check_locked(cnt);
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return ret;
+}
+
static inline void res_counter_reset_max(struct res_counter *cnt)
{
unsigned long flags;
@@ -180,4 +226,16 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
return ret;
}
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+ unsigned long long soft_limit)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->soft_limit = soft_limit;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
#endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e1338f0..bcdabf3 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
{
spin_lock_init(&counter->lock);
counter->limit = RESOURCE_MAX;
+ counter->soft_limit = RESOURCE_MAX;
counter->parent = parent;
}
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->limit;
case RES_FAILCNT:
return &counter->failcnt;
+ case RES_SOFT_LIMIT:
+ return &counter->soft_limit;
};
BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4da70c9..3c9292b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2106,6 +2106,20 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
else
ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
+ case RES_SOFT_LIMIT:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ /*
+ * For memsw, soft limits are hard to implement in terms
+ * of semantics, for now, we support soft limits for
+ * control without swap
+ */
+ if (type == _MEM)
+ ret = res_counter_set_soft_limit(&memcg->res, val);
+ else
+ ret = -EINVAL;
+ break;
default:
ret = -EINVAL; /* should be BUG() ? */
break;
@@ -2359,6 +2373,12 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_read,
},
{
+ .name = "soft_limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
.name = "failcnt",
.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
--
Balbir
Feature: Organize cgroups over soft limit in a RB-Tree
From: Balbir Singh <[email protected]>
Changelog v9...v8
1. Time based update is gone, we now use events to update the RB-Tree
Took code from Kame's patches - added his signed-off-by
2. Names have been simplified, soft_limit from the middle name of data
structures are gone
3. Additional uses of lookup_page_cgroup are now gone
4. Locks don't have to be held with irq's disabled
Changelog v8...v7
1. Make the data structures per zone per node
Changelog v7...v6
1. Refactor the check and update logic. The goal is to allow the
check logic to be modular, so that it can be revisited in the future
if something more appropriate is found to be useful.
Changelog v6...v5
1. Update the key before inserting into RB tree. Without the current change
it could take an additional iteration to get the key correct.
Changelog v5...v4
1. res_counter_uncharge has an additional parameter to indicate if the
counter was over its soft limit, before uncharge.
Changelog v4...v3
1. Optimizations to ensure we don't uncessarily get res_counter values
2. Fixed a bug in usage of time_after()
Changelog v3...v2
1. Add only the ancestor to the RB-Tree
2. Use css_tryget/css_put instead of mem_cgroup_get/mem_cgroup_put
Changelog v2...v1
1. Add support for hierarchies
2. The res_counter that is highest in the hierarchy is returned on soft
limit being exceeded. Since we do hierarchical reclaim and add all
groups exceeding their soft limits, this approach seems to work well
in practice.
This patch introduces a RB-Tree for storing memory cgroups that are over their
soft limit. The overall goal is to
1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit
The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/res_counter.h | 6 +
kernel/res_counter.c | 18 ++-
mm/memcontrol.c | 302 +++++++++++++++++++++++++++++++++++++------
3 files changed, 279 insertions(+), 47 deletions(-)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..731af71 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -114,7 +114,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
int __must_check res_counter_charge_locked(struct res_counter *counter,
unsigned long val);
int __must_check res_counter_charge(struct res_counter *counter,
- unsigned long val, struct res_counter **limit_fail_at);
+ unsigned long val, struct res_counter **limit_fail_at,
+ struct res_counter **soft_limit_at);
/*
* uncharge - tell that some portion of the resource is released
@@ -127,7 +128,8 @@ int __must_check res_counter_charge(struct res_counter *counter,
*/
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+ bool *was_soft_limit_excess);
static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
{
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bcdabf3..88faec2 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -37,17 +37,27 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
}
int res_counter_charge(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at)
+ struct res_counter **limit_fail_at,
+ struct res_counter **soft_limit_fail_at)
{
int ret;
unsigned long flags;
struct res_counter *c, *u;
*limit_fail_at = NULL;
+ if (soft_limit_fail_at)
+ *soft_limit_fail_at = NULL;
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
spin_lock(&c->lock);
ret = res_counter_charge_locked(c, val);
+ /*
+ * With soft limits, we return the highest ancestor
+ * that exceeds its soft limit
+ */
+ if (soft_limit_fail_at &&
+ !res_counter_soft_limit_check_locked(c))
+ *soft_limit_fail_at = c;
spin_unlock(&c->lock);
if (ret < 0) {
*limit_fail_at = c;
@@ -75,7 +85,8 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
counter->usage -= val;
}
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val,
+ bool *was_soft_limit_excess)
{
unsigned long flags;
struct res_counter *c;
@@ -83,6 +94,9 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
spin_lock(&c->lock);
+ if (was_soft_limit_excess)
+ *was_soft_limit_excess =
+ !res_counter_soft_limit_check_locked(c);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3c9292b..219b060 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
#include <linux/rcupdate.h>
#include <linux/limits.h>
#include <linux/mutex.h>
+#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/spinlock.h>
@@ -54,6 +55,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
#endif
static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
+#define SOFTLIMIT_EVENTS_THRESH (1000)
/*
* Statistics for memory cgroup.
@@ -67,6 +69,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
+ MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
MEM_CGROUP_STAT_NSTATS,
};
@@ -79,6 +82,20 @@ struct mem_cgroup_stat {
struct mem_cgroup_stat_cpu cpustat[0];
};
+static inline void
+__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
+ enum mem_cgroup_stat_index idx)
+{
+ stat->count[idx] = 0;
+}
+
+static inline s64
+__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
+ enum mem_cgroup_stat_index idx)
+{
+ return stat->count[idx];
+}
+
/*
* For accounting under irq disable, no need for increment preempt count.
*/
@@ -118,6 +135,10 @@ struct mem_cgroup_per_zone {
unsigned long count[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
+ struct rb_node tree_node; /* RB tree node */
+ unsigned long long usage_in_excess;/* Set to the value by which */
+ /* the soft limit is exceeded*/
+ bool on_tree;
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -131,6 +152,26 @@ struct mem_cgroup_lru_info {
};
/*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+struct mem_cgroup_tree_per_zone {
+ struct rb_root rb_root;
+ spinlock_t lock;
+};
+
+struct mem_cgroup_tree_per_node {
+ struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_tree {
+ struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
+};
+
+static struct mem_cgroup_tree soft_limit_tree __read_mostly;
+
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -215,6 +256,152 @@ static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+ return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static struct mem_cgroup_per_zone *
+page_cgroup_zoneinfo(struct page_cgroup *pc)
+{
+ struct mem_cgroup *mem = pc->mem_cgroup;
+ int nid = page_cgroup_nid(pc);
+ int zid = page_cgroup_zid(pc);
+
+ if (!mem)
+ return NULL;
+
+ return mem_cgroup_zoneinfo(mem, nid, zid);
+}
+
+static struct mem_cgroup_tree_per_zone *
+soft_limit_tree_node_zone(int nid, int zid)
+{
+ return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+}
+
+static struct mem_cgroup_tree_per_zone *
+soft_limit_tree_from_page(struct page *page)
+{
+ int nid = page_to_nid(page);
+ int zid = page_zonenum(page);
+
+ return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+}
+
+static void
+mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+ struct mem_cgroup_per_zone *mz,
+ struct mem_cgroup_tree_per_zone *mctz)
+{
+ struct rb_node **p = &mctz->rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ struct mem_cgroup_per_zone *mz_node;
+
+ if (mz->on_tree)
+ return;
+
+ mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+ spin_lock(&mctz->lock);
+ while (*p) {
+ parent = *p;
+ mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+ tree_node);
+ if (mz->usage_in_excess < mz_node->usage_in_excess)
+ p = &(*p)->rb_left;
+ /*
+ * We can't avoid mem cgroups that are over their soft
+ * limit by the same amount
+ */
+ else if (mz->usage_in_excess >= mz_node->usage_in_excess)
+ p = &(*p)->rb_right;
+ }
+ rb_link_node(&mz->tree_node, parent, p);
+ rb_insert_color(&mz->tree_node, &mctz->rb_root);
+ mz->on_tree = true;
+ spin_unlock(&mctz->lock);
+}
+
+static void
+mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+ struct mem_cgroup_per_zone *mz,
+ struct mem_cgroup_tree_per_zone *mctz)
+{
+ spin_lock(&mctz->lock);
+ rb_erase(&mz->tree_node, &mctz->rb_root);
+ mz->on_tree = false;
+ spin_unlock(&mctz->lock);
+}
+
+static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem,
+ bool over_soft_limit)
+{
+ bool ret = false;
+ int cpu = get_cpu();
+ s64 val;
+ struct mem_cgroup_stat_cpu *cpustat;
+
+ if (!over_soft_limit)
+ return ret;
+
+ cpustat = &mem->stat.cpustat[cpu];
+ val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
+ if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
+ __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+ ret = true;
+ }
+ return ret;
+}
+
+static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
+{
+ unsigned long long prev_usage_in_excess, new_usage_in_excess;
+ bool updated_tree = false;
+ struct mem_cgroup_per_zone *mz;
+ struct mem_cgroup_tree_per_zone *mctz;
+
+ mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+ mctz = soft_limit_tree_from_page(page);
+
+ /*
+ * We do updates in lazy mode, mem's are removed
+ * lazily from the per-zone, per-node rb tree
+ */
+ prev_usage_in_excess = mz->usage_in_excess;
+
+ new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+ if (prev_usage_in_excess) {
+ mem_cgroup_remove_exceeded(mem, mz, mctz);
+ updated_tree = true;
+ }
+ if (!new_usage_in_excess)
+ goto done;
+ mem_cgroup_insert_exceeded(mem, mz, mctz);
+
+done:
+ if (updated_tree) {
+ spin_lock(&mctz->lock);
+ mz->usage_in_excess = new_usage_in_excess;
+ spin_unlock(&mctz->lock);
+ }
+}
+
+static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
+{
+ int node, zone;
+ struct mem_cgroup_per_zone *mz;
+ struct mem_cgroup_tree_per_zone *mctz;
+
+ for_each_node_state(node, N_POSSIBLE) {
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ mz = mem_cgroup_zoneinfo(mem, node, zone);
+ mctz = soft_limit_tree_node_zone(node, zone);
+ mem_cgroup_remove_exceeded(mem, mz, mctz);
+ }
+ }
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -236,28 +423,10 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
else
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
put_cpu();
}
-static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
-{
- return &mem->info.nodeinfo[nid]->zoneinfo[zid];
-}
-
-static struct mem_cgroup_per_zone *
-page_cgroup_zoneinfo(struct page_cgroup *pc)
-{
- struct mem_cgroup *mem = pc->mem_cgroup;
- int nid = page_cgroup_nid(pc);
- int zid = page_cgroup_zid(pc);
-
- if (!mem)
- return NULL;
-
- return mem_cgroup_zoneinfo(mem, nid, zid);
-}
-
static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
enum lru_list idx)
{
@@ -972,11 +1141,11 @@ done:
*/
static int __mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcg,
- bool oom)
+ bool oom, struct page *page)
{
- struct mem_cgroup *mem, *mem_over_limit;
+ struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct res_counter *fail_res;
+ struct res_counter *fail_res, *soft_fail_res = NULL;
if (unlikely(test_thread_flag(TIF_MEMDIE))) {
/* Don't account this! */
@@ -1006,16 +1175,17 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
int ret;
bool noswap = false;
- ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+ ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+ &soft_fail_res);
if (likely(!ret)) {
if (!do_swap_account)
break;
ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
- &fail_res);
+ &fail_res, NULL);
if (likely(!ret))
break;
/* mem+swap counter fails */
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
noswap = true;
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
memsw);
@@ -1053,13 +1223,23 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto nomem;
}
}
+ /*
+ * Insert just the ancestor, we should trickle down to the correct
+ * cgroup for reclaim, since the other nodes will be below their
+ * soft limit
+ */
+ if (soft_fail_res) {
+ mem_over_soft_limit =
+ mem_cgroup_from_res_counter(soft_fail_res, res);
+ if (mem_cgroup_soft_limit_check(mem_over_soft_limit, true))
+ mem_cgroup_update_tree(mem_over_soft_limit, page);
+ }
return 0;
nomem:
css_put(&mem->css);
return -ENOMEM;
}
-
/*
* A helper function to get mem_cgroup from ID. must be called under
* rcu_read_lock(). The caller must check css_is_removed() or some if
@@ -1126,9 +1306,9 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
lock_page_cgroup(pc);
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
css_put(&mem->css);
return;
}
@@ -1205,7 +1385,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
if (pc->mem_cgroup != from)
goto out;
- res_counter_uncharge(&from->res, PAGE_SIZE);
+ res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
mem_cgroup_charge_statistics(from, pc, false);
page = pc->page;
@@ -1225,7 +1405,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
}
if (do_swap_account)
- res_counter_uncharge(&from->memsw, PAGE_SIZE);
+ res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
css_put(&from->css);
css_get(&to->css);
@@ -1259,7 +1439,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
parent = mem_cgroup_from_cont(pcg);
- ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+ ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
if (ret || !parent)
return ret;
@@ -1289,9 +1469,9 @@ uncharge:
/* drop extra refcnt by try_charge() */
css_put(&parent->css);
/* uncharge if move fails */
- res_counter_uncharge(&parent->res, PAGE_SIZE);
+ res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
return ret;
}
@@ -1316,7 +1496,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
prefetchw(pc);
mem = memcg;
- ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
if (ret || !mem)
return ret;
@@ -1435,14 +1615,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
if (!mem)
goto charge_cur_mm;
*ptr = mem;
- ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+ ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
/* drop extra refcnt from tryget */
css_put(&mem->css);
return ret;
charge_cur_mm:
if (unlikely(!mm))
mm = &init_mm;
- return __mem_cgroup_try_charge(mm, mask, ptr, true);
+ return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
}
static void
@@ -1479,7 +1659,7 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
* This recorded memcg can be obsolete one. So, avoid
* calling css_tryget
*/
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
mem_cgroup_put(memcg);
}
rcu_read_unlock();
@@ -1500,9 +1680,9 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
return;
if (!mem)
return;
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
css_put(&mem->css);
}
@@ -1516,6 +1696,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
+ bool soft_limit_excess = false;
if (mem_cgroup_disabled())
return NULL;
@@ -1554,9 +1735,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
break;
}
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
mem_cgroup_charge_statistics(mem, pc, false);
ClearPageCgroupUsed(pc);
@@ -1570,6 +1751,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
mz = page_cgroup_zoneinfo(pc);
unlock_page_cgroup(pc);
+ if (mem_cgroup_soft_limit_check(mem, soft_limit_excess))
+ mem_cgroup_update_tree(mem, page);
/* at swapout, this memcg will be accessed to record to swap */
if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
css_put(&mem->css);
@@ -1645,7 +1828,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
* We uncharge this because swap is freed.
* This memcg can be obsolete one. We avoid calling css_tryget
*/
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
mem_cgroup_put(memcg);
}
rcu_read_unlock();
@@ -1674,7 +1857,8 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
unlock_page_cgroup(pc);
if (mem) {
- ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+ ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+ page);
css_put(&mem->css);
}
*ptr = mem;
@@ -2177,6 +2361,7 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
res_counter_reset_failcnt(&mem->memsw);
break;
}
+
return 0;
}
@@ -2472,6 +2657,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
mz = &pn->zoneinfo[zone];
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
+ mz->usage_in_excess = 0;
}
return 0;
}
@@ -2517,6 +2703,7 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
{
int node;
+ mem_cgroup_remove_from_trees(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);
for_each_node_state(node, N_POSSIBLE)
@@ -2565,6 +2752,31 @@ static void __init enable_swap_cgroup(void)
}
#endif
+static int mem_cgroup_soft_limit_tree_init(void)
+{
+ struct mem_cgroup_tree_per_node *rtpn;
+ struct mem_cgroup_tree_per_zone *rtpz;
+ int tmp, node, zone;
+
+ for_each_node_state(node, N_POSSIBLE) {
+ tmp = node;
+ if (!node_state(node, N_NORMAL_MEMORY))
+ tmp = -1;
+ rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
+ if (!rtpn)
+ return 1;
+
+ soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ rtpz = &rtpn->rb_tree_per_zone[zone];
+ rtpz->rb_root = RB_ROOT;
+ spin_lock_init(&rtpz->lock);
+ }
+ }
+ return 0;
+}
+
static struct cgroup_subsys_state * __ref
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -2579,11 +2791,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
goto free_out;
+
/* root ? */
if (cont->parent == NULL) {
enable_swap_cgroup();
parent = NULL;
root_mem_cgroup = mem;
+ if (mem_cgroup_soft_limit_tree_init())
+ goto free_out;
+
} else {
parent = mem_cgroup_from_cont(cont->parent);
mem->use_hierarchy = parent->use_hierarchy;
--
Balbir
Impact: Refactor mem_cgroup_hierarchical_reclaim()
From: Balbir Singh <[email protected]>
This patch refactors the arguments passed to
mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't
have to be passed as we make the reclaim routine more flexible
Signed-off-by: Balbir Singh <[email protected]>
---
mm/memcontrol.c | 25 +++++++++++++++++++------
1 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 219b060..1421576 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -252,6 +252,14 @@ enum charge_type {
#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
+/*
+ * Reclaim flags for mem_cgroup_hierarchical_reclaim
+ */
+#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
+#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
+#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
+#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1031,11 +1039,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
* If shrink==true, for avoiding to free too much, this returns immedieately.
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
- gfp_t gfp_mask, bool noswap, bool shrink)
+ gfp_t gfp_mask,
+ unsigned long reclaim_options)
{
struct mem_cgroup *victim;
int ret, total = 0;
int loop = 0;
+ bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
+ bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
/* If memsw_is_minimum==1, swap-out is of-no-use. */
if (root_mem->memsw_is_minimum)
@@ -1173,7 +1184,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
while (1) {
int ret;
- bool noswap = false;
+ unsigned long flags = 0;
ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
&soft_fail_res);
@@ -1186,7 +1197,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
break;
/* mem+swap counter fails */
res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
- noswap = true;
+ flags |= MEM_CGROUP_RECLAIM_NOSWAP;
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
memsw);
} else
@@ -1198,7 +1209,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
goto nomem;
ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
- noswap, false);
+ flags);
if (ret)
continue;
@@ -1993,7 +2004,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
break;
progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
- false, true);
+ MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -2045,7 +2056,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
+ mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+ MEM_CGROUP_RECLAIM_NOSWAP |
+ MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
--
Balbir
Feature: Implement reclaim from groups over their soft limit
From: Balbir Singh <[email protected]>
Changelog v9 ..v8
1. Changed the order of css_tryget()
2. Don't need to check for excess being greater than ULONG_MAX anymore.
Changelog v8 ..v7
1. Soft limit reclaim takes an order parameter and does no reclaim for
order > 0. This ensures that we don't do double reclaim for order > 0
2. Make the data structures more scalable, move the reclaim logic
to a new function mem_cgroup_shrink_node_zone that does per node
per zone reclaim.
3. Reclaim has moved back to kswapd (balance_pgdat)
Changelog v7...v6
1. Refactored out reclaim_options patch into a separate patch
2. Added additional checks for all swap off condition in
mem_cgroup_hierarchical_reclaim()
Changelog v6...v5
1. Reclaim arguments to hierarchical reclaim have been merged into one
parameter called reclaim_options.
2. Check if we failed to reclaim from one cgroup during soft reclaim, if
so move on to the next one. This can be very useful if the zonelist
passed to soft limit reclaim has no allocations from the selected
memory cgroup
3. Coding style cleanups
Changelog v5...v4
1. Throttling is removed, earlier we throttled tasks over their soft limit
2. Reclaim has been moved back to __alloc_pages_internal, several experiments
and tests showed that it was the best place to reclaim memory. kswapd has
a different goal, that does not work with a single soft limit for the memory
cgroup.
3. Soft limit reclaim is more targetted and the pages reclaim depend on the
amount by which the soft limit is exceeded.
Changelog v4...v3
1. soft_reclaim is now called from balance_pgdat
2. soft_reclaim is aware of nodes and zones
3. A mem_cgroup will be throttled if it is undergoing soft limit reclaim
and at the same time trying to allocate pages and exceed its soft limit.
4. A new mem_cgroup_shrink_zone() routine has been added to shrink zones
particular to a mem cgroup.
Changelog v3...v2
1. Convert several arguments to hierarchical reclaim to flags, thereby
consolidating them
2. The reclaim for soft limits is now triggered from kswapd
3. try_to_free_mem_cgroup_pages() now accepts an optional zonelist argument
Changelog v2...v1
1. Added support for hierarchical soft limits
This patch allows reclaim from memory cgroups on contention (via the
direct reclaim path).
memory cgroup soft limit reclaim finds the group that exceeds its soft limit
by the largest number of pages and reclaims pages from it and then reinserts the
cgroup into its correct place in the rbtree.
Added additional checks to mem_cgroup_hierarchical_reclaim() to detect
long loops in case all swap is turned off. The code has been refactored
and the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.
Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/memcontrol.h | 11 ++
include/linux/swap.h | 5 +
mm/memcontrol.c | 213 +++++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 45 +++++++++
4 files changed, 259 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..5296675 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -118,6 +118,9 @@ static inline bool mem_cgroup_disabled(void)
extern bool mem_cgroup_oom_called(struct task_struct *task);
void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+ gfp_t gfp_mask, int nid,
+ int zid);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
@@ -276,6 +279,14 @@ static inline void mem_cgroup_update_mapped_file_stat(struct page *page,
{
}
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+ gfp_t gfp_mask, int nid,
+ int zid, int priority)
+{
+ return 0;
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6c990e6..4c78fea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -217,6 +217,11 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
unsigned int swappiness);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ unsigned int swappiness,
+ struct zone *zone,
+ int nid);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1421576..1d85362 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -139,6 +139,8 @@ struct mem_cgroup_per_zone {
unsigned long long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
+ struct mem_cgroup *mem; /* Back pointer, we cannot */
+ /* use container_of */
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -228,6 +230,13 @@ struct mem_cgroup {
struct mem_cgroup_stat stat;
};
+/*
+ * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
+#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
+
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -259,6 +268,8 @@ enum charge_type {
#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
+#define MEM_CGROUP_RECLAIM_SOFT_BIT 0x2
+#define MEM_CGROUP_RECLAIM_SOFT (1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
@@ -299,7 +310,7 @@ soft_limit_tree_from_page(struct page *page)
}
static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
struct mem_cgroup_tree_per_zone *mctz)
{
@@ -311,7 +322,6 @@ mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
return;
mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
- spin_lock(&mctz->lock);
while (*p) {
parent = *p;
mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
@@ -328,6 +338,26 @@ mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
rb_link_node(&mz->tree_node, parent, p);
rb_insert_color(&mz->tree_node, &mctz->rb_root);
mz->on_tree = true;
+}
+
+static void
+__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+ struct mem_cgroup_per_zone *mz,
+ struct mem_cgroup_tree_per_zone *mctz)
+{
+ if (!mz->on_tree)
+ return;
+ rb_erase(&mz->tree_node, &mctz->rb_root);
+ mz->on_tree = false;
+}
+
+static void
+mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+ struct mem_cgroup_per_zone *mz,
+ struct mem_cgroup_tree_per_zone *mctz)
+{
+ spin_lock(&mctz->lock);
+ __mem_cgroup_insert_exceeded(mem, mz, mctz);
spin_unlock(&mctz->lock);
}
@@ -337,8 +367,7 @@ mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
struct mem_cgroup_tree_per_zone *mctz)
{
spin_lock(&mctz->lock);
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
+ __mem_cgroup_remove_exceeded(mem, mz, mctz);
spin_unlock(&mctz->lock);
}
@@ -410,6 +439,47 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
}
}
+static inline unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+ return res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
+}
+
+static struct mem_cgroup_per_zone *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+{
+ struct rb_node *rightmost = NULL;
+ struct mem_cgroup_per_zone *mz = NULL;
+
+retry:
+ rightmost = rb_last(&mctz->rb_root);
+ if (!rightmost)
+ goto done; /* Nothing to reclaim from */
+
+ mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
+ /*
+ * Remove the node now but someone else can add it back,
+ * we will to add it back at the end of reclaim to its correct
+ * position in the tree.
+ */
+ __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+ if (!res_counter_soft_limit_excess(&mz->mem->res) ||
+ !css_tryget(&mz->mem->css))
+ goto retry;
+done:
+ return mz;
+}
+
+static struct mem_cgroup_per_zone *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+{
+ struct mem_cgroup_per_zone *mz;
+
+ spin_lock(&mctz->lock);
+ mz = __mem_cgroup_largest_soft_limit_node(mctz);
+ spin_unlock(&mctz->lock);
+ return mz;
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -1039,6 +1109,7 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
* If shrink==true, for avoiding to free too much, this returns immedieately.
*/
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+ struct zone *zone,
gfp_t gfp_mask,
unsigned long reclaim_options)
{
@@ -1047,23 +1118,49 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
int loop = 0;
bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
+ bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+ unsigned long excess = mem_cgroup_get_excess(root_mem);
/* If memsw_is_minimum==1, swap-out is of-no-use. */
if (root_mem->memsw_is_minimum)
noswap = true;
- while (loop < 2) {
+ while (1) {
victim = mem_cgroup_select_victim(root_mem);
- if (victim == root_mem)
+ if (victim == root_mem) {
loop++;
+ if (loop >= 2) {
+ /*
+ * If we have not been able to reclaim
+ * anything, it might because there are
+ * no reclaimable pages under this hierarchy
+ */
+ if (!check_soft || !total)
+ break;
+ /*
+ * We want to do more targetted reclaim.
+ * excess >> 2 is not to excessive so as to
+ * reclaim too much, nor too less that we keep
+ * coming back to reclaim from this cgroup
+ */
+ if (total >= (excess >> 2) ||
+ (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
+ break;
+ }
+ }
if (!mem_cgroup_local_usage(&victim->stat)) {
/* this cgroup's local usage == 0 */
css_put(&victim->css);
continue;
}
/* we use swappiness of local cgroup */
- ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
- get_swappiness(victim));
+ if (check_soft)
+ ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
+ noswap, get_swappiness(victim), zone,
+ zone->zone_pgdat->node_id);
+ else
+ ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
+ noswap, get_swappiness(victim));
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -1073,7 +1170,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
if (shrink)
return ret;
total += ret;
- if (mem_cgroup_check_under_limit(root_mem))
+ if (check_soft) {
+ if (res_counter_check_under_soft_limit(&root_mem->res))
+ return total;
+ } else if (mem_cgroup_check_under_limit(root_mem))
return 1 + total;
}
return total;
@@ -1208,8 +1308,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
- ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
- flags);
+ ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+ gfp_mask, flags);
if (ret)
continue;
@@ -2003,8 +2103,9 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- progress = mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ progress = mem_cgroup_hierarchical_reclaim(memcg, NULL,
+ GFP_KERNEL,
+ MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -2056,7 +2157,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL,
+ mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
MEM_CGROUP_RECLAIM_NOSWAP |
MEM_CGROUP_RECLAIM_SHRINK);
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
@@ -2069,6 +2170,88 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+ gfp_t gfp_mask, int nid,
+ int zid)
+{
+ unsigned long nr_reclaimed = 0;
+ struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+ unsigned long reclaimed;
+ int loop = 0;
+ struct mem_cgroup_tree_per_zone *mctz;
+
+ if (order > 0)
+ return 0;
+
+ mctz = soft_limit_tree_node_zone(nid, zid);
+ /*
+ * This loop can run a while, specially if mem_cgroup's continuously
+ * keep exceeding their soft limit and putting the system under
+ * pressure
+ */
+ do {
+ if (next_mz)
+ mz = next_mz;
+ else
+ mz = mem_cgroup_largest_soft_limit_node(mctz);
+ if (!mz)
+ break;
+
+ reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
+ gfp_mask,
+ MEM_CGROUP_RECLAIM_SOFT);
+ nr_reclaimed += reclaimed;
+ spin_lock(&mctz->lock);
+
+ /*
+ * If we failed to reclaim anything from this memory cgroup
+ * it is time to move on to the next cgroup
+ */
+ next_mz = NULL;
+ if (!reclaimed) {
+ do {
+ /*
+ * By the time we get the soft_limit lock
+ * again, someone might have aded the
+ * group back on the RB tree. Iterate to
+ * make sure we get a different mem.
+ * mem_cgroup_largest_soft_limit_node returns
+ * NULL if no other cgroup is present on
+ * the tree
+ */
+ next_mz =
+ __mem_cgroup_largest_soft_limit_node(mctz);
+ } while (next_mz == mz);
+ }
+ mz->usage_in_excess =
+ res_counter_soft_limit_excess(&mz->mem->res);
+ __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+ /*
+ * One school of thought says that we should not add
+ * back the node to the tree if reclaim returns 0.
+ * But our reclaim could return 0, simply because due
+ * to priority we are exposing a smaller subset of
+ * memory to reclaim from. Consider this as a longer
+ * term TODO.
+ */
+ if (mz->usage_in_excess)
+ __mem_cgroup_insert_exceeded(mz->mem, mz, mctz);
+ spin_unlock(&mctz->lock);
+ css_put(&mz->mem->css);
+ loop++;
+ /*
+ * Could not reclaim anything and there are no more
+ * mem cgroups to try or we seem to be looping without
+ * reclaiming anything.
+ */
+ if (!nr_reclaimed &&
+ (next_mz == NULL ||
+ loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
+ break;
+ } while (!nr_reclaimed);
+ return nr_reclaimed;
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -2671,6 +2854,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
for_each_lru(l)
INIT_LIST_HEAD(&mz->lists[l]);
mz->usage_in_excess = 0;
+ mz->on_tree = false;
+ mz->mem = mem;
}
return 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86dc0c3..3e95aca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1780,11 +1780,45 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ unsigned int swappiness,
+ struct zone *zone, int nid)
+{
+ struct scan_control sc = {
+ .may_writepage = !laptop_mode,
+ .may_unmap = 1,
+ .may_swap = !noswap,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .swappiness = swappiness,
+ .order = 0,
+ .mem_cgroup = mem,
+ .isolate_pages = mem_cgroup_isolate_pages,
+ };
+ nodemask_t nm = nodemask_of_node(nid);
+
+ sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+ (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
+ sc.nodemask = &nm;
+ sc.nr_reclaimed = 0;
+ sc.nr_scanned = 0;
+ /*
+ * NOTE: Although we can get the priority field, using it
+ * here is not a good idea, since it limits the pages we can scan.
+ * if we don't reclaim here, the shrink_zone from balance_pgdat
+ * will pick up pages from other mem cgroup's as well. We hack
+ * the priority and make it zero.
+ */
+ shrink_zone(0, zone, &sc);
+ return sc.nr_reclaimed;
+}
+
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
bool noswap,
unsigned int swappiness)
{
+ struct zonelist *zonelist;
struct scan_control sc = {
.may_writepage = !laptop_mode,
.may_unmap = 1,
@@ -1796,7 +1830,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
.nodemask = NULL, /* we don't care the placement */
};
- struct zonelist *zonelist;
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -1918,6 +1951,7 @@ loop_again:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
int nr_slab;
+ int nid, zid;
if (!populated_zone(zone))
continue;
@@ -1932,6 +1966,15 @@ loop_again:
temp_priority[i] = priority;
sc.nr_scanned = 0;
note_zone_scanning_priority(zone, priority);
+
+ nid = pgdat->node_id;
+ zid = zone_idx(zone);
+ /*
+ * Call soft limit reclaim before calling shrink_zone.
+ * For now we ignore the return value
+ */
+ mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
+ nid, zid);
/*
* We put equal pressure on every zone, unless one
* zone has way too many pages free already.
--
Balbir
* Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
>
> From: Balbir Singh <[email protected]>
>
> New Feature: Soft limits for memory resource controller.
>
> Here is v9 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though.
>
If there are no objections to these patches, could we pick them up for
testing in mmotm.
--
Balbir
On Wed, 15 Jul 2009 09:38:11 +0530
Balbir Singh <[email protected]> wrote:
> * Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
>
> >
> > From: Balbir Singh <[email protected]>
> >
> > New Feature: Soft limits for memory resource controller.
> >
> > Here is v9 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though.
> >
>
> If there are no objections to these patches, could we pick them up for
> testing in mmotm.
>
If any, will be fixed up in mmotm. About behavior, I don't have more things
than I've said. (dealying kswapd is not very good.)
But plz discuss with Vladislav Buzov about implementation details of [2..3/5].
==
[PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)
It seems there are multiple functionalities you can shere with them.
- hierarchical threshold check
- callback or notify agaisnt threshold.
etc..
I'm very happy if all messy things around res_counter+hierarchy are sorted out
before diving into melting pot. I hope both of you have nice interfaces and
keep res_counter neat.
Thanks,
-Kame
* KAMEZAWA Hiroyuki <[email protected]> [2009-07-15 13:33:24]:
> On Wed, 15 Jul 2009 09:38:11 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
> >
> > >
> > > From: Balbir Singh <[email protected]>
> > >
> > > New Feature: Soft limits for memory resource controller.
> > >
> > > Here is v9 of the new soft limit implementation. Soft limits is a new feature
> > > for the memory resource controller, something similar has existed in the
> > > group scheduler in the form of shares. The CPU controllers interpretation
> > > of shares is very different though.
> > >
> >
> > If there are no objections to these patches, could we pick them up for
> > testing in mmotm.
> >
>
> If any, will be fixed up in mmotm. About behavior, I don't have more things
> than I've said. (dealying kswapd is not very good.)
>
> But plz discuss with Vladislav Buzov about implementation details of [2..3/5].
> ==
> [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)
>
> It seems there are multiple functionalities you can shere with them.
>
> - hierarchical threshold check
> - callback or notify agaisnt threshold.
> etc..
>
> I'm very happy if all messy things around res_counter+hierarchy are sorted out
> before diving into melting pot. I hope both of you have nice interfaces and
> keep res_counter neat.
>
I do see scope for reuse, but I've not yet gotten to reviewing v3 of
the patches. I will, I could potentially get him to base his patches
on top of this. One of the interesting things that Paul pointed out
was of global state.
--
Balbir
> * Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
>
> >
> > From: Balbir Singh <[email protected]>
> >
> > New Feature: Soft limits for memory resource controller.
> >
> > Here is v9 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though.
> >
>
> If there are no objections to these patches, could we pick them up for
> testing in mmotm.
Sorry, I haven't review this patch series. please give me few days.
* KOSAKI Motohiro <[email protected]> [2009-07-15 14:28:19]:
> > * Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
> >
> > >
> > > From: Balbir Singh <[email protected]>
> > >
> > > New Feature: Soft limits for memory resource controller.
> > >
> > > Here is v9 of the new soft limit implementation. Soft limits is a new feature
> > > for the memory resource controller, something similar has existed in the
> > > group scheduler in the form of shares. The CPU controllers interpretation
> > > of shares is very different though.
> > >
> >
> > If there are no objections to these patches, could we pick them up for
> > testing in mmotm.
>
> Sorry, I haven't review this patch series. please give me few days.
>
Sure, could you see if the reclaim bits are better now.
--
Thanks,
Balbir
very sorry for the delaying.
> @@ -1918,6 +1951,7 @@ loop_again:
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> int nr_slab;
> + int nid, zid;
>
> if (!populated_zone(zone))
> continue;
> @@ -1932,6 +1966,15 @@ loop_again:
> temp_priority[i] = priority;
> sc.nr_scanned = 0;
> note_zone_scanning_priority(zone, priority);
> +
> + nid = pgdat->node_id;
> + zid = zone_idx(zone);
> + /*
> + * Call soft limit reclaim before calling shrink_zone.
> + * For now we ignore the return value
> + */
> + mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
> + nid, zid);
> /*
> * We put equal pressure on every zone, unless one
> * zone has way too many pages free already.
In this part:
Acked-by: KOSAKI Motohiro <[email protected]>
* Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
>
> From: Balbir Singh <[email protected]>
>
> New Feature: Soft limits for memory resource controller.
>
> Here is v9 of the new soft limit implementation. Soft limits is a new feature
> for the memory resource controller, something similar has existed in the
> group scheduler in the form of shares. The CPU controllers interpretation
> of shares is very different though.
>
> Soft limits are the most useful feature to have for environments where
> the administrator wants to overcommit the system, such that only on memory
> contention do the limits become active. The current soft limits implementation
> provides a soft_limit_in_bytes interface for the memory controller and not
> for memory+swap controller. The implementation maintains an RB-Tree of groups
> that exceed their soft limit and starts reclaiming from the group that
> exceeds this limit by the maximum amount.
>
> v9 attempts to address several review comments for v8 by Kamezawa, including
> moving over to an event based approach for soft limit rb tree management,
> simplification of data structure names and many others. Comments not
> addressed have been answered via email or I've added comments in the code.
>
> TODOs
>
> 1. The current implementation maintains the delta from the soft limit
> and pushes back groups to their soft limits, a ratio of delta/soft_limit
> might be more useful
>
Hi, Andrew,
Could you please pick up this patchset for testing in -mm, both
Kamezawa-San and Kosaki-San have looked at the patches. I think they
are ready for testing in mmotm.
--
Balbir
* KOSAKI Motohiro <[email protected]> [2009-07-21 00:20:38]:
>
> very sorry for the delaying.
>
No problem, thanks for the review and ack
>
> > @@ -1918,6 +1951,7 @@ loop_again:
> > for (i = 0; i <= end_zone; i++) {
> > struct zone *zone = pgdat->node_zones + i;
> > int nr_slab;
> > + int nid, zid;
> >
> > if (!populated_zone(zone))
> > continue;
> > @@ -1932,6 +1966,15 @@ loop_again:
> > temp_priority[i] = priority;
> > sc.nr_scanned = 0;
> > note_zone_scanning_priority(zone, priority);
> > +
> > + nid = pgdat->node_id;
> > + zid = zone_idx(zone);
> > + /*
> > + * Call soft limit reclaim before calling shrink_zone.
> > + * For now we ignore the return value
> > + */
> > + mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask,
> > + nid, zid);
> > /*
> > * We put equal pressure on every zone, unless one
> > * zone has way too many pages free already.
>
>
> In this part:
> Acked-by: KOSAKI Motohiro <[email protected]>
>
>
>
--
Balbir
On Mon, 20 Jul 2009 21:18:59 +0530
Balbir Singh <[email protected]> wrote:
> * Balbir Singh <[email protected]> [2009-07-10 18:29:50]:
>
> >
> > From: Balbir Singh <[email protected]>
> >
> > New Feature: Soft limits for memory resource controller.
> >
> > Here is v9 of the new soft limit implementation. Soft limits is a new feature
> > for the memory resource controller, something similar has existed in the
> > group scheduler in the form of shares. The CPU controllers interpretation
> > of shares is very different though.
> >
> > Soft limits are the most useful feature to have for environments where
> > the administrator wants to overcommit the system, such that only on memory
> > contention do the limits become active. The current soft limits implementation
> > provides a soft_limit_in_bytes interface for the memory controller and not
> > for memory+swap controller. The implementation maintains an RB-Tree of groups
> > that exceed their soft limit and starts reclaiming from the group that
> > exceeds this limit by the maximum amount.
> >
> > v9 attempts to address several review comments for v8 by Kamezawa, including
> > moving over to an event based approach for soft limit rb tree management,
> > simplification of data structure names and many others. Comments not
> > addressed have been answered via email or I've added comments in the code.
> >
> > TODOs
> >
> > 1. The current implementation maintains the delta from the soft limit
> > and pushes back groups to their soft limits, a ratio of delta/soft_limit
> > might be more useful
> >
>
>
> Hi, Andrew,
>
> Could you please pick up this patchset for testing in -mm, both
> Kamezawa-San and Kosaki-San have looked at the patches. I think they
> are ready for testing in mmotm.
>
ok, plz go. But please consider to rewrite res_coutner related part in more
generic style, allowing mulitple threshold & callbacks without overheads.
Thanks,
-Kame