2008-06-13 09:22:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 0/6] memcg: hierarchy updates (v4)

Hi, this is memcg hierarchy series v4. but I changed the title.
Thank you for many replies to v3.

This is against 2.6.26-rc5-mm3.

I rearranged the patch stack and much amount of codes are rewritten.
I think I answers most of advices in this version. If I misses, please
point out again, sorry.

Balbir, I'd like to write a generic infrastructure to allow me and you to
implement what we want. So, please check patches in the view,
how res_counter is used and whether my codes can have a bad effect to
what you want or not.

Changelog:
- rearranged patch stack.
- "limit change" handling is divided.
- moves basic res_counter handling to res_counter from memcg.

Short description of patches.
- [1/6] ...a callback for change-of-limit support to res_counter.
- [2/6] ...make use of change-of-limit support in memcg.
- [3/6] ...a special case of implicit change-of-limit at rmdir()
- [4/6] ...a hierarchy support infrastructure for res_counter.
- [5/6] ...HARDWALL hierarchy support in res_counter.
- [6/6] ...HARDWALL hierarhcy in memcg.

It seems rc5-mm3 needs more test and I will not be able to answer e-mail quickly.
please check when you have free time ;)

Anyway, I'd like to push [1/6] and [2/6] , at first. Others will be scheduled
later.

Thanks,
-Kame




2008-06-13 09:24:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 1/6] res_counter: handle limit change

Add a support to shrink_usage_at_limit_change feature to res_counter.
memcg will use this to drop pages.

Change log: xxx -> v4 (new file.)
- cut out the limit-change part from hierarchy patch set.
- add "retry_count" arguments to shrink_usage(). This allows that we don't
have to set the default retry loop count.
- res_counter_check_under_val() is added to support subsystem.
- res_counter_init() is res_counter_init_ops(cnt, NULL)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/resource_counter.txt | 19 +++++-
include/linux/res_counter.h | 33 ++++++++++-
kernel/res_counter.c | 74 ++++++++++++++++++++++++-
3 files changed, 121 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
===================================================================
--- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
+++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
@@ -21,6 +21,13 @@
* the helpers described beyond
*/

+struct res_counter;
+struct res_counter_ops {
+ /* called when the subsystem has to reduce the usage. */
+ int (*shrink_usage)(struct res_counter *cnt, unsigned long long val,
+ int retry_count);
+};
+
struct res_counter {
/*
* the current resource consumption level
@@ -39,6 +46,10 @@ struct res_counter {
*/
unsigned long long failcnt;
/*
+ * registered callbacks etc...for res_counter.
+ */
+ struct res_counter_ops ops;
+ /*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
*/
@@ -82,7 +93,13 @@ enum {
* helpers for accounting
*/

-void res_counter_init(struct res_counter *counter);
+void res_counter_init_ops(struct res_counter *counter,
+ struct res_counter_ops *ops);
+
+static inline void res_counter_init(struct res_counter *counter)
+{
+ res_counter_init_ops(counter, NULL);
+}

/*
* charge - try to consume more resource.
@@ -136,6 +153,20 @@ static inline bool res_counter_check_und
return ret;
}

+static inline bool res_counter_check_under_val(struct res_counter *cnt,
+ unsigned long long val)
+{
+ bool ret = false;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->usage <= val)
+ ret = true;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ return ret;
+}
+
static inline void res_counter_reset_max(struct res_counter *cnt)
{
unsigned long flags;
Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
+++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
@@ -14,10 +14,22 @@
#include <linux/res_counter.h>
#include <linux/uaccess.h>

-void res_counter_init(struct res_counter *counter)
+/**
+ * res_counter_init_ops -- initialize res_counter.
+ * @counter: the res_counter to be initialized
+ * @ops: the res_counter_ops for this res_counter. This argument can be NULL
+ * and is copied.
+ *
+ * init spinlock and set limit to be very very big value.
+ */
+
+void res_counter_init_ops(struct res_counter *counter,
+ struct res_counter_ops *ops)
{
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
+ if (ops)
+ counter->ops = *ops;
}

int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -102,6 +114,46 @@ u64 res_counter_read_u64(struct res_coun
return *res_counter_member(counter, member);
}

+/*
+ * Called when the limit changes if res_counter has ops->shrink_usage.
+ * This function uses shrink usage to below new limit. returns 0 at success.
+ */
+
+static int res_counter_resize_limit(struct res_counter *cnt,
+ unsigned long long val)
+{
+ int retry_count = 0;
+ int ret = -EBUSY;
+ unsigned long flags;
+
+ BUG_ON(!cnt->ops.shrink_usage);
+ while (1) {
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->usage <= val) {
+ cnt->limit = val;
+ ret = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ break;
+ }
+ BUG_ON(val > cnt->limit);
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ /*
+ * Rest before calling callback().... rest after callback
+ * tends to add difference between the result of callback and
+ * the check in next loop.
+ */
+ cond_resched();
+
+ ret = cnt->ops.shrink_usage(cnt, val, retry_count);
+ if (!ret)
+ break;
+ retry_count++;
+ }
+ return ret;
+}
+
+
ssize_t res_counter_write(struct res_counter *counter, int member,
const char __user *userbuf, size_t nbytes, loff_t *pos,
int (*write_strategy)(char *st_buf, unsigned long long *val))
@@ -133,11 +185,29 @@ ssize_t res_counter_write(struct res_cou
if (*end != '\0')
goto out_free;
}
+ switch (member) {
+ case RES_LIMIT:
+ if (counter->ops.shrink_usage) {
+ ret = res_counter_resize_limit(counter, tmp);
+ goto done;
+ }
+ break;
+ default:
+ /*
+ * Considering future implementation, we'll have to handle
+ * other members and "fallback" will not work well. So, we
+ * avoid to make use of "default" here.
+ */
+ break;
+ }
spin_lock_irqsave(&counter->lock, flags);
val = res_counter_member(counter, member);
*val = tmp;
spin_unlock_irqrestore(&counter->lock, flags);
- ret = nbytes;
+ ret = 0;
+done:
+ if (!ret)
+ ret = nbytes;
out_free:
kfree(buf);
out:
Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
===================================================================
--- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
+++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
@@ -39,7 +39,11 @@ to work with it.
The failcnt stands for "failures counter". This is the number of
resource allocation attempts that failed.

- c. spinlock_t lock
+ e. res_counter_ops.
+ Callbacks for helping resource_counter per each subsystem.
+ - shrink_usage() .... called at limit change (decrease).
+
+ f. spinlock_t lock

Protects changes of the above values.

@@ -141,8 +145,19 @@ counter fields. They are recommended to
failcnt reset to zero


+5. res_counter_ops (Callbacks)

-5. Usage example
+ res_counter_ops is for implementing feedback control from res_counter
+ to subsystem. Each one has each own purpose and the subsystem doesn't
+ necessary to provide all callbacks. Just implement necessary ones.
+
+ - shrink_usage(res_counter, newlimit, retry)
+ Called for reducing usage to newlimit, retry is incremented per
+ loop. (See memory resource controller as example.)
+ Returns 0 at success. Any error code is acceptable but -EBUSY will be
+ suitable to show "the kernel can't shrink usage."
+
+6. Usage example

a. Declare a task group (take a look at cgroups subsystem for this) and
fold a res_counter into it

2008-06-13 09:24:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 2/6] memcg: handle limit change

Add callback for resize_limit().

After this patch, memcg's usage will be reduced to new limit.
If it cannot, -EBUSY will be return to write() syscall.

And this patch tries to free all pages at force_empty by reusing
shrink function.

Change log: xxx -> v4
- cut out from memcg hierarhcy patch set.
- added retry_count as new arguments.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/memory.txt | 3 --
mm/memcontrol.c | 47 ++++++++++++++++++++++++++++++++---
2 files changed, 45 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc5-mm3/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/memcontrol.c
+++ linux-2.6.26-rc5-mm3/mm/memcontrol.c
@@ -779,6 +779,44 @@ int mem_cgroup_shrink_usage(struct mm_st
}

/*
+ * A callback for shrinking limit, Always GFP_KERNEL.
+ */
+int mem_cgroup_shrink_usage_to(struct res_counter *res, unsigned long long val,
+ int retry_count)
+{
+ struct mem_cgroup *memcg = container_of(res, struct mem_cgroup, res);
+
+ if (retry_count > MEM_CGROUP_RECLAIM_RETRIES)
+ return -EBUSY;
+
+retry:
+ if (res_counter_check_under_val(res, val))
+ return 0;
+
+ cond_resched();
+ if (try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL) == 0)
+ return 0; /* no progress...*/
+
+ goto retry;
+}
+
+/*
+ * Must be called under there is no users on this cgroup.
+ */
+static void memcg_shrink_usage_all(struct mem_cgroup *memcg)
+{
+ int retry_count = 0;
+ int ret = 0;
+
+ while (!ret && !res_counter_check_under_val(&memcg->res, 0)) {
+ ret = mem_cgroup_shrink_usage_to(&memcg->res, 0, retry_count);
+ retry_count++;
+ }
+
+ return;
+}
+
+/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
@@ -835,9 +873,10 @@ static int mem_cgroup_force_empty(struct
* active_list <-> inactive_list while we don't take a lock.
* So, we have to do loop here until all lists are empty.
*/
- while (mem->res.usage > 0) {
+ while (!res_counter_check_under_val(&mem->res, 0)) {
if (atomic_read(&mem->css.cgroup->count) > 0)
goto out;
+ memcg_shrink_usage_all(mem);
for_each_node_state(node, N_POSSIBLE)
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
struct mem_cgroup_per_zone *mz;
@@ -1046,13 +1085,15 @@ static void mem_cgroup_free(struct mem_c
vfree(mem);
}

+struct res_counter_ops root_ops = {
+ .shrink_usage = mem_cgroup_shrink_usage_to,
+};

static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
struct mem_cgroup *mem;
int node;
-
if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
@@ -1062,7 +1103,7 @@ mem_cgroup_create(struct cgroup_subsys *
return ERR_PTR(-ENOMEM);
}

- res_counter_init(&mem->res);
+ res_counter_init_ops(&mem->res, &root_ops);

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
Index: linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
===================================================================
--- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/memory.txt
+++ linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
@@ -242,8 +242,7 @@ rmdir() if there are no tasks.
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
3. Teach controller to account for shared-pages
-4. Start reclamation when the limit is lowered
-5. Start reclamation in the background when the limit is
+4. Start reclamation in the background when the limit is
not yet hit but the usage is getting closer

Summary

2008-06-13 09:26:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 3/6] memcg: reset limit at rmdir

Reset res_counter's limit to be 0.
Typically called when subysystem which uses res_counter is deleted.

Change log: xxx -> v4 (new file)
- cut out from memg hierarchy patch set(v3).

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/res_counter.h | 2 ++
kernel/res_counter.c | 11 +++++++++++
2 files changed, 13 insertions(+)

Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
===================================================================
--- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
+++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
@@ -117,6 +117,8 @@ int __must_check res_counter_charge_lock
int __must_check res_counter_charge(struct res_counter *counter,
unsigned long val);

+int res_counter_reset_limit(struct res_counter *counter);
+
/*
* uncharge - tell that some portion of the resource is released
*
Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
+++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
@@ -153,6 +153,17 @@ static int res_counter_resize_limit(stru
return ret;
}

+/**
+ * res_counter_reset_limit - reset limit to be 0.
+ * @res: the res_counter to be reset.
+ *
+ * res_counter->limit is resized to be 0. return 0 at success.
+ */
+
+int res_counter_reset_limit(struct res_counter *res)
+{
+ return res_counter_resize_limit(res, 0);
+}

ssize_t res_counter_write(struct res_counter *counter, int member,
const char __user *userbuf, size_t nbytes, loff_t *pos,

2008-06-13 09:28:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 4/6] res_counter: basic hierarchy support

Add a hierarhy support to res_counter. This patch itself just supports
"No Hierarchy" hierarchy, as a default/basic hierarchy system.

Changelog: v3 -> v4.
- cut out from hardwall hierarchy patch set.
- just support "No hierarchy" model.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/controllers/resource_counter.txt | 27 +++++-
include/linux/res_counter.h | 15 +++
kernel/res_counter.c | 107 ++++++++++++++++++++-----
mm/memcontrol.c | 1
4 files changed, 129 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
===================================================================
--- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
+++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
@@ -21,8 +21,13 @@
* the helpers described beyond
*/

+enum res_cont_hierarchy_model {
+ RES_CONT_NO_HIERARCHY,
+};
+
struct res_counter;
struct res_counter_ops {
+ enum res_cont_hierarchy_model hierarchy_model;
/* called when the subsystem has to reduce the usage. */
int (*shrink_usage)(struct res_counter *cnt, unsigned long long val,
int retry_count);
@@ -46,6 +51,10 @@ struct res_counter {
*/
unsigned long long failcnt;
/*
+ * parent of this counter in hierarchy. if root, this is NULL.
+ */
+ struct res_counter *parent;
+ /*
* registered callbacks etc...for res_counter.
*/
struct res_counter_ops ops;
@@ -101,6 +110,12 @@ static inline void res_counter_init(stru
res_counter_init_ops(counter, NULL);
}

+void res_counter_init_hierarchy(struct res_counter *counter,
+ struct res_counter *parent);
+
+int res_counter_set_ops(struct res_counter *counter,
+ struct res_counter_ops *ops);
+
/*
* charge - try to consume more resource.
*
Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
+++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
@@ -30,8 +30,70 @@ void res_counter_init_ops(struct res_cou
counter->limit = (unsigned long long)LLONG_MAX;
if (ops)
counter->ops = *ops;
+ counter->parent = NULL;
+}
+
+void __res_counter_init_hierarchy_core(struct res_counter *counter)
+{
+ switch (counter->ops.hierarchy_model) {
+ case RES_CONT_NO_HIERARCHY:
+ counter->limit = (unsigned long long)LLONG_MAX;
+ break;
+ default:
+ break;
+ }
+ return;
+}
+
+
+/**
+ * res_counter_init_hierarchy() -- initialize res_counter under some hierarchy.
+ * @counter: a counter will be initialized.
+ * @parent: parent of counter.
+ *
+ * parent->ops is copied to counter->ops and counter will be initialized
+ * to be suitable style for the hierarchy model.
+ */
+void res_counter_init_hierarchy(struct res_counter *counter,
+ struct res_counter *parent)
+{
+ struct res_counter_ops *ops = NULL;
+
+ if (parent)
+ ops = &parent->ops;
+ res_counter_init_ops(counter, ops);
+ counter->parent = parent;
+
+ __res_counter_init_hierarchy_core(counter);
}

+/**
+ * res_counter_set_ops() -- reset res->counter.ops to be passed ops.
+ * @coutner: a counter to be set ops.
+ * @ops: res_counter_ops
+ *
+ * This operations is allowed only when there is no parent or parent's
+ * hierarchy_model == RES_CONT_NO_HIERARCHY. returns 0 at success.
+ */
+
+int res_counter_set_ops(struct res_counter *counter,
+ struct res_counter_ops *ops)
+{
+ struct res_counter *parent;
+ /*
+ * This operation is allowed only when parents's hierarchy
+ * is NO_HIERARCHY or this is ROOT.
+ */
+ parent = counter->parent;
+ if (parent && parent->ops.hierarchy_model != RES_CONT_NO_HIERARCHY)
+ return -EINVAL;
+
+ counter->ops = *ops;
+
+ return 0;
+}
+
+
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
{
if (counter->usage + val > counter->limit) {
@@ -125,30 +187,39 @@ static int res_counter_resize_limit(stru
int retry_count = 0;
int ret = -EBUSY;
unsigned long flags;
+ enum model = RES_CONT_NO_HIERARCHY;

BUG_ON(!cnt->ops.shrink_usage);
- while (1) {
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= val) {
- cnt->limit = val;
- ret = 0;
- spin_unlock_irqrestore(&cnt->lock, flags);
- break;
- }
- BUG_ON(val > cnt->limit);
- spin_unlock_irqrestore(&cnt->lock, flags);

+ switch (model) {
+ case RES_CONT_NO_HIERARCHY:
/*
- * Rest before calling callback().... rest after callback
- * tends to add difference between the result of callback and
- * the check in next loop.
+ * shrink usage to be below the new limit.
*/
- cond_resched();
+ while (1) {
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->usage <= val) {
+ cnt->limit = val;
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ if (!ret)
+ break;
+ /*
+ * Rest before calling callback().... rest after
+ * callback tends to add difference between the result
+ * of callback and the check in next loop.
+ */
+ cond_resched();

- ret = cnt->ops.shrink_usage(cnt, val, retry_count);
- if (!ret)
- break;
- retry_count++;
+ ret = cnt->ops.shrink_usage(cnt, val, retry_count);
+ if (!ret)
+ break;
+ retry_count++;
+ }
+ break;
+ default:
+ BUG();
}
return ret;
}
Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
===================================================================
--- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
+++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
@@ -39,11 +39,14 @@ to work with it.
The failcnt stands for "failures counter". This is the number of
resource allocation attempts that failed.

- e. res_counter_ops.
+ e. parent
+ parent of this res_counter under hierarchy.
+
+ f. res_counter_ops.
Callbacks for helping resource_counter per each subsystem.
- shrink_usage() .... called at limit change (decrease).

- f. spinlock_t lock
+ g. spinlock_t lock

Protects changes of the above values.

@@ -157,7 +160,25 @@ counter fields. They are recommended to
Returns 0 at success. Any error code is acceptable but -EBUSY will be
suitable to show "the kernel can't shrink usage."

-6. Usage example
+6. Hierarchy
+
+ Groups of res_counter can be controlled under some tree (cgroup tree).
+ Taking the tree into account, res_counter can be under some hierarchical
+ control. THe res_counter itself supports hierarchy_model and calls
+ registered callbacks at suitable events.
+
+ For keeping sanity of hierarchy, hierarchy_model of a res_counter can be
+ changed when parent's hierarchy_model is RES_CONT_NO_HIERARCHY.
+ res_counter doesn't count # of children by itself but some subysystem should
+ be aware that it has no children if necessary.
+ (don't want to fully duplicate cgroup's hierarchy. Cost of remembering parent
+ is cheap.)
+
+ a. Independent hierarchy (RES_CONT_NO_HIERARCHY) model
+ This is no relationship between parent and children.
+
+
+7. Usage example

a. Declare a task group (take a look at cgroups subsystem for this) and
fold a res_counter into it
Index: linux-2.6.26-rc5-mm3/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/memcontrol.c
+++ linux-2.6.26-rc5-mm3/mm/memcontrol.c
@@ -1086,6 +1086,7 @@ static void mem_cgroup_free(struct mem_c
}

struct res_counter_ops root_ops = {
+ .hierarchy_model = RES_CONT_NO_HIERARCHY,
.shrink_usage = mem_cgroup_shrink_usage_to,
};

2008-06-13 09:31:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 5/6] res_counter: HARDWALL hierarchy

This patch adds new hierarchy model, called Hardwall Hierarchy, to res_counter.

Change log v3 -> v4.
- restructured the whole set, cut out from memcg hierarchy patch set.
- just handles HardWall Hierarchy.
- renamed variables and functions, again.

HardWall implements following model
- A cgroup's tree means hierarchy of resource.
- All child's resource is moved from its parents.
- The resource moved to children is charged as parent's usage.
- The resource moves when child->limit is changed.
- The sum of resource for children and its own usage is limited by "limit".

This implies
- No dynamic automatic hierarhcy balancing in the kernel.
- Each resource is isolated completely.
- The kernel just supports resource-move-at-change-in-limit.
- The user (middle-ware) is responsible to make hierarhcy balanced well.
Good balance can be achieved by changing limit from user land.

I think there are 4 characteristics of hierarchy.

- fairness ... how fairness is kept under policy

- performance ... should be _fast_. multi-level resource balancing tend
to use much amount of CPU and can cause soft lockup.

- predictability ... resource management are usually used for resource
isolation. the kernel must not break the isolation and
predictability of users against application's progress.

- flexibility ... some sophisticated dynamic resource balancing with
soft-limit is welcomed when the user doesn't want strict
resource isolation or when the user cannot estimate how much
they want correctly.

This model allows the move of resource only between a parent and its children.
The resource is moved to a child when it declares the amount of resources
to be used. (by limit)
Automatic resource balancing is not supported in this code. This means
this model is useful when a user want strict resource isolation under
hierarchy.

- fairness ... ??? no resource sharing. works as specified by users.
- performance ... good. each resources are capsuled to its own level.
- predictability ... good. resources are completely isolated. balancing only
occurs at the event of changes in limit.
- flexibility ... bad. no flexibility and scheduling in the kernel level.
need middle-ware's help.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/resource_counter.txt | 9 +
include/linux/res_counter.h | 6 +
kernel/res_counter.c | 140 ++++++++++++++++++++++++-
3 files changed, 154 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
===================================================================
--- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
+++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
@@ -23,6 +23,7 @@

enum res_cont_hierarchy_model {
RES_CONT_NO_HIERARCHY,
+ RES_CONT_HARDWALL_HIERARCHY,
};

struct res_counter;
@@ -55,6 +56,10 @@ struct res_counter {
*/
struct res_counter *parent;
/*
+ * resources assigned to children.
+ */
+ unsigned long long used_by_children;
+ /*
* registered callbacks etc...for res_counter.
*/
struct res_counter_ops ops;
@@ -96,6 +101,7 @@ enum {
RES_MAX_USAGE,
RES_LIMIT,
RES_FAILCNT,
+ RES_USED_BY_CHILDREN,
};

/*
Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
+++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
@@ -39,6 +39,10 @@ void __res_counter_init_hierarchy_core(s
case RES_CONT_NO_HIERARCHY:
counter->limit = (unsigned long long)LLONG_MAX;
break;
+ case RES_CONT_HARDWALL_HIERARCHY:
+ counter->limit = 0;
+ counter->used_by_children = 0;
+ break;
default:
break;
}
@@ -148,6 +152,8 @@ res_counter_member(struct res_counter *c
return &counter->limit;
case RES_FAILCNT:
return &counter->failcnt;
+ case RES_USED_BY_CHILDREN:
+ return &counter->used_by_children;
};

BUG();
@@ -177,6 +183,114 @@ u64 res_counter_read_u64(struct res_coun
}

/*
+ * Move resource from a parent to a child.
+ * parent->usage += val
+ * parent->used_by_children += val
+ * child->limit += val
+ * To do this, ops->shrink_usage() is called against parent.
+ *
+ * Returns 0 at success.
+ * Returns -EBUSY or return code of ops->shrink_usage().
+ */
+static int res_counter_borrow_resource(struct res_counter *child,
+ unsigned long long val)
+{
+ struct res_counter *parent = child->parent;
+ unsigned long flags;
+ unsigned long long diff;
+ int ret;
+ int retry_count = 0;
+
+ BUG_ON(!parent);
+
+ spin_lock_irqsave(&child->lock, flags);
+ diff = val - child->limit;
+ spin_unlock_irqrestore(&child->lock, flags);
+
+ while (1) {
+ ret = -EBUSY;
+ spin_lock_irqsave(&parent->lock, flags);
+ if (parent->usage + diff <= parent->limit) {
+ parent->used_by_children += diff;
+ parent->usage += diff;
+ break;
+ }
+ spin_unlock_irqrestore(&parent->lock, flags);
+
+ if (!parent->ops.shrink_usage)
+ goto fail;
+ cond_resched();
+ ret = parent->ops.shrink_usage(parent, val, retry_count);
+ if (ret)
+ goto fail;
+ retry_count++;
+ }
+ ret = 0;
+ spin_unlock_irqrestore(&parent->lock, flags);
+
+ spin_lock_irqsave(&child->lock, flags);
+ child->limit = val;
+ spin_unlock_irqrestore(&child->lock, flags);
+fail:
+ return ret;
+}
+
+
+/*
+ * Move resource from a child to a parent.
+ * parent->usage -= val
+ * parent->used_by_children -= val
+ * child->limit -= val
+ * To do this, ops->shrink_usage() is called against child.
+ *
+ * Returns 0 at success.
+ * Returns -EBUSY or return code passed by ops->shrink_usage().
+ */
+
+static int res_counter_return_resource(struct res_counter *child,
+ unsigned long long val)
+
+{
+ unsigned long flags;
+ struct res_counter *parent = child->parent;
+ int retry_count = 0;
+ unsigned long long diff;
+ int ret;
+
+ BUG_ON(!parent);
+
+ while (1) {
+ ret = -EBUSY;
+ spin_lock_irqsave(&child->lock, flags);
+ if (child->usage <= val) {
+ diff = child->limit - val;
+ child->limit = val;
+ break;
+ }
+ spin_unlock_irqrestore(&child->lock, flags);
+
+ if (!child->ops.shrink_usage)
+ goto fail;
+
+ ret = child->ops.shrink_usage(child, val, retry_count);
+ if (ret)
+ goto fail;
+ retry_count++;
+ }
+ ret = 0;
+ spin_unlock_irqrestore(&child->lock, flags);
+
+ spin_lock_irqsave(&parent->lock, flags);
+ BUG_ON(parent->used_by_children < val);
+ BUG_ON(parent->usage < val);
+ parent->used_by_children -= diff;
+ parent->usage -= diff;
+ spin_unlock_irqrestore(&parent->lock, flags);
+fail:
+ return ret;
+}
+
+/*
* Called when the limit changes if res_counter has ops->shrink_usage.
* This function uses shrink usage to below new limit. returns 0 at success.
*/
@@ -187,10 +301,15 @@ static int res_counter_resize_limit(stru
int retry_count = 0;
int ret = -EBUSY;
unsigned long flags;
- enum model = RES_CONT_NO_HIERARCHY;
+ enum res_cont_hierarchy_model model = RES_CONT_NO_HIERARCHY;
+ struct res_counter *parent;

BUG_ON(!cnt->ops.shrink_usage);

+ parent = cnt->parent;
+ if (parent)
+ model = parent->ops.hierarchy_model;
+
switch (model) {
case RES_CONT_NO_HIERARCHY:
/*
@@ -218,6 +337,25 @@ static int res_counter_resize_limit(stru
retry_count++;
}
break;
+ case RES_CONT_HARDWALL_HIERARCHY:
+ /*
+ * Both of increasing/decreasing limit have to interact with
+ * parent.
+ */
+ {
+ int direction;
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (val > cnt->limit)
+ direction = 1; /* increase */
+ else
+ direction = 0; /* decrease */
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ if (direction)
+ ret = res_counter_borrow_resource(cnt, val);
+ else
+ ret = res_counter_return_resource(cnt, val);
+ }
+ break;
default:
BUG();
}
Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
===================================================================
--- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
+++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
@@ -177,6 +177,15 @@ counter fields. They are recommended to
a. Independent hierarchy (RES_CONT_NO_HIERARCHY) model
This is no relationship between parent and children.

+ b. Strict Hard-limit (RES_CONT_HARDWALL_HIERARCHY) model
+ This model allows strict resource isolation under hierarchy.
+ The rule is.
+ - A cgroup's tree means hierarchy of resource.
+ - All child's resource is moved from its parents.
+ - The resource moved to children is charged as parent's usage.
+ - The resource moves when child->limit is changed.
+ - The sum of resource for children and its own usage is limited by "limit".
+ See controllers/memory.txt if unsure. There will be an example.

7. Usage example

2008-06-13 09:32:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH 6/6] memcg: HARDWALL hierarchy

Support hardwall hierarchy (and no-hierarchy) in memcg.

Change log: v3->v4
- cut out from memcg hierarchy patch set v4.
- no major changes, but some amount of functions are moved to res_counter.
and be more gneric.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/memory.txt | 57 +++++++++++++++++++++++++++++-
mm/memcontrol.c | 65 +++++++++++++++++++++++++++++++++--
2 files changed, 118 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc5-mm3/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/mm/memcontrol.c
+++ linux-2.6.26-rc5-mm3/mm/memcontrol.c
@@ -941,6 +941,48 @@ static int mem_force_empty_write(struct
return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
}

+
+static u64 mem_cgroup_hierarchy_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem;
+
+ mem = mem_cgroup_from_cont(cgrp);
+
+ return mem->res.ops.hierarchy_model;
+}
+
+static int
+mem_cgroup_hierarchy_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *mem;
+ struct res_counter_ops ops;
+ int ret = -EBUSY;
+
+ mem = mem_cgroup_from_cont(cgrp);
+
+ if (!list_empty(&cgrp->children))
+ return ret;
+
+ switch ((int)val) {
+ case RES_CONT_NO_HIERARCHY:
+ ops.hierarchy_model = RES_CONT_NO_HIERARCHY;
+ ops.shrink_usage = mem_cgroup_shrink_usage_to;
+ ret = res_counter_set_ops(&mem->res, &ops);
+ break;
+ case RES_CONT_HARDWALL_HIERARCHY:
+ ops.hierarchy_model = RES_CONT_HARDWALL_HIERARCHY;
+ ops.shrink_usage = mem_cgroup_shrink_usage_to;
+ ret = res_counter_set_ops(&mem->res, &ops);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+
static const struct mem_cgroup_stat_desc {
const char *msg;
u64 unit;
@@ -951,6 +993,9 @@ static const struct mem_cgroup_stat_desc
[MEM_CGROUP_STAT_PGPGOUT_COUNT] = {"pgpgout", 1, },
};

+
+
+
static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
struct cgroup_map_cb *cb)
{
@@ -1024,6 +1069,16 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+ {
+ .name = "used_by_children",
+ .private = RES_USED_BY_CHILDREN,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "hierarchy_model",
+ .write_u64 = mem_cgroup_hierarchy_write,
+ .read_u64 = mem_cgroup_hierarchy_read,
+ },
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1093,18 +1148,23 @@ struct res_counter_ops root_ops = {
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *parent;
int node;
if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+ parent = NULL;
} else {
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
+ parent = mem_cgroup_from_cont(cont->parent);
}

- res_counter_init_ops(&mem->res, &root_ops);
+ if (!parent)
+ res_counter_init_ops(&mem->res, &root_ops);
+ else
+ res_counter_init_hierarchy(&mem->res, &parent->res);

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1124,6 +1184,7 @@ static void mem_cgroup_pre_destroy(struc
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
mem_cgroup_force_empty(mem);
+ res_counter_reset_limit(&mem->res);
}

static void mem_cgroup_destroy(struct cgroup_subsys *ss,
Index: linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
===================================================================
--- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/memory.txt
+++ linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
@@ -154,7 +154,7 @@ The memory controller uses the following

0. Configuration

-a. Enable CONFIG_CGROUPS
+a. Enable CONFESS_CGROUPS
b. Enable CONFIG_RESOURCE_COUNTERS
c. Enable CONFIG_CGROUP_MEM_RES_CTLR

@@ -237,7 +237,58 @@ cgroup might have some charge associated
tasks have migrated away from it. Such charges are automatically dropped at
rmdir() if there are no tasks.

-5. TODO
+5. Supported Hierarchy Model
+
+Currently, memory controller supports following models of hierarchy in the
+kernel. (see also resource_counter.txt)
+
+2 files are related to hierarchy.
+ - memory.hierarchy_model
+ - memory.for_children
+
+Basic Rule.
+ - Hierarchy can be set per cgroup.
+ - A child inherits parent's hierarchy model at creation.
+ - A child can change its hierarchy only when the parent's hierarchy is
+ NO_HIERARCY and it has no children.
+
+
+5.1. NO_HIERARCHY
+ - Each cgroup is independent from other ones.
+ - When memory.hierarchy_model is 0, NO_HIERARCHY is used.
+ Under this model, there is no controls based on tree of cgroups.
+ This is the default model of root cgroup.
+
+5.2 HARDWALL_HIERARCHY
+ - A child is a isolated portion of the parent.
+ - When memory.hierarchy_model is 1, HARDWALL_HIERARCHY is used.
+ In this model a child's limit is charged as parent's usage.
+
+ Hard-Wall Hierarchy Example)
+ 1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
+ - group_A limit=1G,usage=0M.
+
+ 2) create group B, C under A.
+ - group A limit=1G, usage=0M, for_childre=0M
+ - group B limit=0M, usage=0M.
+ - group C limit=0M, usage=0M.
+
+ 3) increase group B's limit to 300M.
+ - group A limit=1G, usage=300M, for_children=300M
+ - group B limit=300M, usage=0M.
+ - group C limit=0M, usage=0M.
+
+ 4) increase group C's limit to 500M
+ - group A limit=1G, usage=800M, for_children=800M
+ - group B limit=300M, usage=0M.
+ - group C limit=500M, usage=0M.
+
+ 5) reduce group B's limit to 100M
+ - group A limit=1G, usage=600M, for_children=600M.
+ - group B limit=100M, usage=0M.
+ - group C limit=500M, usage=0M.
+
+6. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
@@ -274,3 +325,5 @@ References
http://lkml.org/lkml/2007/8/17/69
12. Corbet, Jonathan, Controlling memory use in cgroups,
http://lwn.net/Articles/243795/
+
+ LocalWords: lru CONFIG CTLR

2008-06-16 06:44:16

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

KAMEZAWA Hiroyuki wrote:
> Add a support to shrink_usage_at_limit_change feature to res_counter.
> memcg will use this to drop pages.
>
> Change log: xxx -> v4 (new file.)
> - cut out the limit-change part from hierarchy patch set.
> - add "retry_count" arguments to shrink_usage(). This allows that we don't
> have to set the default retry loop count.
> - res_counter_check_under_val() is added to support subsystem.
> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/resource_counter.txt | 19 +++++-
> include/linux/res_counter.h | 33 ++++++++++-
> kernel/res_counter.c | 74 ++++++++++++++++++++++++-
> 3 files changed, 121 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
> +++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
> @@ -21,6 +21,13 @@
> * the helpers described beyond
> */
>
> +struct res_counter;
> +struct res_counter_ops {
> + /* called when the subsystem has to reduce the usage. */
> + int (*shrink_usage)(struct res_counter *cnt, unsigned long long val,
> + int retry_count);
> +};
> +
> struct res_counter {
> /*
> * the current resource consumption level
> @@ -39,6 +46,10 @@ struct res_counter {
> */
> unsigned long long failcnt;
> /*
> + * registered callbacks etc...for res_counter.
> + */
> + struct res_counter_ops ops;
> + /*

Why would we need such? All res_counter.limit update comes via the appropiate
cgroup's files, so it can do whatever it needs w/o any callbacks?

And (if we definitely need one) isn't it better to make it a
struct res_counter_ops *ops;
pointer?

> * the lock to protect all of the above.
> * the routines below consider this to be IRQ-safe
> */
> @@ -82,7 +93,13 @@ enum {
> * helpers for accounting
> */
>
> -void res_counter_init(struct res_counter *counter);
> +void res_counter_init_ops(struct res_counter *counter,
> + struct res_counter_ops *ops);
> +
> +static inline void res_counter_init(struct res_counter *counter)
> +{
> + res_counter_init_ops(counter, NULL);
> +}
>
> /*
> * charge - try to consume more resource.
> @@ -136,6 +153,20 @@ static inline bool res_counter_check_und
> return ret;
> }
>
> +static inline bool res_counter_check_under_val(struct res_counter *cnt,
> + unsigned long long val)
> +{
> + bool ret = false;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + if (cnt->usage <= val)
> + ret = true;
> + spin_unlock_irqrestore(&cnt->lock, flags);
> +
> + return ret;
> +}
> +
> static inline void res_counter_reset_max(struct res_counter *cnt)
> {
> unsigned long flags;
> Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
> +++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
> @@ -14,10 +14,22 @@
> #include <linux/res_counter.h>
> #include <linux/uaccess.h>
>
> -void res_counter_init(struct res_counter *counter)
> +/**
> + * res_counter_init_ops -- initialize res_counter.
> + * @counter: the res_counter to be initialized
> + * @ops: the res_counter_ops for this res_counter. This argument can be NULL
> + * and is copied.
> + *
> + * init spinlock and set limit to be very very big value.
> + */
> +
> +void res_counter_init_ops(struct res_counter *counter,
> + struct res_counter_ops *ops)
> {
> spin_lock_init(&counter->lock);
> counter->limit = (unsigned long long)LLONG_MAX;
> + if (ops)
> + counter->ops = *ops;
> }
>
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> @@ -102,6 +114,46 @@ u64 res_counter_read_u64(struct res_coun
> return *res_counter_member(counter, member);
> }
>
> +/*
> + * Called when the limit changes if res_counter has ops->shrink_usage.
> + * This function uses shrink usage to below new limit. returns 0 at success.
> + */
> +
> +static int res_counter_resize_limit(struct res_counter *cnt,
> + unsigned long long val)
> +{
> + int retry_count = 0;
> + int ret = -EBUSY;
> + unsigned long flags;
> +
> + BUG_ON(!cnt->ops.shrink_usage);
> + while (1) {
> + spin_lock_irqsave(&cnt->lock, flags);
> + if (cnt->usage <= val) {
> + cnt->limit = val;
> + ret = 0;
> + spin_unlock_irqrestore(&cnt->lock, flags);
> + break;
> + }
> + BUG_ON(val > cnt->limit);
> + spin_unlock_irqrestore(&cnt->lock, flags);
> +
> + /*
> + * Rest before calling callback().... rest after callback
> + * tends to add difference between the result of callback and
> + * the check in next loop.
> + */
> + cond_resched();
> +
> + ret = cnt->ops.shrink_usage(cnt, val, retry_count);
> + if (!ret)
> + break;
> + retry_count++;
> + }
> + return ret;
> +}
> +
> +
> ssize_t res_counter_write(struct res_counter *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> int (*write_strategy)(char *st_buf, unsigned long long *val))
> @@ -133,11 +185,29 @@ ssize_t res_counter_write(struct res_cou
> if (*end != '\0')
> goto out_free;
> }
> + switch (member) {
> + case RES_LIMIT:
> + if (counter->ops.shrink_usage) {
> + ret = res_counter_resize_limit(counter, tmp);
> + goto done;
> + }
> + break;
> + default:
> + /*
> + * Considering future implementation, we'll have to handle
> + * other members and "fallback" will not work well. So, we
> + * avoid to make use of "default" here.
> + */
> + break;
> + }
> spin_lock_irqsave(&counter->lock, flags);
> val = res_counter_member(counter, member);
> *val = tmp;
> spin_unlock_irqrestore(&counter->lock, flags);
> - ret = nbytes;
> + ret = 0;
> +done:
> + if (!ret)
> + ret = nbytes;
> out_free:
> kfree(buf);
> out:
> Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
> +++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> @@ -39,7 +39,11 @@ to work with it.
> The failcnt stands for "failures counter". This is the number of
> resource allocation attempts that failed.
>
> - c. spinlock_t lock
> + e. res_counter_ops.
> + Callbacks for helping resource_counter per each subsystem.
> + - shrink_usage() .... called at limit change (decrease).
> +
> + f. spinlock_t lock
>
> Protects changes of the above values.
>
> @@ -141,8 +145,19 @@ counter fields. They are recommended to
> failcnt reset to zero
>
>
> +5. res_counter_ops (Callbacks)
>
> -5. Usage example
> + res_counter_ops is for implementing feedback control from res_counter
> + to subsystem. Each one has each own purpose and the subsystem doesn't
> + necessary to provide all callbacks. Just implement necessary ones.
> +
> + - shrink_usage(res_counter, newlimit, retry)
> + Called for reducing usage to newlimit, retry is incremented per
> + loop. (See memory resource controller as example.)
> + Returns 0 at success. Any error code is acceptable but -EBUSY will be
> + suitable to show "the kernel can't shrink usage."
> +
> +6. Usage example
>
> a. Declare a task group (take a look at cgroups subsystem for this) and
> fold a res_counter into it
>
>

2008-06-16 07:40:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----
>> + * registered callbacks etc...for res_counter.
>> + */
>> + struct res_counter_ops ops;
>> + /*
>
Now, write to limit is done in following path.
sys_write() -> write_func of subsys -> write in res_counter ->
strategy callback -> set limit -> return

Because stragety callback is called in res_counter, we can only do
something after set-limit without callback. So res_counter should call
another callback before set-limit if it can fail.

>Why would we need such? All res_counter.limit update comes via the appropiate
>cgroup's files, so it can do whatever it needs w/o any callbacks?
>

First reason is that this allows us to implement generic algorithm to
handle limit change. Second is that generic algorithm can be a stack of
functions. I don't like to pass function pointers through several stack
of functions. (And this design allow the code to be much easier to read.
My first version used an argument of function pointer but it was verrry ugly.)

I think when I did all in memcg, someone will comment that "why do that
all in memcg ? please implement generic one to avoid code duplication"

>And (if we definitely need one) isn't it better to make it a
> struct res_counter_ops *ops;
>pointer?
>
My first version did that. When I added hierarchy_model to ops(see later patch
), I made use of copy of ops. But maybe you're right. Keeping
res_counter small is important. I'll use pointer in v5.

Thanks,
-Kame-

2008-06-16 07:55:11

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

[email protected] wrote:
> ----- Original Message -----
>>> + * registered callbacks etc...for res_counter.
>>> + */
>>> + struct res_counter_ops ops;
>>> + /*
> Now, write to limit is done in following path.
> sys_write() -> write_func of subsys -> write in res_counter ->
> strategy callback -> set limit -> return
>
> Because stragety callback is called in res_counter, we can only do
> something after set-limit without callback. So res_counter should call
> another callback before set-limit if it can fail.
>
>> Why would we need such? All res_counter.limit update comes via the appropiate
>> cgroup's files, so it can do whatever it needs w/o any callbacks?
>>
>
> First reason is that this allows us to implement generic algorithm to
> handle limit change. Second is that generic algorithm can be a stack of
> functions. I don't like to pass function pointers through several stack
> of functions. (And this design allow the code to be much easier to read.
> My first version used an argument of function pointer but it was verrry ugly.)
>
> I think when I did all in memcg, someone will comment that "why do that
> all in memcg ? please implement generic one to avoid code duplication"

Hm... But we're choosing between

sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call

and

sys_write->xxx_cgroup_write->res_counter_set_limit
->xxx_cgroup_call

With the sizeof(void *)-bytes difference in res_counter, nNo?

>> And (if we definitely need one) isn't it better to make it a
>> struct res_counter_ops *ops;
>> pointer?
>>
> My first version did that. When I added hierarchy_model to ops(see later patch
> ), I made use of copy of ops. But maybe you're right. Keeping
> res_counter small is important. I'll use pointer in v5.
>
> Thanks,
> -Kame-
>
>

2008-06-16 08:18:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----

>> I think when I did all in memcg, someone will comment that "why do that
>> all in memcg ? please implement generic one to avoid code duplication"
>
>Hm... But we're choosing between
>
>sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>
>and
>
>sys_write->xxx_cgroup_write->res_counter_set_limit
> ->xxx_cgroup_call
>
>With the sizeof(void *)-bytes difference in res_counter, nNo?
>
I can't catch what you mean. What is res_counter_set_limit here ?
(my patche's ?) and what is sizeof(void *)-bytes ?

Is it so strange to add following algorithm in res_counter?
==
set_limit -> fail -> shrink -> set limit -> fail ->shrink
-> success -> return 0
==
I think this is enough generic.

Thanks,
-Kame


2008-06-16 08:29:00

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

[email protected] wrote:
> ----- Original Message -----
>
>>> I think when I did all in memcg, someone will comment that "why do that
>>> all in memcg ? please implement generic one to avoid code duplication"
>> Hm... But we're choosing between
>>
>> sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>>
>> and
>>
>> sys_write->xxx_cgroup_write->res_counter_set_limit
>> ->xxx_cgroup_call
>>
>> With the sizeof(void *)-bytes difference in res_counter, nNo?
>>
> I can't catch what you mean. What is res_counter_set_limit here ?

It's res_counter_resize_limit from your patch, sorry for the confusion.

> (my patche's ?) and what is sizeof(void *)-bytes ?

I meant, that we have to add 4 bytes (8 on 64-bit arches) on the
struct res_counter to store the pointer on the res_counter_ops.

> Is it so strange to add following algorithm in res_counter?
> ==
> set_limit -> fail -> shrink -> set limit -> fail ->shrink
> -> success -> return 0
> ==
> I think this is enough generic.

It is, but my point is - we're calling the set_limit (this is a
res_counter_resize_limit from your patch, sorry for the confusion again)
routine right from the cgroup's write callback and thus can call
the desired "ops->shrink_usage" directly, w/o additional level of
indirection.

> Thanks,
> -Kame
>
>
>
>

2008-06-16 08:32:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----

>[email protected] wrote:
>> ----- Original Message -----
>>
>>>> I think when I did all in memcg, someone will comment that "why do that
>>>> all in memcg ? please implement generic one to avoid code duplication"
>>> Hm... But we're choosing between
>>>
>>> sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>>>
>>> and
>>>
>>> sys_write->xxx_cgroup_write->res_counter_set_limit
>>> ->xxx_cgroup_call
>>>
>>> With the sizeof(void *)-bytes difference in res_counter, nNo?
>>>
>> I can't catch what you mean. What is res_counter_set_limit here ?
>
>It's res_counter_resize_limit from your patch, sorry for the confusion.
>
>> (my patche's ?) and what is sizeof(void *)-bytes ?
>
>I meant, that we have to add 4 bytes (8 on 64-bit arches) on the
>struct res_counter to store the pointer on the res_counter_ops.
>
Okay, maye all you want is "don't increase the size of res_counter"

>> Is it so strange to add following algorithm in res_counter?
>> ==
>> set_limit -> fail -> shrink -> set limit -> fail ->shrink
>> -> success -> return 0
>> ==
>> I think this is enough generic.
>
>It is, but my point is - we're calling the set_limit (this is a
>res_counter_resize_limit from your patch, sorry for the confusion again)
>routine right from the cgroup's write callback and thus can call
>the desired "ops->shrink_usage" directly, w/o additional level of
>indirection.
>
Hmm, to do that, I'd like to remove strategy function from res_counter.
Ok?

Thanks,
-Kame

2008-06-16 08:51:37

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

[email protected] wrote:
> ----- Original Message -----
>
>> [email protected] wrote:
>>> ----- Original Message -----
>>>
>>>>> I think when I did all in memcg, someone will comment that "why do that
>>>>> all in memcg ? please implement generic one to avoid code duplication"
>>>> Hm... But we're choosing between
>>>>
>>>> sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>>>>
>>>> and
>>>>
>>>> sys_write->xxx_cgroup_write->res_counter_set_limit
>>>> ->xxx_cgroup_call
>>>>
>>>> With the sizeof(void *)-bytes difference in res_counter, nNo?
>>>>
>>> I can't catch what you mean. What is res_counter_set_limit here ?
>> It's res_counter_resize_limit from your patch, sorry for the confusion.
>>
>>> (my patche's ?) and what is sizeof(void *)-bytes ?
>> I meant, that we have to add 4 bytes (8 on 64-bit arches) on the
>> struct res_counter to store the pointer on the res_counter_ops.
>>
> Okay, maye all you want is "don't increase the size of res_counter"

Actually no, what I want is not to put indirections level when
not required.

But keeping res_counter as small as possible is also my wish. :)

>>> Is it so strange to add following algorithm in res_counter?
>>> ==
>>> set_limit -> fail -> shrink -> set limit -> fail ->shrink
>>> -> success -> return 0
>>> ==
>>> I think this is enough generic.
>> It is, but my point is - we're calling the set_limit (this is a
>> res_counter_resize_limit from your patch, sorry for the confusion again)
>> routine right from the cgroup's write callback and thus can call
>> the desired "ops->shrink_usage" directly, w/o additional level of
>> indirection.
>>
> Hmm, to do that, I'd like to remove strategy function from res_counter.

Oops... I'm looking at 2.6.26-rc5-mm1's res_counter and don't see such.
I tried to follow the changes in res_counter, but it looks like I've
already missed something.

What do you mean by "strategy function from res_counter"?

> Ok?
>
> Thanks,
> -Kame
>

2008-06-16 08:54:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----
>>> I think when I did all in memcg, someone will comment that "why do that
>>> all in memcg ? please implement generic one to avoid code duplication"
>>
>>Hm... But we're choosing between
>>
>>sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>>
>>and
>>
>>sys_write->xxx_cgroup_write->res_counter_set_limit
>> ->xxx_cgroup_call
>>
>>With the sizeof(void *)-bytes difference in res_counter, nNo?
>>
>I can't catch what you mean. What is res_counter_set_limit here ?
>(my patche's ?) and what is sizeof(void *)-bytes ?
>
>Is it so strange to add following algorithm in res_counter?
>==
>set_limit -> fail -> shrink -> set limit -> fail ->shrink
>-> success -> return 0
>==
>I think this is enough generic.
>
This was previous request from Paul. (to hierarchy patch set)

http://marc.info/?l=linux-mm&m=121257010530546&w=2

I think this version meets his request. and I like this.

I don't want to waste more weeks. Then, what is bad ?
removing res_counter_ops is okay ?

Thanks,
-Kame

2008-06-16 08:58:00

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

KAMEZAWA Hiroyuki wrote:
> Add a support to shrink_usage_at_limit_change feature to res_counter.
> memcg will use this to drop pages.
>
> Change log: xxx -> v4 (new file.)
> - cut out the limit-change part from hierarchy patch set.
> - add "retry_count" arguments to shrink_usage(). This allows that we don't
> have to set the default retry loop count.
> - res_counter_check_under_val() is added to support subsystem.
> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>

Does shrink_usage() really belong to res_counters? Could a task limiter, a
CPU/IO bandwidth controller use this callback? Resource Counters were designed
to be generic and work across controllers. Isn't the memory controller a better
place for such ops.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-16 09:02:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----
>> Okay, maye all you want is "don't increase the size of res_counter"
>
>Actually no, what I want is not to put indirections level when
>not required.
>
"not required" ? I think you miss the point that this patch implements some
feedback algorithm in res_counter. If res_counter doesn't support it,
Okay, I'll do in memcg. But please see this request from Paul in the prev vers
ion.
http://marc.info/?l=linux-mm&m=121257010530546&w=2
And what benefits we can get by implementing feedback per subcgroups ?

>But keeping res_counter as small as possible is also my wish. :)
>
>>>> Is it so strange to add following algorithm in res_counter?
>>>> ==
>>>> set_limit -> fail -> shrink -> set limit -> fail ->shrink
>>>> -> success -> return 0
>>>> ==
>>>> I think this is enough generic.
>>> It is, but my point is - we're calling the set_limit (this is a
>>> res_counter_resize_limit from your patch, sorry for the confusion again)
>>> routine right from the cgroup's write callback and thus can call
>>> the desired "ops->shrink_usage" directly, w/o additional level of
>>> indirection.
>>>
>> Hmm, to do that, I'd like to remove strategy function from res_counter.
>
>Oops... I'm looking at 2.6.26-rc5-mm1's res_counter and don't see such.
>I tried to follow the changes in res_counter, but it looks like I've
>already missed something.
>
>What do you mean by "strategy function from res_counter"?
>
Please ignore. my confusion.
"don't call res_counter_write() at set limit" is ok.

Thanks,
-Kame

2008-06-16 09:04:55

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

[email protected] wrote:
> ----- Original Message -----
>>>> I think when I did all in memcg, someone will comment that "why do that
>>>> all in memcg ? please implement generic one to avoid code duplication"
>>> Hm... But we're choosing between
>>>
>>> sys_write->xxx_cgroup_write->res_counter_set_limit->xxx_cgroup_call
>>>
>>> and
>>>
>>> sys_write->xxx_cgroup_write->res_counter_set_limit
>>> ->xxx_cgroup_call
>>>
>>> With the sizeof(void *)-bytes difference in res_counter, nNo?
>>>
>> I can't catch what you mean. What is res_counter_set_limit here ?
>> (my patche's ?) and what is sizeof(void *)-bytes ?
>>
>> Is it so strange to add following algorithm in res_counter?
>> ==
>> set_limit -> fail -> shrink -> set limit -> fail ->shrink
>> -> success -> return 0
>> ==
>> I think this is enough generic.
>>
> This was previous request from Paul. (to hierarchy patch set)
>
> http://marc.info/?l=linux-mm&m=121257010530546&w=2
>
> I think this version meets his request. and I like this.
>
> I don't want to waste more weeks. Then, what is bad ?
> removing res_counter_ops is okay ?

Yes. I'd prefer seeing this logic in memory controller w/o additional
hacks in res_counter.

> Thanks,
> -Kame
>
>
>

2008-06-16 09:05:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----
>KAMEZAWA Hiroyuki wrote:
>> Add a support to shrink_usage_at_limit_change feature to res_counter.
>> memcg will use this to drop pages.
>>
>> Change log: xxx -> v4 (new file.)
>> - cut out the limit-change part from hierarchy patch set.
>> - add "retry_count" arguments to shrink_usage(). This allows that we don't
>> have to set the default retry loop count.
>> - res_counter_check_under_val() is added to support subsystem.
>> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>
>Does shrink_usage() really belong to res_counters? Could a task limiter, a
>CPU/IO bandwidth controller use this callback? Resource Counters were designe
d
>to be generic and work across controllers. Isn't the memory controller a bett
er
>place for such ops.
>
Definitely No. I think counters which cannot be shrink should return -EBUSY
by shrink_usage() when it cannot do it.

Thanks,
-Kame


2008-06-16 09:14:35

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

Balbir Singh wrote:
> KAMEZAWA Hiroyuki wrote:
>> Add a support to shrink_usage_at_limit_change feature to res_counter.
>> memcg will use this to drop pages.
>>
>> Change log: xxx -> v4 (new file.)
>> - cut out the limit-change part from hierarchy patch set.
>> - add "retry_count" arguments to shrink_usage(). This allows that we don't
>> have to set the default retry loop count.
>> - res_counter_check_under_val() is added to support subsystem.
>> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>
> Does shrink_usage() really belong to res_counters? Could a task limiter, a
> CPU/IO bandwidth controller use this callback? Resource Counters were designed
> to be generic and work across controllers. Isn't the memory controller a better
> place for such ops.

Well, this was my point, I just couldn't express this in an understandable manner.
We're discussing this right now :)

2008-06-16 12:31:25

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

[email protected] wrote:
> ----- Original Message -----
>> KAMEZAWA Hiroyuki wrote:
>>> Add a support to shrink_usage_at_limit_change feature to res_counter.
>>> memcg will use this to drop pages.
>>>
>>> Change log: xxx -> v4 (new file.)
>>> - cut out the limit-change part from hierarchy patch set.
>>> - add "retry_count" arguments to shrink_usage(). This allows that we don't
>>> have to set the default retry loop count.
>>> - res_counter_check_under_val() is added to support subsystem.
>>> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>>
>> Does shrink_usage() really belong to res_counters? Could a task limiter, a
>> CPU/IO bandwidth controller use this callback? Resource Counters were designe
> d
>> to be generic and work across controllers. Isn't the memory controller a bett
> er
>> place for such ops.
>>
> Definitely No. I think counters which cannot be shrink should return -EBUSY
> by shrink_usage() when it cannot do it.

Wouldn't that be all counters except for the memory controller RSS counter? I
can't see anyone besides the memory controller supporting shrink_usage().

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-16 13:27:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 1/6] res_counter: handle limit change

----- Original Message -----

>> Definitely No. I think counters which cannot be shrink should return -EBUSY
>> by shrink_usage() when it cannot do it.
>
>Wouldn't that be all counters except for the memory controller RSS counter? I
>can't see anyone besides the memory controller supporting shrink_usage().
>
Slab_counter is a candidate. But ok, if everyone doesn't like this,
I'll abandon the whole and rewrite it as v3.

And condidering your point, my high-low-watermark patch set should be
implemented within memcg and adding high/low to res_counter is too bad.
I'll change my plan. But res_counter is less useful rather than I thought of ;
)
Besides it doesn't support any feedbacks, it just restricts the access to para
meters.

BTW, I believe current res_counter's behavior to return success
at usage > limit case is very bad. I'd like to return -EBUSY.
How do you think ?
(And I also think res_counter_charge returns -ENOMEM is BUG. It should be
-EBUSY.)

Thanks,
-Kame

2008-06-20 05:09:53

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

On Fri, Jun 13, 2008 at 2:29 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> Add a support to shrink_usage_at_limit_change feature to res_counter.
> memcg will use this to drop pages.

Sorry for the delay in looking at this.

I think the basic idea is great.

>
> Change log: xxx -> v4 (new file.)
> - cut out the limit-change part from hierarchy patch set.
> - add "retry_count" arguments to shrink_usage(). This allows that we don't
> have to set the default retry loop count.
> - res_counter_check_under_val() is added to support subsystem.
> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/resource_counter.txt | 19 +++++-
> include/linux/res_counter.h | 33 ++++++++++-
> kernel/res_counter.c | 74 ++++++++++++++++++++++++-
> 3 files changed, 121 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/include/linux/res_counter.h
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/include/linux/res_counter.h
> +++ linux-2.6.26-rc5-mm3/include/linux/res_counter.h
> @@ -21,6 +21,13 @@
> * the helpers described beyond
> */
>
> +struct res_counter;
> +struct res_counter_ops {
> + /* called when the subsystem has to reduce the usage. */
> + int (*shrink_usage)(struct res_counter *cnt, unsigned long long val,
> + int retry_count);
> +};

We should also add the limit/usage write strategy function in here too.


> +
> struct res_counter {
> /*
> * the current resource consumption level
> @@ -39,6 +46,10 @@ struct res_counter {
> */
> unsigned long long failcnt;
> /*
> + * registered callbacks etc...for res_counter.
> + */
> + struct res_counter_ops ops;
> + /*

As Pavel mentioned, a pointer would be better here.
> -void res_counter_init(struct res_counter *counter);
> +void res_counter_init_ops(struct res_counter *counter,
> + struct res_counter_ops *ops);
> +
> +static inline void res_counter_init(struct res_counter *counter)
> +{
> + res_counter_init_ops(counter, NULL);
> +}

I would rather just see res_counter_init() take an ops parameter, and
update the (few) users of res_counter.


> +static int res_counter_resize_limit(struct res_counter *cnt,
> + unsigned long long val)
> +{
> + int retry_count = 0;
> + int ret = -EBUSY;
> + unsigned long flags;
> +
> + BUG_ON(!cnt->ops.shrink_usage);

As others have pointed out, there are some subsystems where usage
can't be shrunk. Maybe provide a "res_counter_unshrinkable()" function
that always returns -EBUSY and can be used by subsystems that can't
handle shrinking?

> @@ -133,11 +185,29 @@ ssize_t res_counter_write(struct res_cou
> if (*end != '\0')
> goto out_free;
> }
> + switch (member) {
> + case RES_LIMIT:
> + if (counter->ops.shrink_usage) {
> + ret = res_counter_resize_limit(counter, tmp);
> + goto done;
> + }
> + break;
> + default:
> + /*
> + * Considering future implementation, we'll have to handle
> + * other members and "fallback" will not work well. So, we
> + * avoid to make use of "default" here.
> + */
> + break;
> + }

Would this be simpler as just

if (member == RES_LIMIT && counter->ops.shrink_usage) {
ret = res_counter_resize_limit(counter, tmp);
} else {
...
}

Paul

2008-06-23 22:30:45

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: HARDWALL hierarchy

On Fri, 13 Jun 2008 18:37:41 +0900 KAMEZAWA Hiroyuki wrote:

> Support hardwall hierarchy (and no-hierarchy) in memcg.
>
> Change log: v3->v4
> - cut out from memcg hierarchy patch set v4.
> - no major changes, but some amount of functions are moved to res_counter.
> and be more gneric.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/memory.txt | 57 +++++++++++++++++++++++++++++-
> mm/memcontrol.c | 65 +++++++++++++++++++++++++++++++++--
> 2 files changed, 118 insertions(+), 4 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/memory.txt
> +++ linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
> @@ -154,7 +154,7 @@ The memory controller uses the following
>
> 0. Configuration

I apologize if you have already corrected these. I'm a bit behind
on doc reviews.


> -a. Enable CONFIG_CGROUPS
> +a. Enable CONFESS_CGROUPS

Really? Looks odd and backwards.

> b. Enable CONFIG_RESOURCE_COUNTERS
> c. Enable CONFIG_CGROUP_MEM_RES_CTLR
>
> @@ -237,7 +237,58 @@ cgroup might have some charge associated
> tasks have migrated away from it. Such charges are automatically dropped at
> rmdir() if there are no tasks.
>
> -5. TODO
> +5. Supported Hierarchy Model
> +
> +Currently, memory controller supports following models of hierarchy in the
> +kernel. (see also resource_counter.txt)
> +
> +2 files are related to hierarchy.
> + - memory.hierarchy_model
> + - memory.for_children
> +
> +Basic Rule.
> + - Hierarchy can be set per cgroup.
> + - A child inherits parent's hierarchy model at creation.
> + - A child can change its hierarchy only when the parent's hierarchy is
> + NO_HIERARCY and it has no children.

NO_HIERARCHY

> +
> +
> +5.1. NO_HIERARCHY
> + - Each cgroup is independent from other ones.
> + - When memory.hierarchy_model is 0, NO_HIERARCHY is used.
> + Under this model, there is no controls based on tree of cgroups.

there are no controls

> + This is the default model of root cgroup.
> +
> +5.2 HARDWALL_HIERARCHY
> + - A child is a isolated portion of the parent.

is an

> + - When memory.hierarchy_model is 1, HARDWALL_HIERARCHY is used.
> + In this model a child's limit is charged as parent's usage.
> +
> + Hard-Wall Hierarchy Example)

Drop ')'.

> + 1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
> + - group_A limit=1G,usage=0M.

, usage=0M.

> +
> + 2) create group B, C under A.
> + - group A limit=1G, usage=0M, for_childre=0M

for_children=0M

> + - group B limit=0M, usage=0M.
> + - group C limit=0M, usage=0M.
> +
> + 3) increase group B's limit to 300M.
> + - group A limit=1G, usage=300M, for_children=300M
> + - group B limit=300M, usage=0M.
> + - group C limit=0M, usage=0M.
> +
> + 4) increase group C's limit to 500M
> + - group A limit=1G, usage=800M, for_children=800M
> + - group B limit=300M, usage=0M.
> + - group C limit=500M, usage=0M.
> +
> + 5) reduce group B's limit to 100M
> + - group A limit=1G, usage=600M, for_children=600M.
> + - group B limit=100M, usage=0M.
> + - group C limit=500M, usage=0M.


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

2008-06-23 22:41:18

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/6] res_counter: handle limit change

On Fri, 13 Jun 2008 18:29:24 +0900 KAMEZAWA Hiroyuki wrote:

> Add a support to shrink_usage_at_limit_change feature to res_counter.
> memcg will use this to drop pages.
>
> Change log: xxx -> v4 (new file.)
> - cut out the limit-change part from hierarchy patch set.
> - add "retry_count" arguments to shrink_usage(). This allows that we don't
> have to set the default retry loop count.
> - res_counter_check_under_val() is added to support subsystem.
> - res_counter_init() is res_counter_init_ops(cnt, NULL)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/resource_counter.txt | 19 +++++-
> include/linux/res_counter.h | 33 ++++++++++-
> kernel/res_counter.c | 74 ++++++++++++++++++++++++-
> 3 files changed, 121 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
> +++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> @@ -141,8 +145,19 @@ counter fields. They are recommended to
> failcnt reset to zero
>
>
> +5. res_counter_ops (Callbacks)
>
> -5. Usage example
> + res_counter_ops is for implementing feedback control from res_counter
> + to subsystem. Each one has each own purpose and the subsystem doesn't

isn't

> + necessary to provide all callbacks. Just implement necessary ones.

required

> +
> + - shrink_usage(res_counter, newlimit, retry)
> + Called for reducing usage to newlimit, retry is incremented per
> + loop. (See memory resource controller as example.)
> + Returns 0 at success. Any error code is acceptable but -EBUSY will be
> + suitable to show "the kernel can't shrink usage."
> +
> +6. Usage example
>
> a. Declare a task group (take a look at cgroups subsystem for this) and
> fold a res_counter into it


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

2008-06-23 22:51:20

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 4/6] res_counter: basic hierarchy support

On Fri, 13 Jun 2008 18:34:02 +0900 KAMEZAWA Hiroyuki wrote:

> Add a hierarhy support to res_counter. This patch itself just supports
> "No Hierarchy" hierarchy, as a default/basic hierarchy system.
>
> Changelog: v3 -> v4.
> - cut out from hardwall hierarchy patch set.
> - just support "No hierarchy" model.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/controllers/resource_counter.txt | 27 +++++-
> include/linux/res_counter.h | 15 +++
> kernel/res_counter.c | 107 ++++++++++++++++++++-----
> mm/memcontrol.c | 1
> 4 files changed, 129 insertions(+), 21 deletions(-)
>
> Index: linux-2.6.26-rc5-mm3/kernel/res_counter.c
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/kernel/res_counter.c
> +++ linux-2.6.26-rc5-mm3/kernel/res_counter.c
> +
>
> +/**
> + * res_counter_set_ops() -- reset res->counter.ops to be passed ops.
> + * @coutner: a counter to be set ops.

typo:
* @counter:

> + * @ops: res_counter_ops
> + *
> + * This operations is allowed only when there is no parent or parent's
> + * hierarchy_model == RES_CONT_NO_HIERARCHY. returns 0 at success.
> + */
> +
> +int res_counter_set_ops(struct res_counter *counter,
> + struct res_counter_ops *ops)
> +{
> + struct res_counter *parent;
> + /*
> + * This operation is allowed only when parents's hierarchy
> + * is NO_HIERARCHY or this is ROOT.
> + */
> + parent = counter->parent;
> + if (parent && parent->ops.hierarchy_model != RES_CONT_NO_HIERARCHY)
> + return -EINVAL;
> +
> + counter->ops = *ops;
> +
> + return 0;
> +}
> +
> +
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> {
> if (counter->usage + val > counter->limit) {

> Index: linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/resource_counter.txt
> +++ linux-2.6.26-rc5-mm3/Documentation/controllers/resource_counter.txt
> @@ -39,11 +39,14 @@ to work with it.
> The failcnt stands for "failures counter". This is the number of
> resource allocation attempts that failed.
>
> - e. res_counter_ops.
> + e. parent
> + parent of this res_counter under hierarchy.
> +
> + f. res_counter_ops.
> Callbacks for helping resource_counter per each subsystem.
> - shrink_usage() .... called at limit change (decrease).
>
> - f. spinlock_t lock
> + g. spinlock_t lock
>
> Protects changes of the above values.
>
> @@ -157,7 +160,25 @@ counter fields. They are recommended to
> Returns 0 at success. Any error code is acceptable but -EBUSY will be
> suitable to show "the kernel can't shrink usage."
>
> -6. Usage example
> +6. Hierarchy
> +
> + Groups of res_counter can be controlled under some tree (cgroup tree).
> + Taking the tree into account, res_counter can be under some hierarchical
> + control. THe res_counter itself supports hierarchy_model and calls

The

> + registered callbacks at suitable events.
> +
> + For keeping sanity of hierarchy, hierarchy_model of a res_counter can be
> + changed when parent's hierarchy_model is RES_CONT_NO_HIERARCHY.
> + res_counter doesn't count # of children by itself but some subysystem should
> + be aware that it has no children if necessary.
> + (don't want to fully duplicate cgroup's hierarchy. Cost of remembering parent
> + is cheap.)
> +
> + a. Independent hierarchy (RES_CONT_NO_HIERARCHY) model
> + This is no relationship between parent and children.
> +
> +
> +7. Usage example
>
> a. Declare a task group (take a look at cgroups subsystem for this) and
> fold a res_counter into it


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

2008-06-24 03:31:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 6/6] memcg: HARDWALL hierarchy



On Mon, 23 Jun 2008 15:29:41 -0700
Randy Dunlap <[email protected]> wrote:

> On Fri, 13 Jun 2008 18:37:41 +0900 KAMEZAWA Hiroyuki wrote:
>
> > Support hardwall hierarchy (and no-hierarchy) in memcg.
> >
> > Change log: v3->v4
> > - cut out from memcg hierarchy patch set v4.
> > - no major changes, but some amount of functions are moved to res_counter.
> > and be more gneric.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > ---
> > Documentation/controllers/memory.txt | 57 +++++++++++++++++++++++++++++-
> > mm/memcontrol.c | 65 +++++++++++++++++++++++++++++++++--
> > 2 files changed, 118 insertions(+), 4 deletions(-)
> >
> > Index: linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
> > ===================================================================
> > --- linux-2.6.26-rc5-mm3.orig/Documentation/controllers/memory.txt
> > +++ linux-2.6.26-rc5-mm3/Documentation/controllers/memory.txt
> > @@ -154,7 +154,7 @@ The memory controller uses the following
> >
> > 0. Configuration
>
> I apologize if you have already corrected these. I'm a bit behind
> on doc reviews.
>
Thank you for all your help. I'll check my set again especially
's' and 'a/an'.


>
> > -a. Enable CONFIG_CGROUPS
> > +a. Enable CONFESS_CGROUPS
>
> Really? Looks odd and backwards.
>
my mistake..

Thanks,
-Kame

> > b. Enable CONFIG_RESOURCE_COUNTERS
> > c. Enable CONFIG_CGROUP_MEM_RES_CTLR
> >
> > @@ -237,7 +237,58 @@ cgroup might have some charge associated
> > tasks have migrated away from it. Such charges are automatically dropped at
> > rmdir() if there are no tasks.
> >
> > -5. TODO
> > +5. Supported Hierarchy Model
> > +
> > +Currently, memory controller supports following models of hierarchy in the
> > +kernel. (see also resource_counter.txt)
> > +
> > +2 files are related to hierarchy.
> > + - memory.hierarchy_model
> > + - memory.for_children
> > +
> > +Basic Rule.
> > + - Hierarchy can be set per cgroup.
> > + - A child inherits parent's hierarchy model at creation.
> > + - A child can change its hierarchy only when the parent's hierarchy is
> > + NO_HIERARCY and it has no children.
>
> NO_HIERARCHY
>
sure

> > +
> > +
> > +5.1. NO_HIERARCHY
> > + - Each cgroup is independent from other ones.
> > + - When memory.hierarchy_model is 0, NO_HIERARCHY is used.
> > + Under this model, there is no controls based on tree of cgroups.
>
> there are no controls
>
Oh, thanks.

> > + This is the default model of root cgroup.
> > +
> > +5.2 HARDWALL_HIERARCHY
> > + - A child is a isolated portion of the parent.
>
> is an
>
> > + - When memory.hierarchy_model is 1, HARDWALL_HIERARCHY is used.
> > + In this model a child's limit is charged as parent's usage.
> > +
> > + Hard-Wall Hierarchy Example)
>
> Drop ')'.
>
> > + 1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
> > + - group_A limit=1G,usage=0M.
>
> , usage=0M.
>
> > +
> > + 2) create group B, C under A.
> > + - group A limit=1G, usage=0M, for_childre=0M
>
> for_children=0M
>
> > + - group B limit=0M, usage=0M.
> > + - group C limit=0M, usage=0M.
> > +
> > + 3) increase group B's limit to 300M.
> > + - group A limit=1G, usage=300M, for_children=300M
> > + - group B limit=300M, usage=0M.
> > + - group C limit=0M, usage=0M.
> > +
> > + 4) increase group C's limit to 500M
> > + - group A limit=1G, usage=800M, for_children=800M
> > + - group B limit=300M, usage=0M.
> > + - group C limit=500M, usage=0M.
> > +
> > + 5) reduce group B's limit to 100M
> > + - group A limit=1G, usage=600M, for_children=600M.
> > + - group B limit=100M, usage=0M.
> > + - group C limit=500M, usage=0M.
>
>
> ---
> ~Randy
> Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
> http://linuxplumbersconf.org/
>