2008-06-04 04:53:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

Hi, this is third version.

While small changes in codes, the whole _tone_ of code is changed.
I'm not in hurry, any comments are welcome.

based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.

Changes from v2:
- Named as HardWall policy.
- rewrote the code to be read easily. changed the name of functions.
- Added text.
- supported hierarchy_model parameter.
Now, no_hierarchy and hardwall_hierarchy is implemented.

HardWall Policy:
- designed for strict resource isolation under hierarchy.
Usually, automatic load balancing between cgroup can break the
users assumption even if it's implemented very well.
- parent overcommits all children
parent->usage = resource used by itself + resource moved to children.
Of course, parent->limit > parent->usage.
- when child's limit is set, the resouce moves.
- no automatic resource moving between parent <-> child

Example)
1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
- group_A limit=1G,usage=0M.

2) create group B, C under A.
- group A limit=1G, usage=0M
- group B limit=0M, usage=0M.
- group C limit=0M, usage=0M.

3) increase group B's limit to 300M.
- group A limit=1G, usage=300M.
- group B limit=300M, usage=0M.
- group C limit=0M, usage=0M.

4) increase group C's limit to 500M
- group A limit=1G, usage=800M.
- group B limit=300M, usage=0M.
- group C limit=500M, usage=0M.

5) reduce group B's limit to 100M
- group A limit=1G, usage=600M.
- group B limit=100M, usage=0M.
- group C limit=500M, usage=0M.


Thanks,
-Kame


2008-06-04 04:56:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/2] memcg: res_counter hierarchy

A simple hard-wall hierarhcy support for res_counter.

Changelog v2->v3
- changed the name and arguments of functions.
- rewrote to be read easily.
- named as HardWall hierarchy.

This implements following model
- A cgroup's tree means hierarchy of resource.
- All child's resource is moved from its parents.
- The resource moved to children is charged as parent's usage.
- The resource moves when child->limit is changed.
- The sum of resource for children and its own usage is limited by "limit".

This implies
- No dynamic automatic hierarhcy balancing in the kernel.
- Each resource is isolated completely.
- The kernel just supports resource-move-at-change-in-limit.
- The user (middle-ware) is responsible to make hierarhcy balanced well.
Good balance can be achieved by changing limit from user land.


Background:
Recently, there are popular resource isolation technique widely used,
i.e. Hardware-Virtualization. We can do hierarchical resource isolation
by using cgroup on it. But supporting hierarchy management in croups
has some advantages of performance, unity and costs of management.

There are good resource management in other OSs, they support some kind of
hierarchical resource management. We wonder what kind of hierarchy policy
is good for Linux. And there is an another point. Hierarchical system can be
implemented by the kernel and user-land co-operation. So, there are various
choices to do in the kernel. Doing all in the kernel or export some proper
interfaces to the user-land. Middle-wares are tend to be used for management.
I hope there will be Open Source one.

At supporting hierarchy in cgroup, several aspects of characteristics of
policy of hierarchy can be considered. Some needs automatic balancing
between several groups.

- fairness ... how fairness is kept under policy

- performance ... should be _fast_. multi-level resource balancing tend
to use much amount of CPU and can cause soft lockup.

- predictability ... resource management are usually used for resource
isolation. the kernel must not break the isolation and
predictability of users against application's progress.

- flexibility ... some sophisticated dynamic resource balancing with
soft-limit is welcomed when the user doesn't want strict
resource isolation or when the user cannot estimate how much
they want correctly.

Hard Wall Hierarchy.

This patch implements a hard-wall model of hierarchy for resources.
Works well for users who want strict resource isolation.

This model allows the move of resource only between a parent and its children.
The resource is moved to a child when it declares the amount of resources to be
used. (by limit)
Automatic resource balancing is not supported in this code.
(But users can do non-automatic by changing limit dynamically.)

- fairness ... good. no resource sharing. works as specified by users.
- performance ... good. each resources are capsuled to its own level.
- predictability ... good. resources are completely isolated. balancing only
occurs at the event of changes in limit.
- flexibility ... bad. no flexibility and scheduling in the kernel level.
need middle-ware's help.

Considerations:
- This implementation uses "limit" == "current_available_resource".
This should be revisited when Soft-Limit one is implemented.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/resource_counter.txt | 41 +++++++++
include/linux/res_counter.h | 90 +++++++++++++++++++-
kernel/res_counter.c | 112 +++++++++++++++++++++++--
3 files changed, 235 insertions(+), 8 deletions(-)

Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
===================================================================
--- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
+++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
@@ -38,6 +38,16 @@ struct res_counter {
* the number of unsuccessful attempts to consume the resource
*/
unsigned long long failcnt;
+
+ /*
+ * hierarchy support: the parent of this resource.
+ */
+ struct res_counter *parent;
+ /*
+ * the amount of resources assigned to children.
+ */
+ unsigned long long for_children;
+
/*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
@@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
ssize_t res_counter_read(struct res_counter *counter, int member,
const char __user *buf, size_t nbytes, loff_t *pos,
int (*read_strategy)(unsigned long long val, char *s));
+
+/*
+ * An interface for setting res_counter's member (ex. limit)
+ * the new parameter is passed by *buf and translated by write_strategy().
+ * Then, it is applied to member under the control of set_strategy().
+ * If write_strategy() and set_strategy() can be NULL. see res_counter.c
+ */
+
ssize_t res_counter_write(struct res_counter *counter, int member,
- const char __user *buf, size_t nbytes, loff_t *pos,
- int (*write_strategy)(char *buf, unsigned long long *val));
+ const char __user *buf, size_t nbytes, loff_t *pos,
+ int (*write_strategy)(char *buf, unsigned long long *val),
+ int (*set_strategy)(struct res_counter *res, unsigned long long val,
+ int what),
+ );

/*
* the field descriptors. one for each member of res_counter
@@ -76,15 +97,33 @@ enum {
RES_MAX_USAGE,
RES_LIMIT,
RES_FAILCNT,
+ RES_FOR_CHILDREN,
};

/*
* helpers for accounting
*/

+/*
+ * initialize res_counter.
+ * @counter : the counter
+ *
+ * initialize res_counter and set default limit to very big value(unlimited)
+ */
+
void res_counter_init(struct res_counter *counter);

/*
+ * initialize res_counter under hierarchy.
+ * @counter : the counter
+ * @parent : the parent of the counter
+ *
+ * initialize res_counter and set default limit to 0. and set "parent".
+ */
+void res_counter_init_hierarchy(struct res_counter *counter,
+ struct res_counter *parent);
+
+/*
* charge - try to consume more resource.
*
* @counter: the counter
@@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
cnt->failcnt = 0;
spin_unlock_irqrestore(&cnt->lock, flags);
}
+
+/**
+ * Move resources from a parent to a child.
+ * At success,
+ * parent->usage += val.
+ * parent->for_children += val.
+ * child->limit += val.
+ *
+ * @child: an entity to set res->limit. The parent is child->parent.
+ * @val: the amount of resource to be moved.
+ * @callback: called when the parent's free resource is not enough to be moved.
+ * this can be NULL if no callback is necessary.
+ * @retry: limit for the number of trying to callback.
+ * -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success, !0 at failure.
+ *
+ * The callback returns 0 at success, !0 at failure.
+ *
+ */
+
+int res_counter_move_resource(struct res_counter *child,
+ unsigned long long val,
+ int (*callback)(struct res_counter *res, unsigned long long val),
+ int retry);
+
+
+/**
+ * Return resource to its parent.
+ * At success,
+ * parent->usage -= val.
+ * parent->for_children -= val.
+ * child->limit -= val.
+ *
+ * @child: entry to resize. The parent is child->parent.
+ * @val : How much does child repay to parent ? -1 means 'all'
+ * @callback: A callback for decreasing resource usage of child before
+ * returning. If NULL, just deceases child's limit.
+ * @retry: # of retries at calling callback for freeing resource.
+ * -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success.
+ */
+
+int res_counter_return_resource(struct res_counter *child,
+ unsigned long long val,
+ int (*callback)(struct res_counter *res, unsigned long long val),
+ int retry);
+
#endif
Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
===================================================================
--- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
+++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
@@ -44,6 +44,13 @@ to work with it.
Protects changes of the above values.


+ f. struct res_counter *parent
+
+ Parent res_counter under hierarchy.
+
+ g. unsigned long long for_children
+
+ Resources assigned to children. This is included in usage.

2. Basic accounting routines

@@ -179,3 +186,37 @@ counter fields. They are recommended to
still can help with it).

c. Compile and run :)
+
+
+6. Hierarchy
+ a. No Hierarchy
+ each cgroup can use its own private resource.
+
+ b. Hard-wall Hierarhcy
+ A simple hierarchical tree system for resource isolation.
+ Allows moving resources only between a parent and its children.
+ A parent can move its resource to children and remember the amount to
+ for_children member. A child can get new resource only from its parent.
+ Limit of a child is the amount of resource which is moved from its parent.
+
+ When add "val" to a child,
+ parent->usage += val
+ parent->for_children += val
+ child->limit += val
+ When a child returns its resource
+ parent->usage -= val
+ parent->for_children -= val
+ child->limit -= val.
+
+ This implements resource isolation among each group. This works very well
+ when you want to use strict resource isolation.
+
+ Usage Hint:
+ This seems for static resource assignment but dynamic resource re-assignment
+ can be done by resetting "limit" of groups. When you consider "limit" as
+ the amount of allowed _current_ resource, a sophisticated resource management
+ system based on strict resource isolation can be implemented.
+
+c. Soft-wall Hierarchy
+ TBD.
+
Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
+++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
@@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
counter->limit = (unsigned long long)LLONG_MAX;
}

+void res_counter_init_hierarchy(struct res_counter *counter,
+ struct res_counter *parent)
+{
+ spin_lock_init(&counter->lock);
+ counter->limit = 0;
+ counter->parent = parent;
+}
+
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
{
if (counter->usage + val > counter->limit) {
@@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
return &counter->limit;
case RES_FAILCNT:
return &counter->failcnt;
+ case RES_FOR_CHILDREN:
+ return &counter->for_children;
};

BUG();
@@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun

ssize_t res_counter_write(struct res_counter *counter, int member,
const char __user *userbuf, size_t nbytes, loff_t *pos,
- int (*write_strategy)(char *st_buf, unsigned long long *val))
+ int (*write_strategy)(char *st_buf, unsigned long long *val),
+ int (*set_strategy)(struct res_counter *res,
+ unsigned long long val, int what))
{
int ret;
char *buf, *end;
@@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
if (*end != '\0')
goto out_free;
}
- spin_lock_irqsave(&counter->lock, flags);
- val = res_counter_member(counter, member);
- *val = tmp;
- spin_unlock_irqrestore(&counter->lock, flags);
- ret = nbytes;
+ if (set_strategy) {
+ ret = set_strategy(res, tmp, member);
+ if (!ret)
+ ret = nbytes;
+ } else {
+ spin_lock_irqsave(&counter->lock, flags);
+ val = res_counter_member(counter, member);
+ *val = tmp;
+ spin_unlock_irqrestore(&counter->lock, flags);
+ ret = nbytes;
+ }
out_free:
kfree(buf);
out:
return ret;
}
+
+
+int res_counter_move_resource(struct res_counter *child,
+ unsigned long long val,
+ int (*callback)(struct res_counter *res, unsigned long long val),
+ int retry)
+{
+ struct res_counter *parent = child->parent;
+ unsigned long flags;
+
+ BUG_ON(!parent);
+
+ while (1) {
+ spin_lock_irqsave(&parent->lock, flags);
+ if (parent->usage + val < parent->limit) {
+ parent->for_children += val;
+ parent->usage += val;
+ break;
+ }
+ spin_unlock_irqrestore(&parent->lock, flags);
+
+ if (!retry || !callback)
+ goto failed;
+ /* -1 means infinite loop */
+ if (retry != -1)
+ --retry;
+ yield();
+ callback(parent, val);
+ }
+ spin_unlock_irqrestore(&parent->lock, flags);
+
+ spin_lock_irqsave(&child->lock, flags);
+ child->limit += val;
+ spin_unlock_irqrestore(&child->lock, flags);
+ return 0;
+fail:
+ return 1;
+}
+
+
+int res_counter_return_resource(struct res_counter *child,
+ unsigned long long val,
+ int (*callback)(struct res_counter *res, unsigned long long val),
+ int retry)
+{
+ unsigned long flags;
+ struct res_counter *parent = child->parent;
+
+ BUG_ON(!parent);
+
+ while (1) {
+ spin_lock_irqsave(&child->lock, flags);
+ if (val == (unsigned long long) -1) {
+ val = child->limit;
+ child->limit = 0;
+ break;
+ } else if (child->usage <= child->limit - val) {
+ child->limit -= val;
+ break;
+ }
+ spin_unlock_irqrestore(&child->lock, flags);
+
+ if (!retry)
+ goto fail;
+ /* -1 means infinite loop */
+ if (retry != -1)
+ --retry;
+ yield();
+ callback(parent, val);
+ }
+ spin_unlock_irqrestore(&child->lock, flags);
+
+ spin_lock_irqsave(&parent->lock, flags);
+ BUG_ON(parent->for_children < val);
+ BUG_ON(parent->usage < val);
+ parent->for_children -= val;
+ parent->usage -= val;
+ spin_unlock_irqrestore(&parent->lock, flags);
+ return 0;
+fail:
+ return 1;
+}

2008-06-04 04:58:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

Hard-Wall hierarchy support for memcg.
- new member hierarchy_model is added to memcg.

Only root cgroup can modify this only when there is no children.

Adds following functions for supporting HARDWALL hierarchy.
- try to reclaim memory at the change of "limit".
- try to reclaim all memory at force_empty
- returns resources to the parent at destroy.

Changelog v2->v3
- added documentation.
- hierarhcy_model parameter is added.


Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
Documentation/controllers/memory.txt | 27 +++++-
mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
2 files changed, 178 insertions(+), 5 deletions(-)

Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -137,6 +137,8 @@ struct mem_cgroup {
struct mem_cgroup_lru_info info;

int prev_priority; /* for recording reclaim priority */
+
+ int hierarchy_model; /* used hierarchical policy */
/*
* statistics.
*/
@@ -144,6 +146,10 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;

+
+#define MEMCG_NO_HIERARCHY (0)
+#define MEMCG_HARDWALL_HIERARCHY (1)
+
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
@@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
}

/*
+ * Memory Controller hierarchy support.
+ */
+
+/*
+ * shrink usage to be res->usage + val < res->limit.
+ */
+
+int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
+{
+ struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+ unsigned long flags;
+ int ret = 1;
+ int progress = 1;
+
+retry:
+ spin_lock_irqsave(&cnt->lock, flags);
+ /* Need to shrink ? */
+ if (cnt->usage + val <= cnt->limit)
+ ret = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ if (!ret)
+ return 0;
+
+ if (!progress)
+ return 1;
+ progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+
+ goto retry;
+}
+
+/*
+ * For Hard Wall Hierarchy.
+ */
+
+int mem_cgroup_resize_callback(struct res_counter *cnt,
+ unsigned long long val, int what)
+{
+ unsigned long flags, borrow;
+ unsigned long long diffs;
+ int ret = 0;
+
+ BUG_ON(what != RES_LIMIT);
+
+ /* Is this under hierarchy ? */
+ if (!cnt->parent) {
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->limit = val;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+ }
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (val > cnt->limit) {
+ diffs = val - cnt->limit;
+ borrow = 1;
+ } else {
+ diffs = cnt->limit - val;
+ borrow = 0;
+ }
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ if (borrow)
+ ret = res_counter_move_resource(cnt,diffs,
+ memcg_shrink_val,
+ MEM_CGROUP_RECLAIM_RETRIES);
+ else
+ ret = res_counter_return_resource(cnt, diffs,
+ memcg_shrink_val,
+ MEM_CGROUP_RECLAIM_RETRIES);
+ return ret;
+}
+
+
+void memcg_shrink_all(struct mem_cgroup *mem)
+{
+ unsigned long long val;
+
+ val = res_counter_read_u64(&mem->res, RES_LIMIT);
+ memcg_shrink_val(&mem->res, val);
+}
+
+/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
*/
@@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
if (mem_cgroup_subsys.disabled)
return 0;

+ memcg_shrink_all(mem);
+
css_get(&mem->css);
/*
* page reclaim code (kswapd etc..) will move pages between
@@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
struct file *file, const char __user *userbuf,
size_t nbytes, loff_t *ppos)
{
- return res_counter_write(&mem_cgroup_from_cont(cont)->res,
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+
+ if (cft->private != RES_LIMIT
+ || !cont->parent
+ || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
+ return res_counter_write(&memcg->res, cft->private, userbuf,
+ nbytes, ppos, mem_cgroup_write_strategy, NULL);
+ else
+ return res_counter_write(&memcg->res,
cft->private, userbuf, nbytes, ppos,
- mem_cgroup_write_strategy);
+ mem_cgroup_write_strategy,
+ mem_cgroup_resize_callback);
+}
+
+
+static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ return memcg->hierarchy_model;
+}
+
+static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *cft,
+ u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ /* chage policy is allowed to ROOT cgroup && no children */
+ if (cgrp->parent)
+ return -EINVAL;
+ if (!list_empty(&cgrp->children))
+ return -EINVAL;
+ if (val == 0 || val == 1) {
+ memcg->hierarchy_model = val;
+ return 0;
+ }
+ return -EINVAL;
}

+
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *mem;
@@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[]
.name = "stat",
.read_map = mem_control_stat_show,
},
+ {
+ .name = "hierarchy_model",
+ .read_u64 = mem_cgroup_read_hierarchy,
+ .write_u64 = mem_cgroup_write_hierarchy,
+ },
+ {
+ .name = "assigned_to_child",
+ .private = RES_FOR_CHILDREN,
+ .read_u64 = mem_cgroup_read,
+ },
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *parent;
int node;

if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+ parent = NULL;
} else {
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
+ parent = mem_cgroup_from_cont(cont->parent);
}

- res_counter_init(&mem->res);
+ if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
+ res_counter_init(&mem->res);
+ mem->hierarchy_model = MEMCG_NO_HIERARCHY;
+ } else {
+ res_counter_init_hierarchy(&mem->res, &parent->res);
+ mem->hierarchy_model = parent->hierarchy_model;
+ }

for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
int node;
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);

+ if (cont->parent &&
+ mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
+ /* we did what we can...just returns what we borrow */
+ res_counter_return_resource(&mem->res, -1, NULL, 0);
+ }
+
for_each_node_state(node, N_POSSIBLE)
free_mem_cgroup_per_zone_info(mem, node);

Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
===================================================================
--- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
+++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
@@ -237,12 +237,37 @@ cgroup might have some charge associated
tasks have migrated away from it. Such charges are automatically dropped at
rmdir() if there are no tasks.

-5. TODO
+5. Hierarchy Model
+ the kernel supports following kinds of hierarchy models.
+ (your middle-ware may support others based on this.)
+
+ 5-a. Independent Hierarchy
+ There are no relationship between any cgroups, even among a parent and
+ children. This is the default mode. To use this hierarchy, write 0
+ to root cgroup's memory.hierarchy_model
+ echo 0 > .../memory.hierarchy_model.
+
+ 5-b. Hardwall Hierarchy.
+ The resource has to be moved from the parent to the child before use it.
+ When a child's limit is set to 'val', val of the resource is moved from
+ the parent to the child. the parent's usage += val.
+ The amount of children's usage is reported by the file
+
+ - memory.assigned_to_child
+
+ This policy doesn't provide sophisticated automatic resource balancing in
+ the kernel. But this is very good for strict resource isolation. Users
+ can get high predictability of behavior of applications if this is used
+ under proper environments.
+
+
+6. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
3. Teach controller to account for shared-pages
4. Start reclamation when the limit is lowered
+ (this is already done in Hardwall Hierarchy)
5. Start reclamation in the background when the limit is
not yet hit but the usage is getting closer

2008-06-04 06:44:43

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

KAMEZAWA Hiroyuki wrote:
> Hard-Wall hierarchy support for memcg.
> - new member hierarchy_model is added to memcg.
>
> Only root cgroup can modify this only when there is no children.
>
> Adds following functions for supporting HARDWALL hierarchy.
> - try to reclaim memory at the change of "limit".
> - try to reclaim all memory at force_empty
> - returns resources to the parent at destroy.
>
> Changelog v2->v3
> - added documentation.
> - hierarhcy_model parameter is added.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/memory.txt | 27 +++++-
> mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 178 insertions(+), 5 deletions(-)
>
> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> @@ -137,6 +137,8 @@ struct mem_cgroup {
> struct mem_cgroup_lru_info info;
>
> int prev_priority; /* for recording reclaim priority */
> +
> + int hierarchy_model; /* used hierarchical policy */

hierarchy_model can be a global value instead of per cgroup value.

2008-06-04 06:49:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

On Wed, 04 Jun 2008 14:42:21 +0800
Li Zefan <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hard-Wall hierarchy support for memcg.
> > - new member hierarchy_model is added to memcg.
> >
> > Only root cgroup can modify this only when there is no children.
> >
> > Adds following functions for supporting HARDWALL hierarchy.
> > - try to reclaim memory at the change of "limit".
> > - try to reclaim all memory at force_empty
> > - returns resources to the parent at destroy.
> >
> > Changelog v2->v3
> > - added documentation.
> > - hierarhcy_model parameter is added.
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > ---
> > Documentation/controllers/memory.txt | 27 +++++-
> > mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
> > 2 files changed, 178 insertions(+), 5 deletions(-)
> >
> > Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> > +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> > @@ -137,6 +137,8 @@ struct mem_cgroup {
> > struct mem_cgroup_lru_info info;
> >
> > int prev_priority; /* for recording reclaim priority */
> > +
> > + int hierarchy_model; /* used hierarchical policy */
>
> hierarchy_model can be a global value instead of per cgroup value.
>
Ah, Hmm...yes. thank you for pointing out.

Regards,
-Kame

2008-06-04 06:56:43

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -44,6 +44,13 @@ to work with it.
> Protects changes of the above values.
>
>
> + f. struct res_counter *parent
> +
> + Parent res_counter under hierarchy.
> +
> + g. unsigned long long for_children
> +
> + Resources assigned to children. This is included in usage.

Since parent and for_children are also protected by res_count->lock,
the above text should appear before 'e. spinlock_t lock'.

2008-06-04 06:58:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Wed, 04 Jun 2008 14:54:12 +0800
Li Zefan <[email protected]> wrote:

> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -44,6 +44,13 @@ to work with it.
> > Protects changes of the above values.
> >
> >
> > + f. struct res_counter *parent
> > +
> > + Parent res_counter under hierarchy.
> > +
> > + g. unsigned long long for_children
> > +
> > + Resources assigned to children. This is included in usage.
>
> Since parent and for_children are also protected by res_count->lock,
> the above text should appear before 'e. spinlock_t lock'.
>
ok.

Thanks,
-Kame

2008-06-04 07:20:59

by YAMAMOTO Takashi

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

> ssize_t res_counter_write(struct res_counter *counter, int member,
> - const char __user *buf, size_t nbytes, loff_t *pos,
> - int (*write_strategy)(char *buf, unsigned long long *val));
> + const char __user *buf, size_t nbytes, loff_t *pos,
> + int (*write_strategy)(char *buf, unsigned long long *val),
> + int (*set_strategy)(struct res_counter *res, unsigned long long val,
> + int what),

this comma seems surplus.

> + );

> +int res_counter_return_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry)
> +{

> + callback(parent, val);

s/parent/child/

YAMAMOTO Takashi

2008-06-04 07:27:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Wed, 4 Jun 2008 16:20:48 +0900 (JST)
[email protected] (YAMAMOTO Takashi) wrote:

> > ssize_t res_counter_write(struct res_counter *counter, int member,
> > - const char __user *buf, size_t nbytes, loff_t *pos,
> > - int (*write_strategy)(char *buf, unsigned long long *val));
> > + const char __user *buf, size_t nbytes, loff_t *pos,
> > + int (*write_strategy)(char *buf, unsigned long long *val),
> > + int (*set_strategy)(struct res_counter *res, unsigned long long val,
> > + int what),
>
> this comma seems surplus.
>
Ouch, I thought I fixed this...maybe patch reflesh trouble. Thanks.


> > + );
>
> > +int res_counter_return_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry)
> > +{
>
> > + callback(parent, val);
>
> s/parent/child/
>
Hmm..yes.

Thanks,
-Kame

2008-06-04 08:59:51

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

Hi Kame,

I like the idea of keeping the kernel simple, and moving more of the
intelligence to userspace.

It may need the kernel to expose a bit more in the way of VM details,
such as memory pressure, OOM notifications, etc, but as long as
userspace can respond quickly to memory imbalance, it should work
fine. We're doing something a bit similar using cpusets and fake NUMA
at Google - the principle of juggling memory between cpusets is the
same, but the granularity is much worse :-)

On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> - supported hierarchy_model parameter.
> Now, no_hierarchy and hardwall_hierarchy is implemented.

Should we try to support hierarchy and non-hierarchy cgroups in the
same tree? Maybe we should just enforce the restrictions that:

- the hierarchy mode can't be changed on a cgroup if you have children
or any non-zero usage/limit
- a cgroup inherits its parent's hierarchy mode.


> - parent overcommits all children

I'm not sure that "overcommits" is the right word here - specifically,
the model ensures that a parent can't overcommit its children beyond
its limit.

Paul

2008-06-04 09:00:15

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> int ret;
> char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> if (*end != '\0')
> goto out_free;
> }
> - spin_lock_irqsave(&counter->lock, flags);
> - val = res_counter_member(counter, member);
> - *val = tmp;
> - spin_unlock_irqrestore(&counter->lock, flags);
> - ret = nbytes;
> + if (set_strategy) {
> + ret = set_strategy(res, tmp, member);
> + if (!ret)
> + ret = nbytes;
> + } else {
> + spin_lock_irqsave(&counter->lock, flags);
> + val = res_counter_member(counter, member);
> + *val = tmp;
> + spin_unlock_irqrestore(&counter->lock, flags);
> + ret = nbytes;
> + }

I think that the hierarchy/reclaim handling that you currently have in
the memory controller should be here; the memory controller should
just be able to pass a reference to try_to_free_mem_cgroup_pages() and
have everything else handled by res_counter.

Paul

2008-06-04 09:00:39

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
> }
>
> /*
> + * Memory Controller hierarchy support.
> + */
> +
> +/*
> + * shrink usage to be res->usage + val < res->limit.
> + */
> +
> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> +{
> + struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> + unsigned long flags;
> + int ret = 1;
> + int progress = 1;
> +
> +retry:
> + spin_lock_irqsave(&cnt->lock, flags);
> + /* Need to shrink ? */
> + if (cnt->usage + val <= cnt->limit)
> + ret = 0;
> + spin_unlock_irqrestore(&cnt->lock, flags);

Can't this logic be in res_counter itself? I.e. the callback can
assume that some shrinking needs to be done, and should just do it and
return. The res_counter can handle retrying if necessary.

> +/*
> + * For Hard Wall Hierarchy.
> + */
> +
> +int mem_cgroup_resize_callback(struct res_counter *cnt,
> + unsigned long long val, int what)
> +{
> + unsigned long flags, borrow;
> + unsigned long long diffs;
> + int ret = 0;
> +
> + BUG_ON(what != RES_LIMIT);
> +
> + /* Is this under hierarchy ? */
> + if (!cnt->parent) {
> + spin_lock_irqsave(&cnt->lock, flags);
> + cnt->limit = val;
> + spin_unlock_irqrestore(&cnt->lock, flags);
> + return 0;
> + }
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + if (val > cnt->limit) {
> + diffs = val - cnt->limit;
> + borrow = 1;
> + } else {
> + diffs = cnt->limit - val;
> + borrow = 0;
> + }
> + spin_unlock_irqrestore(&cnt->lock, flags);
> +
> + if (borrow)
> + ret = res_counter_move_resource(cnt,diffs,
> + memcg_shrink_val,
> + MEM_CGROUP_RECLAIM_RETRIES);
> + else
> + ret = res_counter_return_resource(cnt, diffs,
> + memcg_shrink_val,
> + MEM_CGROUP_RECLAIM_RETRIES);
> + return ret;
> +}

Again, a lot of this function seems like generic logic that should be
in res_counter. The only bit that's memory specific is the
memcg_shrink_val, and maybe that could just be passed when creating
the res_counter. Perhaps we should have a res_counter_ops structure
with operations like "parse" for parsing strings into numbers
(currently called "write_strategy") and "reclaim" for trying to shrink
the usage.

> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
> struct file *file, const char __user *userbuf,
> size_t nbytes, loff_t *ppos)
> {
> - return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +
> + if (cft->private != RES_LIMIT
> + || !cont->parent
> + || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)

The res_counter already knows whether it has a parent, so these checks
shouldn't be necessary.

> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> int node;
> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>
> + if (cont->parent &&
> + mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> + /* we did what we can...just returns what we borrow */
> + res_counter_return_resource(&mem->res, -1, NULL, 0);
> + }
> +

Should we also re-account any remaining child usage to the parent?

Paul

2008-06-04 09:10:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

On Wed, 4 Jun 2008 01:59:32 -0700
"Paul Menage" <[email protected]> wrote:

> Hi Kame,
>
> I like the idea of keeping the kernel simple, and moving more of the
> intelligence to userspace.
>
thanks.

> It may need the kernel to expose a bit more in the way of VM details,
> such as memory pressure, OOM notifications, etc, but as long as
> userspace can respond quickly to memory imbalance, it should work
> fine. We're doing something a bit similar using cpusets and fake NUMA
> at Google - the principle of juggling memory between cpusets is the
> same, but the granularity is much worse :-)
>
yes, next problem is adding interfaces. but we have to investigate
what is principal.


> On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > - supported hierarchy_model parameter.
> > Now, no_hierarchy and hardwall_hierarchy is implemented.
>
> Should we try to support hierarchy and non-hierarchy cgroups in the
> same tree? Maybe we should just enforce the restrictions that:
>
> - the hierarchy mode can't be changed on a cgroup if you have children
> or any non-zero usage/limit
> - a cgroup inherits its parent's hierarchy mode.
>
Ah, my patch does it (I think). explanation is bad.

- mem cgroup's mode can be changed against ROOT node which has no children.
- a child inherits parent's mode.

Thanks,
-Kame

2008-06-04 09:13:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Wed, 4 Jun 2008 01:59:31 -0700
"Paul Menage" <[email protected]> wrote:

> On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > int ret;
> > char *buf, *end;
> > @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> > if (*end != '\0')
> > goto out_free;
> > }
> > - spin_lock_irqsave(&counter->lock, flags);
> > - val = res_counter_member(counter, member);
> > - *val = tmp;
> > - spin_unlock_irqrestore(&counter->lock, flags);
> > - ret = nbytes;
> > + if (set_strategy) {
> > + ret = set_strategy(res, tmp, member);
> > + if (!ret)
> > + ret = nbytes;
> > + } else {
> > + spin_lock_irqsave(&counter->lock, flags);
> > + val = res_counter_member(counter, member);
> > + *val = tmp;
> > + spin_unlock_irqrestore(&counter->lock, flags);
> > + ret = nbytes;
> > + }
>
> I think that the hierarchy/reclaim handling that you currently have in
> the memory controller should be here; the memory controller should
> just be able to pass a reference to try_to_free_mem_cgroup_pages() and
> have everything else handled by res_counter.
>
Sounds reasonable. I'll re-design the whole AMAP. I think I can do more.

Thanks,
-Kame

2008-06-04 09:16:05

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>> Should we try to support hierarchy and non-hierarchy cgroups in the
>> same tree? Maybe we should just enforce the restrictions that:
>>
>> - the hierarchy mode can't be changed on a cgroup if you have children
>> or any non-zero usage/limit
>> - a cgroup inherits its parent's hierarchy mode.
>>
> Ah, my patch does it (I think). explanation is bad.
>
> - mem cgroup's mode can be changed against ROOT node which has no children.
> - a child inherits parent's mode.

But if it can only be changed for the root cgroup when it has no
children, than implies that all cgroups must have the same mode. I'm
suggesting that we allow non-root cgroups to change their mode, as
long as:

- they have no children

- they don't have any limit charged to their parent (which means that
either they have a zero limit, or they have no parent, or they're not
in hierarchy mode)

Paul

2008-06-04 09:21:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

On Wed, 4 Jun 2008 01:59:12 -0700
"Paul Menage" <[email protected]> wrote:

> On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
> > }
> >
> > /*
> > + * Memory Controller hierarchy support.
> > + */
> > +
> > +/*
> > + * shrink usage to be res->usage + val < res->limit.
> > + */
> > +
> > +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> > +{
> > + struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> > + unsigned long flags;
> > + int ret = 1;
> > + int progress = 1;
> > +
> > +retry:
> > + spin_lock_irqsave(&cnt->lock, flags);
> > + /* Need to shrink ? */
> > + if (cnt->usage + val <= cnt->limit)
> > + ret = 0;
> > + spin_unlock_irqrestore(&cnt->lock, flags);
>
> Can't this logic be in res_counter itself? I.e. the callback can
> assume that some shrinking needs to be done, and should just do it and
> return. The res_counter can handle retrying if necessary.
>
Hmm ok. Maybe All I have to do is to define "What the callback has to do"
and to move this check interface to res_counter.


> > +/*
> > + * For Hard Wall Hierarchy.
> > + */
> > +
> > +int mem_cgroup_resize_callback(struct res_counter *cnt,
> > + unsigned long long val, int what)
> > +{
> > + unsigned long flags, borrow;
> > + unsigned long long diffs;
> > + int ret = 0;
> > +
> > + BUG_ON(what != RES_LIMIT);
> > +
> > + /* Is this under hierarchy ? */
> > + if (!cnt->parent) {
> > + spin_lock_irqsave(&cnt->lock, flags);
> > + cnt->limit = val;
> > + spin_unlock_irqrestore(&cnt->lock, flags);
> > + return 0;
> > + }
> > +
> > + spin_lock_irqsave(&cnt->lock, flags);
> > + if (val > cnt->limit) {
> > + diffs = val - cnt->limit;
> > + borrow = 1;
> > + } else {
> > + diffs = cnt->limit - val;
> > + borrow = 0;
> > + }
> > + spin_unlock_irqrestore(&cnt->lock, flags);
> > +
> > + if (borrow)
> > + ret = res_counter_move_resource(cnt,diffs,
> > + memcg_shrink_val,
> > + MEM_CGROUP_RECLAIM_RETRIES);
> > + else
> > + ret = res_counter_return_resource(cnt, diffs,
> > + memcg_shrink_val,
> > + MEM_CGROUP_RECLAIM_RETRIES);
> > + return ret;
> > +}
>
> Again, a lot of this function seems like generic logic that should be
> in res_counter. The only bit that's memory specific is the
> memcg_shrink_val, and maybe that could just be passed when creating
> the res_counter. Perhaps we should have a res_counter_ops structure
> with operations like "parse" for parsing strings into numbers
> (currently called "write_strategy") and "reclaim" for trying to shrink
> the usage.
>
ok, will try.


> > @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
> > struct file *file, const char __user *userbuf,
> > size_t nbytes, loff_t *ppos)
> > {
> > - return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > +
> > + if (cft->private != RES_LIMIT
> > + || !cont->parent
> > + || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
>
> The res_counter already knows whether it has a parent, so these checks
> shouldn't be necessary.
>
ok, will check in res_counter itself.

> > @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> > int node;
> > struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> >
> > + if (cont->parent &&
> > + mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> > + /* we did what we can...just returns what we borrow */
> > + res_counter_return_resource(&mem->res, -1, NULL, 0);
> > + }
> > +
>
> Should we also re-account any remaining child usage to the parent?
>
When this is called, there are no process in this group. Then, remaining
resources in this level is
- file cache
- swap cache (if shared)
- shmem

And the biggest usage will be "file cache".
So, I don't think it's necessary to move child's usage to the parent,
in hurry. But maybe shmem is worth to be moved.

I'd like to revisit this when I implements "usage move at task move"
logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)

It will help me to implement the logic "move remaining usage to the parent"
in clean way.

Thanks,
-Kame








2008-06-04 09:27:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

On Wed, 4 Jun 2008 02:15:32 -0700
"Paul Menage" <[email protected]> wrote:

> On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >> Should we try to support hierarchy and non-hierarchy cgroups in the
> >> same tree? Maybe we should just enforce the restrictions that:
> >>
> >> - the hierarchy mode can't be changed on a cgroup if you have children
> >> or any non-zero usage/limit
> >> - a cgroup inherits its parent's hierarchy mode.
> >>
> > Ah, my patch does it (I think). explanation is bad.
> >
> > - mem cgroup's mode can be changed against ROOT node which has no children.
> > - a child inherits parent's mode.
>
> But if it can only be changed for the root cgroup when it has no
> children, than implies that all cgroups must have the same mode. I'm
> suggesting that we allow non-root cgroups to change their mode, as
> long as:
>
> - they have no children
>
> - they don't have any limit charged to their parent (which means that
> either they have a zero limit, or they have no parent, or they're not
> in hierarchy mode)
>
Hmm, I got your point. Your suggestion seems reasonable.
I'll try that logic in the next version.

Thanks,
-Kame

2008-06-04 12:33:44

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

Hi.

> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
> if (mem_cgroup_subsys.disabled)
> return 0;
>
> + memcg_shrink_all(mem);
> +
> css_get(&mem->css);
> /*
> * page reclaim code (kswapd etc..) will move pages between

Shouldn't it be called after verifying there remains no task
in this group?

If called via mem_cgroup_pre_destroy, it has been verified
that there remains no task already, but if called via
mem_force_empty_wrte, there may remain some tasks and
this means many and many pages are swaped out, doesn't it?


Thanks,
Daisuke Nishimura.

2008-06-04 12:55:12

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

> > > @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> > > int node;
> > > struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > >
> > > + if (cont->parent &&
> > > + mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> > > + /* we did what we can...just returns what we borrow */
> > > + res_counter_return_resource(&mem->res, -1, NULL, 0);
> > > + }
> > > +
> >
> > Should we also re-account any remaining child usage to the parent?
> >
> When this is called, there are no process in this group. Then, remaining
> resources in this level is
> - file cache
> - swap cache (if shared)
> - shmem
>
> And the biggest usage will be "file cache".
> So, I don't think it's necessary to move child's usage to the parent,
> in hurry. But maybe shmem is worth to be moved.
>
> I'd like to revisit this when I implements "usage move at task move"
> logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)
>
> It will help me to implement the logic "move remaining usage to the parent"
> in clean way.
>

I agree that "usage move at task move" is needed before
"move remaining usage to the parent".


Thanks,
Daisuke Nishimura.

2008-06-04 23:59:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

On Wed, 4 Jun 2008 21:32:35 +0900
Daisuke Nishimura <[email protected]> wrote:

> Hi.
>
> > @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
> > if (mem_cgroup_subsys.disabled)
> > return 0;
> >
> > + memcg_shrink_all(mem);
> > +
> > css_get(&mem->css);
> > /*
> > * page reclaim code (kswapd etc..) will move pages between
>
> Shouldn't it be called after verifying there remains no task
> in this group?
>
> If called via mem_cgroup_pre_destroy, it has been verified
> that there remains no task already, but if called via
> mem_force_empty_wrte, there may remain some tasks and
> this means many and many pages are swaped out, doesn't it?
>
you're right. I misunderstood where the number of children is checked.

Thanks,
-Kame


>
> Thanks,
> Daisuke Nishimura.
>

2008-06-09 09:30:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

KAMEZAWA Hiroyuki wrote:
> Hi, this is third version.
>
> While small changes in codes, the whole _tone_ of code is changed.
> I'm not in hurry, any comments are welcome.
>
> based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
>

Hi, Kamezawa-San,

Sorry for the delay in responding. Like we discussed last time, I'd prefer a
shares based approach for hierarchial memcg management. I'll review/try these
patches and provide more feedback.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-09 09:49:34

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

KAMEZAWA Hiroyuki wrote:
> A simple hard-wall hierarhcy support for res_counter.
>
> Changelog v2->v3
> - changed the name and arguments of functions.
> - rewrote to be read easily.
> - named as HardWall hierarchy.
>
> This implements following model
> - A cgroup's tree means hierarchy of resource.
> - All child's resource is moved from its parents.
> - The resource moved to children is charged as parent's usage.
> - The resource moves when child->limit is changed.
> - The sum of resource for children and its own usage is limited by "limit".
>
> This implies
> - No dynamic automatic hierarhcy balancing in the kernel.
> - Each resource is isolated completely.
> - The kernel just supports resource-move-at-change-in-limit.
> - The user (middle-ware) is responsible to make hierarhcy balanced well.

We'd definitely like to see a user level tool/application as a demo of how this
can be achieved.

> Good balance can be achieved by changing limit from user land.
>
>
> Background:
> Recently, there are popular resource isolation technique widely used,
> i.e. Hardware-Virtualization. We can do hierarchical resource isolation
> by using cgroup on it. But supporting hierarchy management in croups
> has some advantages of performance, unity and costs of management.
>
> There are good resource management in other OSs, they support some kind of
> hierarchical resource management. We wonder what kind of hierarchy policy
> is good for Linux. And there is an another point. Hierarchical system can be
> implemented by the kernel and user-land co-operation. So, there are various
> choices to do in the kernel. Doing all in the kernel or export some proper
> interfaces to the user-land. Middle-wares are tend to be used for management.
> I hope there will be Open Source one.
>
> At supporting hierarchy in cgroup, several aspects of characteristics of
> policy of hierarchy can be considered. Some needs automatic balancing
> between several groups.
>
> - fairness ... how fairness is kept under policy
>
> - performance ... should be _fast_. multi-level resource balancing tend
> to use much amount of CPU and can cause soft lockup.
>
> - predictability ... resource management are usually used for resource
> isolation. the kernel must not break the isolation and
> predictability of users against application's progress.
>
> - flexibility ... some sophisticated dynamic resource balancing with
> soft-limit is welcomed when the user doesn't want strict
> resource isolation or when the user cannot estimate how much
> they want correctly.

Soft limits has been on my plate for a while now. I'll take a crack at it. At
the moment the statistics is a bit of a worry, since users/administrators need
good statistics to take further action.

>
> Hard Wall Hierarchy.
>
> This patch implements a hard-wall model of hierarchy for resources.
> Works well for users who want strict resource isolation.
>
> This model allows the move of resource only between a parent and its children.
> The resource is moved to a child when it declares the amount of resources to be
> used. (by limit)

The other reason for preferring a shares based approach is that, the it will be
more in line with the CPU controllers interfaces.


> Automatic resource balancing is not supported in this code.
> (But users can do non-automatic by changing limit dynamically.)
>
> - fairness ... good. no resource sharing. works as specified by users.
> - performance ... good. each resources are capsuled to its own level.
> - predictability ... good. resources are completely isolated. balancing only
> occurs at the event of changes in limit.
> - flexibility ... bad. no flexibility and scheduling in the kernel level.
> need middle-ware's help.
>
> Considerations:
> - This implementation uses "limit" == "current_available_resource".
> This should be revisited when Soft-Limit one is implemented.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/resource_counter.txt | 41 +++++++++
> include/linux/res_counter.h | 90 +++++++++++++++++++-
> kernel/res_counter.c | 112 +++++++++++++++++++++++--
> 3 files changed, 235 insertions(+), 8 deletions(-)
>
> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> @@ -38,6 +38,16 @@ struct res_counter {
> * the number of unsuccessful attempts to consume the resource
> */
> unsigned long long failcnt;
> +
> + /*
> + * hierarchy support: the parent of this resource.
> + */
> + struct res_counter *parent;
> + /*
> + * the amount of resources assigned to children.
> + */
> + unsigned long long for_children;
> +

I would prefer to use a better name, lent_out? reserved_for_children?
borrowed_by_children?

> /*
> * the lock to protect all of the above.
> * the routines below consider this to be IRQ-safe
> @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
> ssize_t res_counter_read(struct res_counter *counter, int member,
> const char __user *buf, size_t nbytes, loff_t *pos,
> int (*read_strategy)(unsigned long long val, char *s));
> +
> +/*
> + * An interface for setting res_counter's member (ex. limit)
> + * the new parameter is passed by *buf and translated by write_strategy().
> + * Then, it is applied to member under the control of set_strategy().
> + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
> + */
> +
> ssize_t res_counter_write(struct res_counter *counter, int member,
> - const char __user *buf, size_t nbytes, loff_t *pos,
> - int (*write_strategy)(char *buf, unsigned long long *val));
> + const char __user *buf, size_t nbytes, loff_t *pos,
> + int (*write_strategy)(char *buf, unsigned long long *val),
> + int (*set_strategy)(struct res_counter *res, unsigned long long val,
> + int what),
> + );
>
> /*
> * the field descriptors. one for each member of res_counter
> @@ -76,15 +97,33 @@ enum {
> RES_MAX_USAGE,
> RES_LIMIT,
> RES_FAILCNT,
> + RES_FOR_CHILDREN,

RES_BORROWED? RES_BORROWED_BY_CHILDREN?

> };
>
> /*
> * helpers for accounting
> */
>
> +/*
> + * initialize res_counter.
> + * @counter : the counter
> + *
> + * initialize res_counter and set default limit to very big value(unlimited)
> + */
> +
> void res_counter_init(struct res_counter *counter);
>
> /*
> + * initialize res_counter under hierarchy.
> + * @counter : the counter
> + * @parent : the parent of the counter
> + *
> + * initialize res_counter and set default limit to 0. and set "parent".
> + */
> +void res_counter_init_hierarchy(struct res_counter *counter,
> + struct res_counter *parent);
> +
> +/*
> * charge - try to consume more resource.
> *
> * @counter: the counter
> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> cnt->failcnt = 0;
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
> +
> +/**
> + * Move resources from a parent to a child.
> + * At success,
> + * parent->usage += val.
> + * parent->for_children += val.
> + * child->limit += val.
> + *
> + * @child: an entity to set res->limit. The parent is child->parent.
> + * @val: the amount of resource to be moved.
> + * @callback: called when the parent's free resource is not enough to be moved.
> + * this can be NULL if no callback is necessary.
> + * @retry: limit for the number of trying to callback.
> + * -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success, !0 at failure.
> + *
> + * The callback returns 0 at success, !0 at failure.
> + *
> + */
> +
> +int res_counter_move_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry);
> +
> +
> +/**
> + * Return resource to its parent.
> + * At success,
> + * parent->usage -= val.
> + * parent->for_children -= val.
> + * child->limit -= val.
> + *
> + * @child: entry to resize. The parent is child->parent.
> + * @val : How much does child repay to parent ? -1 means 'all'
> + * @callback: A callback for decreasing resource usage of child before
> + * returning. If NULL, just deceases child's limit.
> + * @retry: # of retries at calling callback for freeing resource.
> + * -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success.
> + */
> +
> +int res_counter_return_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry);
> +
> #endif
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -44,6 +44,13 @@ to work with it.
> Protects changes of the above values.
>
>
> + f. struct res_counter *parent
> +
> + Parent res_counter under hierarchy.
> +
> + g. unsigned long long for_children
> +
> + Resources assigned to children. This is included in usage.
>
> 2. Basic accounting routines
>
> @@ -179,3 +186,37 @@ counter fields. They are recommended to
> still can help with it).
>
> c. Compile and run :)
> +
> +
> +6. Hierarchy
> + a. No Hierarchy
> + each cgroup can use its own private resource.
> +
> + b. Hard-wall Hierarhcy
> + A simple hierarchical tree system for resource isolation.
> + Allows moving resources only between a parent and its children.
> + A parent can move its resource to children and remember the amount to
> + for_children member. A child can get new resource only from its parent.
> + Limit of a child is the amount of resource which is moved from its parent.
> +

OK, after reading this I am totally sure I want a shares based interface. Limits
are not shared like this.

A child and a parent should both be capable of having a limit of 1G, but they
could use different shares factors to govern, how much each children will get.
Doing it this way, breaks limit semantics.


> + When add "val" to a child,
> + parent->usage += val
> + parent->for_children += val
> + child->limit += val
> + When a child returns its resource
> + parent->usage -= val
> + parent->for_children -= val
> + child->limit -= val.
> +
> + This implements resource isolation among each group. This works very well
> + when you want to use strict resource isolation.
> +
> + Usage Hint:
> + This seems for static resource assignment but dynamic resource re-assignment
> + can be done by resetting "limit" of groups. When you consider "limit" as
> + the amount of allowed _current_ resource, a sophisticated resource management
> + system based on strict resource isolation can be implemented.
> +
> +c. Soft-wall Hierarchy
> + TBD.
> +
> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
> counter->limit = (unsigned long long)LLONG_MAX;
> }
>
> +void res_counter_init_hierarchy(struct res_counter *counter,
> + struct res_counter *parent)
> +{
> + spin_lock_init(&counter->lock);
> + counter->limit = 0;
> + counter->parent = parent;
> +}
> +
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> {
> if (counter->usage + val > counter->limit) {
> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
> return &counter->limit;
> case RES_FAILCNT:
> return &counter->failcnt;
> + case RES_FOR_CHILDREN:
> + return &counter->for_children;
> };
>
> BUG();
> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
>
> ssize_t res_counter_write(struct res_counter *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> - int (*write_strategy)(char *st_buf, unsigned long long *val))
> + int (*write_strategy)(char *st_buf, unsigned long long *val),
> + int (*set_strategy)(struct res_counter *res,
> + unsigned long long val, int what))
> {
> int ret;
> char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> if (*end != '\0')
> goto out_free;
> }
> - spin_lock_irqsave(&counter->lock, flags);
> - val = res_counter_member(counter, member);
> - *val = tmp;
> - spin_unlock_irqrestore(&counter->lock, flags);
> - ret = nbytes;
> + if (set_strategy) {
> + ret = set_strategy(res, tmp, member);


I'm afraid, I don't understand the set_strategy and it's purpose.

> + if (!ret)
> + ret = nbytes;
> + } else {
> + spin_lock_irqsave(&counter->lock, flags);
> + val = res_counter_member(counter, member);
> + *val = tmp;
> + spin_unlock_irqrestore(&counter->lock, flags);
> + ret = nbytes;
> + }
> out_free:
> kfree(buf);
> out:
> return ret;
> }
> +
> +
> +int res_counter_move_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry)
> +{
> + struct res_counter *parent = child->parent;
> + unsigned long flags;
> +
> + BUG_ON(!parent);
> +
> + while (1) {
> + spin_lock_irqsave(&parent->lock, flags);
> + if (parent->usage + val < parent->limit) {
> + parent->for_children += val;
> + parent->usage += val;
> + break;
> + }
> + spin_unlock_irqrestore(&parent->lock, flags);
> +
> + if (!retry || !callback)
> + goto failed;
> + /* -1 means infinite loop */
> + if (retry != -1)
> + --retry;

I don't like the idea of spinning in an infinite loop, I would prefer to fail
things instead of burning CPU cycles.

> + yield();
> + callback(parent, val);

This code is not very understandable. Why do we yield before callback?

> + }
> + spin_unlock_irqrestore(&parent->lock, flags);
> +
> + spin_lock_irqsave(&child->lock, flags);
> + child->limit += val;
> + spin_unlock_irqrestore(&child->lock, flags);
> + return 0;
> +fail:
> + return 1;
> +}
> +
> +
> +int res_counter_return_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry)
> +{
> + unsigned long flags;
> + struct res_counter *parent = child->parent;
> +
> + BUG_ON(!parent);
> +
> + while (1) {
> + spin_lock_irqsave(&child->lock, flags);
> + if (val == (unsigned long long) -1) {
> + val = child->limit;
> + child->limit = 0;
> + break;
> + } else if (child->usage <= child->limit - val) {
> + child->limit -= val;
> + break;
> + }
> + spin_unlock_irqrestore(&child->lock, flags);
> +
> + if (!retry)
> + goto fail;
> + /* -1 means infinite loop */
> + if (retry != -1)
> + --retry;
> + yield();
> + callback(parent, val);

Ditto comments as above.

> + }
> + spin_unlock_irqrestore(&child->lock, flags);
> +
> + spin_lock_irqsave(&parent->lock, flags);
> + BUG_ON(parent->for_children < val);
> + BUG_ON(parent->usage < val);
> + parent->for_children -= val;
> + parent->usage -= val;
> + spin_unlock_irqrestore(&parent->lock, flags);
> + return 0;
> +fail:
> + return 1;
> +}
>

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-09 09:50:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

On Mon, 09 Jun 2008 15:00:22 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hi, this is third version.
> >
> > While small changes in codes, the whole _tone_ of code is changed.
> > I'm not in hurry, any comments are welcome.
> >
> > based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
> >
>
> Hi, Kamezawa-San,
>
> Sorry for the delay in responding. Like we discussed last time, I'd prefer a
> shares based approach for hierarchial memcg management. I'll review/try these
> patches and provide more feedback.
>
Hi,

I'm now totally re-arranging patches, so just see concepts.

In previous e-mail, I thought that there was a difference between 'your share'
and 'my share'. So, please explain again ?

My 'share' has following characteristics.

- work as soft-limit. not hard-limit.
- no limit when there are not high memory pressure.
- resource usage will be proportionally fair to each group's share (priority)
under memory pressure.

If you want to work on this, I can stop this for a while and do other important
patches, like background reclaim, mlock limitter, guarantee, etc.. because my
priority to hierarchy is not very high (but it seems better to do this before
other misc works, so I did.).

Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)

Thanks,
-Kame

2008-06-09 10:14:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Mon, 09 Jun 2008 15:18:47 +0530
Balbir Singh <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > A simple hard-wall hierarhcy support for res_counter.
> >
> > Changelog v2->v3
> > - changed the name and arguments of functions.
> > - rewrote to be read easily.
> > - named as HardWall hierarchy.
> >
> > This implements following model
> > - A cgroup's tree means hierarchy of resource.
> > - All child's resource is moved from its parents.
> > - The resource moved to children is charged as parent's usage.
> > - The resource moves when child->limit is changed.
> > - The sum of resource for children and its own usage is limited by "limit".
> >
> > This implies
> > - No dynamic automatic hierarhcy balancing in the kernel.
> > - Each resource is isolated completely.
> > - The kernel just supports resource-move-at-change-in-limit.
> > - The user (middle-ware) is responsible to make hierarhcy balanced well.
>
> We'd definitely like to see a user level tool/application as a demo of how this
> can be achieved.
>
I don't have one, now. I'll write one when I have time. Need now ?
Hmm...maybe I(we) need some more patches to implement useful statistics,
notifier to middlewares.



> > Good balance can be achieved by changing limit from user land.
> >
> >
> > Background:
> > Recently, there are popular resource isolation technique widely used,
> > i.e. Hardware-Virtualization. We can do hierarchical resource isolation
> > by using cgroup on it. But supporting hierarchy management in croups
> > has some advantages of performance, unity and costs of management.
> >
> > There are good resource management in other OSs, they support some kind of
> > hierarchical resource management. We wonder what kind of hierarchy policy
> > is good for Linux. And there is an another point. Hierarchical system can be
> > implemented by the kernel and user-land co-operation. So, there are various
> > choices to do in the kernel. Doing all in the kernel or export some proper
> > interfaces to the user-land. Middle-wares are tend to be used for management.
> > I hope there will be Open Source one.
> >
> > At supporting hierarchy in cgroup, several aspects of characteristics of
> > policy of hierarchy can be considered. Some needs automatic balancing
> > between several groups.
> >
> > - fairness ... how fairness is kept under policy
> >
> > - performance ... should be _fast_. multi-level resource balancing tend
> > to use much amount of CPU and can cause soft lockup.
> >
> > - predictability ... resource management are usually used for resource
> > isolation. the kernel must not break the isolation and
> > predictability of users against application's progress.
> >
> > - flexibility ... some sophisticated dynamic resource balancing with
> > soft-limit is welcomed when the user doesn't want strict
> > resource isolation or when the user cannot estimate how much
> > they want correctly.
>
> Soft limits has been on my plate for a while now. I'll take a crack at it. At
> the moment the statistics is a bit of a worry, since users/administrators need
> good statistics to take further action.
>
Yes, statistics is not enough now.



> >
> > Hard Wall Hierarchy.
> >
> > This patch implements a hard-wall model of hierarchy for resources.
> > Works well for users who want strict resource isolation.
> >
> > This model allows the move of resource only between a parent and its children.
> > The resource is moved to a child when it declares the amount of resources to be
> > used. (by limit)
>
> The other reason for preferring a shares based approach is that, the it will be
> more in line with the CPU controllers interfaces.
>

You have to think of the major difference of tha nature of CPU and Memory.
We have to reclaim the resource with some feedbacks among sevral cgroups.
But ok, if it's can be implemented in simple way.
I have no objections if cost is very low. My concern is only performance.
(and maintenance)


> > Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> > +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> > @@ -38,6 +38,16 @@ struct res_counter {
> > * the number of unsuccessful attempts to consume the resource
> > */
> > unsigned long long failcnt;
> > +
> > + /*
> > + * hierarchy support: the parent of this resource.
> > + */
> > + struct res_counter *parent;
> > + /*
> > + * the amount of resources assigned to children.
> > + */
> > + unsigned long long for_children;
> > +
>
> I would prefer to use a better name, lent_out? reserved_for_children?
> borrowed_by_children?
>
ok. use other names.



> > /*
> > * the lock to protect all of the above.
> > * the routines below consider this to be IRQ-safe
> > @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
> > ssize_t res_counter_read(struct res_counter *counter, int member,
> > const char __user *buf, size_t nbytes, loff_t *pos,
> > int (*read_strategy)(unsigned long long val, char *s));
> > +
> > +/*
> > + * An interface for setting res_counter's member (ex. limit)
> > + * the new parameter is passed by *buf and translated by write_strategy().
> > + * Then, it is applied to member under the control of set_strategy().
> > + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
> > + */
> > +
> > ssize_t res_counter_write(struct res_counter *counter, int member,
> > - const char __user *buf, size_t nbytes, loff_t *pos,
> > - int (*write_strategy)(char *buf, unsigned long long *val));
> > + const char __user *buf, size_t nbytes, loff_t *pos,
> > + int (*write_strategy)(char *buf, unsigned long long *val),
> > + int (*set_strategy)(struct res_counter *res, unsigned long long val,
> > + int what),
> > + );
> >
> > /*
> > * the field descriptors. one for each member of res_counter
> > @@ -76,15 +97,33 @@ enum {
> > RES_MAX_USAGE,
> > RES_LIMIT,
> > RES_FAILCNT,
> > + RES_FOR_CHILDREN,
>
> RES_BORROWED? RES_BORROWED_BY_CHILDREN?
>
ok, again.

> > };
> >
> > /*
> > * helpers for accounting
> > */
> >
> > +/*
> > + * initialize res_counter.
> > + * @counter : the counter
> > + *
> > + * initialize res_counter and set default limit to very big value(unlimited)
> > + */
> > +
> > void res_counter_init(struct res_counter *counter);
> >
> > /*
> > + * initialize res_counter under hierarchy.
> > + * @counter : the counter
> > + * @parent : the parent of the counter
> > + *
> > + * initialize res_counter and set default limit to 0. and set "parent".
> > + */
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > + struct res_counter *parent);
> > +
> > +/*
> > * charge - try to consume more resource.
> > *
> > * @counter: the counter
> > @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> > cnt->failcnt = 0;
> > spin_unlock_irqrestore(&cnt->lock, flags);
> > }
> > +
> > +/**
> > + * Move resources from a parent to a child.
> > + * At success,
> > + * parent->usage += val.
> > + * parent->for_children += val.
> > + * child->limit += val.
> > + *
> > + * @child: an entity to set res->limit. The parent is child->parent.
> > + * @val: the amount of resource to be moved.
> > + * @callback: called when the parent's free resource is not enough to be moved.
> > + * this can be NULL if no callback is necessary.
> > + * @retry: limit for the number of trying to callback.
> > + * -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success, !0 at failure.
> > + *
> > + * The callback returns 0 at success, !0 at failure.
> > + *
> > + */
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry);
> > +
> > +
> > +/**
> > + * Return resource to its parent.
> > + * At success,
> > + * parent->usage -= val.
> > + * parent->for_children -= val.
> > + * child->limit -= val.
> > + *
> > + * @child: entry to resize. The parent is child->parent.
> > + * @val : How much does child repay to parent ? -1 means 'all'
> > + * @callback: A callback for decreasing resource usage of child before
> > + * returning. If NULL, just deceases child's limit.
> > + * @retry: # of retries at calling callback for freeing resource.
> > + * -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success.
> > + */
> > +
> > +int res_counter_return_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry);
> > +
> > #endif
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -44,6 +44,13 @@ to work with it.
> > Protects changes of the above values.
> >
> >
> > + f. struct res_counter *parent
> > +
> > + Parent res_counter under hierarchy.
> > +
> > + g. unsigned long long for_children
> > +
> > + Resources assigned to children. This is included in usage.
> >
> > 2. Basic accounting routines
> >
> > @@ -179,3 +186,37 @@ counter fields. They are recommended to
> > still can help with it).
> >
> > c. Compile and run :)
> > +
> > +
> > +6. Hierarchy
> > + a. No Hierarchy
> > + each cgroup can use its own private resource.
> > +
> > + b. Hard-wall Hierarhcy
> > + A simple hierarchical tree system for resource isolation.
> > + Allows moving resources only between a parent and its children.
> > + A parent can move its resource to children and remember the amount to
> > + for_children member. A child can get new resource only from its parent.
> > + Limit of a child is the amount of resource which is moved from its parent.
> > +
>
> OK, after reading this I am totally sure I want a shares based interface. Limits
> are not shared like this.
>
> A child and a parent should both be capable of having a limit of 1G, but they
> could use different shares factors to govern, how much each children will get.
> Doing it this way, breaks limit semantics.
>
Not easy to use in my point of view. Can we use 'share' in proper way
on no-swap machine ?


>
> > + When add "val" to a child,
> > + parent->usage += val
> > + parent->for_children += val
> > + child->limit += val
> > + When a child returns its resource
> > + parent->usage -= val
> > + parent->for_children -= val
> > + child->limit -= val.
> > +
> > + This implements resource isolation among each group. This works very well
> > + when you want to use strict resource isolation.
> > +
> > + Usage Hint:
> > + This seems for static resource assignment but dynamic resource re-assignment
> > + can be done by resetting "limit" of groups. When you consider "limit" as
> > + the amount of allowed _current_ resource, a sophisticated resource management
> > + system based on strict resource isolation can be implemented.
> > +
> > +c. Soft-wall Hierarchy
> > + TBD.
> > +
> > Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> > +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> > @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
> > counter->limit = (unsigned long long)LLONG_MAX;
> > }
> >
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > + struct res_counter *parent)
> > +{
> > + spin_lock_init(&counter->lock);
> > + counter->limit = 0;
> > + counter->parent = parent;
> > +}
> > +
> > int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> > {
> > if (counter->usage + val > counter->limit) {
> > @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
> > return &counter->limit;
> > case RES_FAILCNT:
> > return &counter->failcnt;
> > + case RES_FOR_CHILDREN:
> > + return &counter->for_children;
> > };
> >
> > BUG();
> > @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
> >
> > ssize_t res_counter_write(struct res_counter *counter, int member,
> > const char __user *userbuf, size_t nbytes, loff_t *pos,
> > - int (*write_strategy)(char *st_buf, unsigned long long *val))
> > + int (*write_strategy)(char *st_buf, unsigned long long *val),
> > + int (*set_strategy)(struct res_counter *res,
> > + unsigned long long val, int what))
> > {
> > int ret;
> > char *buf, *end;
> > @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> > if (*end != '\0')
> > goto out_free;
> > }
> > - spin_lock_irqsave(&counter->lock, flags);
> > - val = res_counter_member(counter, member);
> > - *val = tmp;
> > - spin_unlock_irqrestore(&counter->lock, flags);
> > - ret = nbytes;
> > + if (set_strategy) {
> > + ret = set_strategy(res, tmp, member);
>
>
> I'm afraid, I don't understand the set_strategy and it's purpose.
>
Sorry. I'm now rewritten and removed this.



> > + if (!ret)
> > + ret = nbytes;
> > + } else {
> > + spin_lock_irqsave(&counter->lock, flags);
> > + val = res_counter_member(counter, member);
> > + *val = tmp;
> > + spin_unlock_irqrestore(&counter->lock, flags);
> > + ret = nbytes;
> > + }
> > out_free:
> > kfree(buf);
> > out:
> > return ret;
> > }
> > +
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry)
> > +{
> > + struct res_counter *parent = child->parent;
> > + unsigned long flags;
> > +
> > + BUG_ON(!parent);
> > +
> > + while (1) {
> > + spin_lock_irqsave(&parent->lock, flags);
> > + if (parent->usage + val < parent->limit) {
> > + parent->for_children += val;
> > + parent->usage += val;
> > + break;
> > + }
> > + spin_unlock_irqrestore(&parent->lock, flags);
> > +
> > + if (!retry || !callback)
> > + goto failed;
> > + /* -1 means infinite loop */
> > + if (retry != -1)
> > + --retry;
>
> I don't like the idea of spinning in an infinite loop, I would prefer to fail
> things instead of burning CPU cycles.
>
ok, will remove "-1" case.




> > + yield();
> > + callback(parent, val);
>
> This code is not very understandable. Why do we yield before callback?
>

yield() after callback() means that res_counter's state will be
far different from the state after callback.
So, we have to yield before call back and check res_coutner sooner.

Thanks,
-Kame

2008-06-09 10:34:21

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)

KAMEZAWA Hiroyuki wrote:
> On Mon, 09 Jun 2008 15:00:22 +0530
> Balbir Singh <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> Hi, this is third version.
>>>
>>> While small changes in codes, the whole _tone_ of code is changed.
>>> I'm not in hurry, any comments are welcome.
>>>
>>> based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
>>>
>> Hi, Kamezawa-San,
>>
>> Sorry for the delay in responding. Like we discussed last time, I'd prefer a
>> shares based approach for hierarchial memcg management. I'll review/try these
>> patches and provide more feedback.
>>
> Hi,
>
> I'm now totally re-arranging patches, so just see concepts.
>
> In previous e-mail, I thought that there was a difference between 'your share'
> and 'my share'. So, please explain again ?
>
> My 'share' has following characteristics.
>
> - work as soft-limit. not hard-limit.
> - no limit when there are not high memory pressure.
> - resource usage will be proportionally fair to each group's share (priority)
> under memory pressure.
>

My share is very similar to yours.

A group might have a share of 100% and a hard limit of 1G. In this case the hard
limit applies if the system has more than 1G of memory. I think of hard limit as
the final controlling factor and shares are suggestive.

Yes, my shares also have the same factors, but can be overridden by hard limits.


> If you want to work on this, I can stop this for a while and do other important
> patches, like background reclaim, mlock limitter, guarantee, etc.. because my
> priority to hierarchy is not very high (but it seems better to do this before
> other misc works, so I did.).
>

I do, but I don't want to stop you from doing it. mlock limitter is definitely
important, along with some control for large pages. Hierarchy is definitely
important, since we cannot add other major functionality without first solving
this proble, After that, High on my list is

1. Soft limits
2. Performance/space trade-offs

> Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)

Yes :) I just saw that going in


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-09 10:38:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

KAMEZAWA Hiroyuki wrote:
> On Mon, 09 Jun 2008 15:18:47 +0530
> Balbir Singh <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> A simple hard-wall hierarhcy support for res_counter.
>>>
>>> Changelog v2->v3
>>> - changed the name and arguments of functions.
>>> - rewrote to be read easily.
>>> - named as HardWall hierarchy.
>>>
>>> This implements following model
>>> - A cgroup's tree means hierarchy of resource.
>>> - All child's resource is moved from its parents.
>>> - The resource moved to children is charged as parent's usage.
>>> - The resource moves when child->limit is changed.
>>> - The sum of resource for children and its own usage is limited by "limit".
>>>
>>> This implies
>>> - No dynamic automatic hierarhcy balancing in the kernel.
>>> - Each resource is isolated completely.
>>> - The kernel just supports resource-move-at-change-in-limit.
>>> - The user (middle-ware) is responsible to make hierarhcy balanced well.
>> We'd definitely like to see a user level tool/application as a demo of how this
>> can be achieved.
>>
> I don't have one, now. I'll write one when I have time. Need now ?
> Hmm...maybe I(we) need some more patches to implement useful statistics,
> notifier to middlewares.
>

Yes, we need more useful statistics.

>
>
>>> Good balance can be achieved by changing limit from user land.
>>>
>>>
>>> Background:
>>> Recently, there are popular resource isolation technique widely used,
>>> i.e. Hardware-Virtualization. We can do hierarchical resource isolation
>>> by using cgroup on it. But supporting hierarchy management in croups
>>> has some advantages of performance, unity and costs of management.
>>>
>>> There are good resource management in other OSs, they support some kind of
>>> hierarchical resource management. We wonder what kind of hierarchy policy
>>> is good for Linux. And there is an another point. Hierarchical system can be
>>> implemented by the kernel and user-land co-operation. So, there are various
>>> choices to do in the kernel. Doing all in the kernel or export some proper
>>> interfaces to the user-land. Middle-wares are tend to be used for management.
>>> I hope there will be Open Source one.
>>>
>>> At supporting hierarchy in cgroup, several aspects of characteristics of
>>> policy of hierarchy can be considered. Some needs automatic balancing
>>> between several groups.
>>>
>>> - fairness ... how fairness is kept under policy
>>>
>>> - performance ... should be _fast_. multi-level resource balancing tend
>>> to use much amount of CPU and can cause soft lockup.
>>>
>>> - predictability ... resource management are usually used for resource
>>> isolation. the kernel must not break the isolation and
>>> predictability of users against application's progress.
>>>
>>> - flexibility ... some sophisticated dynamic resource balancing with
>>> soft-limit is welcomed when the user doesn't want strict
>>> resource isolation or when the user cannot estimate how much
>>> they want correctly.
>> Soft limits has been on my plate for a while now. I'll take a crack at it. At
>> the moment the statistics is a bit of a worry, since users/administrators need
>> good statistics to take further action.
>>
> Yes, statistics is not enough now.
>
>
>
>>> Hard Wall Hierarchy.
>>>
>>> This patch implements a hard-wall model of hierarchy for resources.
>>> Works well for users who want strict resource isolation.
>>>
>>> This model allows the move of resource only between a parent and its children.
>>> The resource is moved to a child when it declares the amount of resources to be
>>> used. (by limit)
>> The other reason for preferring a shares based approach is that, the it will be
>> more in line with the CPU controllers interfaces.
>>
>
> You have to think of the major difference of tha nature of CPU and Memory.
> We have to reclaim the resource with some feedbacks among sevral cgroups.
> But ok, if it's can be implemented in simple way.
> I have no objections if cost is very low. My concern is only performance.
> (and maintenance)
>

True, I don't see hierarchy as adding too much additional cost.

>
>>> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
>>> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
>>> @@ -38,6 +38,16 @@ struct res_counter {
>>> * the number of unsuccessful attempts to consume the resource
>>> */
>>> unsigned long long failcnt;
>>> +
>>> + /*
>>> + * hierarchy support: the parent of this resource.
>>> + */
>>> + struct res_counter *parent;
>>> + /*
>>> + * the amount of resources assigned to children.
>>> + */
>>> + unsigned long long for_children;
>>> +
>> I would prefer to use a better name, lent_out? reserved_for_children?
>> borrowed_by_children?
>>
> ok. use other names.
>
>
>
>>> /*
>>> * the lock to protect all of the above.
>>> * the routines below consider this to be IRQ-safe
>>> @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
>>> ssize_t res_counter_read(struct res_counter *counter, int member,
>>> const char __user *buf, size_t nbytes, loff_t *pos,
>>> int (*read_strategy)(unsigned long long val, char *s));
>>> +
>>> +/*
>>> + * An interface for setting res_counter's member (ex. limit)
>>> + * the new parameter is passed by *buf and translated by write_strategy().
>>> + * Then, it is applied to member under the control of set_strategy().
>>> + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
>>> + */
>>> +
>>> ssize_t res_counter_write(struct res_counter *counter, int member,
>>> - const char __user *buf, size_t nbytes, loff_t *pos,
>>> - int (*write_strategy)(char *buf, unsigned long long *val));
>>> + const char __user *buf, size_t nbytes, loff_t *pos,
>>> + int (*write_strategy)(char *buf, unsigned long long *val),
>>> + int (*set_strategy)(struct res_counter *res, unsigned long long val,
>>> + int what),
>>> + );
>>>
>>> /*
>>> * the field descriptors. one for each member of res_counter
>>> @@ -76,15 +97,33 @@ enum {
>>> RES_MAX_USAGE,
>>> RES_LIMIT,
>>> RES_FAILCNT,
>>> + RES_FOR_CHILDREN,
>> RES_BORROWED? RES_BORROWED_BY_CHILDREN?
>>
> ok, again.
>
>>> };
>>>
>>> /*
>>> * helpers for accounting
>>> */
>>>
>>> +/*
>>> + * initialize res_counter.
>>> + * @counter : the counter
>>> + *
>>> + * initialize res_counter and set default limit to very big value(unlimited)
>>> + */
>>> +
>>> void res_counter_init(struct res_counter *counter);
>>>
>>> /*
>>> + * initialize res_counter under hierarchy.
>>> + * @counter : the counter
>>> + * @parent : the parent of the counter
>>> + *
>>> + * initialize res_counter and set default limit to 0. and set "parent".
>>> + */
>>> +void res_counter_init_hierarchy(struct res_counter *counter,
>>> + struct res_counter *parent);
>>> +
>>> +/*
>>> * charge - try to consume more resource.
>>> *
>>> * @counter: the counter
>>> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
>>> cnt->failcnt = 0;
>>> spin_unlock_irqrestore(&cnt->lock, flags);
>>> }
>>> +
>>> +/**
>>> + * Move resources from a parent to a child.
>>> + * At success,
>>> + * parent->usage += val.
>>> + * parent->for_children += val.
>>> + * child->limit += val.
>>> + *
>>> + * @child: an entity to set res->limit. The parent is child->parent.
>>> + * @val: the amount of resource to be moved.
>>> + * @callback: called when the parent's free resource is not enough to be moved.
>>> + * this can be NULL if no callback is necessary.
>>> + * @retry: limit for the number of trying to callback.
>>> + * -1 means infinite loop. At each retry, yield() is called.
>>> + * Returns 0 at success, !0 at failure.
>>> + *
>>> + * The callback returns 0 at success, !0 at failure.
>>> + *
>>> + */
>>> +
>>> +int res_counter_move_resource(struct res_counter *child,
>>> + unsigned long long val,
>>> + int (*callback)(struct res_counter *res, unsigned long long val),
>>> + int retry);
>>> +
>>> +
>>> +/**
>>> + * Return resource to its parent.
>>> + * At success,
>>> + * parent->usage -= val.
>>> + * parent->for_children -= val.
>>> + * child->limit -= val.
>>> + *
>>> + * @child: entry to resize. The parent is child->parent.
>>> + * @val : How much does child repay to parent ? -1 means 'all'
>>> + * @callback: A callback for decreasing resource usage of child before
>>> + * returning. If NULL, just deceases child's limit.
>>> + * @retry: # of retries at calling callback for freeing resource.
>>> + * -1 means infinite loop. At each retry, yield() is called.
>>> + * Returns 0 at success.
>>> + */
>>> +
>>> +int res_counter_return_resource(struct res_counter *child,
>>> + unsigned long long val,
>>> + int (*callback)(struct res_counter *res, unsigned long long val),
>>> + int retry);
>>> +
>>> #endif
>>> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
>>> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
>>> @@ -44,6 +44,13 @@ to work with it.
>>> Protects changes of the above values.
>>>
>>>
>>> + f. struct res_counter *parent
>>> +
>>> + Parent res_counter under hierarchy.
>>> +
>>> + g. unsigned long long for_children
>>> +
>>> + Resources assigned to children. This is included in usage.
>>>
>>> 2. Basic accounting routines
>>>
>>> @@ -179,3 +186,37 @@ counter fields. They are recommended to
>>> still can help with it).
>>>
>>> c. Compile and run :)
>>> +
>>> +
>>> +6. Hierarchy
>>> + a. No Hierarchy
>>> + each cgroup can use its own private resource.
>>> +
>>> + b. Hard-wall Hierarhcy
>>> + A simple hierarchical tree system for resource isolation.
>>> + Allows moving resources only between a parent and its children.
>>> + A parent can move its resource to children and remember the amount to
>>> + for_children member. A child can get new resource only from its parent.
>>> + Limit of a child is the amount of resource which is moved from its parent.
>>> +
>> OK, after reading this I am totally sure I want a shares based interface. Limits
>> are not shared like this.
>>
>> A child and a parent should both be capable of having a limit of 1G, but they
>> could use different shares factors to govern, how much each children will get.
>> Doing it this way, breaks limit semantics.
>>
> Not easy to use in my point of view. Can we use 'share' in proper way
> on no-swap machine ?
>

Not sure I understand your question. Share represents the share of available
resources.

>
>>> + When add "val" to a child,
>>> + parent->usage += val
>>> + parent->for_children += val
>>> + child->limit += val
>>> + When a child returns its resource
>>> + parent->usage -= val
>>> + parent->for_children -= val
>>> + child->limit -= val.
>>> +
>>> + This implements resource isolation among each group. This works very well
>>> + when you want to use strict resource isolation.
>>> +
>>> + Usage Hint:
>>> + This seems for static resource assignment but dynamic resource re-assignment
>>> + can be done by resetting "limit" of groups. When you consider "limit" as
>>> + the amount of allowed _current_ resource, a sophisticated resource management
>>> + system based on strict resource isolation can be implemented.
>>> +
>>> +c. Soft-wall Hierarchy
>>> + TBD.
>>> +
>>> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
>>> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
>>> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
>>> counter->limit = (unsigned long long)LLONG_MAX;
>>> }
>>>
>>> +void res_counter_init_hierarchy(struct res_counter *counter,
>>> + struct res_counter *parent)
>>> +{
>>> + spin_lock_init(&counter->lock);
>>> + counter->limit = 0;
>>> + counter->parent = parent;
>>> +}
>>> +
>>> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>>> {
>>> if (counter->usage + val > counter->limit) {
>>> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
>>> return &counter->limit;
>>> case RES_FAILCNT:
>>> return &counter->failcnt;
>>> + case RES_FOR_CHILDREN:
>>> + return &counter->for_children;
>>> };
>>>
>>> BUG();
>>> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
>>>
>>> ssize_t res_counter_write(struct res_counter *counter, int member,
>>> const char __user *userbuf, size_t nbytes, loff_t *pos,
>>> - int (*write_strategy)(char *st_buf, unsigned long long *val))
>>> + int (*write_strategy)(char *st_buf, unsigned long long *val),
>>> + int (*set_strategy)(struct res_counter *res,
>>> + unsigned long long val, int what))
>>> {
>>> int ret;
>>> char *buf, *end;
>>> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
>>> if (*end != '\0')
>>> goto out_free;
>>> }
>>> - spin_lock_irqsave(&counter->lock, flags);
>>> - val = res_counter_member(counter, member);
>>> - *val = tmp;
>>> - spin_unlock_irqrestore(&counter->lock, flags);
>>> - ret = nbytes;
>>> + if (set_strategy) {
>>> + ret = set_strategy(res, tmp, member);
>>
>> I'm afraid, I don't understand the set_strategy and it's purpose.
>>
> Sorry. I'm now rewritten and removed this.
>
>

OK

>
>>> + if (!ret)
>>> + ret = nbytes;
>>> + } else {
>>> + spin_lock_irqsave(&counter->lock, flags);
>>> + val = res_counter_member(counter, member);
>>> + *val = tmp;
>>> + spin_unlock_irqrestore(&counter->lock, flags);
>>> + ret = nbytes;
>>> + }
>>> out_free:
>>> kfree(buf);
>>> out:
>>> return ret;
>>> }
>>> +
>>> +
>>> +int res_counter_move_resource(struct res_counter *child,
>>> + unsigned long long val,
>>> + int (*callback)(struct res_counter *res, unsigned long long val),
>>> + int retry)
>>> +{
>>> + struct res_counter *parent = child->parent;
>>> + unsigned long flags;
>>> +
>>> + BUG_ON(!parent);
>>> +
>>> + while (1) {
>>> + spin_lock_irqsave(&parent->lock, flags);
>>> + if (parent->usage + val < parent->limit) {
>>> + parent->for_children += val;
>>> + parent->usage += val;
>>> + break;
>>> + }
>>> + spin_unlock_irqrestore(&parent->lock, flags);
>>> +
>>> + if (!retry || !callback)
>>> + goto failed;
>>> + /* -1 means infinite loop */
>>> + if (retry != -1)
>>> + --retry;
>> I don't like the idea of spinning in an infinite loop, I would prefer to fail
>> things instead of burning CPU cycles.
>>
> ok, will remove "-1" case.
>
>
>
>
>>> + yield();
>>> + callback(parent, val);
>> This code is not very understandable. Why do we yield before callback?
>>
>
> yield() after callback() means that res_counter's state will be
> far different from the state after callback.
> So, we have to yield before call back and check res_coutner sooner.
>

But does yield() get us any guarantees of seeing the state change?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-09 10:57:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

KAMEZAWA Hiroyuki wrote:
> Hard-Wall hierarchy support for memcg.
> - new member hierarchy_model is added to memcg.
>
> Only root cgroup can modify this only when there is no children.
>

Sounds like the correct thing TODO and also maintains compatibility

> Adds following functions for supporting HARDWALL hierarchy.
> - try to reclaim memory at the change of "limit".
> - try to reclaim all memory at force_empty
> - returns resources to the parent at destroy.
>
> Changelog v2->v3
> - added documentation.
> - hierarhcy_model parameter is added.
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> Documentation/controllers/memory.txt | 27 +++++-
> mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 178 insertions(+), 5 deletions(-)
>
> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> @@ -137,6 +137,8 @@ struct mem_cgroup {
> struct mem_cgroup_lru_info info;
>
> int prev_priority; /* for recording reclaim priority */
> +
> + int hierarchy_model; /* used hierarchical policy */

Could we enumerate what these policies are?

> /*
> * statistics.
> */
> @@ -144,6 +146,10 @@ struct mem_cgroup {
> };
> static struct mem_cgroup init_mem_cgroup;
>
> +
> +#define MEMCG_NO_HIERARCHY (0)

Nice to indent with the #define below

> +#define MEMCG_HARDWALL_HIERARCHY (1)
> +
> /*
> * We use the lower bit of the page->page_cgroup pointer as a bit spin
> * lock. We need to ensure that page->page_cgroup is at least two
> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
> }
>
> /*
> + * Memory Controller hierarchy support.
> + */
> +
> +/*
> + * shrink usage to be res->usage + val < res->limit.
> + */
> +
> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> +{
> + struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> + unsigned long flags;
> + int ret = 1;
> + int progress = 1;
> +
> +retry:
> + spin_lock_irqsave(&cnt->lock, flags);
> + /* Need to shrink ? */
> + if (cnt->usage + val <= cnt->limit)
> + ret = 0;
> + spin_unlock_irqrestore(&cnt->lock, flags);
> +

We have res_counter_check_under_limit(), may be we could re-use that here by
adding parameters.

> + if (!ret)
> + return 0;
> +
> + if (!progress)
> + return 1;
> + progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
> +
> + goto retry;
> +}
> +
> +/*
> + * For Hard Wall Hierarchy.
> + */
> +
> +int mem_cgroup_resize_callback(struct res_counter *cnt,
> + unsigned long long val, int what)
> +{
> + unsigned long flags, borrow;
> + unsigned long long diffs;
> + int ret = 0;
> +
> + BUG_ON(what != RES_LIMIT);
> +
> + /* Is this under hierarchy ? */
> + if (!cnt->parent) {
> + spin_lock_irqsave(&cnt->lock, flags);
> + cnt->limit = val;

I know callback is called from the two functions specified in patch 1/2 (move
and return resource). I don't understand why it is OK to force the limit to be
val, if it is the root node?

> + spin_unlock_irqrestore(&cnt->lock, flags);
> + return 0;
> + }
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + if (val > cnt->limit) {
> + diffs = val - cnt->limit;
> + borrow = 1;
> + } else {
> + diffs = cnt->limit - val;
> + borrow = 0;
> + }
> + spin_unlock_irqrestore(&cnt->lock, flags);
> +
> + if (borrow)
> + ret = res_counter_move_resource(cnt,diffs,
> + memcg_shrink_val,
> + MEM_CGROUP_RECLAIM_RETRIES);
> + else
> + ret = res_counter_return_resource(cnt, diffs,
> + memcg_shrink_val,
> + MEM_CGROUP_RECLAIM_RETRIES);
> + return ret;
> +}
> +
> +
> +void memcg_shrink_all(struct mem_cgroup *mem)
> +{

memcg_shrink_to_limit?

> + unsigned long long val;
> +
> + val = res_counter_read_u64(&mem->res, RES_LIMIT);
> + memcg_shrink_val(&mem->res, val);
> +}
> +
> +/*
> * This routine traverse page_cgroup in given list and drop them all.
> * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> */
> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
> if (mem_cgroup_subsys.disabled)
> return 0;
>
> + memcg_shrink_all(mem);
> +
> css_get(&mem->css);
> /*
> * page reclaim code (kswapd etc..) will move pages between
> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
> struct file *file, const char __user *userbuf,
> size_t nbytes, loff_t *ppos)
> {
> - return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +
> + if (cft->private != RES_LIMIT
> + || !cont->parent
> + || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
> + return res_counter_write(&memcg->res, cft->private, userbuf,
> + nbytes, ppos, mem_cgroup_write_strategy, NULL);
> + else
> + return res_counter_write(&memcg->res,
> cft->private, userbuf, nbytes, ppos,
> - mem_cgroup_write_strategy);
> + mem_cgroup_write_strategy,
> + mem_cgroup_resize_callback);
> +}
> +
> +
> +static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->hierarchy_model;
> +}
> +
> +static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *cft,
> + u64 val)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + /* chage policy is allowed to ROOT cgroup && no children */
> + if (cgrp->parent)
> + return -EINVAL;
> + if (!list_empty(&cgrp->children))
> + return -EINVAL;
> + if (val == 0 || val == 1) {
> + memcg->hierarchy_model = val;
> + return 0;
> + }
> + return -EINVAL;
> }
>
> +
> static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> {
> struct mem_cgroup *mem;
> @@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[]
> .name = "stat",
> .read_map = mem_control_stat_show,
> },
> + {
> + .name = "hierarchy_model",
> + .read_u64 = mem_cgroup_read_hierarchy,
> + .write_u64 = mem_cgroup_write_hierarchy,
> + },
> + {
> + .name = "assigned_to_child",
> + .private = RES_FOR_CHILDREN,
> + .read_u64 = mem_cgroup_read,
> + },
> };
>
> static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> @@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
> static struct cgroup_subsys_state *
> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> {
> - struct mem_cgroup *mem;
> + struct mem_cgroup *mem, *parent;
> int node;
>
> if (unlikely((cont->parent) == NULL)) {
> mem = &init_mem_cgroup;
> page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> + parent = NULL;
> } else {
> mem = mem_cgroup_alloc();
> if (!mem)
> return ERR_PTR(-ENOMEM);
> + parent = mem_cgroup_from_cont(cont->parent);
> }
>
> - res_counter_init(&mem->res);
> + if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
> + res_counter_init(&mem->res);
> + mem->hierarchy_model = MEMCG_NO_HIERARCHY;
> + } else {
> + res_counter_init_hierarchy(&mem->res, &parent->res);
> + mem->hierarchy_model = parent->hierarchy_model;
> + }
>
> for_each_node_state(node, N_POSSIBLE)
> if (alloc_mem_cgroup_per_zone_info(mem, node))
> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> int node;
> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>
> + if (cont->parent &&
> + mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> + /* we did what we can...just returns what we borrow */
> + res_counter_return_resource(&mem->res, -1, NULL, 0);
> + }
> +
> for_each_node_state(node, N_POSSIBLE)
> free_mem_cgroup_per_zone_info(mem, node);
>
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> @@ -237,12 +237,37 @@ cgroup might have some charge associated
> tasks have migrated away from it. Such charges are automatically dropped at
> rmdir() if there are no tasks.
>
> -5. TODO
> +5. Hierarchy Model
> + the kernel supports following kinds of hierarchy models.
> + (your middle-ware may support others based on this.)
> +
> + 5-a. Independent Hierarchy
> + There are no relationship between any cgroups, even among a parent and
> + children. This is the default mode. To use this hierarchy, write 0
> + to root cgroup's memory.hierarchy_model
> + echo 0 > .../memory.hierarchy_model.
> +
> + 5-b. Hardwall Hierarchy.
> + The resource has to be moved from the parent to the child before use it.
> + When a child's limit is set to 'val', val of the resource is moved from
> + the parent to the child. the parent's usage += val.
> + The amount of children's usage is reported by the file
> +
> + - memory.assigned_to_child
> +
> + This policy doesn't provide sophisticated automatic resource balancing in
> + the kernel. But this is very good for strict resource isolation. Users
> + can get high predictability of behavior of applications if this is used
> + under proper environments.
> +
> +
> +6. TODO
>
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> 3. Teach controller to account for shared-pages
> 4. Start reclamation when the limit is lowered
> + (this is already done in Hardwall Hierarchy)

Excellent! May be we should split this patch out?

> 5. Start reclamation in the background when the limit is
> not yet hit but the usage is getting closer
>
>


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-06-09 12:03:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

----- Original Message -----
way, breaks limit semantics.
>>>
>> Not easy to use in my point of view. Can we use 'share' in proper way
>> on no-swap machine ?
>>
>
>Not sure I understand your question. Share represents the share of available
>resources.
>

If no swap, you cannot reclaim anonymous pages and shared memory.
Then, the kernel has to abandon any kinds of auto-balancing somewhere.
(just an example. Things will be more complicated when we consinder
mlocked pages and swap-resource-controller.)


>> yield() after callback() means that res_counter's state will be
>> far different from the state after callback.
>> So, we have to yield before call back and check res_coutner sooner.
>>
>
>But does yield() get us any guarantees of seeing the state change?
>
Hmm, myabe my explanation is bad.

in following sequence
1.callback()
2.yield()
3.check usage again
Elapsed time between 1->3 is big.

in following
1.yield()
2.callback()
3.check usage again
Elapsed time between 2->3 is small.

There is an option to implement "changing limit grarually"

Thanks,
-Kame








2008-06-09 12:10:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

----- Original Message -----

>KAMEZAWA Hiroyuki wrote:
>> Hard-Wall hierarchy support for memcg.
>> - new member hierarchy_model is added to memcg.
>>
>> Only root cgroup can modify this only when there is no children.
>>
>
>Sounds like the correct thing TODO and also maintains compatibility
>
Paul suggested another way. (please see his mail, sorry)

So, I changed this behavior as following.
"a cgroup's hierarchy mode can be changed when
- parent's hierarchy mode is "no hirerachy"
- no children

>> Adds following functions for supporting HARDWALL hierarchy.
>> - try to reclaim memory at the change of "limit".
>> - try to reclaim all memory at force_empty
>> - returns resources to the parent at destroy.
>>
>> Changelog v2->v3
>> - added documentation.
>> - hierarhcy_model parameter is added.
>>
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> ---
>> Documentation/controllers/memory.txt | 27 +++++-
>> mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++
++++-
>> 2 files changed, 178 insertions(+), 5 deletions(-)
>>
>> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
>> ===================================================================
>> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
>> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
>> @@ -137,6 +137,8 @@ struct mem_cgroup {
>> struct mem_cgroup_lru_info info;
>>
>> int prev_priority; /* for recording reclaim priority */
>> +
>> + int hierarchy_model; /* used hierarchical policy */
>
>Could we enumerate what these policies are?
>
Sure,

>> /*
>> * statistics.
>> */
>> @@ -144,6 +146,10 @@ struct mem_cgroup {
>> };
>> static struct mem_cgroup init_mem_cgroup;
>>
>> +
>> +#define MEMCG_NO_HIERARCHY (0)
>
>Nice to indent with the #define below
>
ok, but I use another style in my current set (not posted yet.)

>> +#define MEMCG_HARDWALL_HIERARCHY (1)
>> +
>> /*
>> * We use the lower bit of the page->page_cgroup pointer as a bit spin
>> * lock. We need to ensure that page->page_cgroup is at least two
>> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
>> }
>>
>> /*
>> + * Memory Controller hierarchy support.
>> + */
>> +
>> +/*
>> + * shrink usage to be res->usage + val < res->limit.
>> + */
>> +
>> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
>> +{
>> + struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
>> + unsigned long flags;
>> + int ret = 1;
>> + int progress = 1;
>> +
>> +retry:
>> + spin_lock_irqsave(&cnt->lock, flags);
>> + /* Need to shrink ? */
>> + if (cnt->usage + val <= cnt->limit)
>> + ret = 0;
>> + spin_unlock_irqrestore(&cnt->lock, flags);
>> +
>
>We have res_counter_check_under_limit(), may be we could re-use that here by
>adding parameters.
>
sure


>> + if (!ret)
>> + return 0;
>> +
>> + if (!progress)
>> + return 1;
>> + progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
>> +
>> + goto retry;
>> +}
>> +
>> +/*
>> + * For Hard Wall Hierarchy.
>> + */
>> +
>> +int mem_cgroup_resize_callback(struct res_counter *cnt,
>> + unsigned long long val, int what)
>> +{
>> + unsigned long flags, borrow;
>> + unsigned long long diffs;
>> + int ret = 0;
>> +
>> + BUG_ON(what != RES_LIMIT);
>> +
>> + /* Is this under hierarchy ? */
>> + if (!cnt->parent) {
>> + spin_lock_irqsave(&cnt->lock, flags);
>> + cnt->limit = val;
>
>I know callback is called from the two functions specified in patch 1/2 (move
>and return resource). I don't understand why it is OK to force the limit to b
e
>val, if it is the root node?
>
Because it has no parent. So there is no resource move.


>> + spin_unlock_irqrestore(&cnt->lock, flags);
>> + return 0;
>> + }
>> +
>> + spin_lock_irqsave(&cnt->lock, flags);
>> + if (val > cnt->limit) {
>> + diffs = val - cnt->limit;
>> + borrow = 1;
>> + } else {
>> + diffs = cnt->limit - val;
>> + borrow = 0;
>> + }
>> + spin_unlock_irqrestore(&cnt->lock, flags);
>> +
>> + if (borrow)
>> + ret = res_counter_move_resource(cnt,diffs,
>> + memcg_shrink_val,
>> + MEM_CGROUP_RECLAIM_RETRIES);
>> + else
>> + ret = res_counter_return_resource(cnt, diffs,
>> + memcg_shrink_val,
>> + MEM_CGROUP_RECLAIM_RETRIES);
>> + return ret;
>> +}
>> +
>> +
>> +void memcg_shrink_all(struct mem_cgroup *mem)
>> +{
>
>memcg_shrink_to_limit?
>
ok, use more precise name.

>> + unsigned long long val;
>> +
>> + val = res_counter_read_u64(&mem->res, RES_LIMIT);
>> + memcg_shrink_val(&mem->res, val);
>> +}
>> +
>> +/*
>> * This routine traverse page_cgroup in given list and drop them all.
>> * *And* this routine doesn't reclaim page itself, just removes page_cgrou
p.
>> */
>> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
>> if (mem_cgroup_subsys.disabled)
>> return 0;
>>
>> + memcg_shrink_all(mem);
>> +
>> css_get(&mem->css);
>> /*
>> * page reclaim code (kswapd etc..) will move pages between
>> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
>> struct file *file, const char __user *userbuf,
>> size_t nbytes, loff_t *ppos)
>> {
>> - return res_counter_write(&mem_cgroup_from_cont(cont)->res,
>> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>> +
>> + if (cft->private != RES_LIMIT
>> + || !cont->parent
>> + || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
>> + return res_counter_write(&memcg->res, cft->private, userbuf,
>> + nbytes, ppos, mem_cgroup_write_strategy, NULL);
>> + else
>> + return res_counter_write(&memcg->res,
>> cft->private, userbuf, nbytes, ppos,
>> - mem_cgroup_write_strategy);
>> + mem_cgroup_write_strategy,
>> + mem_cgroup_resize_callback);
>> +}
>> +
>> +
>> +static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *c
ft)
>> +{
>> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> + return memcg->hierarchy_model;
>> +}
>> +
>> +static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *
cft,
>> + u64 val)
>> +{
>> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> + /* chage policy is allowed to ROOT cgroup && no children */
>> + if (cgrp->parent)
>> + return -EINVAL;
>> + if (!list_empty(&cgrp->children))
>> + return -EINVAL;
>> + if (val == 0 || val == 1) {
>> + memcg->hierarchy_model = val;
>> + return 0;
>> + }
>> + return -EINVAL;
>> }
>>
>> +
>> static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>> {
>> struct mem_cgroup *mem;
>> @@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[]
>> .name = "stat",
>> .read_map = mem_control_stat_show,
>> },
>> + {
>> + .name = "hierarchy_model",
>> + .read_u64 = mem_cgroup_read_hierarchy,
>> + .write_u64 = mem_cgroup_write_hierarchy,
>> + },
>> + {
>> + .name = "assigned_to_child",
>> + .private = RES_FOR_CHILDREN,
>> + .read_u64 = mem_cgroup_read,
>> + },
>> };
>>
>> static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node
)
>> @@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
>> static struct cgroup_subsys_state *
>> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>> {
>> - struct mem_cgroup *mem;
>> + struct mem_cgroup *mem, *parent;
>> int node;
>>
>> if (unlikely((cont->parent) == NULL)) {
>> mem = &init_mem_cgroup;
>> page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
>> + parent = NULL;
>> } else {
>> mem = mem_cgroup_alloc();
>> if (!mem)
>> return ERR_PTR(-ENOMEM);
>> + parent = mem_cgroup_from_cont(cont->parent);
>> }
>>
>> - res_counter_init(&mem->res);
>> + if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
>> + res_counter_init(&mem->res);
>> + mem->hierarchy_model = MEMCG_NO_HIERARCHY;
>> + } else {
>> + res_counter_init_hierarchy(&mem->res, &parent->res);
>> + mem->hierarchy_model = parent->hierarchy_model;
>> + }
>>
>> for_each_node_state(node, N_POSSIBLE)
>> if (alloc_mem_cgroup_per_zone_info(mem, node))
>> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
>> int node;
>> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>>
>> + if (cont->parent &&
>> + mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
>> + /* we did what we can...just returns what we borrow */
>> + res_counter_return_resource(&mem->res, -1, NULL, 0);
>> + }
>> +
>> for_each_node_state(node, N_POSSIBLE)
>> free_mem_cgroup_per_zone_info(mem, node);
>>
>> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
>> ===================================================================
>> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
>> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
>> @@ -237,12 +237,37 @@ cgroup might have some charge associated
>> tasks have migrated away from it. Such charges are automatically dropped a
t
>> rmdir() if there are no tasks.
>>
>> -5. TODO
>> +5. Hierarchy Model
>> + the kernel supports following kinds of hierarchy models.
>> + (your middle-ware may support others based on this.)
>> +
>> + 5-a. Independent Hierarchy
>> + There are no relationship between any cgroups, even among a parent and
>> + children. This is the default mode. To use this hierarchy, write 0
>> + to root cgroup's memory.hierarchy_model
>> + echo 0 > .../memory.hierarchy_model.
>> +
>> + 5-b. Hardwall Hierarchy.
>> + The resource has to be moved from the parent to the child before use it.
>> + When a child's limit is set to 'val', val of the resource is moved from
>> + the parent to the child. the parent's usage += val.
>> + The amount of children's usage is reported by the file
>> +
>> + - memory.assigned_to_child
>> +
>> + This policy doesn't provide sophisticated automatic resource balancing i
n
>> + the kernel. But this is very good for strict resource isolation. Users
>> + can get high predictability of behavior of applications if this is used
>> + under proper environments.
>> +
>> +
>> +6. TODO
>>
>> 1. Add support for accounting huge pages (as a separate controller)
>> 2. Make per-cgroup scanner reclaim not-shared pages first
>> 3. Teach controller to account for shared-pages
>> 4. Start reclamation when the limit is lowered
>> + (this is already done in Hardwall Hierarchy)
>
>Excellent! May be we should split this patch out?
>
I'm now rearranging the set to do that :)

Thank you for comments.

Regards,
-Kame

2008-06-11 23:25:38

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Wed, 4 Jun 2008 14:01:53 +0900 KAMEZAWA Hiroyuki wrote:

> A simple hard-wall hierarhcy support for res_counter.
>
> Changelog v2->v3
> - changed the name and arguments of functions.
> - rewrote to be read easily.
> - named as HardWall hierarchy.

> ---
> Documentation/controllers/resource_counter.txt | 41 +++++++++
> include/linux/res_counter.h | 90 +++++++++++++++++++-
> kernel/res_counter.c | 112 +++++++++++++++++++++++--
> 3 files changed, 235 insertions(+), 8 deletions(-)
>
> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> @@ -76,15 +97,33 @@ enum {
> RES_MAX_USAGE,
> RES_LIMIT,
> RES_FAILCNT,
> + RES_FOR_CHILDREN,
> };
>
> /*
> * helpers for accounting
> */
>
> +/*
> + * initialize res_counter.
> + * @counter : the counter
> + *
> + * initialize res_counter and set default limit to very big value(unlimited)
> + */
> +
> void res_counter_init(struct res_counter *counter);

For these non-static (non-private) functions, please use kernel-doc notation
(see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
Also, we prefer for the function documentation to be above its definition (implementation)
rather than above its declaration, so the kernel-doc should be moved to .c files
instead of living in .h files.


>
> /*
> + * initialize res_counter under hierarchy.
> + * @counter : the counter
> + * @parent : the parent of the counter
> + *
> + * initialize res_counter and set default limit to 0. and set "parent".
> + */
> +void res_counter_init_hierarchy(struct res_counter *counter,
> + struct res_counter *parent);
> +
> +/*
> * charge - try to consume more resource.
> *
> * @counter: the counter
> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> cnt->failcnt = 0;
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
> +
> +/**
> + * Move resources from a parent to a child.
> + * At success,
> + * parent->usage += val.
> + * parent->for_children += val.
> + * child->limit += val.
> + *
> + * @child: an entity to set res->limit. The parent is child->parent.
> + * @val: the amount of resource to be moved.
> + * @callback: called when the parent's free resource is not enough to be moved.
> + * this can be NULL if no callback is necessary.
> + * @retry: limit for the number of trying to callback.
> + * -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success, !0 at failure.
> + *
> + * The callback returns 0 at success, !0 at failure.
> + *
> + */
> +
> +int res_counter_move_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry);
> +
> +
> +/**
> + * Return resource to its parent.
> + * At success,
> + * parent->usage -= val.
> + * parent->for_children -= val.
> + * child->limit -= val.
> + *
> + * @child: entry to resize. The parent is child->parent.
> + * @val : How much does child repay to parent ? -1 means 'all'
> + * @callback: A callback for decreasing resource usage of child before
> + * returning. If NULL, just deceases child's limit.
> + * @retry: # of retries at calling callback for freeing resource.
> + * -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success.
> + */
> +
> +int res_counter_return_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry);
> +
> #endif
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -179,3 +186,37 @@ counter fields. They are recommended to
> still can help with it).
>
> c. Compile and run :)
> +
> +
> +6. Hierarchy
> + a. No Hierarchy
> + each cgroup can use its own private resource.
> +
> + b. Hard-wall Hierarhcy
> + A simple hierarchical tree system for resource isolation.
> + Allows moving resources only between a parent and its children.
> + A parent can move its resource to children and remember the amount to
> + for_children member. A child can get new resource only from its parent.
> + Limit of a child is the amount of resource which is moved from its parent.
> +
> + When add "val" to a child,
> + parent->usage += val
> + parent->for_children += val
> + child->limit += val
> + When a child returns its resource
> + parent->usage -= val
> + parent->for_children -= val
> + child->limit -= val.
> +
> + This implements resource isolation among each group. This works very well
> + when you want to use strict resource isolation.
> +
> + Usage Hint:
> + This seems for static resource assignment but dynamic resource re-assignment

seems to be?

> + can be done by resetting "limit" of groups. When you consider "limit" as
> + the amount of allowed _current_ resource, a sophisticated resource management
> + system based on strict resource isolation can be implemented.
> +
> +c. Soft-wall Hierarchy
> + TBD.
> +
> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
> counter->limit = (unsigned long long)LLONG_MAX;
> }
>
> +void res_counter_init_hierarchy(struct res_counter *counter,
> + struct res_counter *parent)
> +{
> + spin_lock_init(&counter->lock);
> + counter->limit = 0;
> + counter->parent = parent;
> +}
> +
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> {
> if (counter->usage + val > counter->limit) {
> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
> return &counter->limit;
> case RES_FAILCNT:
> return &counter->failcnt;
> + case RES_FOR_CHILDREN:
> + return &counter->for_children;
> };
>
> BUG();
> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
>
> ssize_t res_counter_write(struct res_counter *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> - int (*write_strategy)(char *st_buf, unsigned long long *val))
> + int (*write_strategy)(char *st_buf, unsigned long long *val),
> + int (*set_strategy)(struct res_counter *res,
> + unsigned long long val, int what))
> {
> int ret;
> char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> if (*end != '\0')
> goto out_free;
> }
> - spin_lock_irqsave(&counter->lock, flags);
> - val = res_counter_member(counter, member);
> - *val = tmp;
> - spin_unlock_irqrestore(&counter->lock, flags);
> - ret = nbytes;
> + if (set_strategy) {
> + ret = set_strategy(res, tmp, member);
> + if (!ret)
> + ret = nbytes;
> + } else {
> + spin_lock_irqsave(&counter->lock, flags);
> + val = res_counter_member(counter, member);
> + *val = tmp;
> + spin_unlock_irqrestore(&counter->lock, flags);
> + ret = nbytes;
> + }
> out_free:
> kfree(buf);
> out:
> return ret;
> }
> +
> +
> +int res_counter_move_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry)
> +{
> + struct res_counter *parent = child->parent;
> + unsigned long flags;
> +
> + BUG_ON(!parent);
> +
> + while (1) {
> + spin_lock_irqsave(&parent->lock, flags);
> + if (parent->usage + val < parent->limit) {
> + parent->for_children += val;
> + parent->usage += val;
> + break;
> + }
> + spin_unlock_irqrestore(&parent->lock, flags);
> +
> + if (!retry || !callback)
> + goto failed;
> + /* -1 means infinite loop */
> + if (retry != -1)
> + --retry;
> + yield();
> + callback(parent, val);
> + }
> + spin_unlock_irqrestore(&parent->lock, flags);
> +
> + spin_lock_irqsave(&child->lock, flags);
> + child->limit += val;
> + spin_unlock_irqrestore(&child->lock, flags);
> + return 0;
> +fail:
> + return 1;
> +}
> +
> +
> +int res_counter_return_resource(struct res_counter *child,
> + unsigned long long val,
> + int (*callback)(struct res_counter *res, unsigned long long val),
> + int retry)
> +{
> + unsigned long flags;
> + struct res_counter *parent = child->parent;
> +
> + BUG_ON(!parent);
> +
> + while (1) {
> + spin_lock_irqsave(&child->lock, flags);
> + if (val == (unsigned long long) -1) {
> + val = child->limit;
> + child->limit = 0;
> + break;
> + } else if (child->usage <= child->limit - val) {
> + child->limit -= val;
> + break;
> + }
> + spin_unlock_irqrestore(&child->lock, flags);
> +
> + if (!retry)
> + goto fail;
> + /* -1 means infinite loop */
> + if (retry != -1)
> + --retry;
> + yield();
> + callback(parent, val);
> + }
> + spin_unlock_irqrestore(&child->lock, flags);
> +
> + spin_lock_irqsave(&parent->lock, flags);
> + BUG_ON(parent->for_children < val);
> + BUG_ON(parent->usage < val);
> + parent->for_children -= val;
> + parent->usage -= val;
> + spin_unlock_irqrestore(&parent->lock, flags);
> + return 0;
> +fail:
> + return 1;
> +}
> --


---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."

2008-06-11 23:28:57

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

On Wed, 4 Jun 2008 14:03:29 +0900 KAMEZAWA Hiroyuki wrote:

> ---
> Documentation/controllers/memory.txt | 27 +++++-
> mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 178 insertions(+), 5 deletions(-)
>

> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> @@ -237,12 +237,37 @@ cgroup might have some charge associated
> tasks have migrated away from it. Such charges are automatically dropped at
> rmdir() if there are no tasks.
>
> -5. TODO
> +5. Hierarchy Model
> + the kernel supports following kinds of hierarchy models.

The kernel supports the following kinds ....

> + (your middle-ware may support others based on this.)

(Your
> +
> + 5-a. Independent Hierarchy
> + There are no relationship between any cgroups, even among a parent and

is

> + children. This is the default mode. To use this hierarchy, write 0
> + to root cgroup's memory.hierarchy_model
> + echo 0 > .../memory.hierarchy_model.
> +
> + 5-b. Hardwall Hierarchy.
> + The resource has to be moved from the parent to the child before use it.

using it.

> + When a child's limit is set to 'val', val of the resource is moved from
> + the parent to the child. the parent's usage += val.

The parent's usage is incremented by val.
(if that's what you mean)

> + The amount of children's usage is reported by the file
> +
> + - memory.assigned_to_child
> +
> + This policy doesn't provide sophisticated automatic resource balancing in
> + the kernel. But this is very good for strict resource isolation. Users

kernel, but this ...

> + can get high predictability of behavior of applications if this is used
> + under proper environments.
> +
> +
> +6. TODO
>
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> 3. Teach controller to account for shared-pages
> 4. Start reclamation when the limit is lowered
> + (this is already done in Hardwall Hierarchy)
> 5. Start reclamation in the background when the limit is
> not yet hit but the usage is getting closer
> --

---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."

2008-06-12 04:54:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy

On Wed, 11 Jun 2008 16:24:27 -0700
Randy Dunlap <[email protected]> wrote:
> > /*
> > * helpers for accounting
> > */
> >
> > +/*
> > + * initialize res_counter.
> > + * @counter : the counter
> > + *
> > + * initialize res_counter and set default limit to very big value(unlimited)
> > + */
> > +
> > void res_counter_init(struct res_counter *counter);
>
> For these non-static (non-private) functions, please use kernel-doc notation
> (see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
> Also, we prefer for the function documentation to be above its definition (implementation)
> rather than above its declaration, so the kernel-doc should be moved to .c files
> instead of living in .h files.
>
Ah, sorry. I'll do so in the next version. Maybe I should move other comments
in res_counter.h to res_counter.c


>
> >
> > /*
> > + * initialize res_counter under hierarchy.
> > + * @counter : the counter
> > + * @parent : the parent of the counter
> > + *
> > + * initialize res_counter and set default limit to 0. and set "parent".
> > + */
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > + struct res_counter *parent);
> > +
> > +/*
> > * charge - try to consume more resource.
> > *
> > * @counter: the counter
> > @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> > cnt->failcnt = 0;
> > spin_unlock_irqrestore(&cnt->lock, flags);
> > }
> > +
> > +/**
> > + * Move resources from a parent to a child.
> > + * At success,
> > + * parent->usage += val.
> > + * parent->for_children += val.
> > + * child->limit += val.
> > + *
> > + * @child: an entity to set res->limit. The parent is child->parent.
> > + * @val: the amount of resource to be moved.
> > + * @callback: called when the parent's free resource is not enough to be moved.
> > + * this can be NULL if no callback is necessary.
> > + * @retry: limit for the number of trying to callback.
> > + * -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success, !0 at failure.
> > + *
> > + * The callback returns 0 at success, !0 at failure.
> > + *
> > + */
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry);
> > +
> > +
> > +/**
> > + * Return resource to its parent.
> > + * At success,
> > + * parent->usage -= val.
> > + * parent->for_children -= val.
> > + * child->limit -= val.
> > + *
> > + * @child: entry to resize. The parent is child->parent.
> > + * @val : How much does child repay to parent ? -1 means 'all'
> > + * @callback: A callback for decreasing resource usage of child before
> > + * returning. If NULL, just deceases child's limit.
> > + * @retry: # of retries at calling callback for freeing resource.
> > + * -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success.
> > + */
> > +
> > +int res_counter_return_resource(struct res_counter *child,
> > + unsigned long long val,
> > + int (*callback)(struct res_counter *res, unsigned long long val),
> > + int retry);
> > +
> > #endif
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -179,3 +186,37 @@ counter fields. They are recommended to
> > still can help with it).
> >
> > c. Compile and run :)
> > +
> > +
> > +6. Hierarchy
> > + a. No Hierarchy
> > + each cgroup can use its own private resource.
> > +
> > + b. Hard-wall Hierarhcy
> > + A simple hierarchical tree system for resource isolation.
> > + Allows moving resources only between a parent and its children.
> > + A parent can move its resource to children and remember the amount to
> > + for_children member. A child can get new resource only from its parent.
> > + Limit of a child is the amount of resource which is moved from its parent.
> > +
> > + When add "val" to a child,
> > + parent->usage += val
> > + parent->for_children += val
> > + child->limit += val
> > + When a child returns its resource
> > + parent->usage -= val
> > + parent->for_children -= val
> > + child->limit -= val.
> > +
> > + This implements resource isolation among each group. This works very well
> > + when you want to use strict resource isolation.
> > +
> > + Usage Hint:
> > + This seems for static resource assignment but dynamic resource re-assignment
>
> seems to be?
>
will fix.

Thank you for review!

Thanks,
-Kame

2008-06-12 04:55:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

Thank you. will fix.

-Kame

On Wed, 11 Jun 2008 16:24:23 -0700
Randy Dunlap <[email protected]> wrote:

> On Wed, 4 Jun 2008 14:03:29 +0900 KAMEZAWA Hiroyuki wrote:
>
> > ---
> > Documentation/controllers/memory.txt | 27 +++++-
> > mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
> > 2 files changed, 178 insertions(+), 5 deletions(-)
> >
>
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> > @@ -237,12 +237,37 @@ cgroup might have some charge associated
> > tasks have migrated away from it. Such charges are automatically dropped at
> > rmdir() if there are no tasks.
> >
> > -5. TODO
> > +5. Hierarchy Model
> > + the kernel supports following kinds of hierarchy models.
>
> The kernel supports the following kinds ....
>
> > + (your middle-ware may support others based on this.)
>
> (Your
> > +
> > + 5-a. Independent Hierarchy
> > + There are no relationship between any cgroups, even among a parent and
>
> is
>
> > + children. This is the default mode. To use this hierarchy, write 0
> > + to root cgroup's memory.hierarchy_model
> > + echo 0 > .../memory.hierarchy_model.
> > +
> > + 5-b. Hardwall Hierarchy.
> > + The resource has to be moved from the parent to the child before use it.
>
> using it.
>
> > + When a child's limit is set to 'val', val of the resource is moved from
> > + the parent to the child. the parent's usage += val.
>
> The parent's usage is incremented by val.
> (if that's what you mean)
>
> > + The amount of children's usage is reported by the file
> > +
> > + - memory.assigned_to_child
> > +
> > + This policy doesn't provide sophisticated automatic resource balancing in
> > + the kernel. But this is very good for strict resource isolation. Users
>
> kernel, but this ...
>
> > + can get high predictability of behavior of applications if this is used
> > + under proper environments.
> > +
> > +
> > +6. TODO
> >
> > 1. Add support for accounting huge pages (as a separate controller)
> > 2. Make per-cgroup scanner reclaim not-shared pages first
> > 3. Teach controller to account for shared-pages
> > 4. Start reclamation when the limit is lowered
> > + (this is already done in Hardwall Hierarchy)
> > 5. Start reclamation in the background when the limit is
> > not yet hit but the usage is getting closer
> > --
>
> ---
> ~Randy
> '"Daemon' is an old piece of jargon from the UNIX operating system,
> where it referred to a piece of low-level utility software, a
> fundamental part of the operating system."