LinuxLists.cc - [PATCH v3][0/6] memcg: prevent -ENOMEM in pre

2012-05-11 09:43:37

Subject: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

Hi, here is v3 based on memcg-devel tree.
git://github.com/mstsxfx/memcg-devel.git

This patch series is for avoiding -ENOMEM at calling pre_destroy()
which is called at rmdir(). After this patch, charges will be moved
to root (if use_hierarchy==0) or parent (if use_hierarchy==1), and
we'll not see -ENOMEM in rmdir() of cgroup.

v2 included some other patches than ones for handling -ENOMEM problem,
but I divided it. I'd like to post others in different series, later.
No logical changes in general, maybe v3 is cleaner than v2.

0001 ....fix error code in memcg-hugetlb
0002 ....add res_counter_uncharge_until
0003 ....use res_counter_uncharge_until in memcg
0004 ....move charges to root is use_hierarchy==0
0005 ....cleanup for mem_cgroup_move_account()
0006 ....remove warning of res_counter_uncharge_nofail (from Costa's slub accounting series).

Regards,
-Kame

2012-05-11 09:47:23

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

The conditions are handled as -EBUSY, _now_.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/hugetlb.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d3c8ea9..824f07b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
int ret = 0, idx = 0;

do {
- if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+ if (cgroup_task_count(cgroup)
+ || !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
goto out;
+ }
/*
* If the task doing the cgroup_rmdir got a signal
* we don't really need to loop till the hugetlb resource
--
1.7.4.1

2012-05-11 09:49:06

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 2/6] add res_counter_uncharge_until()

From: Frederic Weisbecker <[email protected]>

At killing res_counter which is a child of other counter,
we need to do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)

This is not atomic and wasting cpu. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates
to ancestors until specified res_counter.

res_counter_uncharge_until(child, parent, xxx)

Now, ops is atomic and efficient.

Changelog since v2
- removed unnecessary lines.
- Fixed 'From' , this patch comes from his series. Please signed-off-by if good.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/resource_counter.txt | 8 ++++++++
include/linux/res_counter.h | 3 +++
kernel/res_counter.c | 10 ++++++++--
3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..703103a 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -92,6 +92,14 @@ to work with it.

The _locked routines imply that the res_counter->lock is taken.

+ f. void res_counter_uncharge_until
+ (struct res_counter *rc, struct res_counter *top,
+ unsinged long val)
+
+ Almost same as res_cunter_uncharge() but propagation of uncharge
+ stops when rc == top. This is useful when kill a res_coutner in
+ child cgroup.
+
2.1 Other accounting routines

There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index da81af0..d11c1cd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -135,6 +135,9 @@ int __must_check res_counter_charge_nofail(struct res_counter *counter,
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
void res_counter_uncharge(struct res_counter *counter, unsigned long val);

+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d508363..d9ea45e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -99,13 +99,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
counter->usage -= val;
}

-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;

local_irq_save(flags);
- for (c = counter; c != NULL; c = c->parent) {
+ for (c = counter; c != top; c = c->parent) {
spin_lock(&c->lock);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
@@ -113,6 +115,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
local_irq_restore(flags);
}

+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+ res_counter_uncharge_until(counter, NULL, val);
+}

static inline unsigned long long *
res_counter_member(struct res_counter *counter, int member)
--
1.7.4.1

2012-05-11 09:50:17

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH v3 3/6] memcg: use res_counter_uncharge_until in move_parent()

By using res_counter_uncharge_until(), we can avoid race and
unnecessary charging.

Changelog since v2:
- a coding style chanve in __mem_cgroup_cancel_local_charge()
- fixed typos.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++------------------
1 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09109be..cb90be1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2446,6 +2446,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
}

/*
+ * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
+ * This is useful when moving usage to parent cgroup.
+ */
+static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ unsigned long bytes = nr_pages * PAGE_SIZE;
+
+ if (mem_cgroup_is_root(memcg))
+ return;
+
+ res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
+ if (do_swap_account)
+ res_counter_uncharge_until(&memcg->memsw,
+ memcg->memsw.parent, bytes);
+}
+
+/*
* A helper function to get mem_cgroup from ID. must be called under
* rcu_read_lock(). The caller must check css_is_removed() or some if
* it's concern. (dropping refcnt from swap can be called against removed
@@ -2711,16 +2729,28 @@ static int mem_cgroup_move_parent(struct page *page,
nr_pages = hpage_nr_pages(page);

parent = mem_cgroup_from_cont(pcg);
- ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
- if (ret)
- goto put_back;
+ if (!parent->use_hierarchy) {
+ ret = __mem_cgroup_try_charge(NULL,
+ gfp_mask, nr_pages, &parent, false);
+ if (ret)
+ goto put_back;
+ }

if (nr_pages > 1)
flags = compound_lock_irqsave(page);

- ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);
+ if (parent->use_hierarchy) {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);
+ } else {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, true);
+
+ if (ret)
+ __mem_cgroup_cancel_charge(parent, nr_pages);
+ }

if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
@@ -3324,6 +3354,7 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
struct cgroup *pcgrp = cgroup->parent;
struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+ struct res_counter *counter;

if (!get_page_unless_zero(page))
goto out;
@@ -3334,28 +3365,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
goto err_out;

csize = PAGE_SIZE << compound_order(page);
- /*
- * If we have use_hierarchy set we can never fail here. So instead of
- * using res_counter_uncharge use the open-coded variant which just
- * uncharge the child res_counter. The parent will retain the charge.
- */
- if (parent->use_hierarchy) {
- unsigned long flags;
- struct res_counter *counter;
-
- counter = &memcg->hugepage[idx];
- spin_lock_irqsave(&counter->lock, flags);
- res_counter_uncharge_locked(counter, csize);
- spin_unlock_irqrestore(&counter->lock, flags);
- } else {
+ /* If parent->use_hierarchy == 0, we need to charge parent */
+ if (!parent->use_hierarchy) {
ret = res_counter_charge(&parent->hugepage[idx],
csize, &fail_res);
if (ret) {
ret = -EBUSY;
goto err_out;
}
- res_counter_uncharge(&memcg->hugepage[idx], csize);
}
+ counter = &memcg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
pc->mem_cgroup = parent;
err_out:
unlock_page_cgroup(pc);
--
1.7.4.1

2012-05-11 09:51:22

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH v3 4/6] memcg: move charges to root cgroup if use_hierarchy=0.

Now, at removal of cgroup, ->pre_destroy() is called and move charges
to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
is that the 'moving' hits parent's resource limitation. It happens only
when use_hierarchy=0.

Considering use_hierarchy=0, all cgroups should be flat. So, no one
cannot justify moving charges to parent...parent and children are in
flat configuration, not hierarchical.

This patch modifes to move charges to root cgroup at rmdir/force_empty
if use_hierarchy==0. This will much simplify rmdir() and reduce error
in ->pre_destroy.

Changelog since v2:
- use parent_mem_cgroup()
- updated Documenation

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 13 ++++++----
mm/memcontrol.c | 49 +++++++++++++------------------------
2 files changed, 25 insertions(+), 37 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 730e222a..8d0de70 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -393,14 +393,15 @@ cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)

-Such charges are freed or moved to their parent. At moving, both of RSS
-and CACHES are moved to parent.
-rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
+We move the stats to root (if use_hierarchy==0) or parent (if
+use_hierarchy==1), and no change on the charge except uncharging
+from the child.

Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.

+About use_hierarchy, see Section 6.

5. Misc. interfaces.

@@ -413,13 +414,15 @@ will be charged as a new owner of it.

Almost all pages tracked by this memory cgroup will be unmapped and freed.
Some pages cannot be freed because they are locked or in-use. Such pages are
- moved to parent and this cgroup will be empty. This may return -EBUSY if
- VM is too busy to free/move all pages immediately.
+ moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
+ cgroup will be empty.

Typical use case of this interface is that calling this before rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.

+ About use_hierarchy, see Section 6.
+
5.2 stat file

memory.stat file includes following statistics
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb90be1..f007c17 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2709,15 +2709,13 @@ static int mem_cgroup_move_parent(struct page *page,
struct mem_cgroup *child,
gfp_t gfp_mask)
{
- struct cgroup *cg = child->css.cgroup;
- struct cgroup *pcg = cg->parent;
struct mem_cgroup *parent;
unsigned int nr_pages;
unsigned long uninitialized_var(flags);
int ret;

/* Is ROOT ? */
- if (!pcg)
+ if (mem_cgroup_is_root(child))
return -EINVAL;

ret = -EBUSY;
@@ -2728,33 +2726,23 @@ static int mem_cgroup_move_parent(struct page *page,

nr_pages = hpage_nr_pages(page);

- parent = mem_cgroup_from_cont(pcg);
- if (!parent->use_hierarchy) {
- ret = __mem_cgroup_try_charge(NULL,
- gfp_mask, nr_pages, &parent, false);
- if (ret)
- goto put_back;
- }
+ parent = parent_mem_cgroup(child);
+ /*
+ * If no parent, move charges to root cgroup.
+ */
+ if (!parent)
+ parent = root_mem_cgroup;

if (nr_pages > 1)
flags = compound_lock_irqsave(page);

- if (parent->use_hierarchy) {
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, false);
- if (!ret)
- __mem_cgroup_cancel_local_charge(child, nr_pages);
- } else {
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, true);
-
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);
- }
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);

if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
-put_back:
putback_lru_page(page);
put:
put_page(page);
@@ -3351,9 +3339,8 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
struct page_cgroup *pc;
int csize, ret = 0;
struct res_counter *fail_res;
- struct cgroup *pcgrp = cgroup->parent;
- struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct res_counter *counter;

if (!get_page_unless_zero(page))
@@ -3366,13 +3353,11 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,

csize = PAGE_SIZE << compound_order(page);
/* If parent->use_hierarchy == 0, we need to charge parent */
- if (!parent->use_hierarchy) {
- ret = res_counter_charge(&parent->hugepage[idx],
- csize, &fail_res);
- if (ret) {
- ret = -EBUSY;
- goto err_out;
- }
+ if (!parent) {
+ parent = root_mem_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);
}
counter = &memcg->hugepage[idx];
res_counter_uncharge_until(counter, counter->parent, csize);
--
1.7.4.1

2012-05-11 09:52:27

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH v3 5/6] memcg: don't uncharge in mem_cgroup_move_account

Now, all caller passes 'false' for 'bool uncharge', remove the argument.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 20 ++++++--------------
1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f007c17..fcb0095 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2615,23 +2615,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
* @pc: page_cgroup of the page.
* @from: mem_cgroup which the page is moved from.
* @to: mem_cgroup which the page is moved to. @from != @to.
- * @uncharge: whether we should call uncharge and css_put against @from.
*
* The caller must confirm following.
* - page is not on LRU (isolate_page() is useful.)
* - compound_lock is held when nr_pages > 1
*
- * This function doesn't do "charge" nor css_get to new cgroup. It should be
- * done by a caller(__mem_cgroup_try_charge would be useful). If @uncharge is
- * true, this function does "uncharge" from old cgroup, but it doesn't if
- * @uncharge is false, so a caller should do "uncharge".
+ * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
+ * from old cgroup.
*/
static int mem_cgroup_move_account(struct page *page,
unsigned int nr_pages,
struct page_cgroup *pc,
struct mem_cgroup *from,
- struct mem_cgroup *to,
- bool uncharge)
+ struct mem_cgroup *to)
{
unsigned long flags;
int ret;
@@ -2673,9 +2669,6 @@ static int mem_cgroup_move_account(struct page *page,
preempt_enable();
}
mem_cgroup_charge_statistics(from, anon, -nr_pages);
- if (uncharge)
- /* This is not "cancel", but cancel_charge does all we need. */
- __mem_cgroup_cancel_charge(from, nr_pages);

/* caller should have done css_get */
pc->mem_cgroup = to;
@@ -2737,7 +2730,7 @@ static int mem_cgroup_move_parent(struct page *page,
flags = compound_lock_irqsave(page);

ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, false);
+ pc, child, parent);
if (!ret)
__mem_cgroup_cancel_local_charge(child, nr_pages);

@@ -5757,8 +5750,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (!isolate_lru_page(page)) {
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
- pc, mc.from, mc.to,
- false)) {
+ pc, mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
mc.moved_charge += HPAGE_PMD_NR;
}
@@ -5788,7 +5780,7 @@ retry:
goto put;
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, 1, pc,
- mc.from, mc.to, false)) {
+ mc.from, mc.to)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
--
1.7.4.1

2012-05-11 09:55:07

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH v3 6/6] remove __must_check for res_counter_charge_nofail()

I picked this up from Costa's slub memcg series. For fixing added warning
by patch 4.
==
From: Glauber Costa <[email protected]>
Subject: [PATCH 6/6] remove __must_check for res_counter_charge_nofail()

Since we will succeed with the allocation no matter what, there
isn't the need to use __must_check with it. It can very well
be optional.

Signed-off-by: Glauber Costa <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/res_counter.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index d11c1cd..c6b0368 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -119,7 +119,7 @@ int __must_check res_counter_charge_locked(struct res_counter *counter,
unsigned long val);
int __must_check res_counter_charge(struct res_counter *counter,
unsigned long val, struct res_counter **limit_fail_at);
-int __must_check res_counter_charge_nofail(struct res_counter *counter,
+int res_counter_charge_nofail(struct res_counter *counter,
unsigned long val, struct res_counter **limit_fail_at);

/*
--
1.7.4.1

2012-05-11 21:17:58

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

On Fri, 11 May 2012 18:45:18 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> The conditions are handled as -EBUSY, _now_.

The changelog is poor. I rewrote it to

: hugetlb_force_memcg_empty() incorrectly returns 0 (success) when the
: cgroup is found to be busy. Return -EBUSY instead.

But it still doesn't tell us the end-user-visible effects of the bug.
It should.

2012-05-11 21:19:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

On Fri, 11 May 2012 18:47:06 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: Frederic Weisbecker <[email protected]>
>
> At killing res_counter which is a child of other counter,
> we need to do
> res_counter_uncharge(child, xxx)
> res_counter_charge(parent, xxx)
>
> This is not atomic and wasting cpu. This patch adds
> res_counter_uncharge_until(). This function's uncharge propagates
> to ancestors until specified res_counter.
>
> res_counter_uncharge_until(child, parent, xxx)
>
> Now, ops is atomic and efficient.
>
> Changelog since v2
> - removed unnecessary lines.
> - Fixed 'From' , this patch comes from his series. Please signed-off-by if good.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Frederic's Signed-off-by: is unavaliable?

2012-05-14 01:09:35

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

(2012/05/12 6:17), Andrew Morton wrote:

> On Fri, 11 May 2012 18:45:18 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> The conditions are handled as -EBUSY, _now_.
>
> The changelog is poor. I rewrote it to
>
> : hugetlb_force_memcg_empty() incorrectly returns 0 (success) when the
> : cgroup is found to be busy. Return -EBUSY instead.
>
> But it still doesn't tell us the end-user-visible effects of the bug.
> It should.
>

Ah, sorry. How about this ?

The force_empty interface allows to make the memcg only when the cgroup
doesn't include any tasks.

# echo 0 > /cgroup/xxxx/memory.force_empty

If cgroup isn't empty, force_empty does nothing and retruns -EBUSY in usual
memcg, memcontrol.c. But hugetlb implementation has inconsitency with
it and returns 0 and do nothing. Fix it to return -EBUSY.

Thanks,
-Kame

2012-05-14 01:12:39

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

(2012/05/12 6:19), Andrew Morton wrote:

> On Fri, 11 May 2012 18:47:06 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> From: Frederic Weisbecker <[email protected]>
>>
>> At killing res_counter which is a child of other counter,
>> we need to do
>> res_counter_uncharge(child, xxx)
>> res_counter_charge(parent, xxx)
>>
>> This is not atomic and wasting cpu. This patch adds
>> res_counter_uncharge_until(). This function's uncharge propagates
>> to ancestors until specified res_counter.
>>
>> res_counter_uncharge_until(child, parent, xxx)
>>
>> Now, ops is atomic and efficient.
>>
>> Changelog since v2
>> - removed unnecessary lines.
>> - Fixed 'From' , this patch comes from his series. Please signed-off-by if good.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Frederic's Signed-off-by: is unavaliable?
>

I didn't add his Signed-off because I modified his orignal patch a little...
I dropped res_counter_charge_until() because it's not used in this series,
I have no justification for adding it.
The idea of res_counter_uncharge_until() is from his patch.

Thanks,
-Kame

2012-05-14 10:08:34

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

2012/5/14 KAMEZAWA Hiroyuki <[email protected]>:
> (2012/05/12 6:19), Andrew Morton wrote:
>
>> On Fri, 11 May 2012 18:47:06 +0900
>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>
>>> From: Frederic Weisbecker <[email protected]>
>>>
>>> At killing res_counter which is a child of other counter,
>>> we need to do
>>> ? ? ?res_counter_uncharge(child, xxx)
>>> ? ? ?res_counter_charge(parent, xxx)
>>>
>>> This is not atomic and wasting cpu. This patch adds
>>> res_counter_uncharge_until(). This function's uncharge propagates
>>> to ancestors until specified res_counter.
>>>
>>> ? ? ?res_counter_uncharge_until(child, parent, xxx)
>>>
>>> Now, ops is atomic and efficient.
>>>
>>> Changelog since v2
>>> ?- removed unnecessary lines.
>>> ?- Fixed 'From' , this patch comes from his series. Please signed-off-by if good.
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> Frederic's Signed-off-by: is unavaliable?
>>
>
> I didn't add his Signed-off because I modified his orignal patch a little...
> I dropped res_counter_charge_until() because it's not used in this series,
> I have no justification for adding it.
> The idea of res_counter_uncharge_until() is from his patch.

The property of Signed-off-by is that as long as you
carry/relay/modify a patch, you add your
own signed-off-by. But you can't remove the signed off by of somebody
in the chain.

Even if you did a change in the patch, you need to preserve the chain.

There may be some special cases with "Original-patch-from:" tags used when
one heavily inspire from a patch without taking much of its original code.

2012-05-14 10:34:48

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

(2012/05/14 19:08), Frederic Weisbecker wrote:

> 2012/5/14 KAMEZAWA Hiroyuki <[email protected]>:
>> (2012/05/12 6:19), Andrew Morton wrote:
>>
>>> On Fri, 11 May 2012 18:47:06 +0900
>>> KAMEZAWA Hiroyuki <[email protected]> wrote:
>>>
>>>> From: Frederic Weisbecker <[email protected]>
>>>>
>>>> At killing res_counter which is a child of other counter,
>>>> we need to do
>>>> res_counter_uncharge(child, xxx)
>>>> res_counter_charge(parent, xxx)
>>>>
>>>> This is not atomic and wasting cpu. This patch adds
>>>> res_counter_uncharge_until(). This function's uncharge propagates
>>>> to ancestors until specified res_counter.
>>>>
>>>> res_counter_uncharge_until(child, parent, xxx)
>>>>
>>>> Now, ops is atomic and efficient.
>>>>
>>>> Changelog since v2
>>>> - removed unnecessary lines.
>>>> - Fixed 'From' , this patch comes from his series. Please signed-off-by if good.
>>>>
>>>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>>
>>> Frederic's Signed-off-by: is unavaliable?
>>>
>>
>> I didn't add his Signed-off because I modified his orignal patch a little...
>> I dropped res_counter_charge_until() because it's not used in this series,
>> I have no justification for adding it.
>> The idea of res_counter_uncharge_until() is from his patch.
>
> The property of Signed-off-by is that as long as you
> carry/relay/modify a patch, you add your
> own signed-off-by. But you can't remove the signed off by of somebody
> in the chain.
>

> Even if you did a change in the patch, you need to preserve the chain.
>

Oh, sorry.

> There may be some special cases with "Original-patch-from:" tags used when
> one heavily inspire from a patch without taking much of its original code.
>

Is this ok ?

==
[PATCH 2/6] memcg: add res_counter_uncharge_until()

From: Frederic Weisbecker <[email protected]>

At killing res_counter which is a child of other counter,
we need to do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)

This is not atomic and wasting cpu. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates
to ancestors until specified res_counter.

res_counter_uncharge_until(child, parent, xxx)

Now, ops is atomic and efficient.

Changelog since v2
- removed unnecessary lines.
- added 'From' , this patch comes from his one.

Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/resource_counter.txt | 8 ++++++++
include/linux/res_counter.h | 3 +++
kernel/res_counter.c | 10 ++++++++--
3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..703103a 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -92,6 +92,14 @@ to work with it.

The _locked routines imply that the res_counter->lock is taken.

+ f. void res_counter_uncharge_until
+ (struct res_counter *rc, struct res_counter *top,
+ unsinged long val)
+
+ Almost same as res_cunter_uncharge() but propagation of uncharge
+ stops when rc == top. This is useful when kill a res_coutner in
+ child cgroup.
+
2.1 Other accounting routines

There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index da81af0..d11c1cd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -135,6 +135,9 @@ int __must_check res_counter_charge_nofail(struct res_counter *counter,
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
void res_counter_uncharge(struct res_counter *counter, unsigned long val);

+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d508363..d9ea45e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -99,13 +99,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
counter->usage -= val;
}

-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;

local_irq_save(flags);
- for (c = counter; c != NULL; c = c->parent) {
+ for (c = counter; c != top; c = c->parent) {
spin_lock(&c->lock);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
@@ -113,6 +115,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
local_irq_restore(flags);
}

+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+ res_counter_uncharge_until(counter, NULL, val);
+}

static inline unsigned long long *
res_counter_member(struct res_counter *counter, int member)
--
1.7.4.1

2012-05-14 10:56:41

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

2012/5/14 KAMEZAWA Hiroyuki <[email protected]>:
>> There may be some special cases with "Original-patch-from:" tags used when
>> one heavily inspire from a patch without taking much of its original code.
>>
>
>
> Is this ok ?

Yep, or even better since I plan to use my company's address now to
sign my patches:

Signed-off-by: Frederic Weisbecker <[email protected]>

Thanks!

2012-05-14 18:16:03

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

On Fri, May 11, 2012 at 06:45:18PM +0900, KAMEZAWA Hiroyuki wrote:
> - if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
> + if (cgroup_task_count(cgroup)
> + || !list_empty(&cgroup->children)) {
> + ret = -EBUSY;
> goto out;

Why break the line? It doesn't go over 80 col.

Thanks.

--
tejun

2012-05-14 18:17:32

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/6] add res_counter_uncharge_until()

On Mon, May 14, 2012 at 07:32:42PM +0900, KAMEZAWA Hiroyuki wrote:
> [PATCH 2/6] memcg: add res_counter_uncharge_until()
>
> From: Frederic Weisbecker <[email protected]>
>
> At killing res_counter which is a child of other counter,
> we need to do
> res_counter_uncharge(child, xxx)
> res_counter_charge(parent, xxx)
>
> This is not atomic and wasting cpu. This patch adds
> res_counter_uncharge_until(). This function's uncharge propagates
> to ancestors until specified res_counter.
>
> res_counter_uncharge_until(child, parent, xxx)
>
> Now, ops is atomic and efficient.
>
> Changelog since v2
> - removed unnecessary lines.
> - added 'From' , this patch comes from his one.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2012-05-14 18:32:25

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

On Mon, May 14, 2012 at 11:15:56AM -0700, Tejun Heo wrote:
> On Fri, May 11, 2012 at 06:45:18PM +0900, KAMEZAWA Hiroyuki wrote:
> > - if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
> > + if (cgroup_task_count(cgroup)
> > + || !list_empty(&cgroup->children)) {
> > + ret = -EBUSY;
> > goto out;
>
> Why break the line? It doesn't go over 80 col.

Ooh, it does. Sorry, my bad. But still, isn't it more usual to leave
the operator in the preceding line and align the start of the second
line with the first? ie.

if (cgroup_task_count(cgroup) ||
!list_empty(&cgroup->children)) {

Thanks.

--
tejun

2012-05-14 20:09:32

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3 6/6] remove __must_check for res_counter_charge_nofail()

On Fri, May 11, 2012 at 06:53:08PM +0900, KAMEZAWA Hiroyuki wrote:
> I picked this up from Costa's slub memcg series. For fixing added warning
> by patch 4.
> ==
> From: Glauber Costa <[email protected]>
> Subject: [PATCH 6/6] remove __must_check for res_counter_charge_nofail()
>
> Since we will succeed with the allocation no matter what, there
> isn't the need to use __must_check with it. It can very well
> be optional.
>
> Signed-off-by: Glauber Costa <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

For 3-6,

Reviewed-by: Tejun Heo <[email protected]>

Thanks a lot for doing this. This doesn't solve all the failure paths
tho. ie. what about -EINTR failures from lock contention?
pre_destroy() would probably need delay and retry logic with
WARN_ON_ONCE() on !-EINTR failures.

Thank you.

--
tejun

2012-05-14 20:14:45

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3 4/6] memcg: move charges to root cgroup if use_hierarchy=0.

On Fri, May 11, 2012 at 06:49:22PM +0900, KAMEZAWA Hiroyuki wrote:
> @@ -3351,9 +3339,8 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
> struct page_cgroup *pc;
> int csize, ret = 0;
> struct res_counter *fail_res;
> - struct cgroup *pcgrp = cgroup->parent;
> - struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
> + struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> struct res_counter *counter;
>
> if (!get_page_unless_zero(page))
> @@ -3366,13 +3353,11 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>
> csize = PAGE_SIZE << compound_order(page);
> /* If parent->use_hierarchy == 0, we need to charge parent */
> - if (!parent->use_hierarchy) {
> - ret = res_counter_charge(&parent->hugepage[idx],
> - csize, &fail_res);
> - if (ret) {
> - ret = -EBUSY;
> - goto err_out;
> - }
> + if (!parent) {
> + parent = root_mem_cgroup;
> + /* root has no limit */
> + res_counter_charge_nofail(&parent->hugepage[idx],
> + csize, &fail_res);
> }
> counter = &memcg->hugepage[idx];
> res_counter_uncharge_until(counter, counter->parent, csize);

This function can simply return 0 now, so no point in having int
return. Make it return void?

Also, follow-up patches to cleanup -ENOMEM failure handling in the
callers would be nice.

Thanks.

--
tejun

2012-05-15 00:04:39

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3 6/6] remove __must_check for res_counter_charge_nofail()

(2012/05/15 5:09), Tejun Heo wrote:

> On Fri, May 11, 2012 at 06:53:08PM +0900, KAMEZAWA Hiroyuki wrote:
>> I picked this up from Costa's slub memcg series. For fixing added warning
>> by patch 4.
>> ==
>> From: Glauber Costa <[email protected]>
>> Subject: [PATCH 6/6] remove __must_check for res_counter_charge_nofail()
>>
>> Since we will succeed with the allocation no matter what, there
>> isn't the need to use __must_check with it. It can very well
>> be optional.
>>
>> Signed-off-by: Glauber Costa <[email protected]>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> For 3-6,
>
> Reviewed-by: Tejun Heo <[email protected]>
>
> Thanks a lot for doing this. This doesn't solve all the failure paths
> tho. ie. what about -EINTR failures from lock contention?
> pre_destroy() would probably need delay and retry logic with
> WARN_ON_ONCE() on !-EINTR failures.
>

Yes, I'll do more work. I tend to split series, sorry.

Thanks,
-Kame

2012-05-15 00:06:29

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3 4/6] memcg: move charges to root cgroup if use_hierarchy=0.

(2012/05/15 5:14), Tejun Heo wrote:

> On Fri, May 11, 2012 at 06:49:22PM +0900, KAMEZAWA Hiroyuki wrote:
>> @@ -3351,9 +3339,8 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>> struct page_cgroup *pc;
>> int csize, ret = 0;
>> struct res_counter *fail_res;
>> - struct cgroup *pcgrp = cgroup->parent;
>> - struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
>> struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
>> + struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> struct res_counter *counter;
>>
>> if (!get_page_unless_zero(page))
>> @@ -3366,13 +3353,11 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>>
>> csize = PAGE_SIZE << compound_order(page);
>> /* If parent->use_hierarchy == 0, we need to charge parent */
>> - if (!parent->use_hierarchy) {
>> - ret = res_counter_charge(&parent->hugepage[idx],
>> - csize, &fail_res);
>> - if (ret) {
>> - ret = -EBUSY;
>> - goto err_out;
>> - }
>> + if (!parent) {
>> + parent = root_mem_cgroup;
>> + /* root has no limit */
>> + res_counter_charge_nofail(&parent->hugepage[idx],
>> + csize, &fail_res);
>> }
>> counter = &memcg->hugepage[idx];
>> res_counter_uncharge_until(counter, counter->parent, csize);
>
> This function can simply return 0 now, so no point in having int
> return. Make it return void?
>
> Also, follow-up patches to cleanup -ENOMEM failure handling in the
> callers would be nice.
>
Sure, I'll check it.

Thanks,
-Kame

2012-05-15 01:12:39

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

(2012/05/15 3:32), Tejun Heo wrote:

> On Mon, May 14, 2012 at 11:15:56AM -0700, Tejun Heo wrote:
>> On Fri, May 11, 2012 at 06:45:18PM +0900, KAMEZAWA Hiroyuki wrote:
>>> - if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
>>> + if (cgroup_task_count(cgroup)
>>> + || !list_empty(&cgroup->children)) {
>>> + ret = -EBUSY;
>>> goto out;
>>
>> Why break the line? It doesn't go over 80 col.
>
> Ooh, it does. Sorry, my bad. But still, isn't it more usual to leave
> the operator in the preceding line and align the start of the second
> line with the first? ie.
>
> if (cgroup_task_count(cgroup) ||
> !list_empty(&cgroup->children)) {
>

How about this ?
==
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Fri, 27 Apr 2012 13:19:19 +0900
Subject: [PATCH] memcg: fix error code in hugetlb_force_memcg_empty()

Changelog:
- clean up.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/hugetlb.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d3c8ea9..82ec623 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
int ret = 0, idx = 0;

do {
- if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+ if (cgroup_task_count(cgroup) ||
+ !list_empty(&cgroup->children)){
+ ret = -EBUSY;
goto out;
+ }
/*
* If the task doing the cgroup_rmdir got a signal
* we don't really need to loop till the hugetlb resource
--
1.7.4.1

2012-05-15 15:12:26

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] memcg: fix error code in hugetlb_force_memcg_empty()

On Tue, May 15, 2012 at 10:10:34AM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/05/15 3:32), Tejun Heo wrote:
>
> > On Mon, May 14, 2012 at 11:15:56AM -0700, Tejun Heo wrote:
> >> On Fri, May 11, 2012 at 06:45:18PM +0900, KAMEZAWA Hiroyuki wrote:
> >>> - if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
> >>> + if (cgroup_task_count(cgroup)
> >>> + || !list_empty(&cgroup->children)) {
> >>> + ret = -EBUSY;
> >>> goto out;
> >>
> >> Why break the line? It doesn't go over 80 col.
> >
> > Ooh, it does. Sorry, my bad. But still, isn't it more usual to leave
> > the operator in the preceding line and align the start of the second
> > line with the first? ie.
> >
> > if (cgroup_task_count(cgroup) ||
> > !list_empty(&cgroup->children)) {
> >
>
>
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Fri, 27 Apr 2012 13:19:19 +0900
> Subject: [PATCH] memcg: fix error code in hugetlb_force_memcg_empty()
>
> Changelog:
> - clean up.
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Heh, it was a nitpick anyway. Please feel free to add my reviewed-by
for the whole series.

Thank you!

--
tejun

2012-06-21 20:20:49

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

On Fri, May 11, 2012 at 06:41:36PM +0900, KAMEZAWA Hiroyuki wrote:
> Hi, here is v3 based on memcg-devel tree.
> git://github.com/mstsxfx/memcg-devel.git
>
> This patch series is for avoiding -ENOMEM at calling pre_destroy()
> which is called at rmdir(). After this patch, charges will be moved
> to root (if use_hierarchy==0) or parent (if use_hierarchy==1), and
> we'll not see -ENOMEM in rmdir() of cgroup.
>
> v2 included some other patches than ones for handling -ENOMEM problem,
> but I divided it. I'd like to post others in different series, later.
> No logical changes in general, maybe v3 is cleaner than v2.
>
> 0001 ....fix error code in memcg-hugetlb
> 0002 ....add res_counter_uncharge_until
> 0003 ....use res_counter_uncharge_until in memcg
> 0004 ....move charges to root is use_hierarchy==0
> 0005 ....cleanup for mem_cgroup_move_account()
> 0006 ....remove warning of res_counter_uncharge_nofail (from Costa's slub accounting series).

KAME, how is this progressing? Is it stuck on anything?

Thank you very much.

--
tejun

2012-06-21 23:29:46

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

(2012/06/22 5:20), Tejun Heo wrote:
> On Fri, May 11, 2012 at 06:41:36PM +0900, KAMEZAWA Hiroyuki wrote:
>> Hi, here is v3 based on memcg-devel tree.
>> git://github.com/mstsxfx/memcg-devel.git
>>
>> This patch series is for avoiding -ENOMEM at calling pre_destroy()
>> which is called at rmdir(). After this patch, charges will be moved
>> to root (if use_hierarchy==0) or parent (if use_hierarchy==1), and
>> we'll not see -ENOMEM in rmdir() of cgroup.
>>
>> v2 included some other patches than ones for handling -ENOMEM problem,
>> but I divided it. I'd like to post others in different series, later.
>> No logical changes in general, maybe v3 is cleaner than v2.
>>
>> 0001 ....fix error code in memcg-hugetlb
>> 0002 ....add res_counter_uncharge_until
>> 0003 ....use res_counter_uncharge_until in memcg
>> 0004 ....move charges to root is use_hierarchy==0
>> 0005 ....cleanup for mem_cgroup_move_account()
>> 0006 ....remove warning of res_counter_uncharge_nofail (from Costa's slub accounting series).
>
> KAME, how is this progressing? Is it stuck on anything?
>

I think I finished 80% of works and patches are in -mm stack now.
They'll be visible in -next, soon.

Remaining 20% of work is based on a modification to cgroup layer

How do you think this patch ? (This patch is not tested yet...so
may have troubles...) I think callers of pre_destory() is not so many...

==
From a28db946f91f3509d25779e8c5db249506cc4b07 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Fri, 22 Jun 2012 08:38:38 +0900
Subject: [PATCH] cgroup: keep cgroup_mutex() while calling ->pre_destroy()

In past, memcg's pre_destroy() was verrry slow because of the possibility
of page reclaiming in it. So, cgroup_mutex() was released before calling
pre_destroy() callbacks. Now, it's enough fast. memcg just scans the list
and move pages to other cgroup, no memory reclaim happens.
Then, we can keep cgroup_mutex() there.

By holding looks, we can avoid following cases
1. new task is attached while rmdir().
2. new child cgroup is created while rmdir()
3. new task is attached to cgroup and removed from cgroup before
checking css's count. So, ->destroy() will be called even if
some trashes by the task remains

(3. is terrible case...even if I think it will not happen in real world..)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
kernel/cgroup.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index caff6a1..a5b6df1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4171,7 +4171,6 @@ again:
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);

/*
* In general, subsystem has no css->refcnt after pre_destroy(). But
@@ -4190,11 +4189,11 @@ again:
*/
ret = cgroup_call_pre_destroy(cgrp);
if (ret) {
+ mutex_unlock(&cgroup_mutex);
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
return ret;
}

- mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
--
1.7.4.1

2012-06-27 17:58:24

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

Hello, KAME.

On Fri, Jun 22, 2012 at 08:27:25AM +0900, Kamezawa Hiroyuki wrote:
> Remaining 20% of work is based on a modification to cgroup layer
>
> How do you think this patch ? (This patch is not tested yet...so
> may have troubles...) I think callers of pre_destory() is not so many...
>
> ==
> From a28db946f91f3509d25779e8c5db249506cc4b07 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Fri, 22 Jun 2012 08:38:38 +0900
> Subject: [PATCH] cgroup: keep cgroup_mutex() while calling ->pre_destroy()
>
> In past, memcg's pre_destroy() was verrry slow because of the possibility
> of page reclaiming in it. So, cgroup_mutex() was released before calling
> pre_destroy() callbacks. Now, it's enough fast. memcg just scans the list
> and move pages to other cgroup, no memory reclaim happens.
> Then, we can keep cgroup_mutex() there.
>
> By holding looks, we can avoid following cases
> 1. new task is attached while rmdir().
> 2. new child cgroup is created while rmdir()
> 3. new task is attached to cgroup and removed from cgroup before
> checking css's count. So, ->destroy() will be called even if
> some trashes by the task remains
>
> (3. is terrible case...even if I think it will not happen in real world..)

Ooh, once memcg drops the __DEPRECATED_clear_css_refs, cgroup_rmdir()
will mark the cgroup dead before start calling pre_destroy() and none
of the above will happen.

Thanks.

--
tejun

2012-06-28 08:36:17

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

(2012/06/28 2:58), Tejun Heo wrote:
> Hello, KAME.
>
> On Fri, Jun 22, 2012 at 08:27:25AM +0900, Kamezawa Hiroyuki wrote:
>> Remaining 20% of work is based on a modification to cgroup layer
>>
>> How do you think this patch ? (This patch is not tested yet...so
>> may have troubles...) I think callers of pre_destory() is not so many...
>>
>> ==
>> From a28db946f91f3509d25779e8c5db249506cc4b07 Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Date: Fri, 22 Jun 2012 08:38:38 +0900
>> Subject: [PATCH] cgroup: keep cgroup_mutex() while calling ->pre_destroy()
>>
>> In past, memcg's pre_destroy() was verrry slow because of the possibility
>> of page reclaiming in it. So, cgroup_mutex() was released before calling
>> pre_destroy() callbacks. Now, it's enough fast. memcg just scans the list
>> and move pages to other cgroup, no memory reclaim happens.
>> Then, we can keep cgroup_mutex() there.
>>
>> By holding looks, we can avoid following cases
>> 1. new task is attached while rmdir().
>> 2. new child cgroup is created while rmdir()
>> 3. new task is attached to cgroup and removed from cgroup before
>> checking css's count. So, ->destroy() will be called even if
>> some trashes by the task remains
>>
>> (3. is terrible case...even if I think it will not happen in real world..)
>
> Ooh, once memcg drops the __DEPRECATED_clear_css_refs, cgroup_rmdir()
> will mark the cgroup dead before start calling pre_destroy() and none
> of the above will happen.
>

Hm, threads which touches memcg should hold memcg's reference count rather than css.
Right ? IIUC, one of reason is a reference from kswapd etc...hm. I'll check it.

Thanks,
-Kame

2012-06-28 16:06:29

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3][0/6] memcg: prevent -ENOMEM in pre_destroy()

Hello, KAME.

On Thu, Jun 28, 2012 at 1:33 AM, Kamezawa Hiroyuki
<[email protected]> wrote:
>> Ooh, once memcg drops the __DEPRECATED_clear_css_refs, cgroup_rmdir()
>> will mark the cgroup dead before start calling pre_destroy() and none
>> of the above will happen.
>>
>
> Hm, threads which touches memcg should hold memcg's reference count rather
> than css.
> Right ? IIUC, one of reason is a reference from kswapd etc...hm. I'll check
> it.

Not sure I'm following. I meant that css_tryget() will always fail
once pre_destroy() calls being for the cgroup, so no new child or
reference can be created for it after that point.

Thanks.

--
tejun