(I'll be off from my office soon, and I won't be responsive in the following
3 days.)
I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
still be alive. This patchset converts memcg to always use css_get/put, so
memcg will have the same life cycle as its corresponding cgroup, and then
it's always safe for memcg to use cgroup->id.
The historical reason that memcg didn't use css_get in some cases, is that
cgroup couldn't be removed if there're still css refs. The situation has
changed so that rmdir a cgroup will succeed regardless css refs, but won't
be freed until css refs goes down to 0.
This is an early post, and it's NOT TESTED. I just want to see if the changes
are fine in general.
btw, after this patchset I think we don't need to free memcg via RCU, because
cgroup is already freed in RCU callback.
Note this patchset is based on a few memcg fixes I sent (but hasn't been
accepted)
--
kernel/cgroup.c | 10 ++++++++
mm/memcontrol.c | 129 ++++++++++++++++++++++++++++++++++++++-------------------------------------------------------
2 files changed, 62 insertions(+), 77 deletions(-)
Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 43ca91d..dafacb8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
list_del(&s->memcg_params->list);
mutex_unlock(&memcg->slab_caches_mutex);
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
out:
kfree(s->memcg_params);
}
@@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
mutex_lock(&memcg_cache_mutex);
new_cachep = cachep->memcg_params->memcg_caches[idx];
- if (new_cachep)
+ if (new_cachep) {
+ css_put(&memcg->css);
goto out;
+ }
new_cachep = kmem_cache_dup(memcg, cachep);
if (new_cachep == NULL) {
new_cachep = cachep;
+ css_put(&memcg->css);
goto out;
}
- mem_cgroup_get(memcg);
atomic_set(&new_cachep->memcg_params->nr_pages , 0);
cachep->memcg_params->memcg_caches[idx] = new_cachep;
@@ -3449,8 +3451,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
cw = container_of(w, struct create_work, work);
memcg_create_kmem_cache(cw->memcg, cw->cachep);
- /* Drop the reference gotten when we enqueued. */
- css_put(&cw->memcg->css);
kfree(cw);
}
--
1.8.0.2
Use css_get/put instead of mem_cgroup_get/put.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 877551d..ad576e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4137,12 +4137,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
unlock_page_cgroup(pc);
/*
* even after unlock, we have memcg->res.usage here and this memcg
- * will never be freed.
+ * will never be freed, so it's safe to call css_get().
*/
memcg_check_events(memcg, page);
if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
mem_cgroup_swap_statistics(memcg, true);
- mem_cgroup_get(memcg);
+ css_get(&memcg->css);
}
/*
* Migration does not charge the res_counter for the
@@ -4242,7 +4242,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
/*
* record memcg information, if swapout && memcg != NULL,
- * mem_cgroup_get() was called in uncharge().
+ * css_get() was called in uncharge().
*/
if (do_swap_account && swapout && memcg)
swap_cgroup_record(ent, css_id(&memcg->css));
@@ -4273,7 +4273,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
if (!mem_cgroup_is_root(memcg))
res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
rcu_read_unlock();
}
@@ -4307,11 +4307,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
* This function is only called from task migration context now.
* It postpones res_counter and refcount handling till the end
* of task migration(mem_cgroup_clear_mc()) for performance
- * improvement. But we cannot postpone mem_cgroup_get(to)
- * because if the process that has been moved to @to does
- * swap-in, the refcount of @to might be decreased to 0.
+ * improvement. But we cannot postpone css_get(to) because if
+ * the process that has been moved to @to does swap-in, the
+ * refcount of @to might be decreased to 0.
+ *
+ * We are in attach() phase, so the cgroup is guaranteed to be
+ * alive, so we can just call css_get().
*/
- mem_cgroup_get(to);
+ css_get(&to->css);
return 0;
}
return -EINVAL;
@@ -6597,6 +6600,7 @@ static void __mem_cgroup_clear_mc(void)
{
struct mem_cgroup *from = mc.from;
struct mem_cgroup *to = mc.to;
+ int i;
/* we must uncharge all the leftover precharges from mc.to */
if (mc.precharge) {
@@ -6617,7 +6621,9 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.from))
res_counter_uncharge(&mc.from->memsw,
PAGE_SIZE * mc.moved_swap);
- __mem_cgroup_put(mc.from, mc.moved_swap);
+
+ for (i = 0; i < mc.moved_swap; i++)
+ css_put(&mc.from->css);
if (!mem_cgroup_is_root(mc.to)) {
/*
@@ -6627,7 +6633,7 @@ static void __mem_cgroup_clear_mc(void)
res_counter_uncharge(&mc.to->res,
PAGE_SIZE * mc.moved_swap);
}
- /* we've already done mem_cgroup_get(mc.to) */
+ /* we've already done css_get(mc.to) */
mc.moved_swap = 0;
}
memcg_oom_recover(from);
--
1.8.0.2
Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed due
to css ref draining to 0. Now it would be a disaster if the child cgroup
tries to access its parent.
Make sure this won't happen.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fa54b92..78204bc 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -888,6 +888,13 @@ static void cgroup_free_fn(struct work_struct *work)
mutex_unlock(&cgroup_mutex);
/*
+ * We get a ref to the parent's dentry, and put the ref when
+ * this cgroup is being freed, so it's guaranteed that the
+ * parent won't be destroyed before its children.
+ */
+ dput(cgrp->parent->dentry);
+
+ /*
* Drop the active superblock reference that we took when we
* created the cgroup
*/
@@ -4171,6 +4178,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
for_each_subsys(root, ss)
dget(dentry);
+ /* hold a ref to the parent's dentry */
+ dget(parent->dentry);
+
/* creation succeeded, notify subsystems */
for_each_subsys(root, ss) {
err = online_css(ss, cgrp);
--
1.8.0.2
The cgroup core guarantees it's always safe to access the parent.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ad576e8..45129cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6124,12 +6124,8 @@ static void mem_cgroup_get(struct mem_cgroup *memcg)
static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
{
- if (atomic_sub_and_test(count, &memcg->refcnt)) {
- struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ if (atomic_sub_and_test(count, &memcg->refcnt))
call_rcu(&memcg->rcu_freeing, free_rcu);
- if (parent)
- mem_cgroup_put(parent);
- }
}
static void mem_cgroup_put(struct mem_cgroup *memcg)
@@ -6229,14 +6225,6 @@ mem_cgroup_css_online(struct cgroup *cont)
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
res_counter_init(&memcg->kmem, &parent->kmem);
-
- /*
- * We increment refcnt of the parent to ensure that we can
- * safely access it on res_counter_charge/uncharge.
- * This refcnt will be decremented when freeing this
- * mem_cgroup(see mem_cgroup_put).
- */
- mem_cgroup_get(parent);
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
--
1.8.0.2
Use css_get/put instead of mem_cgroup_get/put.
We can't do a simple replacement, because here mem_cgroup_put()
is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
won't be called until css refcnt goes down to 0.
Instead we increment css refcnt in mem_cgroup_css_offline(), and
then check if there's still kmem charges. If not, css refcnt will
be decremented, otherwise the refcnt will be decremented when
kmem charges goes down to 0.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 49 ++++++++++++++++++++++++++-----------------------
1 file changed, 26 insertions(+), 23 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dafacb8..877551d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3004,7 +3004,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
return;
if (memcg_kmem_test_and_clear_dead(memcg))
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
@@ -5089,14 +5089,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
* starts accounting before all call sites are patched
*/
memcg_kmem_set_active(memcg);
-
- /*
- * kmem charges can outlive the cgroup. In the case of slab
- * pages, for instance, a page contain objects from various
- * processes, so it is unfeasible to migrate them away. We
- * need to reference count the memcg because of that.
- */
- mem_cgroup_get(memcg);
} else
ret = res_counter_set_limit(&memcg->kmem, val);
out:
@@ -5129,12 +5121,11 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
goto out;
/*
- * destroy(), called if we fail, will issue static_key_slow_inc() and
- * mem_cgroup_put() if kmem is enabled. We have to either call them
- * unconditionally, or clear the KMEM_ACTIVE flag. I personally find
- * this more consistent, since it always leads to the same destroy path
+ * destroy(), called if we fail, will issue static_key_slow_dec() if
+ * kmem is enabled. We have to either call them unconditionally, or
+ * clear the KMEM_ACTIVE flag. I personally find this more consistent,
+ * since it always leads to the same destroy path
*/
- mem_cgroup_get(memcg);
static_key_slow_inc(&memcg_kmem_enabled_key);
mutex_lock(&set_limit_mutex);
@@ -5823,23 +5814,33 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
return mem_cgroup_sockets_init(memcg, ss);
};
-static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
+static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
{
- mem_cgroup_sockets_destroy(memcg);
+ /*
+ * kmem charges can outlive the cgroup. In the case of slab
+ * pages, for instance, a page contain objects from various
+ * processes, so it is unfeasible to migrate them away. We
+ * need to reference count the memcg because of that.
+ */
+ css_get(&memcg->css);
+ /*
+ * We need to call css_get() first, because memcg_uncharge_kmem()
+ * will call css_put() if it sees the memcg is dead.
+ */
memcg_kmem_mark_dead(memcg);
if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
return;
/*
- * Charges already down to 0, undo mem_cgroup_get() done in the charge
- * path here, being careful not to race with memcg_uncharge_kmem: it is
- * possible that the charges went down to 0 between mark_dead and the
- * res_counter read, so in that case, we don't need the put
+ * Charges already down to 0, undo css_get() done previosly,, being
+ * careful not to race with memcg_uncharge_kmem: it is possible that
+ * the charges went down to 0 between mark_dead and the res_counter
+ * read, so in that case, we don't need the put
*/
if (memcg_kmem_test_and_clear_dead(memcg))
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
@@ -5847,7 +5848,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
return 0;
}
-static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
+static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
{
}
#endif
@@ -6274,6 +6275,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ kmem_cgroup_css_offline(memcg);
+
mem_cgroup_invalidate_reclaim_iterators(memcg);
mem_cgroup_reparent_charges(memcg);
mem_cgroup_destroy_all_caches(memcg);
@@ -6283,7 +6286,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- kmem_cgroup_destroy(memcg);
+ mem_cgroup_sockets_destroy(memcg);
mem_cgroup_put(memcg);
}
--
1.8.0.2
Now memcg has the same life cycle as the corresponding cgroup.
Kill the useless refcnt.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 24 +-----------------------
1 file changed, 1 insertion(+), 23 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45129cd..9714a16 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -297,8 +297,6 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
- atomic_t refcnt;
-
int swappiness;
/* OOM-Killer disable */
int oom_kill_disable;
@@ -501,9 +499,6 @@ enum res_type {
*/
static DEFINE_MUTEX(memcg_create_mutex);
-static void mem_cgroup_get(struct mem_cgroup *memcg);
-static void mem_cgroup_put(struct mem_cgroup *memcg);
-
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -6117,22 +6112,6 @@ static void free_rcu(struct rcu_head *rcu_head)
schedule_work(&memcg->work_freeing);
}
-static void mem_cgroup_get(struct mem_cgroup *memcg)
-{
- atomic_inc(&memcg->refcnt);
-}
-
-static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
-{
- if (atomic_sub_and_test(count, &memcg->refcnt))
- call_rcu(&memcg->rcu_freeing, free_rcu);
-}
-
-static void mem_cgroup_put(struct mem_cgroup *memcg)
-{
- __mem_cgroup_put(memcg, 1);
-}
-
/*
* Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
*/
@@ -6192,7 +6171,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&memcg->oom_notify);
- atomic_set(&memcg->refcnt, 1);
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
@@ -6279,7 +6257,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
mem_cgroup_sockets_destroy(memcg);
- mem_cgroup_put(memcg);
+ call_rcu(&memcg->rcu_freeing, free_rcu);
}
#ifdef CONFIG_MMU
--
1.8.0.2
Use css_get/css_put instead of mem_cgroup_get/put.
Note, if at the same time someone is moving @current to a different
cgroup and removing the old cgroup, css_tryget() may return false,
and sock->sk_cgrp won't be initialized.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23d0f6e..43ca91d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
*/
if (sk->sk_cgrp) {
BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
- mem_cgroup_get(sk->sk_cgrp->memcg);
+ css_get(&sk->sk_cgrp->memcg->css);
return;
}
rcu_read_lock();
memcg = mem_cgroup_from_task(current);
cg_proto = sk->sk_prot->proto_cgroup(memcg);
- if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
- mem_cgroup_get(memcg);
+ if (!mem_cgroup_is_root(memcg) &&
+ memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
sk->sk_cgrp = cg_proto;
}
rcu_read_unlock();
@@ -558,7 +558,7 @@ void sock_release_memcg(struct sock *sk)
struct mem_cgroup *memcg;
WARN_ON(!sk->sk_cgrp->memcg);
memcg = sk->sk_cgrp->memcg;
- mem_cgroup_put(memcg);
+ css_put(&sk->sk_cgrp->memcg->css);
}
}
--
1.8.0.2
On 04/03/2013 01:11 PM, Li Zefan wrote:
> The historical reason that memcg didn't use css_get in some cases, is that
> cgroup couldn't be removed if there're still css refs. The situation has
> changed so that rmdir a cgroup will succeed regardless css refs, but won't
> be freed until css refs goes down to 0.
>
> This is an early post, and it's NOT TESTED. I just want to see if the changes
> are fine in general.
Well, from my PoV, this will in general greatly simplify memcg lifecycle
management.
Indeed, I can remember no other reason for those complications other
than the problem with removing directories. If cgroup no longer have
this limitation, I would much rather see your approach in.
I will look into the patches individually soon
On 04/03/2013 01:11 PM, Li Zefan wrote:
> Use css_get/css_put instead of mem_cgroup_get/put.
>
> Note, if at the same time someone is moving @current to a different
> cgroup and removing the old cgroup, css_tryget() may return false,
> and sock->sk_cgrp won't be initialized.
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23d0f6e..43ca91d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
> */
> if (sk->sk_cgrp) {
> BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> - mem_cgroup_get(sk->sk_cgrp->memcg);
> + css_get(&sk->sk_cgrp->memcg->css);
> return;
> }
>
> rcu_read_lock();
> memcg = mem_cgroup_from_task(current);
> cg_proto = sk->sk_prot->proto_cgroup(memcg);
> - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
> - mem_cgroup_get(memcg);
> + if (!mem_cgroup_is_root(memcg) &&
> + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
> sk->sk_cgrp = cg_proto;
> }
What happens if this tryget fails ? Won't we leak a reference here? We
will put regardless when the socket is released, and this may go
negative. No?
On 04/03/2013 01:12 PM, Li Zefan wrote:
> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 43ca91d..dafacb8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
> list_del(&s->memcg_params->list);
> mutex_unlock(&memcg->slab_caches_mutex);
>
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> out:
> kfree(s->memcg_params);
> }
> @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
>
> mutex_lock(&memcg_cache_mutex);
> new_cachep = cachep->memcg_params->memcg_caches[idx];
> - if (new_cachep)
> + if (new_cachep) {
> + css_put(&memcg->css);
> goto out;
> + }
>
> new_cachep = kmem_cache_dup(memcg, cachep);
> if (new_cachep == NULL) {
> new_cachep = cachep;
> + css_put(&memcg->css);
> goto out;
> }
>
> - mem_cgroup_get(memcg);
> atomic_set(&new_cachep->memcg_params->nr_pages , 0);
>
> cachep->memcg_params->memcg_caches[idx] = new_cachep;
> @@ -3449,8 +3451,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
>
> cw = container_of(w, struct create_work, work);
> memcg_create_kmem_cache(cw->memcg, cw->cachep);
> - /* Drop the reference gotten when we enqueued. */
> - css_put(&cw->memcg->css);
> kfree(cw);
> }
>
>
At first look, this one seems all right.
On Wed 03-04-13 16:58:48, Glauber Costa wrote:
> On 04/03/2013 01:11 PM, Li Zefan wrote:
> > Use css_get/css_put instead of mem_cgroup_get/put.
> >
> > Note, if at the same time someone is moving @current to a different
> > cgroup and removing the old cgroup, css_tryget() may return false,
> > and sock->sk_cgrp won't be initialized.
> >
> > Signed-off-by: Li Zefan <[email protected]>
> > ---
> > mm/memcontrol.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 23d0f6e..43ca91d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
> > */
> > if (sk->sk_cgrp) {
> > BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> > - mem_cgroup_get(sk->sk_cgrp->memcg);
> > + css_get(&sk->sk_cgrp->memcg->css);
I am not sure I understand this one. So we have a goup here (which means
that somebody already took a reference on it, right?) and we are taking
another reference. If this is released by sock_release_memcg then who
releases the previous one? It is not directly related to the patch
because this has been done previously already. Could you clarify
Glauber, please?
> > return;
> > }
> >
> > rcu_read_lock();
> > memcg = mem_cgroup_from_task(current);
> > cg_proto = sk->sk_prot->proto_cgroup(memcg);
> > - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
> > - mem_cgroup_get(memcg);
> > + if (!mem_cgroup_is_root(memcg) &&
> > + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
> > sk->sk_cgrp = cg_proto;
> > }
>
> What happens if this tryget fails ? Won't we leak a reference here? We
> will put regardless when the socket is released, and this may go
> negative. No?
AFAICS sock_release_memcg releases the reference only if sk->sk_cgrp and
that one wouldn't be set if css_tryget fails.
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:12:21, Li Zefan wrote:
> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 43ca91d..dafacb8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
> list_del(&s->memcg_params->list);
> mutex_unlock(&memcg->slab_caches_mutex);
>
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> out:
> kfree(s->memcg_params);
> }
> @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
>
> mutex_lock(&memcg_cache_mutex);
> new_cachep = cachep->memcg_params->memcg_caches[idx];
> - if (new_cachep)
> + if (new_cachep) {
> + css_put(&memcg->css);
> goto out;
> + }
>
> new_cachep = kmem_cache_dup(memcg, cachep);
> if (new_cachep == NULL) {
> new_cachep = cachep;
> + css_put(&memcg->css);
> goto out;
> }
>
> - mem_cgroup_get(memcg);
> atomic_set(&new_cachep->memcg_params->nr_pages , 0);
>
> cachep->memcg_params->memcg_caches[idx] = new_cachep;
> @@ -3449,8 +3451,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
>
> cw = container_of(w, struct create_work, work);
> memcg_create_kmem_cache(cw->memcg, cw->cachep);
> - /* Drop the reference gotten when we enqueued. */
> - css_put(&cw->memcg->css);
> kfree(cw);
> }
You are putting references but I do not see any single css_{try}get
here. /me puzzled.
--
Michal Hocko
SUSE Labs
On Wed, Apr 03, 2013 at 05:11:15PM +0800, Li Zefan wrote:
> (I'll be off from my office soon, and I won't be responsive in the following
> 3 days.)
>
> I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
>
> Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
> still be alive. This patchset converts memcg to always use css_get/put, so
> memcg will have the same life cycle as its corresponding cgroup, and then
> it's always safe for memcg to use cgroup->id.
>
> The historical reason that memcg didn't use css_get in some cases, is that
> cgroup couldn't be removed if there're still css refs. The situation has
> changed so that rmdir a cgroup will succeed regardless css refs, but won't
> be freed until css refs goes down to 0.
Hallelujah. This one is egregious if you take a step back and what
happened as a whole. cgroup core had this weird object life-time
management rule where it forces draining of all css refs on cgroup
destruction, which is very unusual and puts unnecessary restrictions
on css object usages in controllers.
As the restriction wasn't too nice, memcg goes ahead and creates its
own object which basically is the same as css but has a different
life-time rule and does refcnting for both objects. Bah....
So, yeah, let's please get rid of this abomination. It shouldn't have
existed from the beginning.
Thanks a lot for doing this!
--
tejun
On Wed 03-04-13 17:12:36, Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> We can't do a simple replacement, because here mem_cgroup_put()
> is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
> won't be called until css refcnt goes down to 0.
>
> Instead we increment css refcnt in mem_cgroup_css_offline(), and
> then check if there's still kmem charges. If not, css refcnt will
> be decremented, otherwise the refcnt will be decremented when
> kmem charges goes down to 0.
Yes the patch makes sense and I actually like it. We just have to be
sure to document the reference counting as much as possible (see
bellow). I also thing that we need a memory barrier in the offline code.
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 49 ++++++++++++++++++++++++++-----------------------
> 1 file changed, 26 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dafacb8..877551d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3004,7 +3004,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
> return;
>
/*
* releases a reference taken in kmem_cgroup_css_offline in case
* this last uncharge is racing with the offlining code or it is
* outliving the memcg existence.
* The memory barrier imposed by test&clear is pair with the
* explicit one in kmem_cgroup_css_offline.
*/
> if (memcg_kmem_test_and_clear_dead(memcg))
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
>
> void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
> @@ -5089,14 +5089,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
> * starts accounting before all call sites are patched
> */
> memcg_kmem_set_active(memcg);
> -
> - /*
> - * kmem charges can outlive the cgroup. In the case of slab
> - * pages, for instance, a page contain objects from various
> - * processes, so it is unfeasible to migrate them away. We
> - * need to reference count the memcg because of that.
> - */
> - mem_cgroup_get(memcg);
> } else
> ret = res_counter_set_limit(&memcg->kmem, val);
> out:
> @@ -5129,12 +5121,11 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
> goto out;
>
> /*
> - * destroy(), called if we fail, will issue static_key_slow_inc() and
> - * mem_cgroup_put() if kmem is enabled. We have to either call them
> - * unconditionally, or clear the KMEM_ACTIVE flag. I personally find
> - * this more consistent, since it always leads to the same destroy path
> + * destroy(), called if we fail, will issue static_key_slow_dec() if
> + * kmem is enabled. We have to either call them unconditionally, or
> + * clear the KMEM_ACTIVE flag. I personally find this more consistent,
> + * since it always leads to the same destroy path
> */
Can we make this comment more clear while you are changing it, please? E.g.
/*
* __mem_cgroup_free will issue static_key_slow_dec because this
* memcg is active already. If the later initialization fails
* then the cgroup core triggers the cleanup so we do not have
* to do it here.
*/
> - mem_cgroup_get(memcg);
> static_key_slow_inc(&memcg_kmem_enabled_key);
>
> mutex_lock(&set_limit_mutex);
> @@ -5823,23 +5814,33 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> return mem_cgroup_sockets_init(memcg, ss);
> };
>
> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> {
> - mem_cgroup_sockets_destroy(memcg);
> + /*
> + * kmem charges can outlive the cgroup. In the case of slab
> + * pages, for instance, a page contain objects from various
> + * processes, so it is unfeasible to migrate them away. We
> + * need to reference count the memcg because of that.
> + */
I would prefer if we could merge all three comments in this function
into a single one. What about something like the following?
/*
* kmem charges can outlive the cgroup. In the case of slab
* pages, for instance, a page contain objects from various
* processes. As we prevent from taking a reference for every
* such allocation we have to be careful when doing uncharge
* (see memcg_uncharge_kmem) and here during offlining.
* The idea is that that only the _last_ uncharge which sees
* the dead memcg will drop the last reference. An additional
* reference is taken here before the group is marked dead
* which is then paired with css_put during uncharge resp. here.
* Although this might sound strange as this path is called when
* the reference has already dropped down to 0 and shouldn't be
* incremented anymore (css_tryget would fail) we do not have
* other options because of the kmem allocations lifetime.
*/
> + css_get(&memcg->css);
I think that you need a write memory barrier here because css_get
nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses
memcg_kmem_test_and_clear_dead which imply a full memory barrier but it
should see the elevated reference count. No?
> + /*
> + * We need to call css_get() first, because memcg_uncharge_kmem()
> + * will call css_put() if it sees the memcg is dead.
> + */
> memcg_kmem_mark_dead(memcg);
>
> if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
> return;
>
> /*
> - * Charges already down to 0, undo mem_cgroup_get() done in the charge
> - * path here, being careful not to race with memcg_uncharge_kmem: it is
> - * possible that the charges went down to 0 between mark_dead and the
> - * res_counter read, so in that case, we don't need the put
> + * Charges already down to 0, undo css_get() done previosly,, being
> + * careful not to race with memcg_uncharge_kmem: it is possible that
> + * the charges went down to 0 between mark_dead and the res_counter
> + * read, so in that case, we don't need the put
> */
> if (memcg_kmem_test_and_clear_dead(memcg))
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
> #else
> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> @@ -5847,7 +5848,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> return 0;
> }
>
> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> {
> }
> #endif
> @@ -6274,6 +6275,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> + kmem_cgroup_css_offline(memcg);
> +
> mem_cgroup_invalidate_reclaim_iterators(memcg);
> mem_cgroup_reparent_charges(memcg);
> mem_cgroup_destroy_all_caches(memcg);
> @@ -6283,7 +6286,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> - kmem_cgroup_destroy(memcg);
> + mem_cgroup_sockets_destroy(memcg);
>
> mem_cgroup_put(memcg);
> }
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:12:54, Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> Signed-off-by: Li Zefan <[email protected]>
Looks good to me. Swapped in pages still use css_tryget so they would
fallback to recharge in __mem_cgroup_try_charge_swapin if the group was
removed.
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
> mm/memcontrol.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 877551d..ad576e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4137,12 +4137,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
> unlock_page_cgroup(pc);
> /*
> * even after unlock, we have memcg->res.usage here and this memcg
> - * will never be freed.
> + * will never be freed, so it's safe to call css_get().
> */
> memcg_check_events(memcg, page);
> if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> mem_cgroup_swap_statistics(memcg, true);
> - mem_cgroup_get(memcg);
> + css_get(&memcg->css);
> }
> /*
> * Migration does not charge the res_counter for the
> @@ -4242,7 +4242,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>
> /*
> * record memcg information, if swapout && memcg != NULL,
> - * mem_cgroup_get() was called in uncharge().
> + * css_get() was called in uncharge().
> */
> if (do_swap_account && swapout && memcg)
> swap_cgroup_record(ent, css_id(&memcg->css));
> @@ -4273,7 +4273,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
> rcu_read_unlock();
> }
> @@ -4307,11 +4307,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
> * This function is only called from task migration context now.
> * It postpones res_counter and refcount handling till the end
> * of task migration(mem_cgroup_clear_mc()) for performance
> - * improvement. But we cannot postpone mem_cgroup_get(to)
> - * because if the process that has been moved to @to does
> - * swap-in, the refcount of @to might be decreased to 0.
> + * improvement. But we cannot postpone css_get(to) because if
> + * the process that has been moved to @to does swap-in, the
> + * refcount of @to might be decreased to 0.
> + *
> + * We are in attach() phase, so the cgroup is guaranteed to be
> + * alive, so we can just call css_get().
> */
> - mem_cgroup_get(to);
> + css_get(&to->css);
> return 0;
> }
> return -EINVAL;
> @@ -6597,6 +6600,7 @@ static void __mem_cgroup_clear_mc(void)
> {
> struct mem_cgroup *from = mc.from;
> struct mem_cgroup *to = mc.to;
> + int i;
>
> /* we must uncharge all the leftover precharges from mc.to */
> if (mc.precharge) {
> @@ -6617,7 +6621,9 @@ static void __mem_cgroup_clear_mc(void)
> if (!mem_cgroup_is_root(mc.from))
> res_counter_uncharge(&mc.from->memsw,
> PAGE_SIZE * mc.moved_swap);
> - __mem_cgroup_put(mc.from, mc.moved_swap);
> +
> + for (i = 0; i < mc.moved_swap; i++)
> + css_put(&mc.from->css);
>
> if (!mem_cgroup_is_root(mc.to)) {
> /*
> @@ -6627,7 +6633,7 @@ static void __mem_cgroup_clear_mc(void)
> res_counter_uncharge(&mc.to->res,
> PAGE_SIZE * mc.moved_swap);
> }
> - /* we've already done mem_cgroup_get(mc.to) */
> + /* we've already done css_get(mc.to) */
> mc.moved_swap = 0;
> }
> memcg_oom_recover(from);
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:13:08, Li Zefan wrote:
> Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> be freed. Then we rmdir the parent cgroup, and the parent is freed due
> to css ref draining to 0. Now it would be a disaster if the child cgroup
> tries to access its parent.
Hmm, I am not sure what is the correct layer for this to handle - cgroup
core or memcg. But we have enforced that in mem_cgroup_css_online where
we take an additional reference to the memcg.
Handling it in the memcg code would have an advantage of limiting an
additional reference only to use_hierarchy cases which is sufficient
as we never touch the parent otherwise (parent_mem_cgroup).
So I think this patch should just take css reference to parent during
online and drop it from mem_cgroup_css_free.
If there are more contollers that need this then it should be handled by
the cgroup core of course.
> Make sure this won't happen.
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> kernel/cgroup.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index fa54b92..78204bc 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -888,6 +888,13 @@ static void cgroup_free_fn(struct work_struct *work)
> mutex_unlock(&cgroup_mutex);
>
> /*
> + * We get a ref to the parent's dentry, and put the ref when
> + * this cgroup is being freed, so it's guaranteed that the
> + * parent won't be destroyed before its children.
> + */
> + dput(cgrp->parent->dentry);
> +
> + /*
> * Drop the active superblock reference that we took when we
> * created the cgroup
> */
> @@ -4171,6 +4178,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> for_each_subsys(root, ss)
> dget(dentry);
>
> + /* hold a ref to the parent's dentry */
> + dget(parent->dentry);
> +
> /* creation succeeded, notify subsystems */
> for_each_subsys(root, ss) {
> err = online_css(ss, cgrp);
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:11:15, Li Zefan wrote:
> (I'll be off from my office soon, and I won't be responsive in the following
> 3 days.)
>
> I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
>
> Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
> still be alive. This patchset converts memcg to always use css_get/put, so
> memcg will have the same life cycle as its corresponding cgroup, and then
> it's always safe for memcg to use cgroup->id.
>
> The historical reason that memcg didn't use css_get in some cases, is that
> cgroup couldn't be removed if there're still css refs. The situation has
> changed so that rmdir a cgroup will succeed regardless css refs, but won't
> be freed until css refs goes down to 0.
>
> This is an early post, and it's NOT TESTED. I just want to see if the changes
> are fine in general.
yes, I like the approach and it looks correct as well (some minor things
mentioned in the patches). Thanks a lot Li! This will make our lifes much
easier. The separate ref counting was PITA especially after
introduction of kmem accounting which made its usage even more trickier.
> btw, after this patchset I think we don't need to free memcg via RCU, because
> cgroup is already freed in RCU callback.
But this depends on changes waiting in for-3.10 branch, right?
Anyway, I think we should be safe with the workqueue based releasing as
well once mem_cgroup_{get,put} are gone, right?
> Note this patchset is based on a few memcg fixes I sent (but hasn't been
> accepted)
>
> --
> kernel/cgroup.c | 10 ++++++++
> mm/memcontrol.c | 129 ++++++++++++++++++++++++++++++++++++++-------------------------------------------------------
> 2 files changed, 62 insertions(+), 77 deletions(-)
--
Michal Hocko
SUSE Labs
Hey,
On Thu, Apr 04, 2013 at 01:37:50PM +0200, Michal Hocko wrote:
> On Wed 03-04-13 17:13:08, Li Zefan wrote:
> > Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> > be freed. Then we rmdir the parent cgroup, and the parent is freed due
> > to css ref draining to 0. Now it would be a disaster if the child cgroup
> > tries to access its parent.
>
> Hmm, I am not sure what is the correct layer for this to handle - cgroup
> core or memcg. But we have enforced that in mem_cgroup_css_online where
> we take an additional reference to the memcg.
>
> Handling it in the memcg code would have an advantage of limiting an
> additional reference only to use_hierarchy cases which is sufficient
> as we never touch the parent otherwise (parent_mem_cgroup).
But what harm does an additional reference do? And given that there
are cgroup core interfaces which access ->parent, I think it'd be a
good idea that parent always exists while there are children.
Thanks.
--
tejun
On Thu 04-04-13 06:53:53, Tejun Heo wrote:
> Hey,
>
> On Thu, Apr 04, 2013 at 01:37:50PM +0200, Michal Hocko wrote:
> > On Wed 03-04-13 17:13:08, Li Zefan wrote:
> > > Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> > > be freed. Then we rmdir the parent cgroup, and the parent is freed due
> > > to css ref draining to 0. Now it would be a disaster if the child cgroup
> > > tries to access its parent.
> >
> > Hmm, I am not sure what is the correct layer for this to handle - cgroup
> > core or memcg. But we have enforced that in mem_cgroup_css_online where
> > we take an additional reference to the memcg.
> >
> > Handling it in the memcg code would have an advantage of limiting an
> > additional reference only to use_hierarchy cases which is sufficient
> > as we never touch the parent otherwise (parent_mem_cgroup).
>
> But what harm does an additional reference do?
No harm at all. I just wanted to be sure that this is not yet another
"for memcg" hack. So if this is useful for other controllers then I have
no objections of course.
> And given that there are cgroup core interfaces which access ->parent,
> I think it'd be a good idea that parent always exists while there are
> children.
>
> Thanks.
--
Michal Hocko
SUSE Labs
On Thu, Apr 04, 2013 at 05:20:28PM +0200, Michal Hocko wrote:
> > But what harm does an additional reference do?
>
> No harm at all. I just wanted to be sure that this is not yet another
> "for memcg" hack. So if this is useful for other controllers then I have
> no objections of course.
I think it makes sense in general, so let's do it in cgroup core. I
suppose it'd be easier for this to be routed together with other memcg
changes?
Thanks.
--
tejun
On Thu 04-04-13 08:22:13, Tejun Heo wrote:
> On Thu, Apr 04, 2013 at 05:20:28PM +0200, Michal Hocko wrote:
> > > But what harm does an additional reference do?
> >
> > No harm at all. I just wanted to be sure that this is not yet another
> > "for memcg" hack. So if this is useful for other controllers then I have
> > no objections of course.
>
> I think it makes sense in general, so let's do it in cgroup core.
Ok
> I suppose it'd be easier for this to be routed together with other
> memcg changes?
Probably yes.
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:13:08, Li Zefan wrote:
> Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> be freed. Then we rmdir the parent cgroup, and the parent is freed due
> to css ref draining to 0. Now it would be a disaster if the child cgroup
> tries to access its parent.
>
> Make sure this won't happen.
>
> Signed-off-by: Li Zefan <[email protected]>
OK, after clarification from Tejun, that this is helpful for other
controllers as well I think it is better than a memcg specific handling.
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index fa54b92..78204bc 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -888,6 +888,13 @@ static void cgroup_free_fn(struct work_struct *work)
> mutex_unlock(&cgroup_mutex);
>
> /*
> + * We get a ref to the parent's dentry, and put the ref when
> + * this cgroup is being freed, so it's guaranteed that the
> + * parent won't be destroyed before its children.
> + */
> + dput(cgrp->parent->dentry);
> +
> + /*
> * Drop the active superblock reference that we took when we
> * created the cgroup
> */
> @@ -4171,6 +4178,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> for_each_subsys(root, ss)
> dget(dentry);
>
> + /* hold a ref to the parent's dentry */
> + dget(parent->dentry);
> +
> /* creation succeeded, notify subsystems */
> for_each_subsys(root, ss) {
> err = online_css(ss, cgrp);
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:13:21, Li Zefan wrote:
> The cgroup core guarantees it's always safe to access the parent.
>
> Signed-off-by: Li Zefan <[email protected]>
I would just prefer if you put a comment into place where we used to
take a reference to parent so that it is more obvious in the code.
See bellow.
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 14 +-------------
> 1 file changed, 1 insertion(+), 13 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ad576e8..45129cd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6124,12 +6124,8 @@ static void mem_cgroup_get(struct mem_cgroup *memcg)
>
> static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
> {
> - if (atomic_sub_and_test(count, &memcg->refcnt)) {
> - struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> + if (atomic_sub_and_test(count, &memcg->refcnt))
> call_rcu(&memcg->rcu_freeing, free_rcu);
> - if (parent)
> - mem_cgroup_put(parent);
> - }
> }
>
> static void mem_cgroup_put(struct mem_cgroup *memcg)
> @@ -6229,14 +6225,6 @@ mem_cgroup_css_online(struct cgroup *cont)
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> res_counter_init(&memcg->kmem, &parent->kmem);
> -
> - /*
> - * We increment refcnt of the parent to ensure that we can
> - * safely access it on res_counter_charge/uncharge.
> - * This refcnt will be decremented when freeing this
> - * mem_cgroup(see mem_cgroup_put).
> - */
> - mem_cgroup_get(parent);
Maybe a little note would be nice
/*
* No need to take reference to parent because cgroup
* core guarantees its existence
*/
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> --
> 1.8.0.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:14:11, Li Zefan wrote:
> Now memcg has the same life cycle as the corresponding cgroup.
> Kill the useless refcnt.
>
> Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
> mm/memcontrol.c | 24 +-----------------------
> 1 file changed, 1 insertion(+), 23 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45129cd..9714a16 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -297,8 +297,6 @@ struct mem_cgroup {
> bool oom_lock;
> atomic_t under_oom;
>
> - atomic_t refcnt;
> -
> int swappiness;
> /* OOM-Killer disable */
> int oom_kill_disable;
> @@ -501,9 +499,6 @@ enum res_type {
> */
> static DEFINE_MUTEX(memcg_create_mutex);
>
> -static void mem_cgroup_get(struct mem_cgroup *memcg);
> -static void mem_cgroup_put(struct mem_cgroup *memcg);
> -
> static inline
> struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> {
> @@ -6117,22 +6112,6 @@ static void free_rcu(struct rcu_head *rcu_head)
> schedule_work(&memcg->work_freeing);
> }
>
> -static void mem_cgroup_get(struct mem_cgroup *memcg)
> -{
> - atomic_inc(&memcg->refcnt);
> -}
> -
> -static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
> -{
> - if (atomic_sub_and_test(count, &memcg->refcnt))
> - call_rcu(&memcg->rcu_freeing, free_rcu);
> -}
> -
> -static void mem_cgroup_put(struct mem_cgroup *memcg)
> -{
> - __mem_cgroup_put(memcg, 1);
> -}
> -
> /*
> * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
> */
> @@ -6192,7 +6171,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>
> memcg->last_scanned_node = MAX_NUMNODES;
> INIT_LIST_HEAD(&memcg->oom_notify);
> - atomic_set(&memcg->refcnt, 1);
> memcg->move_charge_at_immigrate = 0;
> mutex_init(&memcg->thresholds_lock);
> spin_lock_init(&memcg->move_lock);
> @@ -6279,7 +6257,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>
> mem_cgroup_sockets_destroy(memcg);
>
> - mem_cgroup_put(memcg);
> + call_rcu(&memcg->rcu_freeing, free_rcu);
> }
>
> #ifdef CONFIG_MMU
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
(2013/04/03 18:11), Li Zefan wrote:
> Use css_get/css_put instead of mem_cgroup_get/put.
>
> Note, if at the same time someone is moving @current to a different
> cgroup and removing the old cgroup, css_tryget() may return false,
> and sock->sk_cgrp won't be initialized.
>
> Signed-off-by: Li Zefan <[email protected]>
Thank you.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/03 18:12), Li Zefan wrote:
> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 43ca91d..dafacb8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
> list_del(&s->memcg_params->list);
> mutex_unlock(&memcg->slab_caches_mutex);
>
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> out:
> kfree(s->memcg_params);
> }
> @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
>
> mutex_lock(&memcg_cache_mutex);
> new_cachep = cachep->memcg_params->memcg_caches[idx];
> - if (new_cachep)
> + if (new_cachep) {
> + css_put(&memcg->css);
> goto out;
> + }
Where css_get() against this is done ?
Thanks,
-Kame
(2013/04/03 18:12), Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 877551d..ad576e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4137,12 +4137,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
> unlock_page_cgroup(pc);
> /*
> * even after unlock, we have memcg->res.usage here and this memcg
> - * will never be freed.
> + * will never be freed, so it's safe to call css_get().
> */
> memcg_check_events(memcg, page);
> if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> mem_cgroup_swap_statistics(memcg, true);
> - mem_cgroup_get(memcg);
> + css_get(&memcg->css);
> }
> /*
> * Migration does not charge the res_counter for the
> @@ -4242,7 +4242,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>
> /*
> * record memcg information, if swapout && memcg != NULL,
> - * mem_cgroup_get() was called in uncharge().
> + * css_get() was called in uncharge().
> */
> if (do_swap_account && swapout && memcg)
> swap_cgroup_record(ent, css_id(&memcg->css));
> @@ -4273,7 +4273,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> if (!mem_cgroup_is_root(memcg))
> res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> mem_cgroup_swap_statistics(memcg, false);
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
> rcu_read_unlock();
> }
> @@ -4307,11 +4307,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
> * This function is only called from task migration context now.
> * It postpones res_counter and refcount handling till the end
> * of task migration(mem_cgroup_clear_mc()) for performance
> - * improvement. But we cannot postpone mem_cgroup_get(to)
> - * because if the process that has been moved to @to does
> - * swap-in, the refcount of @to might be decreased to 0.
> + * improvement. But we cannot postpone css_get(to) because if
> + * the process that has been moved to @to does swap-in, the
> + * refcount of @to might be decreased to 0.
> + *
> + * We are in attach() phase, so the cgroup is guaranteed to be
> + * alive, so we can just call css_get().
> */
> - mem_cgroup_get(to);
> + css_get(&to->css);
> return 0;
> }
> return -EINVAL;
> @@ -6597,6 +6600,7 @@ static void __mem_cgroup_clear_mc(void)
> {
> struct mem_cgroup *from = mc.from;
> struct mem_cgroup *to = mc.to;
> + int i;
>
> /* we must uncharge all the leftover precharges from mc.to */
> if (mc.precharge) {
> @@ -6617,7 +6621,9 @@ static void __mem_cgroup_clear_mc(void)
> if (!mem_cgroup_is_root(mc.from))
> res_counter_uncharge(&mc.from->memsw,
> PAGE_SIZE * mc.moved_swap);
> - __mem_cgroup_put(mc.from, mc.moved_swap);
> +
> + for (i = 0; i < mc.moved_swap; i++)
> + css_put(&mc.from->css);
>
> if (!mem_cgroup_is_root(mc.to)) {
> /*
> @@ -6627,7 +6633,7 @@ static void __mem_cgroup_clear_mc(void)
> res_counter_uncharge(&mc.to->res,
> PAGE_SIZE * mc.moved_swap);
> }
> - /* we've already done mem_cgroup_get(mc.to) */
> + /* we've already done css_get(mc.to) */
> mc.moved_swap = 0;
> }
> memcg_oom_recover(from);
>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/03 18:13), Li Zefan wrote:
> Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> be freed. Then we rmdir the parent cgroup, and the parent is freed due
> to css ref draining to 0. Now it would be a disaster if the child cgroup
> tries to access its parent.
>
> Make sure this won't happen.
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> kernel/cgroup.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index fa54b92..78204bc 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -888,6 +888,13 @@ static void cgroup_free_fn(struct work_struct *work)
> mutex_unlock(&cgroup_mutex);
>
> /*
> + * We get a ref to the parent's dentry, and put the ref when
> + * this cgroup is being freed, so it's guaranteed that the
> + * parent won't be destroyed before its children.
> + */
> + dput(cgrp->parent->dentry);
> +
> + /*
> * Drop the active superblock reference that we took when we
> * created the cgroup
> */
> @@ -4171,6 +4178,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> for_each_subsys(root, ss)
> dget(dentry);
>
> + /* hold a ref to the parent's dentry */
> + dget(parent->dentry);
> +
> /* creation succeeded, notify subsystems */
> for_each_subsys(root, ss) {
> err = online_css(ss, cgrp);
>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
On 04/03/2013 07:29 PM, Michal Hocko wrote:
> On Wed 03-04-13 16:58:48, Glauber Costa wrote:
>> On 04/03/2013 01:11 PM, Li Zefan wrote:
>>> Use css_get/css_put instead of mem_cgroup_get/put.
>>>
>>> Note, if at the same time someone is moving @current to a different
>>> cgroup and removing the old cgroup, css_tryget() may return false,
>>> and sock->sk_cgrp won't be initialized.
>>>
>>> Signed-off-by: Li Zefan <[email protected]>
>>> ---
>>> mm/memcontrol.c | 8 ++++----
>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 23d0f6e..43ca91d 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
>>> */
>>> if (sk->sk_cgrp) {
>>> BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
>>> - mem_cgroup_get(sk->sk_cgrp->memcg);
>>> + css_get(&sk->sk_cgrp->memcg->css);
>
> I am not sure I understand this one. So we have a goup here (which means
> that somebody already took a reference on it, right?) and we are taking
> another reference. If this is released by sock_release_memcg then who
> releases the previous one? It is not directly related to the patch
> because this has been done previously already. Could you clarify
> Glauber, please?
This should be documented in the commit that introduced this, and it was
one of the first bugs I've handled with this code.
Bottom line, we can create sockets normally, and those will have process
context. But we also can create sockets by cloning existing sockets. To
the best of my knowledge, this is done by things like accept().
Because those sockets are a clone of their ancestors, they also belong
to a workload that should be limited. Also note that because they have
cgroup context, we will eventually try to put them. So we need to grab
an extra reference.
socket_update_cgroup is always called at socket creation, and the
original structures are filled with zeroes. Therefore cloning is the
*only* path that takes us here with sk->sk_cgroup filled.
My comment right above this excerpt states:
/* Socket cloning can throw us here with sk_cgrp already
* filled. It won't however, necessarily happen from
* process context. So the test for root memcg given
* the current task's memcg won't help us in this case.
*
* Respecting the original socket's memcg is a better
* decision in this case.
*/
>
>>> return;
>>> }
>>>
>>> rcu_read_lock();
>>> memcg = mem_cgroup_from_task(current);
>>> cg_proto = sk->sk_prot->proto_cgroup(memcg);
>>> - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
>>> - mem_cgroup_get(memcg);
>>> + if (!mem_cgroup_is_root(memcg) &&
>>> + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
>>> sk->sk_cgrp = cg_proto;
>>> }
>>
>> What happens if this tryget fails ? Won't we leak a reference here? We
>> will put regardless when the socket is released, and this may go
>> negative. No?
>
> AFAICS sock_release_memcg releases the reference only if sk->sk_cgrp and
> that one wouldn't be set if css_tryget fails.
>
Yes, this is totally fine. I was actually thinking of the same socket
cloning I mentioned above. We cannot fail that path because we already
have an "implicit" reference, we just need to officially mark it.
Failing here is indeed fine. Future cloned sockets from this socket will
have NULL cgroup context as well.
On 04/04/2013 07:22 PM, Tejun Heo wrote:
> On Thu, Apr 04, 2013 at 05:20:28PM +0200, Michal Hocko wrote:
>>> But what harm does an additional reference do?
>>
>> No harm at all. I just wanted to be sure that this is not yet another
>> "for memcg" hack. So if this is useful for other controllers then I have
>> no objections of course.
>
> I think it makes sense in general, so let's do it in cgroup core. I
> suppose it'd be easier for this to be routed together with other memcg
> changes?
>
> Thanks.
>
You guys seems already settled, but FWIW I agree with Tejun here. It
makes sense from a design point of view for a cgroup to pin its parent.
cgroup core it is.
(2013/04/03 18:13), Li Zefan wrote:
> The cgroup core guarantees it's always safe to access the parent.
>
> Signed-off-by: Li Zefan <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/03 18:14), Li Zefan wrote:
> Now memcg has the same life cycle as the corresponding cgroup.
> Kill the useless refcnt.
>
> Signed-off-by: Li Zefan <[email protected]>
very very very nice. Thank you.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> * __mem_cgroup_free will issue static_key_slow_dec because this
> * memcg is active already. If the later initialization fails
> * then the cgroup core triggers the cleanup so we do not have
> * to do it here.
> */
>> - mem_cgroup_get(memcg);
>> static_key_slow_inc(&memcg_kmem_enabled_key);
>>
>> mutex_lock(&set_limit_mutex);
>> @@ -5823,23 +5814,33 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>> return mem_cgroup_sockets_init(memcg, ss);
>> };
>>
>> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
>> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
>> {
>> - mem_cgroup_sockets_destroy(memcg);
>> + /*
>> + * kmem charges can outlive the cgroup. In the case of slab
>> + * pages, for instance, a page contain objects from various
>> + * processes, so it is unfeasible to migrate them away. We
>> + * need to reference count the memcg because of that.
>> + */
>
> I would prefer if we could merge all three comments in this function
> into a single one. What about something like the following?
> /*
> * kmem charges can outlive the cgroup. In the case of slab
> * pages, for instance, a page contain objects from various
> * processes. As we prevent from taking a reference for every
> * such allocation we have to be careful when doing uncharge
> * (see memcg_uncharge_kmem) and here during offlining.
> * The idea is that that only the _last_ uncharge which sees
> * the dead memcg will drop the last reference. An additional
> * reference is taken here before the group is marked dead
> * which is then paired with css_put during uncharge resp. here.
> * Although this might sound strange as this path is called when
> * the reference has already dropped down to 0 and shouldn't be
> * incremented anymore (css_tryget would fail) we do not have
> * other options because of the kmem allocations lifetime.
> */
>> + css_get(&memcg->css);
>
> I think that you need a write memory barrier here because css_get
> nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses
> memcg_kmem_test_and_clear_dead which imply a full memory barrier but it
> should see the elevated reference count. No?
>
We don't use barriers for any other kind of reference counting. What is
different here?
On 04/03/2013 01:12 PM, Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> We can't do a simple replacement, because here mem_cgroup_put()
> is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
> won't be called until css refcnt goes down to 0.
>
> Instead we increment css refcnt in mem_cgroup_css_offline(), and
> then check if there's still kmem charges. If not, css refcnt will
> be decremented, otherwise the refcnt will be decremented when
> kmem charges goes down to 0.
That is okay, it should work.
On 04/03/2013 07:31 PM, Michal Hocko wrote:
> On Wed 03-04-13 17:12:21, Li Zefan wrote:
>> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
>>
>> Signed-off-by: Li Zefan <[email protected]>
>> ---
>> mm/memcontrol.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 43ca91d..dafacb8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
>> list_del(&s->memcg_params->list);
>> mutex_unlock(&memcg->slab_caches_mutex);
>>
>> - mem_cgroup_put(memcg);
>> + css_put(&memcg->css);
>> out:
>> kfree(s->memcg_params);
>> }
>> @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
>>
>> mutex_lock(&memcg_cache_mutex);
>> new_cachep = cachep->memcg_params->memcg_caches[idx];
>> - if (new_cachep)
>> + if (new_cachep) {
>> + css_put(&memcg->css);
>> goto out;
>> + }
>>
>> new_cachep = kmem_cache_dup(memcg, cachep);
>> if (new_cachep == NULL) {
>> new_cachep = cachep;
>> + css_put(&memcg->css);
>> goto out;
>> }
>>
>> - mem_cgroup_get(memcg);
>> atomic_set(&new_cachep->memcg_params->nr_pages , 0);
>>
>> cachep->memcg_params->memcg_caches[idx] = new_cachep;
>> @@ -3449,8 +3451,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
>>
>> cw = container_of(w, struct create_work, work);
>> memcg_create_kmem_cache(cw->memcg, cw->cachep);
>> - /* Drop the reference gotten when we enqueued. */
>> - css_put(&cw->memcg->css);
>> kfree(cw);
>> }
>
> You are putting references but I do not see any single css_{try}get
> here. /me puzzled.
>
There are two things being done in this code:
First, we acquired a css_ref to make sure that the underlying cgroup
would not go away. That is a short lived reference, and it is put as
soon as the cache is created.
At this point, we acquire a long-lived per-cache memcg reference count
to guarantee that the memcg will still be alive.
so it is:
enqueue: css_get
create : memcg_get, css_put
destroy: css_put
If I understand Li's patch correctly, he is not touching the first
css_get, only turning that into the long lived reference (which was not
possible before, since that would prevent rmdir).
Then he only needs to get rid of the memcg_get, change the memcg_put to
css_put, and get rid of the now extra css_put.
He is issuing extra css_puts in memcg_create_kmem_cache, but only in
failure paths. So the code reads as:
* css_get on enqueue (already done, so not shown in patch)
* if it fails, css_put
* if it succeeds, don't do anything. This is already the long-lived
reference count. put it at release time.
The code looks correct, and of course, extremely simpler due to the
use of a single reference.
Li, am I right in my understanding that this is your intention?
On Fri 05-04-13 12:08:40, Glauber Costa wrote:
> On 04/03/2013 07:29 PM, Michal Hocko wrote:
> > On Wed 03-04-13 16:58:48, Glauber Costa wrote:
> >> On 04/03/2013 01:11 PM, Li Zefan wrote:
> >>> Use css_get/css_put instead of mem_cgroup_get/put.
> >>>
> >>> Note, if at the same time someone is moving @current to a different
> >>> cgroup and removing the old cgroup, css_tryget() may return false,
> >>> and sock->sk_cgrp won't be initialized.
> >>>
> >>> Signed-off-by: Li Zefan <[email protected]>
> >>> ---
> >>> mm/memcontrol.c | 8 ++++----
> >>> 1 file changed, 4 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 23d0f6e..43ca91d 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
> >>> */
> >>> if (sk->sk_cgrp) {
> >>> BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> >>> - mem_cgroup_get(sk->sk_cgrp->memcg);
> >>> + css_get(&sk->sk_cgrp->memcg->css);
> >
> > I am not sure I understand this one. So we have a goup here (which means
> > that somebody already took a reference on it, right?) and we are taking
> > another reference. If this is released by sock_release_memcg then who
> > releases the previous one? It is not directly related to the patch
> > because this has been done previously already. Could you clarify
> > Glauber, please?
>
> This should be documented in the commit that introduced this, and it was
> one of the first bugs I've handled with this code.
>
> Bottom line, we can create sockets normally, and those will have process
> context. But we also can create sockets by cloning existing sockets. To
> the best of my knowledge, this is done by things like accept().
>
> Because those sockets are a clone of their ancestors, they also belong
> to a workload that should be limited. Also note that because they have
> cgroup context, we will eventually try to put them. So we need to grab
> an extra reference.
>
> socket_update_cgroup is always called at socket creation, and the
> original structures are filled with zeroes. Therefore cloning is the
> *only* path that takes us here with sk->sk_cgroup filled.
OK, I guess I understand.
Thanks for the clarification, Galuber!
--
Michal Hocko
SUSE Labs
On Wed 03-04-13 17:11:37, Li Zefan wrote:
> Use css_get/css_put instead of mem_cgroup_get/put.
>
> Note, if at the same time someone is moving @current to a different
> cgroup and removing the old cgroup, css_tryget() may return false,
> and sock->sk_cgrp won't be initialized.
>
> Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23d0f6e..43ca91d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
> */
> if (sk->sk_cgrp) {
> BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> - mem_cgroup_get(sk->sk_cgrp->memcg);
> + css_get(&sk->sk_cgrp->memcg->css);
> return;
> }
>
> rcu_read_lock();
> memcg = mem_cgroup_from_task(current);
> cg_proto = sk->sk_prot->proto_cgroup(memcg);
> - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
> - mem_cgroup_get(memcg);
> + if (!mem_cgroup_is_root(memcg) &&
> + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
> sk->sk_cgrp = cg_proto;
> }
> rcu_read_unlock();
> @@ -558,7 +558,7 @@ void sock_release_memcg(struct sock *sk)
> struct mem_cgroup *memcg;
> WARN_ON(!sk->sk_cgrp->memcg);
> memcg = sk->sk_cgrp->memcg;
> - mem_cgroup_put(memcg);
> + css_put(&sk->sk_cgrp->memcg->css);
> }
> }
>
> --
> 1.8.0.2
--
Michal Hocko
SUSE Labs
>
> OK, I guess I understand.
>
> Thanks for the clarification, Galuber!
You're welcome Mihcal!
On Fri 05-04-13 14:28:12, Glauber Costa wrote:
> On 04/03/2013 07:31 PM, Michal Hocko wrote:
> > On Wed 03-04-13 17:12:21, Li Zefan wrote:
> >> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
> >>
> >> Signed-off-by: Li Zefan <[email protected]>
> >> ---
> >> mm/memcontrol.c | 10 +++++-----
> >> 1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 43ca91d..dafacb8 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
> >> list_del(&s->memcg_params->list);
> >> mutex_unlock(&memcg->slab_caches_mutex);
> >>
> >> - mem_cgroup_put(memcg);
> >> + css_put(&memcg->css);
> >> out:
> >> kfree(s->memcg_params);
> >> }
> >> @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >>
> >> mutex_lock(&memcg_cache_mutex);
> >> new_cachep = cachep->memcg_params->memcg_caches[idx];
> >> - if (new_cachep)
> >> + if (new_cachep) {
> >> + css_put(&memcg->css);
> >> goto out;
> >> + }
> >>
> >> new_cachep = kmem_cache_dup(memcg, cachep);
> >> if (new_cachep == NULL) {
> >> new_cachep = cachep;
> >> + css_put(&memcg->css);
> >> goto out;
> >> }
> >>
> >> - mem_cgroup_get(memcg);
> >> atomic_set(&new_cachep->memcg_params->nr_pages , 0);
> >>
> >> cachep->memcg_params->memcg_caches[idx] = new_cachep;
> >> @@ -3449,8 +3451,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
> >>
> >> cw = container_of(w, struct create_work, work);
> >> memcg_create_kmem_cache(cw->memcg, cw->cachep);
> >> - /* Drop the reference gotten when we enqueued. */
> >> - css_put(&cw->memcg->css);
> >> kfree(cw);
> >> }
> >
> > You are putting references but I do not see any single css_{try}get
> > here. /me puzzled.
> >
>
> There are two things being done in this code:
> First, we acquired a css_ref to make sure that the underlying cgroup
> would not go away. That is a short lived reference, and it is put as
> soon as the cache is created.
> At this point, we acquire a long-lived per-cache memcg reference count
> to guarantee that the memcg will still be alive.
>
> so it is:
>
> enqueue: css_get
> create : memcg_get, css_put
> destroy: css_put
>
> If I understand Li's patch correctly, he is not touching the first
> css_get, only turning that into the long lived reference (which was not
> possible before, since that would prevent rmdir).
>
> Then he only needs to get rid of the memcg_get, change the memcg_put to
> css_put, and get rid of the now extra css_put.
>
> He is issuing extra css_puts in memcg_create_kmem_cache, but only in
> failure paths. So the code reads as:
> * css_get on enqueue (already done, so not shown in patch)
> * if it fails, css_put
> * if it succeeds, don't do anything. This is already the long-lived
> reference count. put it at release time.
OK, this makes more sense now. It is __memcg_create_cache_enqueue which
takes the reference and it is not put after this because it replaced
mem_cgroup reference counting.
Li, please put something along these lines into the changelog. This is
really tricky and easy to get misunderstand.
You can put my Acked-by then.
> The code looks correct, and of course, extremely simpler due to the
> use of a single reference.
>
> Li, am I right in my understanding that this is your intention?
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On Fri 05-04-13 14:51:10, KAMEZAWA Hiroyuki wrote:
> (2013/04/03 18:12), Li Zefan wrote:
> > Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
> >
> > Signed-off-by: Li Zefan <[email protected]>
> > ---
> > mm/memcontrol.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 43ca91d..dafacb8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3191,7 +3191,7 @@ void memcg_release_cache(struct kmem_cache *s)
> > list_del(&s->memcg_params->list);
> > mutex_unlock(&memcg->slab_caches_mutex);
> >
> > - mem_cgroup_put(memcg);
> > + css_put(&memcg->css);
> > out:
> > kfree(s->memcg_params);
> > }
> > @@ -3350,16 +3350,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >
> > mutex_lock(&memcg_cache_mutex);
> > new_cachep = cachep->memcg_params->memcg_caches[idx];
> > - if (new_cachep)
> > + if (new_cachep) {
> > + css_put(&memcg->css);
> > goto out;
> > + }
>
> Where css_get() against this is done ?
As glauber explained in another email in this thread. It was
__memcg_create_cache_enqueue which took the reference.
--
Michal Hocko
SUSE Labs
On Fri 05-04-13 14:19:30, Glauber Costa wrote:
>
> > * __mem_cgroup_free will issue static_key_slow_dec because this
> > * memcg is active already. If the later initialization fails
> > * then the cgroup core triggers the cleanup so we do not have
> > * to do it here.
> > */
> >> - mem_cgroup_get(memcg);
> >> static_key_slow_inc(&memcg_kmem_enabled_key);
> >>
> >> mutex_lock(&set_limit_mutex);
> >> @@ -5823,23 +5814,33 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> >> return mem_cgroup_sockets_init(memcg, ss);
> >> };
> >>
> >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
> >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> >> {
> >> - mem_cgroup_sockets_destroy(memcg);
> >> + /*
> >> + * kmem charges can outlive the cgroup. In the case of slab
> >> + * pages, for instance, a page contain objects from various
> >> + * processes, so it is unfeasible to migrate them away. We
> >> + * need to reference count the memcg because of that.
> >> + */
> >
> > I would prefer if we could merge all three comments in this function
> > into a single one. What about something like the following?
> > /*
> > * kmem charges can outlive the cgroup. In the case of slab
> > * pages, for instance, a page contain objects from various
> > * processes. As we prevent from taking a reference for every
> > * such allocation we have to be careful when doing uncharge
> > * (see memcg_uncharge_kmem) and here during offlining.
> > * The idea is that that only the _last_ uncharge which sees
> > * the dead memcg will drop the last reference. An additional
> > * reference is taken here before the group is marked dead
> > * which is then paired with css_put during uncharge resp. here.
> > * Although this might sound strange as this path is called when
> > * the reference has already dropped down to 0 and shouldn't be
> > * incremented anymore (css_tryget would fail) we do not have
> > * other options because of the kmem allocations lifetime.
> > */
> >> + css_get(&memcg->css);
> >
> > I think that you need a write memory barrier here because css_get
> > nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses
> > memcg_kmem_test_and_clear_dead which imply a full memory barrier but it
> > should see the elevated reference count. No?
> >
>
> We don't use barriers for any other kind of reference counting. What is
> different here?
Now we need to make sure that the racing uncharge sees an elevated
reference count before the group is marked dead. Otherwise we could see
a dead group with ref count == 0, no?
--
Michal Hocko
SUSE Labs
>>> You are putting references but I do not see any single css_{try}get
>>> here. /me puzzled.
>>>
>>
>> There are two things being done in this code:
>> First, we acquired a css_ref to make sure that the underlying cgroup
>> would not go away. That is a short lived reference, and it is put as
>> soon as the cache is created.
>> At this point, we acquire a long-lived per-cache memcg reference count
>> to guarantee that the memcg will still be alive.
>>
>> so it is:
>>
>> enqueue: css_get
>> create : memcg_get, css_put
>> destroy: css_put
>>
>> If I understand Li's patch correctly, he is not touching the first
>> css_get, only turning that into the long lived reference (which was not
>> possible before, since that would prevent rmdir).
>>
>> Then he only needs to get rid of the memcg_get, change the memcg_put to
>> css_put, and get rid of the now extra css_put.
>>
>> He is issuing extra css_puts in memcg_create_kmem_cache, but only in
>> failure paths. So the code reads as:
>> * css_get on enqueue (already done, so not shown in patch)
>> * if it fails, css_put
>> * if it succeeds, don't do anything. This is already the long-lived
>> reference count. put it at release time.
>
> OK, this makes more sense now. It is __memcg_create_cache_enqueue which
> takes the reference and it is not put after this because it replaced
> mem_cgroup reference counting.
> Li, please put something along these lines into the changelog. This is
> really tricky and easy to get misunderstand.
>
Yeah, I think I'll just steal Glauber's explanation as the changelog.
> You can put my Acked-by then.
>
Thanks!
>> The code looks correct, and of course, extremely simpler due to the
>> use of a single reference.
>>
>> Li, am I right in my understanding that this is your intention?
>>
On 2013/4/4 20:00, Michal Hocko wrote:
> On Wed 03-04-13 17:11:15, Li Zefan wrote:
>> (I'll be off from my office soon, and I won't be responsive in the following
>> 3 days.)
>>
>> I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
>>
>> Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
>> still be alive. This patchset converts memcg to always use css_get/put, so
>> memcg will have the same life cycle as its corresponding cgroup, and then
>> it's always safe for memcg to use cgroup->id.
>>
>> The historical reason that memcg didn't use css_get in some cases, is that
>> cgroup couldn't be removed if there're still css refs. The situation has
>> changed so that rmdir a cgroup will succeed regardless css refs, but won't
>> be freed until css refs goes down to 0.
>>
>> This is an early post, and it's NOT TESTED. I just want to see if the changes
>> are fine in general.
>
> yes, I like the approach and it looks correct as well (some minor things
> mentioned in the patches). Thanks a lot Li! This will make our lifes much
> easier. The separate ref counting was PITA especially after
> introduction of kmem accounting which made its usage even more trickier.
>
>> btw, after this patchset I think we don't need to free memcg via RCU, because
>> cgroup is already freed in RCU callback.
>
> But this depends on changes waiting in for-3.10 branch, right?
What changes? memcg changes or cgroup core changes? I don't think this depends
on anything in cgroup 3.10 branch.
> Anyway, I think we should be safe with the workqueue based releasing as
> well once mem_cgroup_{get,put} are gone, right?
>
cgroup calls mem_cgroup_css_free() in a work function, so seems memcg doesn't
need to use RCU or workqueue in mem_cgroup_css_free().
Hi,
I'm rebasing this patchset against latest linux-next, and it conflicts with
"[PATCH v2] memcg: debugging facility to access dangling memcgs." slightly.
That is a debugging patch and will never be pushed into mainline, so should I
still base this patchset on that debugging patch?
Also that patch needs update (and can be simplified) after this patchset:
- move memcg_dangling_add() to mem_cgroup_css_offline()
- remove memcg->memcg_name, and use cgroup_path() in mem_cgroup_dangling_read()?
On 2013/4/3 17:11, Li Zefan wrote:
> (I'll be off from my office soon, and I won't be responsive in the following
> 3 days.)
>
> I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
>
> Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
> still be alive. This patchset converts memcg to always use css_get/put, so
> memcg will have the same life cycle as its corresponding cgroup, and then
> it's always safe for memcg to use cgroup->id.
>
> The historical reason that memcg didn't use css_get in some cases, is that
> cgroup couldn't be removed if there're still css refs. The situation has
> changed so that rmdir a cgroup will succeed regardless css refs, but won't
> be freed until css refs goes down to 0.
>
> This is an early post, and it's NOT TESTED. I just want to see if the changes
> are fine in general.
>
> btw, after this patchset I think we don't need to free memcg via RCU, because
> cgroup is already freed in RCU callback.
>
> Note this patchset is based on a few memcg fixes I sent (but hasn't been
> accepted)
>
> --
> kernel/cgroup.c | 10 ++++++++
> mm/memcontrol.c | 129 ++++++++++++++++++++++++++++++++++++++-------------------------------------------------------
> 2 files changed, 62 insertions(+), 77 deletions(-)
>
On Sun 07-04-13 16:44:07, Li Zefan wrote:
> Hi,
>
> I'm rebasing this patchset against latest linux-next, and it conflicts with
> "[PATCH v2] memcg: debugging facility to access dangling memcgs." slightly.
>
> That is a debugging patch and will never be pushed into mainline, so should I
> still base this patchset on that debugging patch?
Could you split CONFIG_MEMCG_DEBUG_ASYNC_DESTROY changes into a separate
patch so that Andrew could put it on top of the mentioned patch?
> Also that patch needs update (and can be simplified) after this patchset:
> - move memcg_dangling_add() to mem_cgroup_css_offline()
> - remove memcg->memcg_name, and use cgroup_path() in mem_cgroup_dangling_read()?
Every improvement is welcome.
Thanks
--
Michal Hocko
SUSE Labs
On Sun 07-04-13 14:00:24, Li Zefan wrote:
> On 2013/4/4 20:00, Michal Hocko wrote:
> > On Wed 03-04-13 17:11:15, Li Zefan wrote:
> >> (I'll be off from my office soon, and I won't be responsive in the following
> >> 3 days.)
> >>
> >> I'm working on converting memcg to use cgroup->id, and then we can kill css_id.
> >>
> >> Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
> >> still be alive. This patchset converts memcg to always use css_get/put, so
> >> memcg will have the same life cycle as its corresponding cgroup, and then
> >> it's always safe for memcg to use cgroup->id.
> >>
> >> The historical reason that memcg didn't use css_get in some cases, is that
> >> cgroup couldn't be removed if there're still css refs. The situation has
> >> changed so that rmdir a cgroup will succeed regardless css refs, but won't
> >> be freed until css refs goes down to 0.
> >>
> >> This is an early post, and it's NOT TESTED. I just want to see if the changes
> >> are fine in general.
> >
> > yes, I like the approach and it looks correct as well (some minor things
> > mentioned in the patches). Thanks a lot Li! This will make our lifes much
> > easier. The separate ref counting was PITA especially after
> > introduction of kmem accounting which made its usage even more trickier.
> >
> >> btw, after this patchset I think we don't need to free memcg via RCU, because
> >> cgroup is already freed in RCU callback.
> >
> > But this depends on changes waiting in for-3.10 branch, right?
>
> What changes? memcg changes or cgroup core changes? I don't think this depends
> on anything in cgroup 3.10 branch.
cgroup (be445626 e.g.) but now I've noticed that those are already
merged in Linus tree.
FYI: I've cherry-picked themo my -mm git tree.
--
Michal Hocko
SUSE Labs
On 04/07/2013 12:44 PM, Li Zefan wrote:
> Hi,
>
> I'm rebasing this patchset against latest linux-next, and it conflicts with
> "[PATCH v2] memcg: debugging facility to access dangling memcgs." slightly.
>
> That is a debugging patch and will never be pushed into mainline, so should I
> still base this patchset on that debugging patch?
>
It will conflict as well with my shrinking patches: I will still keep
the memcgs in the dangling list, but that will have nothing to do with
debugging. So I will split that patch in a list management part, which
will be used, and a debugging part (with the file + the debugging
information).
I will be happy to rebase it ontop of your series.
> Also that patch needs update (and can be simplified) after this patchset:
> - move memcg_dangling_add() to mem_cgroup_css_offline()
> - remove memcg->memcg_name, and use cgroup_path() in mem_cgroup_dangling_read()?
>
Don't worry about it. If this is just this one patch conflicting, I
would avise Andrew to remove it, and I will provide another (maybe two,
already splitted up) version.