Changes since v1:
- wrote better changelog and added acked-by and reviewed-by tags
- revised some comments as suggested by Michal
- added a wmb() in kmem_cgroup_css_offline(), pointed out by Michal
- fixed a bug which causes a css_put() never be called
Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can
still be alive. This patchset converts memcg to always use css_get/put, so
memcg will have the same life cycle as its corresponding cgroup.
The historical reason that memcg didn't use css_get in some cases, is that
cgroup couldn't be removed if there're still css refs. The situation has
changed so that rmdir a cgroup will succeed regardless css refs, but won't
be freed until css refs goes down to 0.
Since the introduction of kmemcg, the memcg refcnt handling grows even more
complicated. This patchset greately simplifies memcg's life cycle management.
Also, after those changes, we can convert memcg to use cgroup->id, and then
we can kill css_id.
This patchset is based on linux-next but with "memcg: debugging facility to access dangling memcgs."
excluded.
The first 4 patches are bug fixes that should go into 3.9, and the rest are
for 3.10. The extra patch 13/12 is for the dangling memcg debugging patch.
You'll see 2 small conflicts when you apply that debugging patch on top
of this patchset. Just move memcg_dangling_add() to mem_cgroup_css_offline()
and move memcg_dangling_free() to mem_cggroup_css_free().
Li Zefan (10):
memcg: take reference before releasing rcu_read_lock
memcg: avoid accessing memcg after releasing reference
memcg: use css_get() in sock_update_memcg()
memcg: don't use mem_cgroup_get() when creating a kmemcg cache
memcg: use css_get/put when charging/uncharging kmem
memcg: use css_get/put for swap memcg
cgroup: make sure parent won't be destroyed before its children
memcg: don't need to get a reference to the parent
memcg: kill memcg refcnt
memcg: don't need to free memcg via RCU or workqueue
Michal Hocko (2):
Revert "memcg: avoid dangling reference count in creation failure."
memcg, kmem: fix reference count handling on the error path
---
kernel/cgroup.c | 10 +++
mm/memcontrol.c | 267 ++++++++++++++++++++++++++++------------------------------------------
2 files changed, 116 insertions(+), 161 deletions(-)
The memcg is not referenced, so it can be destroyed at anytime right
after we exit rcu read section, so it's not safe to access it.
To fix this, we call css_tryget() to get a reference while we're still
in rcu read section.
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++---------------------------
1 file changed, 33 insertions(+), 30 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d280016..e054ac0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3460,7 +3460,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
/*
* Enqueue the creation of a per-memcg kmem_cache.
- * Called with rcu_read_lock.
*/
static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
struct kmem_cache *cachep)
@@ -3468,12 +3467,8 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
struct create_work *cw;
cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
- if (cw == NULL)
- return;
-
- /* The corresponding put will be done in the workqueue. */
- if (!css_tryget(&memcg->css)) {
- kfree(cw);
+ if (cw == NULL) {
+ css_put(&memcg->css);
return;
}
@@ -3529,10 +3524,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
rcu_read_lock();
memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
- rcu_read_unlock();
if (!memcg_can_account_kmem(memcg))
- return cachep;
+ goto out;
idx = memcg_cache_id(memcg);
@@ -3541,29 +3535,38 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
* code updating memcg_caches will issue a write barrier to match this.
*/
read_barrier_depends();
- if (unlikely(cachep->memcg_params->memcg_caches[idx] == NULL)) {
- /*
- * If we are in a safe context (can wait, and not in interrupt
- * context), we could be be predictable and return right away.
- * This would guarantee that the allocation being performed
- * already belongs in the new cache.
- *
- * However, there are some clashes that can arrive from locking.
- * For instance, because we acquire the slab_mutex while doing
- * kmem_cache_dup, this means no further allocation could happen
- * with the slab_mutex held.
- *
- * Also, because cache creation issue get_online_cpus(), this
- * creates a lock chain: memcg_slab_mutex -> cpu_hotplug_mutex,
- * that ends up reversed during cpu hotplug. (cpuset allocates
- * a bunch of GFP_KERNEL memory during cpuup). Due to all that,
- * better to defer everything.
- */
- memcg_create_cache_enqueue(memcg, cachep);
- return cachep;
+ if (likely(cachep->memcg_params->memcg_caches[idx])) {
+ cachep = cachep->memcg_params->memcg_caches[idx];
+ goto out;
}
- return cachep->memcg_params->memcg_caches[idx];
+ /* The corresponding put will be done in the workqueue. */
+ if (!css_tryget(&memcg->css))
+ goto out;
+ rcu_read_unlock();
+
+ /*
+ * If we are in a safe context (can wait, and not in interrupt
+ * context), we could be be predictable and return right away.
+ * This would guarantee that the allocation being performed
+ * already belongs in the new cache.
+ *
+ * However, there are some clashes that can arrive from locking.
+ * For instance, because we acquire the slab_mutex while doing
+ * kmem_cache_dup, this means no further allocation could happen
+ * with the slab_mutex held.
+ *
+ * Also, because cache creation issue get_online_cpus(), this
+ * creates a lock chain: memcg_slab_mutex -> cpu_hotplug_mutex,
+ * that ends up reversed during cpu hotplug. (cpuset allocates
+ * a bunch of GFP_KERNEL memory during cpuup). Due to all that,
+ * better to defer everything.
+ */
+ memcg_create_cache_enqueue(memcg, cachep);
+ return cachep;
+out:
+ rcu_read_unlock();
+ return cachep;
}
EXPORT_SYMBOL(__memcg_kmem_get_cache);
--
1.8.0.2
This might cause use-after-free bug.
Signed-off-by: Li Zefan <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e054ac0..2364f4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3192,12 +3192,12 @@ void memcg_release_cache(struct kmem_cache *s)
root = s->memcg_params->root_cache;
root->memcg_params->memcg_caches[id] = NULL;
- mem_cgroup_put(memcg);
mutex_lock(&memcg->slab_caches_mutex);
list_del(&s->memcg_params->list);
mutex_unlock(&memcg->slab_caches_mutex);
+ mem_cgroup_put(memcg);
out:
kfree(s->memcg_params);
}
--
1.8.0.2
From: Michal Hocko <[email protected]>
This reverts commit e4715f01be697a3730c78f8ffffb595591d6a88c
mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
an additional reference from all parents so the additional
mem_cgrroup_put(parent) potentially causes use-after-free.
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2364f4e..44cec72 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6250,8 +6250,6 @@ mem_cgroup_css_online(struct cgroup *cont)
* call __mem_cgroup_free, so return directly
*/
mem_cgroup_put(memcg);
- if (parent->use_hierarchy)
- mem_cgroup_put(parent);
}
return error;
}
--
1.8.0.2
From: Michal Hocko <[email protected]>
mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem
fails. This is not correct because only memcg_propagate_kmem takes an
additional reference while mem_cgroup_sockets_init is allowed to fail as
well (although no current implementation fails) but it doesn't take any
reference. This all suggests that it should be memcg_propagate_kmem that
should clean up after itself so this patch moves mem_cgroup_put over
there.
Unfortunately this is not that easy (as pointed out by Li Zefan) because
memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it
is marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
memcg_propagate_kmem fails so the additional reference is dropped in
that case in kmem_cgroup_destroy which means that the reference would be
dropped two times.
The easiest way then would be to simply remove mem_cgrroup_put from
mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
thing.
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Cc: <[email protected]> # 3.8.x
---
mm/memcontrol.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 44cec72..e65eaac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6243,14 +6243,6 @@ mem_cgroup_css_online(struct cgroup *cont)
error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
mutex_unlock(&memcg_create_mutex);
- if (error) {
- /*
- * We call put now because our (and parent's) refcnts
- * are already in place. mem_cgroup_put() will internally
- * call __mem_cgroup_free, so return directly
- */
- mem_cgroup_put(memcg);
- }
return error;
}
--
1.8.0.2
Use css_get/css_put instead of mem_cgroup_get/put.
Note, if at the same time someone is moving @current to a different
cgroup and removing the old cgroup, css_tryget() may return false,
and sock->sk_cgrp won't be initialized, which is fine.
Signed-off-by: Li Zefan <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e65eaac..bbf5bf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -536,15 +536,15 @@ void sock_update_memcg(struct sock *sk)
*/
if (sk->sk_cgrp) {
BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
- mem_cgroup_get(sk->sk_cgrp->memcg);
+ css_get(&sk->sk_cgrp->memcg->css);
return;
}
rcu_read_lock();
memcg = mem_cgroup_from_task(current);
cg_proto = sk->sk_prot->proto_cgroup(memcg);
- if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
- mem_cgroup_get(memcg);
+ if (!mem_cgroup_is_root(memcg) &&
+ memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) {
sk->sk_cgrp = cg_proto;
}
rcu_read_unlock();
@@ -558,7 +558,7 @@ void sock_release_memcg(struct sock *sk)
struct mem_cgroup *memcg;
WARN_ON(!sk->sk_cgrp->memcg);
memcg = sk->sk_cgrp->memcg;
- mem_cgroup_put(memcg);
+ css_put(&sk->sk_cgrp->memcg->css);
}
}
--
1.8.0.2
Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
There are two things being done in the current code:
First, we acquired a css_ref to make sure that the underlying cgroup
would not go away. That is a short lived reference, and it is put as
soon as the cache is created.
At this point, we acquire a long-lived per-cache memcg reference count
to guarantee that the memcg will still be alive.
so it is:
enqueue: css_get
create : memcg_get, css_put
destroy: memcg_put
So we only need to get rid of the memcg_get, change the memcg_put to
css_put, and get rid of the now extra css_put.
(This changelog is basically written by Glauber)
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bbf5bf3..c308ea0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3197,7 +3197,7 @@ void memcg_release_cache(struct kmem_cache *s)
list_del(&s->memcg_params->list);
mutex_unlock(&memcg->slab_caches_mutex);
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
out:
kfree(s->memcg_params);
}
@@ -3356,16 +3356,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
mutex_lock(&memcg_cache_mutex);
new_cachep = cachep->memcg_params->memcg_caches[idx];
- if (new_cachep)
+ if (new_cachep) {
+ css_put(&memcg->css);
goto out;
+ }
new_cachep = kmem_cache_dup(memcg, cachep);
if (new_cachep == NULL) {
new_cachep = cachep;
+ css_put(&memcg->css);
goto out;
}
- mem_cgroup_get(memcg);
atomic_set(&new_cachep->memcg_params->nr_pages , 0);
cachep->memcg_params->memcg_caches[idx] = new_cachep;
@@ -3453,8 +3455,6 @@ static void memcg_create_cache_work_func(struct work_struct *w)
cw = container_of(w, struct create_work, work);
memcg_create_kmem_cache(cw->memcg, cw->cachep);
- /* Drop the reference gotten when we enqueued. */
- css_put(&cw->memcg->css);
kfree(cw);
}
--
1.8.0.2
Use css_get/put instead of mem_cgroup_get/put.
We can't do a simple replacement, because here mem_cgroup_put()
is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
won't be called until css refcnt goes down to 0.
Instead we increment css refcnt in mem_cgroup_css_offline(), and
then check if there's still kmem charges. If not, css refcnt will
be decremented immediately, otherwise the refcnt won't be decremented
when kmem charges goes down to 0.
v2:
- added wmb() in kmem_cgroup_css_offline(), pointed out by Michal
- revised comments as suggested by Michal
- fixed to check if kmem is activated in kmem_cgroup_css_offline()
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 25 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c308ea0..7be796c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3003,8 +3003,16 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
if (res_counter_uncharge(&memcg->kmem, size))
return;
+ /*
+ * Releases a reference taken in kmem_cgroup_css_offline in case
+ * this last uncharge is racing with the offlining code or it is
+ * outliving the memcg existence.
+ *
+ * The memory barrier imposed by test&clear is paired with the
+ * explicit one in kmem_cgroup_css_offline.
+ */
if (memcg_kmem_test_and_clear_dead(memcg))
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
@@ -5090,14 +5098,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
* starts accounting before all call sites are patched
*/
memcg_kmem_set_active(memcg);
-
- /*
- * kmem charges can outlive the cgroup. In the case of slab
- * pages, for instance, a page contain objects from various
- * processes, so it is unfeasible to migrate them away. We
- * need to reference count the memcg because of that.
- */
- mem_cgroup_get(memcg);
} else
ret = res_counter_set_limit(&memcg->kmem, val);
out:
@@ -5130,12 +5130,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
goto out;
/*
- * destroy(), called if we fail, will issue static_key_slow_inc() and
- * mem_cgroup_put() if kmem is enabled. We have to either call them
- * unconditionally, or clear the KMEM_ACTIVE flag. I personally find
- * this more consistent, since it always leads to the same destroy path
+ * __mem_cgroup_free() will issue static_key_slow_dec() because this
+ * memcg is active already. If the later initialization fails then the
+ * cgroup core triggers the cleanup so we do not have to do it here.
*/
- mem_cgroup_get(memcg);
static_key_slow_inc(&memcg_kmem_enabled_key);
mutex_lock(&set_limit_mutex);
@@ -5818,23 +5816,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
return mem_cgroup_sockets_init(memcg, ss);
};
-static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
+static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
{
- mem_cgroup_sockets_destroy(memcg);
+ if (!memcg_kmem_is_active(memcg))
+ return;
+ /*
+ * kmem charges can outlive the cgroup. In the case of slab
+ * pages, for instance, a page contain objects from various
+ * processes. As we prevent from taking a reference for every
+ * such allocation we have to be careful when doing uncharge
+ * (see memcg_uncharge_kmem) and here during offlining.
+ *
+ * The idea is that that only the _last_ uncharge which sees
+ * the dead memcg will drop the last reference. An additional
+ * reference is taken here before the group is marked dead
+ * which is then paired with css_put during uncharge resp. here.
+ *
+ * Although this might sound strange as this path is called when
+ * the reference has already dropped down to 0 and shouldn't be
+ * incremented anymore (css_tryget would fail) we do not have
+ * other options because of the kmem allocations lifetime.
+ */
+ css_get(&memcg->css);
+
+ /* see comment in memcg_uncharge_kmem() */
+ wmb();
memcg_kmem_mark_dead(memcg);
if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
return;
- /*
- * Charges already down to 0, undo mem_cgroup_get() done in the charge
- * path here, being careful not to race with memcg_uncharge_kmem: it is
- * possible that the charges went down to 0 between mark_dead and the
- * res_counter read, so in that case, we don't need the put
- */
if (memcg_kmem_test_and_clear_dead(memcg))
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
@@ -5842,7 +5856,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
return 0;
}
-static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
+static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
{
}
#endif
@@ -6268,6 +6282,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ kmem_cgroup_css_offline(memcg);
+
mem_cgroup_invalidate_reclaim_iterators(memcg);
mem_cgroup_reparent_charges(memcg);
mem_cgroup_destroy_all_caches(memcg);
@@ -6277,7 +6293,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- kmem_cgroup_destroy(memcg);
+ mem_cgroup_sockets_destroy(memcg);
mem_cgroup_put(memcg);
}
--
1.8.0.2
Use css_get/put instead of mem_cgroup_get/put. A simple replacement
will do.
The historical reason that memcg has its own refcnt instead of always
using css_get/put, is that cgroup couldn't be removed if there're still
css refs, so css refs can't be used as long-lived reference. The
situation has changed so that rmdir a cgroup will succeed regardless
css refs, but won't be freed until css refs goes down to 0.
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7be796c..7fdd69d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4149,12 +4149,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
unlock_page_cgroup(pc);
/*
* even after unlock, we have memcg->res.usage here and this memcg
- * will never be freed.
+ * will never be freed, so it's safe to call css_get().
*/
memcg_check_events(memcg, page);
if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
mem_cgroup_swap_statistics(memcg, true);
- mem_cgroup_get(memcg);
+ css_get(&memcg->css);
}
/*
* Migration does not charge the res_counter for the
@@ -4254,7 +4254,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
/*
* record memcg information, if swapout && memcg != NULL,
- * mem_cgroup_get() was called in uncharge().
+ * css_get() was called in uncharge().
*/
if (do_swap_account && swapout && memcg)
swap_cgroup_record(ent, css_id(&memcg->css));
@@ -4285,7 +4285,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
if (!mem_cgroup_is_root(memcg))
res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
- mem_cgroup_put(memcg);
+ css_put(&memcg->css);
}
rcu_read_unlock();
}
@@ -4319,11 +4319,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
* This function is only called from task migration context now.
* It postpones res_counter and refcount handling till the end
* of task migration(mem_cgroup_clear_mc()) for performance
- * improvement. But we cannot postpone mem_cgroup_get(to)
- * because if the process that has been moved to @to does
- * swap-in, the refcount of @to might be decreased to 0.
+ * improvement. But we cannot postpone css_get(to) because if
+ * the process that has been moved to @to does swap-in, the
+ * refcount of @to might be decreased to 0.
+ *
+ * We are in attach() phase, so the cgroup is guaranteed to be
+ * alive, so we can just call css_get().
*/
- mem_cgroup_get(to);
+ css_get(&to->css);
return 0;
}
return -EINVAL;
@@ -6604,6 +6607,7 @@ static void __mem_cgroup_clear_mc(void)
{
struct mem_cgroup *from = mc.from;
struct mem_cgroup *to = mc.to;
+ int i;
/* we must uncharge all the leftover precharges from mc.to */
if (mc.precharge) {
@@ -6624,7 +6628,9 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.from))
res_counter_uncharge(&mc.from->memsw,
PAGE_SIZE * mc.moved_swap);
- __mem_cgroup_put(mc.from, mc.moved_swap);
+
+ for (i = 0; i < mc.moved_swap; i++)
+ css_put(&mc.from->css);
if (!mem_cgroup_is_root(mc.to)) {
/*
@@ -6634,7 +6640,7 @@ static void __mem_cgroup_clear_mc(void)
res_counter_uncharge(&mc.to->res,
PAGE_SIZE * mc.moved_swap);
}
- /* we've already done mem_cgroup_get(mc.to) */
+ /* we've already done css_get(mc.to) */
mc.moved_swap = 0;
}
memcg_oom_recover(from);
--
1.8.0.2
Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed
immediately due to css ref draining to 0. Now it would be a disaster if
the still-alive child cgroup tries to access its parent.
Make sure this won't happen.
Signed-off-by: Li Zefan <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
kernel/cgroup.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 06aeb42..7ee3bdf 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -888,6 +888,13 @@ static void cgroup_free_fn(struct work_struct *work)
mutex_unlock(&cgroup_mutex);
/*
+ * We get a ref to the parent's dentry, and put the ref when
+ * this cgroup is being freed, so it's guaranteed that the
+ * parent won't be destroyed before its children.
+ */
+ dput(cgrp->parent->dentry);
+
+ /*
* Drop the active superblock reference that we took when we
* created the cgroup
*/
@@ -4170,6 +4177,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
for_each_subsys(root, ss)
dget(dentry);
+ /* hold a ref to the parent's dentry */
+ dget(parent->dentry);
+
/* creation succeeded, notify subsystems */
for_each_subsys(root, ss) {
err = online_css(ss, cgrp);
--
1.8.0.2
The cgroup core guarantees it's always safe to access the parent.
v2:
- added a comment in mem_cgroup_put() as suggested by Michal
- removed mem_cgroup_get(), otherwise gcc will warn that it's not used
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fdd69d..9ca5a99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -501,7 +501,6 @@ enum res_type {
*/
static DEFINE_MUTEX(memcg_create_mutex);
-static void mem_cgroup_get(struct mem_cgroup *memcg);
static void mem_cgroup_put(struct mem_cgroup *memcg);
static inline
@@ -6125,19 +6124,10 @@ static void free_rcu(struct rcu_head *rcu_head)
schedule_work(&memcg->work_freeing);
}
-static void mem_cgroup_get(struct mem_cgroup *memcg)
-{
- atomic_inc(&memcg->refcnt);
-}
-
static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
{
- if (atomic_sub_and_test(count, &memcg->refcnt)) {
- struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ if (atomic_sub_and_test(count, &memcg->refcnt))
call_rcu(&memcg->rcu_freeing, free_rcu);
- if (parent)
- mem_cgroup_put(parent);
- }
}
static void mem_cgroup_put(struct mem_cgroup *memcg)
@@ -6239,12 +6229,9 @@ mem_cgroup_css_online(struct cgroup *cont)
res_counter_init(&memcg->kmem, &parent->kmem);
/*
- * We increment refcnt of the parent to ensure that we can
- * safely access it on res_counter_charge/uncharge.
- * This refcnt will be decremented when freeing this
- * mem_cgroup(see mem_cgroup_put).
+ * No need to take a reference to the parent because cgroup
+ * core guarantees its existence.
*/
- mem_cgroup_get(parent);
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
--
1.8.0.2
Now memcg has the same life cycle as its corresponding cgroup.
Kill the useless refcnt.
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 18 +-----------------
1 file changed, 1 insertion(+), 17 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ca5a99..a6d44bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -297,8 +297,6 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
- atomic_t refcnt;
-
int swappiness;
/* OOM-Killer disable */
int oom_kill_disable;
@@ -501,8 +499,6 @@ enum res_type {
*/
static DEFINE_MUTEX(memcg_create_mutex);
-static void mem_cgroup_put(struct mem_cgroup *memcg);
-
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -6124,17 +6120,6 @@ static void free_rcu(struct rcu_head *rcu_head)
schedule_work(&memcg->work_freeing);
}
-static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
-{
- if (atomic_sub_and_test(count, &memcg->refcnt))
- call_rcu(&memcg->rcu_freeing, free_rcu);
-}
-
-static void mem_cgroup_put(struct mem_cgroup *memcg)
-{
- __mem_cgroup_put(memcg, 1);
-}
-
/*
* Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
*/
@@ -6194,7 +6179,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&memcg->oom_notify);
- atomic_set(&memcg->refcnt, 1);
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
@@ -6285,7 +6269,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
mem_cgroup_sockets_destroy(memcg);
- mem_cgroup_put(memcg);
+ call_rcu(&memcg->rcu_freeing, free_rcu);
}
#ifdef CONFIG_MMU
--
1.8.0.2
Now memcg has the same life cycle as its corresponding cgroup,
we don't have to save the cgroup path name in memcg->memcg_name.
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 65 +++++++++++++++++++++------------------------------------
1 file changed, 24 insertions(+), 41 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aeab1d3..06e995e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -306,20 +306,12 @@ struct mem_cgroup {
struct list_head dead;
};
- union {
- /*
- * Should we move charges of a task when a task is moved into
- * this mem_cgroup ? And what type of charges should we move ?
- */
- unsigned long move_charge_at_immigrate;
+ /*
+ * Should we move charges of a task when a task is moved into
+ * this mem_cgroup ? And what type of charges should we move ?
+ */
+ unsigned long move_charge_at_immigrate;
- /*
- * We are no longer concerned about moving charges after memcg
- * is dead. So we will fill this up with its name, to aid
- * debugging.
- */
- char *memcg_name;
- };
/*
* set > 0 if pages under this cgroup are moving to other cgroup.
*/
@@ -381,36 +373,10 @@ static inline void memcg_dangling_free(struct mem_cgroup *memcg)
mutex_lock(&dangling_memcgs_mutex);
list_del(&memcg->dead);
mutex_unlock(&dangling_memcgs_mutex);
- free_pages((unsigned long)memcg->memcg_name, 0);
}
static inline void memcg_dangling_add(struct mem_cgroup *memcg)
{
- /*
- * cgroup.c will do page-sized allocations most of the time,
- * so we'll just follow the pattern. Also, __get_free_pages
- * is a better interface than kmalloc for us here, because
- * we'd like this memory to be always billed to the root cgroup,
- * not to the process removing the memcg. While kmalloc would
- * require us to wrap it into memcg_stop/resume_kmem_account,
- * with __get_free_pages we just don't pass the memcg flag.
- */
- memcg->memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
-
- /*
- * we will, in general, just ignore failures. No need to go crazy,
- * being this just a debugging interface. It is nice to copy a memcg
- * name over, but if we (unlikely) can't, just the address will do
- */
- if (!memcg->memcg_name)
- goto add_list;
-
- if (cgroup_path(memcg->css.cgroup, memcg->memcg_name, PAGE_SIZE) < 0) {
- free_pages((unsigned long)memcg->memcg_name, 0);
- memcg->memcg_name = NULL;
- }
-
-add_list:
INIT_LIST_HEAD(&memcg->dead);
mutex_lock(&dangling_memcgs_mutex);
list_add(&memcg->dead, &dangling_memcgs);
@@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
struct seq_file *m)
{
struct mem_cgroup *memcg;
+ char *memcg_name;
+ int ret;
+
+ /*
+ * cgroup.c will do page-sized allocations most of the time,
+ * so we'll just follow the pattern. Also, __get_free_pages
+ * is a better interface than kmalloc for us here, because
+ * we'd like this memory to be always billed to the root cgroup,
+ * not to the process removing the memcg. While kmalloc would
+ * require us to wrap it into memcg_stop/resume_kmem_account,
+ * with __get_free_pages we just don't pass the memcg flag.
+ */
+ memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
+ if (!memcg_name)
+ return -ENOMEM;
mutex_lock(&dangling_memcgs_mutex);
list_for_each_entry(memcg, &dangling_memcgs, dead) {
- if (memcg->memcg_name)
- seq_printf(m, "%s:\n", memcg->memcg_name);
+ ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
+ if (!ret)
+ seq_printf(m, "%s:\n", memcg_name);
else
seq_printf(m, "%p (name lost):\n", memcg);
@@ -5203,6 +5185,7 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
}
mutex_unlock(&dangling_memcgs_mutex);
+ free_pages((unsigned long)memcg_name, 0);
return 0;
}
#endif
--
1.8.0.2
Now memcg has the same life cycle with its corresponding cgroup, and
a cgroup is freed via RCU and then mem_cgroup_css_free() is called
in a work function, so we can simply call __mem_cgroup_free() in
mem_cgroup_css_free().
This actually reverts 59927fb984de1703c67bc640c3e522d8b5276c73
("memcg: free mem_cgroup by RCU to fix oops").
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
---
mm/memcontrol.c | 51 +++++----------------------------------------------
1 file changed, 5 insertions(+), 46 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6d44bc..5aa6e91 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,28 +261,10 @@ struct mem_cgroup {
*/
struct res_counter res;
- union {
- /*
- * the counter to account for mem+swap usage.
- */
- struct res_counter memsw;
-
- /*
- * rcu_freeing is used only when freeing struct mem_cgroup,
- * so put it into a union to avoid wasting more memory.
- * It must be disjoint from the css field. It could be
- * in a union with the res field, but res plays a much
- * larger part in mem_cgroup life than memsw, and might
- * be of interest, even at time of free, when debugging.
- * So share rcu_head with the less interesting memsw.
- */
- struct rcu_head rcu_freeing;
- /*
- * We also need some space for a worker in deferred freeing.
- * By the time we call it, rcu_freeing is no longer in use.
- */
- struct work_struct work_freeing;
- };
+ /*
+ * the counter to account for mem+swap usage.
+ */
+ struct res_counter memsw;
/*
* the counter to account for kernel memory usage.
@@ -6097,29 +6079,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
vfree(memcg);
}
-
-/*
- * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
- * but in process context. The work_freeing structure is overlaid
- * on the rcu_freeing structure, which itself is overlaid on memsw.
- */
-static void free_work(struct work_struct *work)
-{
- struct mem_cgroup *memcg;
-
- memcg = container_of(work, struct mem_cgroup, work_freeing);
- __mem_cgroup_free(memcg);
-}
-
-static void free_rcu(struct rcu_head *rcu_head)
-{
- struct mem_cgroup *memcg;
-
- memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
- INIT_WORK(&memcg->work_freeing, free_work);
- schedule_work(&memcg->work_freeing);
-}
-
/*
* Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
*/
@@ -6269,7 +6228,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
mem_cgroup_sockets_destroy(memcg);
- call_rcu(&memcg->rcu_freeing, free_rcu);
+ __mem_cgroup_free(memcg);
}
#ifdef CONFIG_MMU
--
1.8.0.2
On Mon 08-04-13 14:34:35, Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> We can't do a simple replacement, because here mem_cgroup_put()
> is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
> won't be called until css refcnt goes down to 0.
>
> Instead we increment css refcnt in mem_cgroup_css_offline(), and
> then check if there's still kmem charges. If not, css refcnt will
> be decremented immediately, otherwise the refcnt won't be decremented
> when kmem charges goes down to 0.
>
> v2:
> - added wmb() in kmem_cgroup_css_offline(), pointed out by Michal
> - revised comments as suggested by Michal
> - fixed to check if kmem is activated in kmem_cgroup_css_offline()
>
> Signed-off-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 41 insertions(+), 25 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c308ea0..7be796c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3003,8 +3003,16 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
> if (res_counter_uncharge(&memcg->kmem, size))
> return;
>
> + /*
> + * Releases a reference taken in kmem_cgroup_css_offline in case
> + * this last uncharge is racing with the offlining code or it is
> + * outliving the memcg existence.
> + *
> + * The memory barrier imposed by test&clear is paired with the
> + * explicit one in kmem_cgroup_css_offline.
> + */
> if (memcg_kmem_test_and_clear_dead(memcg))
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
>
> void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
> @@ -5090,14 +5098,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
> * starts accounting before all call sites are patched
> */
> memcg_kmem_set_active(memcg);
> -
> - /*
> - * kmem charges can outlive the cgroup. In the case of slab
> - * pages, for instance, a page contain objects from various
> - * processes, so it is unfeasible to migrate them away. We
> - * need to reference count the memcg because of that.
> - */
> - mem_cgroup_get(memcg);
> } else
> ret = res_counter_set_limit(&memcg->kmem, val);
> out:
> @@ -5130,12 +5130,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
> goto out;
>
> /*
> - * destroy(), called if we fail, will issue static_key_slow_inc() and
> - * mem_cgroup_put() if kmem is enabled. We have to either call them
> - * unconditionally, or clear the KMEM_ACTIVE flag. I personally find
> - * this more consistent, since it always leads to the same destroy path
> + * __mem_cgroup_free() will issue static_key_slow_dec() because this
> + * memcg is active already. If the later initialization fails then the
> + * cgroup core triggers the cleanup so we do not have to do it here.
> */
> - mem_cgroup_get(memcg);
> static_key_slow_inc(&memcg_kmem_enabled_key);
>
> mutex_lock(&set_limit_mutex);
> @@ -5818,23 +5816,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> return mem_cgroup_sockets_init(memcg, ss);
> };
>
> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> {
> - mem_cgroup_sockets_destroy(memcg);
> + if (!memcg_kmem_is_active(memcg))
> + return;
>
> + /*
> + * kmem charges can outlive the cgroup. In the case of slab
> + * pages, for instance, a page contain objects from various
> + * processes. As we prevent from taking a reference for every
> + * such allocation we have to be careful when doing uncharge
> + * (see memcg_uncharge_kmem) and here during offlining.
> + *
> + * The idea is that that only the _last_ uncharge which sees
> + * the dead memcg will drop the last reference. An additional
> + * reference is taken here before the group is marked dead
> + * which is then paired with css_put during uncharge resp. here.
> + *
> + * Although this might sound strange as this path is called when
> + * the reference has already dropped down to 0 and shouldn't be
> + * incremented anymore (css_tryget would fail) we do not have
> + * other options because of the kmem allocations lifetime.
> + */
> + css_get(&memcg->css);
> +
> + /* see comment in memcg_uncharge_kmem() */
> + wmb();
> memcg_kmem_mark_dead(memcg);
>
> if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
> return;
>
> - /*
> - * Charges already down to 0, undo mem_cgroup_get() done in the charge
> - * path here, being careful not to race with memcg_uncharge_kmem: it is
> - * possible that the charges went down to 0 between mark_dead and the
> - * res_counter read, so in that case, we don't need the put
> - */
> if (memcg_kmem_test_and_clear_dead(memcg))
> - mem_cgroup_put(memcg);
> + css_put(&memcg->css);
> }
> #else
> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> @@ -5842,7 +5856,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> return 0;
> }
>
> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> {
> }
> #endif
> @@ -6268,6 +6282,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> + kmem_cgroup_css_offline(memcg);
> +
> mem_cgroup_invalidate_reclaim_iterators(memcg);
> mem_cgroup_reparent_charges(memcg);
> mem_cgroup_destroy_all_caches(memcg);
> @@ -6277,7 +6293,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> - kmem_cgroup_destroy(memcg);
> + mem_cgroup_sockets_destroy(memcg);
>
> mem_cgroup_put(memcg);
> }
> --
> 1.8.0.2
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Michal Hocko
SUSE Labs
On Mon 08-04-13 14:36:32, Li Zefan wrote:
> Now memcg has the same life cycle with its corresponding cgroup, and
> a cgroup is freed via RCU and then mem_cgroup_css_free() is called
> in a work function, so we can simply call __mem_cgroup_free() in
> mem_cgroup_css_free().
>
> This actually reverts 59927fb984de1703c67bc640c3e522d8b5276c73
> ("memcg: free mem_cgroup by RCU to fix oops").
>
> Cc: Hugh Dickins <[email protected]>
> Signed-off-by: Li Zefan <[email protected]>
OK, makes sense after the previous changes.
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 51 +++++----------------------------------------------
> 1 file changed, 5 insertions(+), 46 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6d44bc..5aa6e91 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,28 +261,10 @@ struct mem_cgroup {
> */
> struct res_counter res;
>
> - union {
> - /*
> - * the counter to account for mem+swap usage.
> - */
> - struct res_counter memsw;
> -
> - /*
> - * rcu_freeing is used only when freeing struct mem_cgroup,
> - * so put it into a union to avoid wasting more memory.
> - * It must be disjoint from the css field. It could be
> - * in a union with the res field, but res plays a much
> - * larger part in mem_cgroup life than memsw, and might
> - * be of interest, even at time of free, when debugging.
> - * So share rcu_head with the less interesting memsw.
> - */
> - struct rcu_head rcu_freeing;
> - /*
> - * We also need some space for a worker in deferred freeing.
> - * By the time we call it, rcu_freeing is no longer in use.
> - */
> - struct work_struct work_freeing;
> - };
> + /*
> + * the counter to account for mem+swap usage.
> + */
> + struct res_counter memsw;
>
> /*
> * the counter to account for kernel memory usage.
> @@ -6097,29 +6079,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> vfree(memcg);
> }
>
> -
> -/*
> - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> - * but in process context. The work_freeing structure is overlaid
> - * on the rcu_freeing structure, which itself is overlaid on memsw.
> - */
> -static void free_work(struct work_struct *work)
> -{
> - struct mem_cgroup *memcg;
> -
> - memcg = container_of(work, struct mem_cgroup, work_freeing);
> - __mem_cgroup_free(memcg);
> -}
> -
> -static void free_rcu(struct rcu_head *rcu_head)
> -{
> - struct mem_cgroup *memcg;
> -
> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> - INIT_WORK(&memcg->work_freeing, free_work);
> - schedule_work(&memcg->work_freeing);
> -}
> -
> /*
> * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
> */
> @@ -6269,7 +6228,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>
> mem_cgroup_sockets_destroy(memcg);
>
> - call_rcu(&memcg->rcu_freeing, free_rcu);
> + __mem_cgroup_free(memcg);
> }
>
> #ifdef CONFIG_MMU
> --
> 1.8.0.2
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On Mon 08-04-13 14:36:52, Li Zefan wrote:
[...]
> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
> struct seq_file *m)
> {
> struct mem_cgroup *memcg;
> + char *memcg_name;
> + int ret;
The interface is only for debugging, all right, but that doesn't mean we
should allocate a buffer for each read. Why cannot we simply use
cgroup_path for seq_printf directly? Can we still race with the group
rename?
> +
> + /*
> + * cgroup.c will do page-sized allocations most of the time,
> + * so we'll just follow the pattern. Also, __get_free_pages
> + * is a better interface than kmalloc for us here, because
> + * we'd like this memory to be always billed to the root cgroup,
> + * not to the process removing the memcg. While kmalloc would
> + * require us to wrap it into memcg_stop/resume_kmem_account,
> + * with __get_free_pages we just don't pass the memcg flag.
> + */
> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
> + if (!memcg_name)
> + return -ENOMEM;
>
> mutex_lock(&dangling_memcgs_mutex);
>
> list_for_each_entry(memcg, &dangling_memcgs, dead) {
> - if (memcg->memcg_name)
> - seq_printf(m, "%s:\n", memcg->memcg_name);
> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
> + if (!ret)
> + seq_printf(m, "%s:\n", memcg_name);
> else
> seq_printf(m, "%p (name lost):\n", memcg);
>
> @@ -5203,6 +5185,7 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
> }
>
> mutex_unlock(&dangling_memcgs_mutex);
> + free_pages((unsigned long)memcg_name, 0);
> return 0;
> }
> #endif
> --
> 1.8.0.2
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On Mon, Apr 08, 2013 at 02:35:02PM +0800, Li Zefan wrote:
> Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
> be freed. Then we rmdir the parent cgroup, and the parent is freed
> immediately due to css ref draining to 0. Now it would be a disaster if
> the still-alive child cgroup tries to access its parent.
>
> Make sure this won't happen.
>
> Signed-off-by: Li Zefan <[email protected]>
> Reviewed-by: Michal Hocko <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Thanks.
--
tejun
On 2013/4/8 22:25, Michal Hocko wrote:
> On Mon 08-04-13 14:36:52, Li Zefan wrote:
> [...]
>> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>> struct seq_file *m)
>> {
>> struct mem_cgroup *memcg;
>> + char *memcg_name;
>> + int ret;
>
> The interface is only for debugging, all right, but that doesn't mean we
> should allocate a buffer for each read. Why cannot we simply use
> cgroup_path for seq_printf directly? Can we still race with the group
> rename?
because cgroup_path() requires the caller pass a buffer to it.
>
>> +
>> + /*
>> + * cgroup.c will do page-sized allocations most of the time,
>> + * so we'll just follow the pattern. Also, __get_free_pages
>> + * is a better interface than kmalloc for us here, because
>> + * we'd like this memory to be always billed to the root cgroup,
>> + * not to the process removing the memcg. While kmalloc would
>> + * require us to wrap it into memcg_stop/resume_kmem_account,
>> + * with __get_free_pages we just don't pass the memcg flag.
>> + */
>> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
>> + if (!memcg_name)
>> + return -ENOMEM;
>>
>> mutex_lock(&dangling_memcgs_mutex);
>>
>> list_for_each_entry(memcg, &dangling_memcgs, dead) {
>> - if (memcg->memcg_name)
>> - seq_printf(m, "%s:\n", memcg->memcg_name);
>> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
>> + if (!ret)
>> + seq_printf(m, "%s:\n", memcg_name);
>> else
>> seq_printf(m, "%p (name lost):\n", memcg);
>>
>> @@ -5203,6 +5185,7 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>> }
>>
>> mutex_unlock(&dangling_memcgs_mutex);
>> + free_pages((unsigned long)memcg_name, 0);
>> return 0;
>> }
>> #endif
(2013/04/08 15:33), Li Zefan wrote:
> From: Michal Hocko <[email protected]>
>
> This reverts commit e4715f01be697a3730c78f8ffffb595591d6a88c
>
> mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
> an additional reference from all parents so the additional
> mem_cgrroup_put(parent) potentially causes use-after-free.
>
> Signed-off-by: Li Zefan <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/08 15:33), Li Zefan wrote:
> From: Michal Hocko <[email protected]>
>
> mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem
> fails. This is not correct because only memcg_propagate_kmem takes an
> additional reference while mem_cgroup_sockets_init is allowed to fail as
> well (although no current implementation fails) but it doesn't take any
> reference. This all suggests that it should be memcg_propagate_kmem that
> should clean up after itself so this patch moves mem_cgroup_put over
> there.
>
> Unfortunately this is not that easy (as pointed out by Li Zefan) because
> memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it
> is marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
> memcg_propagate_kmem fails so the additional reference is dropped in
> that case in kmem_cgroup_destroy which means that the reference would be
> dropped two times.
>
> The easiest way then would be to simply remove mem_cgrroup_put from
> mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
> thing.
>
> Signed-off-by: Li Zefan <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> Cc: <[email protected]> # 3.8.x
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/08 15:34), Li Zefan wrote:
> Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().
>
> There are two things being done in the current code:
>
> First, we acquired a css_ref to make sure that the underlying cgroup
> would not go away. That is a short lived reference, and it is put as
> soon as the cache is created.
>
> At this point, we acquire a long-lived per-cache memcg reference count
> to guarantee that the memcg will still be alive.
>
> so it is:
>
> enqueue: css_get
> create : memcg_get, css_put
> destroy: memcg_put
>
> So we only need to get rid of the memcg_get, change the memcg_put to
> css_put, and get rid of the now extra css_put.
>
> (This changelog is basically written by Glauber)
>
> Signed-off-by: Li Zefan <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/08 15:34), Li Zefan wrote:
> Use css_get/put instead of mem_cgroup_get/put.
>
> We can't do a simple replacement, because here mem_cgroup_put()
> is called during mem_cgroup_css_free(), while mem_cgroup_css_free()
> won't be called until css refcnt goes down to 0.
>
> Instead we increment css refcnt in mem_cgroup_css_offline(), and
> then check if there's still kmem charges. If not, css refcnt will
> be decremented immediately, otherwise the refcnt won't be decremented
> when kmem charges goes down to 0.
>
> v2:
> - added wmb() in kmem_cgroup_css_offline(), pointed out by Michal
> - revised comments as suggested by Michal
> - fixed to check if kmem is activated in kmem_cgroup_css_offline()
>
> Signed-off-by: Li Zefan <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/08 15:36), Li Zefan wrote:
> Now memcg has the same life cycle with its corresponding cgroup, and
> a cgroup is freed via RCU and then mem_cgroup_css_free() is called
> in a work function, so we can simply call __mem_cgroup_free() in
> mem_cgroup_css_free().
>
> This actually reverts 59927fb984de1703c67bc640c3e522d8b5276c73
> ("memcg: free mem_cgroup by RCU to fix oops").
>
> Cc: Hugh Dickins <[email protected]>
> Signed-off-by: Li Zefan <[email protected]>
Very nice.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2013/04/08 15:36), Li Zefan wrote:
> Now memcg has the same life cycle as its corresponding cgroup,
> we don't have to save the cgroup path name in memcg->memcg_name.
>
> Signed-off-by: Li Zefan <[email protected]>
> ---
> mm/memcontrol.c | 65 +++++++++++++++++++++------------------------------------
> 1 file changed, 24 insertions(+), 41 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aeab1d3..06e995e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -306,20 +306,12 @@ struct mem_cgroup {
> struct list_head dead;
> };
>
> - union {
> - /*
> - * Should we move charges of a task when a task is moved into
> - * this mem_cgroup ? And what type of charges should we move ?
> - */
> - unsigned long move_charge_at_immigrate;
> + /*
> + * Should we move charges of a task when a task is moved into
> + * this mem_cgroup ? And what type of charges should we move ?
> + */
> + unsigned long move_charge_at_immigrate;
>
> - /*
> - * We are no longer concerned about moving charges after memcg
> - * is dead. So we will fill this up with its name, to aid
> - * debugging.
> - */
> - char *memcg_name;
> - };
> /*
> * set > 0 if pages under this cgroup are moving to other cgroup.
> */
> @@ -381,36 +373,10 @@ static inline void memcg_dangling_free(struct mem_cgroup *memcg)
> mutex_lock(&dangling_memcgs_mutex);
> list_del(&memcg->dead);
> mutex_unlock(&dangling_memcgs_mutex);
> - free_pages((unsigned long)memcg->memcg_name, 0);
> }
>
> static inline void memcg_dangling_add(struct mem_cgroup *memcg)
> {
> - /*
> - * cgroup.c will do page-sized allocations most of the time,
> - * so we'll just follow the pattern. Also, __get_free_pages
> - * is a better interface than kmalloc for us here, because
> - * we'd like this memory to be always billed to the root cgroup,
> - * not to the process removing the memcg. While kmalloc would
> - * require us to wrap it into memcg_stop/resume_kmem_account,
> - * with __get_free_pages we just don't pass the memcg flag.
> - */
> - memcg->memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
> -
> - /*
> - * we will, in general, just ignore failures. No need to go crazy,
> - * being this just a debugging interface. It is nice to copy a memcg
> - * name over, but if we (unlikely) can't, just the address will do
> - */
> - if (!memcg->memcg_name)
> - goto add_list;
> -
> - if (cgroup_path(memcg->css.cgroup, memcg->memcg_name, PAGE_SIZE) < 0) {
> - free_pages((unsigned long)memcg->memcg_name, 0);
> - memcg->memcg_name = NULL;
> - }
> -
> -add_list:
> INIT_LIST_HEAD(&memcg->dead);
> mutex_lock(&dangling_memcgs_mutex);
> list_add(&memcg->dead, &dangling_memcgs);
> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
> struct seq_file *m)
> {
> struct mem_cgroup *memcg;
> + char *memcg_name;
> + int ret;
> +
> + /*
> + * cgroup.c will do page-sized allocations most of the time,
> + * so we'll just follow the pattern. Also, __get_free_pages
> + * is a better interface than kmalloc for us here, because
> + * we'd like this memory to be always billed to the root cgroup,
> + * not to the process removing the memcg. While kmalloc would
> + * require us to wrap it into memcg_stop/resume_kmem_account,
> + * with __get_free_pages we just don't pass the memcg flag.
> + */
> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
> + if (!memcg_name)
> + return -ENOMEM;
>
> mutex_lock(&dangling_memcgs_mutex);
>
> list_for_each_entry(memcg, &dangling_memcgs, dead) {
> - if (memcg->memcg_name)
> - seq_printf(m, "%s:\n", memcg->memcg_name);
> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
> + if (!ret)
> + seq_printf(m, "%s:\n", memcg_name);
> else
> seq_printf(m, "%p (name lost):\n", memcg);
>
I'm sorry for dawm question ...when this error happens ?
We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
Thanks,
-Kame
>> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>> struct seq_file *m)
>> {
>> struct mem_cgroup *memcg;
>> + char *memcg_name;
>> + int ret;
>> +
>> + /*
>> + * cgroup.c will do page-sized allocations most of the time,
>> + * so we'll just follow the pattern. Also, __get_free_pages
>> + * is a better interface than kmalloc for us here, because
>> + * we'd like this memory to be always billed to the root cgroup,
>> + * not to the process removing the memcg. While kmalloc would
>> + * require us to wrap it into memcg_stop/resume_kmem_account,
>> + * with __get_free_pages we just don't pass the memcg flag.
>> + */
>> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
>> + if (!memcg_name)
>> + return -ENOMEM;
>>
>> mutex_lock(&dangling_memcgs_mutex);
>>
>> list_for_each_entry(memcg, &dangling_memcgs, dead) {
>> - if (memcg->memcg_name)
>> - seq_printf(m, "%s:\n", memcg->memcg_name);
>> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
>> + if (!ret)
>> + seq_printf(m, "%s:\n", memcg_name);
>> else
>> seq_printf(m, "%p (name lost):\n", memcg);
>>
>
> I'm sorry for dawm question ...when this error happens ?
> We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
>
It does no harm to check the return value, and we don't have to
worry about if cgroup_path() will be changed to return some other
errno like ENOMEM in the future.
(2013/04/09 12:18), Li Zefan wrote:
>>> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>>> struct seq_file *m)
>>> {
>>> struct mem_cgroup *memcg;
>>> + char *memcg_name;
>>> + int ret;
>>> +
>>> + /*
>>> + * cgroup.c will do page-sized allocations most of the time,
>>> + * so we'll just follow the pattern. Also, __get_free_pages
>>> + * is a better interface than kmalloc for us here, because
>>> + * we'd like this memory to be always billed to the root cgroup,
>>> + * not to the process removing the memcg. While kmalloc would
>>> + * require us to wrap it into memcg_stop/resume_kmem_account,
>>> + * with __get_free_pages we just don't pass the memcg flag.
>>> + */
>>> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
>>> + if (!memcg_name)
>>> + return -ENOMEM;
>>>
>>> mutex_lock(&dangling_memcgs_mutex);
>>>
>>> list_for_each_entry(memcg, &dangling_memcgs, dead) {
>>> - if (memcg->memcg_name)
>>> - seq_printf(m, "%s:\n", memcg->memcg_name);
>>> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
>>> + if (!ret)
>>> + seq_printf(m, "%s:\n", memcg_name);
>>> else
>>> seq_printf(m, "%p (name lost):\n", memcg);
>>>
>>
>> I'm sorry for dawm question ...when this error happens ?
>> We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
>>
>
> It does no harm to check the return value, and we don't have to
> worry about if cgroup_path() will be changed to return some other
> errno like ENOMEM in the future.
>
Hmm. but the name is not lost, right ?
How about returning error rather than making a mixture of lines in different formats ?
Thanks,
-Kame
On Tue, Apr 09, 2013 at 11:18:21AM +0800, Li Zefan wrote:
> >> - if (memcg->memcg_name)
> >> - seq_printf(m, "%s:\n", memcg->memcg_name);
> >> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
> >> + if (!ret)
> >> + seq_printf(m, "%s:\n", memcg_name);
> >> else
> >> seq_printf(m, "%p (name lost):\n", memcg);
> >>
> >
> > I'm sorry for dawm question ...when this error happens ?
> > We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
> >
>
> It does no harm to check the return value, and we don't have to
> worry about if cgroup_path() will be changed to return some other
> errno like ENOMEM in the future.
Maybe change the function to return the length of the path regardless
of the specified buffer length? ie. as in snprintf()?
--
tejun
On 04/09/2013 07:18 AM, Li Zefan wrote:
>>> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>>> struct seq_file *m)
>>> {
>>> struct mem_cgroup *memcg;
>>> + char *memcg_name;
>>> + int ret;
>>> +
>>> + /*
>>> + * cgroup.c will do page-sized allocations most of the time,
>>> + * so we'll just follow the pattern. Also, __get_free_pages
>>> + * is a better interface than kmalloc for us here, because
>>> + * we'd like this memory to be always billed to the root cgroup,
>>> + * not to the process removing the memcg. While kmalloc would
>>> + * require us to wrap it into memcg_stop/resume_kmem_account,
>>> + * with __get_free_pages we just don't pass the memcg flag.
>>> + */
>>> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
>>> + if (!memcg_name)
>>> + return -ENOMEM;
>>>
>>> mutex_lock(&dangling_memcgs_mutex);
>>>
>>> list_for_each_entry(memcg, &dangling_memcgs, dead) {
>>> - if (memcg->memcg_name)
>>> - seq_printf(m, "%s:\n", memcg->memcg_name);
>>> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
>>> + if (!ret)
>>> + seq_printf(m, "%s:\n", memcg_name);
>>> else
>>> seq_printf(m, "%p (name lost):\n", memcg);
>>>
>>
>> I'm sorry for dawm question ...when this error happens ?
>> We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
>>
>
> It does no harm to check the return value, and we don't have to
> worry about if cgroup_path() will be changed to return some other
> errno like ENOMEM in the future.
>
I feel more comfortable with the check as well. This is a debugging
interface after all. People tend to resort to this when things go wrong.
So at this point the less assumptions we make, the better.
On 04/09/2013 07:46 AM, Kamezawa Hiroyuki wrote:
> (2013/04/09 12:18), Li Zefan wrote:
>>>> @@ -5188,12 +5154,28 @@ static int mem_cgroup_dangling_read(struct cgroup *cont, struct cftype *cft,
>>>> struct seq_file *m)
>>>> {
>>>> struct mem_cgroup *memcg;
>>>> + char *memcg_name;
>>>> + int ret;
>>>> +
>>>> + /*
>>>> + * cgroup.c will do page-sized allocations most of the time,
>>>> + * so we'll just follow the pattern. Also, __get_free_pages
>>>> + * is a better interface than kmalloc for us here, because
>>>> + * we'd like this memory to be always billed to the root cgroup,
>>>> + * not to the process removing the memcg. While kmalloc would
>>>> + * require us to wrap it into memcg_stop/resume_kmem_account,
>>>> + * with __get_free_pages we just don't pass the memcg flag.
>>>> + */
>>>> + memcg_name = (char *)__get_free_pages(GFP_KERNEL, 0);
>>>> + if (!memcg_name)
>>>> + return -ENOMEM;
>>>>
>>>> mutex_lock(&dangling_memcgs_mutex);
>>>>
>>>> list_for_each_entry(memcg, &dangling_memcgs, dead) {
>>>> - if (memcg->memcg_name)
>>>> - seq_printf(m, "%s:\n", memcg->memcg_name);
>>>> + ret = cgroup_path(memcg->css.cgroup, memcg_name, PAGE_SIZE);
>>>> + if (!ret)
>>>> + seq_printf(m, "%s:\n", memcg_name);
>>>> else
>>>> seq_printf(m, "%p (name lost):\n", memcg);
>>>>
>>>
>>> I'm sorry for dawm question ...when this error happens ?
>>> We may get ENAMETOOLONG even with PAGE_SIZE(>=4096bytes) buffer ?
>>>
>>
>> It does no harm to check the return value, and we don't have to
>> worry about if cgroup_path() will be changed to return some other
>> errno like ENOMEM in the future.
>>
> Hmm. but the name is not lost, right ?
> How about returning error rather than making a mixture of lines in different formats ?
>
I still think it would be preferable to dump as much informative data as
we can. Even if the name is lost, a lot of info can still be present
(from the caches, etc).
The format also should not matter. Nobody will be script-processing this.