2014-07-07 12:00:30

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 0/8] memcg: reparent kmem on css offline

Hi,

This patch set introduces re-parenting of kmem charges on memcg css
offline. The idea lying behind it is very simple - instead of pointing
from kmem objects (kmem caches, non-slab kmem pages) directly to the
memcg which they are charged against, we make them point to a proxy
object, mem_cgroup_kmem_context, which, in turn, points to the memcg
which it belongs to. As a result on memcg offline, it's enough to only
re-parent the memcg's mem_cgroup_kmem_context.

Note, reparented kmem contexts will hang around until there is at least
one object accounted to them, but since they are small (especially
comparing to struct mem_cgroup), it's no big deal.

Reviews are appreciated.

Thanks,

Vladimir Davydov (8):
memcg: add pointer from memcg_cache_params to owner cache
memcg: keep all children of each root cache on a list
slab: guarantee unique kmem cache naming
slub: remove kmemcg id from create_unique_id
memcg: rework non-slab kmem pages charge path
memcg: introduce kmem context
memcg: move some kmem definitions upper
memcg: reparent kmem context on memcg offline

include/linux/memcontrol.h | 88 ++++-----
include/linux/mm_types.h | 6 +
include/linux/slab.h | 17 +-
mm/memcontrol.c | 468 +++++++++++++++++++++-----------------------
mm/page_alloc.c | 22 ++-
mm/slab.c | 40 ++--
mm/slab_common.c | 64 ++++--
mm/slub.c | 45 ++---
8 files changed, 382 insertions(+), 368 deletions(-)

--
1.7.10.4


2014-07-07 12:00:34

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 3/8] slab: guarantee unique kmem cache naming

Unique names are necessary to avoid sysfs name clashes in SLUB's
implementation. Currently we give per memcg caches unique names by
appending memcg name and id to the root cache's name. However, it won't
be enough when kmem cache re-parenting is introduced, because then memcg
id and name can be reused resulting in a name conflict. To solve it,
let's allocate a unique id for each per memcg cache and use it in cache
names.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/slab.h | 2 ++
mm/slab_common.c | 36 ++++++++++++++++++++++++++++++------
2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 640e6a655d51..c6680a885910 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -529,6 +529,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
* @root_cache: pointer to the global, root cache, this cache was derived from
* @siblings: list_head for the list of all child caches of the root_cache
* @refcnt: reference counter
+ * @id: unique id
* @dead: set to true when owner memcg is turned offline
* @unregister_work: worker to destroy the cache
*/
@@ -547,6 +548,7 @@ struct memcg_cache_params {
struct kmem_cache *root_cache;
struct list_head siblings;
atomic_long_t refcnt;
+ int id;
bool dead;
struct work_struct unregister_work;
};
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 95a8f772b0d1..20ec4d47c161 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -261,6 +261,15 @@ EXPORT_SYMBOL(kmem_cache_create);

#ifdef CONFIG_MEMCG_KMEM
/*
+ * To avoid sysfs name conflicts all kmem caches must be uniquely named.
+ * Appending cgroup id and name to per memcg caches is not enough, because they
+ * can be reused before cache is destroyed. So we assign a unique id to each
+ * per memcg cache. The ids are used for creating unique names and are
+ * allocated by the ida defined below.
+ */
+static DEFINE_IDA(memcg_cache_ida);
+
+/*
* memcg_create_kmem_cache - Create a cache for a memory cgroup.
* @memcg: The memory cgroup the new cache is for.
* @root_cache: The parent of the new cache.
@@ -276,27 +285,32 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
{
struct kmem_cache *s = NULL;
char *cache_name;
+ int id;

get_online_cpus();
get_online_mems();

mutex_lock(&slab_mutex);

- cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name,
- memcg_cache_id(memcg), memcg_name);
- if (!cache_name)
+ id = ida_simple_get(&memcg_cache_ida, 0, 0, GFP_KERNEL);
+ if (id < 0)
goto out_unlock;

+ cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)(%d)", root_cache->name,
+ memcg_cache_id(memcg), memcg_name, id);
+ if (!cache_name)
+ goto out_free_id;
+
s = do_kmem_cache_create(cache_name, root_cache->object_size,
root_cache->size, root_cache->align,
root_cache->flags, root_cache->ctor,
memcg, root_cache);
if (IS_ERR(s)) {
- kfree(cache_name);
s = NULL;
- goto out_unlock;
+ goto out_free_name;
}

+ s->memcg_params->id = id;
list_add(&s->memcg_params->siblings,
&root_cache->memcg_params->children);

@@ -307,6 +321,12 @@ out_unlock:
put_online_cpus();

return s;
+
+out_free_name:
+ kfree(cache_name);
+out_free_id:
+ ida_simple_remove(&memcg_cache_ida, id);
+ goto out_unlock;
}

static int memcg_cleanup_cache_params(struct kmem_cache *s)
@@ -330,6 +350,11 @@ static int memcg_cleanup_cache_params(struct kmem_cache *s)

void slab_kmem_cache_release(struct kmem_cache *s)
{
+#ifdef CONFIG_MEMCG_KMEM
+ if (!is_root_cache(s))
+ ida_simple_remove(&memcg_cache_ida, s->memcg_params->id);
+#endif
+ memcg_free_cache_params(s);
kfree(s->name);
kmem_cache_free(kmem_cache, s);
}
@@ -365,7 +390,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (s->flags & SLAB_DESTROY_BY_RCU)
rcu_barrier();

- memcg_free_cache_params(s);
#ifdef SLAB_SUPPORTS_SYSFS
sysfs_slab_remove(s);
#else
--
1.7.10.4

2014-07-07 12:00:39

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 4/8] slub: remove kmemcg id from create_unique_id

This function is never called for memcg caches, because they are
unmergeable, so remove the dead code.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/slub.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1821e2096cbb..81f3823f3e03 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5153,12 +5153,6 @@ static char *create_unique_id(struct kmem_cache *s)
*p++ = '-';
p += sprintf(p, "%07d", s->size);

-#ifdef CONFIG_MEMCG_KMEM
- if (!is_root_cache(s))
- p += sprintf(p, "-%08d",
- memcg_cache_id(s->memcg_params->memcg));
-#endif
-
BUG_ON(p > name + ID_STR_LENGTH - 1);
return name;
}
--
1.7.10.4

2014-07-07 12:00:46

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 8/8] memcg: reparent kmem context on memcg offline

Currently, mem_cgroup_kmem_context holds a reference to the owner memcg,
so a memcg will be hanging around after offline until the last kmem
object charged to it is freed. This is bad, because the mem_cgroup
struct is huge, and we actually don't need most of its fields to
uncharge kmem after css offline.

This patch introduces kmem charges reparenting. The implementation is
tivial: we simply make mem_cgroup_kmem_context point to the parent memcg
when its owner is taken offline. Ongoing kmem uncharges can still go to
the old mem cgroup for some time, but by the time css free is called all
kmem uncharges paths must have been switched to the parent memcg.

Note the difference between mem/memsw charges reparenting, where we walk
over all charged pages and fix their page cgroups, and kmem reparenting,
where we only switch memcg pointer in kmem context. As a result, if
everything goes right, on css free we will have mem res counter usage
equal to 0, but kmem res counter usage can still be positive, because we
don't uncharge kmem from the dead memcg. In this regard, kmem
reparenting doesn't look like "real" reparenting - we don't actually
move charges and we don't release kmem context (and kmem cache) until
the last object is released. However, introducing "real" kmem
reparenting would require tracking of all charged pages, which is not
done currently (slub doesn't track full slabs; pages allocated with
alloc_kmem_pages aren't tracked), and changing this would impact
performance. So we prefer to go with re-parenting of kmem contexts
instead of dealing with individual charges - fortunately kmem context
struct is tiny and having it pending after memcg death is no big deal.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/memcontrol.c | 161 +++++++++++++++++++++++++++++++++----------------------
1 file changed, 96 insertions(+), 65 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21cf15184ad8..e2f8dd669063 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -269,17 +269,29 @@ struct mem_cgroup_event {
};

struct mem_cgroup_kmem_context {
- struct mem_cgroup *memcg;
+ /*
+ * the memcg pointer is updated on css offline under memcg_slab_mutex,
+ * so it is safe to access it inside an RCU critical section or with
+ * the memcg_slab_mutex held
+ */
+ struct mem_cgroup __rcu *memcg;
atomic_long_t refcnt;
/*
* true if accounting is enabled
*/
bool active;
/*
+ * list of contexts re-parented to the memcg; protected by
+ * memcg_slab_mutex
+ */
+ struct list_head list;
+ /*
* analogous to slab_common's slab_caches list, but per-memcg;
* protected by memcg_slab_mutex
*/
struct list_head slab_caches;
+
+ struct work_struct destroy_work;
};

static void mem_cgroup_threshold(struct mem_cgroup *memcg);
@@ -447,10 +459,11 @@ memcg_get_kmem_context(struct mem_cgroup *memcg)
static inline void memcg_put_kmem_context(struct mem_cgroup_kmem_context *ctx)
{
if (unlikely(atomic_long_dec_and_test(&ctx->refcnt)))
- css_put(&ctx->memcg->css); /* drop the reference taken in
- * kmem_cgroup_css_offline */
+ schedule_work(&ctx->destroy_work);
}

+static void memcg_kmem_context_destroy_func(struct work_struct *work);
+
static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
{
struct mem_cgroup_kmem_context *ctx;
@@ -462,7 +475,9 @@ static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
ctx->memcg = memcg;
atomic_long_set(&ctx->refcnt, 1);
ctx->active = false;
+ INIT_LIST_HEAD(&ctx->list);
INIT_LIST_HEAD(&ctx->slab_caches);
+ INIT_WORK(&ctx->destroy_work, memcg_kmem_context_destroy_func);

memcg->kmem_ctx = ctx;
return 0;
@@ -470,20 +485,9 @@ static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)

static void memcg_release_kmem_context(struct mem_cgroup *memcg)
{
- kfree(memcg->kmem_ctx);
-}
-
-static void disarm_kmem_keys(struct mem_cgroup *memcg)
-{
- if (memcg_kmem_is_active(memcg)) {
- static_key_slow_dec(&memcg_kmem_enabled_key);
+ if (memcg_kmem_is_active(memcg))
ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
- }
- /*
- * This check can't live in kmem destruction function,
- * since the charges will outlive the cgroup
- */
- WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+ memcg_put_kmem_context(memcg->kmem_ctx);
}
#else
static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
@@ -494,10 +498,6 @@ static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
static void memcg_release_kmem_context(struct mem_cgroup *memcg)
{
}
-
-static void disarm_kmem_keys(struct mem_cgroup *memcg)
-{
-}
#endif /* CONFIG_MEMCG_KMEM */

/* Stuffs for move charges at task migration. */
@@ -692,12 +692,6 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif

-static void disarm_static_keys(struct mem_cgroup *memcg)
-{
- disarm_sock_keys(memcg);
- disarm_kmem_keys(memcg);
-}
-
static void drain_all_stock_async(struct mem_cgroup *memcg);

static struct mem_cgroup_per_zone *
@@ -2809,6 +2803,25 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
memcg_kmem_is_active(memcg);
}

+static void memcg_kmem_context_destroy_func(struct work_struct *work)
+{
+ struct mem_cgroup_kmem_context *ctx = container_of(work,
+ struct mem_cgroup_kmem_context, destroy_work);
+
+ if (ctx->active)
+ static_key_slow_dec(&memcg_kmem_enabled_key);
+
+ mutex_lock(&memcg_slab_mutex);
+ BUG_ON(!list_empty(&ctx->slab_caches));
+ if (!list_empty(&ctx->list)) {
+ BUG_ON(ctx->memcg->kmem_ctx == ctx);
+ list_del(&ctx->list);
+ }
+ mutex_unlock(&memcg_slab_mutex);
+
+ kfree(ctx);
+}
+
#ifdef CONFIG_SLABINFO
static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
{
@@ -3067,8 +3080,14 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
memcg = cachep->memcg_params->ctx->memcg;
id = memcg_cache_id(memcg);

- BUG_ON(root_cache->memcg_params->memcg_caches[id] != cachep);
- root_cache->memcg_params->memcg_caches[id] = NULL;
+ /*
+ * If the cache being unregistered is active (i.e. its initial owner is
+ * alive), we have to clear its slot in the root cache's array.
+ * Otherwise, the slot has already been cleared by
+ * memcg_unregister_all_caches.
+ */
+ if (root_cache->memcg_params->memcg_caches[id] == cachep)
+ root_cache->memcg_params->memcg_caches[id] = NULL;

list_del(&cachep->memcg_params->list);

@@ -3139,6 +3158,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
{
struct memcg_cache_params *params, *tmp;
LIST_HEAD(empty_caches);
+ int id = memcg_cache_id(memcg);

if (!memcg_kmem_is_active(memcg))
return;
@@ -3147,10 +3167,14 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
list_for_each_entry_safe(params, tmp,
&memcg->kmem_ctx->slab_caches, list) {
struct kmem_cache *cachep = params->cachep;
+ struct kmem_cache *root_cache = params->root_cache;

memcg_cache_mark_dead(cachep);
kmem_cache_shrink(cachep);

+ BUG_ON(root_cache->memcg_params->memcg_caches[id] != cachep);
+ root_cache->memcg_params->memcg_caches[id] = NULL;
+
if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
list_move(&cachep->memcg_params->list, &empty_caches);
}
@@ -3239,18 +3263,28 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
{
int res;
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ do {
+ memcg = cachep->memcg_params->ctx->memcg;
+ } while (!css_tryget(&memcg->css));
+ rcu_read_unlock();

- res = memcg_charge_kmem(cachep->memcg_params->ctx->memcg, gfp,
- PAGE_SIZE << order);
+ res = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
if (!res)
atomic_long_inc(&cachep->memcg_params->refcnt);
+
+ css_put(&memcg->css);
return res;
}

void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
{
+ rcu_read_lock();
memcg_uncharge_kmem(cachep->memcg_params->ctx->memcg,
PAGE_SIZE << order);
+ rcu_read_unlock();

if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
/* see memcg_unregister_all_caches */
@@ -3369,13 +3403,43 @@ int __memcg_charge_kmem_pages(gfp_t gfp, int order,

void __memcg_uncharge_kmem_pages(struct mem_cgroup_kmem_context *ctx, int order)
{
+ rcu_read_lock();
memcg_uncharge_kmem(ctx->memcg, PAGE_SIZE << order);
+ rcu_read_unlock();
+
memcg_put_kmem_context(ctx);
}
+
+static inline void memcg_reparent_kmem(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ struct mem_cgroup_kmem_context *ctx = memcg->kmem_ctx;
+ struct mem_cgroup_kmem_context *p;
+
+ mutex_lock(&memcg_slab_mutex);
+
+ /* first reparent all ctxs that were reparented to this ctx earlier */
+ list_for_each_entry(p, &ctx->list, list) {
+ BUG_ON(p->memcg != memcg);
+ p->memcg = parent;
+ }
+ list_splice(&ctx->list, &parent->kmem_ctx->list);
+
+ /* now reparent this ctx itself */
+ BUG_ON(ctx->memcg != memcg);
+ ctx->memcg = parent;
+ list_add(&ctx->list, &parent->kmem_ctx->list);
+
+ mutex_unlock(&memcg_slab_mutex);
+}
#else
static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
{
}
+
+static inline void memcg_reparent_kmem(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4947,34 +5011,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
mem_cgroup_sockets_destroy(memcg);
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
- if (!memcg_kmem_is_active(memcg))
- return;
-
- /*
- * kmem charges can outlive the cgroup. In the case of slab
- * pages, for instance, a page contain objects from various
- * processes. As we prevent from taking a reference for every
- * such allocation we have to be careful when doing uncharge
- * (see memcg_uncharge_kmem) and here during offlining.
- *
- * The idea is that that only the _last_ uncharge which sees
- * the dead memcg will drop the last reference. An additional
- * reference is taken here before the group is marked dead
- * which is then paired with css_put during uncharge resp. here.
- *
- * Although this might sound strange as this path is called from
- * css_offline() when the referencemight have dropped down to 0 and
- * shouldn't be incremented anymore (css_tryget_online() would
- * fail) we do not have other options because of the kmem
- * allocations lifetime.
- */
- css_get(&memcg->css);
-
- memcg_put_kmem_context(memcg->kmem_ctx);
-}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
@@ -4984,10 +5020,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
static void memcg_destroy_kmem(struct mem_cgroup *memcg)
{
}
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
#endif

/*
@@ -5442,7 +5474,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
* to move this code around, and make sure it is outside
* the cgroup_lock.
*/
- disarm_static_keys(memcg);
+ disarm_sock_keys(memcg);

memcg_release_kmem_context(memcg);
kfree(memcg);
@@ -5604,8 +5636,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);

- kmem_cgroup_css_offline(memcg);
-
mem_cgroup_invalidate_reclaim_iterators(memcg);

/*
@@ -5616,6 +5646,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));

memcg_unregister_all_caches(memcg);
+ memcg_reparent_kmem(memcg);
vmpressure_cleanup(&memcg->vmpressure);
}

--
1.7.10.4

2014-07-07 12:00:44

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 7/8] memcg: move some kmem definitions upper

I need this in the next patch.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/memcontrol.c | 114 +++++++++++++++++++++++++++----------------------------
1 file changed, 56 insertions(+), 58 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fb25575bdb22..21cf15184ad8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -391,6 +391,45 @@ struct mem_cgroup {
};

#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This will be the memcg's index in each cache's ->memcg_params->memcg_caches.
+ * The main reason for not using cgroup id for this:
+ * this works better in sparse environments, where we have a lot of memcgs,
+ * but only a few kmem-limited. Or also, if we have, for instance, 200
+ * memcgs, and none but the 200th is kmem-limited, we'd have to have a
+ * 200 entry array for that.
+ *
+ * The current size of the caches array is stored in
+ * memcg_limited_groups_array_size. It will double each time we have to
+ * increase it.
+ */
+static DEFINE_IDA(kmem_limited_groups);
+int memcg_limited_groups_array_size;
+
+/*
+ * MIN_SIZE is different than 1, because we would like to avoid going through
+ * the alloc/free process all the time. In a small machine, 4 kmem-limited
+ * cgroups is a reasonable guess. In the future, it could be a parameter or
+ * tunable, but that is strictly not necessary.
+ *
+ * MAX_SIZE should be as large as the number of cgrp_ids. Ideally, we could get
+ * this constant directly from cgroup, but it is understandable that this is
+ * better kept as an internal representation in cgroup.c. In any case, the
+ * cgrp_id space is not getting any smaller, and we don't have to necessarily
+ * increase ours as well if it increases.
+ */
+#define MEMCG_CACHES_MIN_SIZE 4
+#define MEMCG_CACHES_MAX_SIZE MEM_CGROUP_ID_MAX
+
+/*
+ * A lot of the calls to the cache allocation functions are expected to be
+ * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * conditional to this static branch, we'll have to allow modules that does
+ * kmem_cache_alloc and the such to see this symbol as well
+ */
+struct static_key memcg_kmem_enabled_key;
+EXPORT_SYMBOL(memcg_kmem_enabled_key);
+
static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
{
return memcg->kmem_ctx->active;
@@ -433,6 +472,19 @@ static void memcg_release_kmem_context(struct mem_cgroup *memcg)
{
kfree(memcg->kmem_ctx);
}
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+ if (memcg_kmem_is_active(memcg)) {
+ static_key_slow_dec(&memcg_kmem_enabled_key);
+ ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
+ }
+ /*
+ * This check can't live in kmem destruction function,
+ * since the charges will outlive the cgroup
+ */
+ WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+}
#else
static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
{
@@ -442,6 +494,10 @@ static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
static void memcg_release_kmem_context(struct mem_cgroup *memcg)
{
}
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */

/* Stuffs for move charges at task migration. */
@@ -636,64 +692,6 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif

-#ifdef CONFIG_MEMCG_KMEM
-/*
- * This will be the memcg's index in each cache's ->memcg_params->memcg_caches.
- * The main reason for not using cgroup id for this:
- * this works better in sparse environments, where we have a lot of memcgs,
- * but only a few kmem-limited. Or also, if we have, for instance, 200
- * memcgs, and none but the 200th is kmem-limited, we'd have to have a
- * 200 entry array for that.
- *
- * The current size of the caches array is stored in
- * memcg_limited_groups_array_size. It will double each time we have to
- * increase it.
- */
-static DEFINE_IDA(kmem_limited_groups);
-int memcg_limited_groups_array_size;
-
-/*
- * MIN_SIZE is different than 1, because we would like to avoid going through
- * the alloc/free process all the time. In a small machine, 4 kmem-limited
- * cgroups is a reasonable guess. In the future, it could be a parameter or
- * tunable, but that is strictly not necessary.
- *
- * MAX_SIZE should be as large as the number of cgrp_ids. Ideally, we could get
- * this constant directly from cgroup, but it is understandable that this is
- * better kept as an internal representation in cgroup.c. In any case, the
- * cgrp_id space is not getting any smaller, and we don't have to necessarily
- * increase ours as well if it increases.
- */
-#define MEMCG_CACHES_MIN_SIZE 4
-#define MEMCG_CACHES_MAX_SIZE MEM_CGROUP_ID_MAX
-
-/*
- * A lot of the calls to the cache allocation functions are expected to be
- * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
- * conditional to this static branch, we'll have to allow modules that does
- * kmem_cache_alloc and the such to see this symbol as well
- */
-struct static_key memcg_kmem_enabled_key;
-EXPORT_SYMBOL(memcg_kmem_enabled_key);
-
-static void disarm_kmem_keys(struct mem_cgroup *memcg)
-{
- if (memcg_kmem_is_active(memcg)) {
- static_key_slow_dec(&memcg_kmem_enabled_key);
- ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
- }
- /*
- * This check can't live in kmem destruction function,
- * since the charges will outlive the cgroup
- */
- WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
-}
-#else
-static void disarm_kmem_keys(struct mem_cgroup *memcg)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
static void disarm_static_keys(struct mem_cgroup *memcg)
{
disarm_sock_keys(memcg);
--
1.7.10.4

2014-07-07 12:00:42

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 6/8] memcg: introduce kmem context

Currently, each kmem allocation keeps a pointer to the memcg which it
was charged to. It's kmem_cache->memcg_params->memcg in case of kmalloc
and page->memcg in case of alloc_kmem_pages. As a result, to re-parent
those charges on memcg offline we have to fix all the pointers so that
they would point to the parent memcg. However, we can't always iterate
over all kmem allocations, e.g. pages allocated with alloc_kmem_pages
are not tracked. We could link all such allocations to per memcg lists,
but there's a simpler solution.

This patch introduces the mem_cgroup_kmem_context struct, which works as
a proxy between kmem objects and the memcg which they are charged
against. It has a pointer to the owner memcg and a reference counter.
Each kmem allocation holds a reference to the kmem context object
instead of pointing directly to the memcg, so that to re-parent all kmem
charges it's enough to change the memcg pointer in the kmem context
struct.

kmem context also allows us to get rid of the KMEM_ACCOUNTED_DEAD flag,
which was used to initiate memcg destruction on last uncharge, because
now each charge (each kmem cache and each non-slab kmem page) holds a
reference to the context, which in turn holds a reference to the memcg
preventing it from going away.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 31 ++++++----
include/linux/mm_types.h | 6 +-
include/linux/slab.h | 6 +-
mm/memcontrol.c | 143 +++++++++++++++++++++++++-------------------
mm/page_alloc.c | 18 +++---
5 files changed, 113 insertions(+), 91 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 33077215b8d4..5a38c9c49392 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct mem_cgroup_kmem_context;

/*
* The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -440,8 +441,10 @@ static inline bool memcg_kmem_enabled(void)
* conditions, but because they are pretty simple, they are expected to be
* fast.
*/
-int __memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg);
-void __memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order);
+int __memcg_charge_kmem_pages(gfp_t gfp, int order,
+ struct mem_cgroup_kmem_context **ctxp);
+void __memcg_uncharge_kmem_pages(struct mem_cgroup_kmem_context *ctx,
+ int order);

int memcg_cache_id(struct mem_cgroup *memcg);

@@ -464,7 +467,7 @@ void __memcg_cleanup_cache_params(struct kmem_cache *s);
* memcg_charge_kmem_pages: verify if a kmem page allocation is allowed.
* @gfp: the gfp allocation flags.
* @order: allocation order.
- * @memcg: a pointer to the memcg this was charged against.
+ * @ctxp: a pointer to the memcg kmem context this was charged against.
*
* The function tries to charge a kmem page allocation to the memory cgroup
* which the current task belongs to. It should be used for accounting non-slab
@@ -475,9 +478,10 @@ void __memcg_cleanup_cache_params(struct kmem_cache *s);
* if this allocation is not to be accounted to any memcg.
*/
static inline int
-memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg)
+memcg_charge_kmem_pages(gfp_t gfp, int order,
+ struct mem_cgroup_kmem_context **ctxp)
{
- *memcg = NULL;
+ *ctxp = NULL;

if (!memcg_kmem_enabled())
return 0;
@@ -498,22 +502,22 @@ memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg)
if (unlikely(fatal_signal_pending(current)))
return 0;

- return __memcg_charge_kmem_pages(gfp, order, memcg);
+ return __memcg_charge_kmem_pages(gfp, order, ctxp);
}

/**
* memcg_uncharge_kmem_pages: uncharge a kmem page allocation
- * @memcg: the memcg the allocation is charged to.
+ * @ctx: the memcg kmem context the allocation was charged against.
* @order: allocation order.
*
* The function is used to uncharge kmem page allocations charged using
* memcg_charge_kmem_pages.
*/
static inline void
-memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
+memcg_uncharge_kmem_pages(struct mem_cgroup_kmem_context *ctx, int order)
{
- if (memcg_kmem_enabled() && memcg)
- __memcg_uncharge_kmem_pages(memcg, order);
+ if (memcg_kmem_enabled() && ctx)
+ __memcg_uncharge_kmem_pages(ctx, order);
}

/**
@@ -547,14 +551,15 @@ static inline bool memcg_kmem_enabled(void)
}

static inline int
-memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg)
+memcg_charge_kmem_pages(gfp_t gfp, int order,
+ struct mem_cgroup_kmem_context **ctxp)
{
- *memcg = NULL;
+ *ctxp = NULL;
return 0;
}

static inline void
-memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
+memcg_uncharge_kmem_pages(struct mem_cgroup_kmem_context *ctx, int order)
{
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4656c02fcd1d..e1c8466c8d90 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,7 +23,7 @@
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))

struct address_space;
-struct mem_cgroup;
+struct mem_cgroup_kmem_context;

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
@@ -168,8 +168,8 @@ struct page {
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */

/* for non-slab kmem pages (see alloc_kmem_pages):
- * memcg which the page is charged to */
- struct mem_cgroup *memcg;
+ * memcg kmem context which the page was charged against */
+ struct mem_cgroup_kmem_context *memcg_kmem_ctx;

struct page *first_page; /* Compound tail pages */
};
diff --git a/include/linux/slab.h b/include/linux/slab.h
index c6680a885910..c3e85aeeb556 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -524,8 +524,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
* Child caches will hold extra metadata needed for its operation. Fields are:
*
* @cachep: cache which this struct is for
- * @memcg: pointer to the memcg this cache belongs to
- * @list: list_head for the list of all caches in this memcg
+ * @ctx: pointer to the memcg kmem context this cache belongs to
+ * @list: list_head for the list of all caches in the context
* @root_cache: pointer to the global, root cache, this cache was derived from
* @siblings: list_head for the list of all child caches of the root_cache
* @refcnt: reference counter
@@ -543,7 +543,7 @@ struct memcg_cache_params {
};
struct {
struct kmem_cache *cachep;
- struct mem_cgroup *memcg;
+ struct mem_cgroup_kmem_context *ctx;
struct list_head list;
struct kmem_cache *root_cache;
struct list_head siblings;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b155ebf1973..fb25575bdb22 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,6 +268,20 @@ struct mem_cgroup_event {
struct work_struct remove;
};

+struct mem_cgroup_kmem_context {
+ struct mem_cgroup *memcg;
+ atomic_long_t refcnt;
+ /*
+ * true if accounting is enabled
+ */
+ bool active;
+ /*
+ * analogous to slab_common's slab_caches list, but per-memcg;
+ * protected by memcg_slab_mutex
+ */
+ struct list_head slab_caches;
+};
+
static void mem_cgroup_threshold(struct mem_cgroup *memcg);
static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);

@@ -305,7 +319,6 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */

bool oom_lock;
atomic_t under_oom;
@@ -357,11 +370,9 @@ struct mem_cgroup {
struct cg_proto tcp_mem;
#endif
#if defined(CONFIG_MEMCG_KMEM)
- /* analogous to slab_common's slab_caches list, but per-memcg;
- * protected by memcg_slab_mutex */
- struct list_head memcg_slab_caches;
/* Index in the kmem_cache->memcg_params->memcg_caches array */
int kmemcg_id;
+ struct mem_cgroup_kmem_context *kmem_ctx;
#endif

int last_scanned_node;
@@ -379,40 +390,59 @@ struct mem_cgroup {
/* WARNING: nodeinfo must be the last member here */
};

-/* internal only representation about the status of kmem accounting. */
-enum {
- KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
- KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
-};
-
#ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
{
- set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
+ return memcg->kmem_ctx->active;
}

-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+static inline struct mem_cgroup_kmem_context *
+memcg_get_kmem_context(struct mem_cgroup *memcg)
{
- return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
+ struct mem_cgroup_kmem_context *ctx = memcg->kmem_ctx;
+
+ atomic_long_inc(&ctx->refcnt);
+ return ctx;
}

-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+static inline void memcg_put_kmem_context(struct mem_cgroup_kmem_context *ctx)
{
- /*
- * Our caller must use css_get() first, because memcg_uncharge_kmem()
- * will call css_put() if it sees the memcg is dead.
- */
- smp_wmb();
- if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
- set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+ if (unlikely(atomic_long_dec_and_test(&ctx->refcnt)))
+ css_put(&ctx->memcg->css); /* drop the reference taken in
+ * kmem_cgroup_css_offline */
}

-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
{
- return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
- &memcg->kmem_account_flags);
+ struct mem_cgroup_kmem_context *ctx;
+
+ ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->memcg = memcg;
+ atomic_long_set(&ctx->refcnt, 1);
+ ctx->active = false;
+ INIT_LIST_HEAD(&ctx->slab_caches);
+
+ memcg->kmem_ctx = ctx;
+ return 0;
}
-#endif
+
+static void memcg_release_kmem_context(struct mem_cgroup *memcg)
+{
+ kfree(memcg->kmem_ctx);
+}
+#else
+static int memcg_alloc_kmem_context(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+
+static void memcg_release_kmem_context(struct mem_cgroup *memcg)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */

/* Stuffs for move charges at task migration. */
/*
@@ -2793,7 +2823,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
print_slabinfo_header(m);

mutex_lock(&memcg_slab_mutex);
- list_for_each_entry(params, &memcg->memcg_slab_caches, list)
+ list_for_each_entry(params, &memcg->kmem_ctx->slab_caches, list)
cache_show(params->cachep, m);
mutex_unlock(&memcg_slab_mutex);

@@ -2843,21 +2873,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
res_counter_uncharge(&memcg->res, size);
if (do_swap_account)
res_counter_uncharge(&memcg->memsw, size);
-
- /* Not down to 0 */
- if (res_counter_uncharge(&memcg->kmem, size))
- return;
-
- /*
- * Releases a reference taken in kmem_cgroup_css_offline in case
- * this last uncharge is racing with the offlining code or it is
- * outliving the memcg existence.
- *
- * The memory barrier imposed by test&clear is paired with the
- * explicit one in memcg_kmem_mark_dead().
- */
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
+ res_counter_uncharge(&memcg->kmem, size);
}

/*
@@ -2974,12 +2990,11 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,

if (memcg) {
s->memcg_params->cachep = s;
- s->memcg_params->memcg = memcg;
+ s->memcg_params->ctx = memcg_get_kmem_context(memcg);
s->memcg_params->root_cache = root_cache;
atomic_long_set(&s->memcg_params->refcnt, 1);
INIT_WORK(&s->memcg_params->unregister_work,
memcg_unregister_cache_func);
- css_get(&memcg->css);
} else {
s->memcg_params->is_root_cache = true;
INIT_LIST_HEAD(&s->memcg_params->children);
@@ -2993,7 +3008,7 @@ void memcg_free_cache_params(struct kmem_cache *s)
if (!s->memcg_params)
return;
if (!s->memcg_params->is_root_cache)
- css_put(&s->memcg_params->memcg->css);
+ memcg_put_kmem_context(s->memcg_params->ctx);
kfree(s->memcg_params);
}

@@ -3027,7 +3042,7 @@ static void memcg_register_cache(struct mem_cgroup *memcg,
if (!cachep)
return;

- list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches);
+ list_add(&cachep->memcg_params->list, &memcg->kmem_ctx->slab_caches);

/*
* Since readers won't lock (see cache_from_memcg_idx()), we need a
@@ -3051,7 +3066,7 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
BUG_ON(is_root_cache(cachep));

root_cache = cachep->memcg_params->root_cache;
- memcg = cachep->memcg_params->memcg;
+ memcg = cachep->memcg_params->ctx->memcg;
id = memcg_cache_id(memcg);

BUG_ON(root_cache->memcg_params->memcg_caches[id] != cachep);
@@ -3131,7 +3146,8 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
return;

mutex_lock(&memcg_slab_mutex);
- list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
+ list_for_each_entry_safe(params, tmp,
+ &memcg->kmem_ctx->slab_caches, list) {
struct kmem_cache *cachep = params->cachep;

memcg_cache_mark_dead(cachep);
@@ -3226,7 +3242,7 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
{
int res;

- res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
+ res = memcg_charge_kmem(cachep->memcg_params->ctx->memcg, gfp,
PAGE_SIZE << order);
if (!res)
atomic_long_inc(&cachep->memcg_params->refcnt);
@@ -3235,7 +3251,8 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)

void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
{
- memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(cachep->memcg_params->ctx->memcg,
+ PAGE_SIZE << order);

if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
/* see memcg_unregister_all_caches */
@@ -3304,7 +3321,8 @@ out:
return cachep;
}

-int __memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **_memcg)
+int __memcg_charge_kmem_pages(gfp_t gfp, int order,
+ struct mem_cgroup_kmem_context **ctxp)
{
struct mem_cgroup *memcg;
int ret;
@@ -3345,15 +3363,16 @@ int __memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **_memcg)

ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
if (!ret)
- *_memcg = memcg;
+ *ctxp = memcg_get_kmem_context(memcg);

css_put(&memcg->css);
return ret;
}

-void __memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
+void __memcg_uncharge_kmem_pages(struct mem_cgroup_kmem_context *ctx, int order)
{
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ memcg_uncharge_kmem(ctx->memcg, PAGE_SIZE << order);
+ memcg_put_kmem_context(ctx);
}
#else
static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -4182,7 +4201,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
goto out_rmid;

memcg->kmemcg_id = memcg_id;
- INIT_LIST_HEAD(&memcg->memcg_slab_caches);

/*
* We couldn't have accounted to this cgroup, because it hasn't got the
@@ -4197,7 +4215,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
* guarantee no one starts accounting before all call sites are
* patched.
*/
- memcg_kmem_set_active(memcg);
+ memcg->kmem_ctx->active = true;
out:
memcg_resume_kmem_account();
return err;
@@ -4957,13 +4975,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
*/
css_get(&memcg->css);

- memcg_kmem_mark_dead(memcg);
-
- if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
- return;
-
- if (memcg_kmem_test_and_clear_dead(memcg))
- css_put(&memcg->css);
+ memcg_put_kmem_context(memcg->kmem_ctx);
}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
@@ -5433,6 +5445,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
* the cgroup_lock.
*/
disarm_static_keys(memcg);
+
+ memcg_release_kmem_context(memcg);
kfree(memcg);
}

@@ -5481,6 +5495,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (!memcg)
return ERR_PTR(error);

+ if (memcg_alloc_kmem_context(memcg))
+ goto free_out;
+
for_each_node(node)
if (alloc_mem_cgroup_per_zone_info(memcg, node))
goto free_out;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f4090a582caf..39097a46b60c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2902,32 +2902,32 @@ EXPORT_SYMBOL(free_pages);
struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- struct mem_cgroup *memcg;
+ struct mem_cgroup_kmem_context *ctx;

- if (memcg_charge_kmem_pages(gfp_mask, order, &memcg) != 0)
+ if (memcg_charge_kmem_pages(gfp_mask, order, &ctx) != 0)
return NULL;
page = alloc_pages(gfp_mask, order);
if (!page) {
- memcg_uncharge_kmem_pages(memcg, order);
+ memcg_uncharge_kmem_pages(ctx, order);
return NULL;
}
- page->memcg = memcg;
+ page->memcg_kmem_ctx = ctx;
return page;
}

struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- struct mem_cgroup *memcg;
+ struct mem_cgroup_kmem_context *ctx;

- if (memcg_charge_kmem_pages(gfp_mask, order, &memcg) != 0)
+ if (memcg_charge_kmem_pages(gfp_mask, order, &ctx) != 0)
return NULL;
page = alloc_pages_node(nid, gfp_mask, order);
if (!page) {
- memcg_uncharge_kmem_pages(memcg, order);
+ memcg_uncharge_kmem_pages(ctx, order);
return NULL;
}
- page->memcg = memcg;
+ page->memcg_kmem_ctx = ctx;
return page;
}

@@ -2937,7 +2937,7 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
*/
void __free_kmem_pages(struct page *page, unsigned int order)
{
- memcg_uncharge_kmem_pages(page->memcg, order);
+ memcg_uncharge_kmem_pages(page->memcg_kmem_ctx, order);
__free_pages(page, order);
}

--
1.7.10.4

2014-07-07 12:01:39

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 5/8] memcg: rework non-slab kmem pages charge path

Currently we have two functions for that: memcg_kmem_newpage_charge and
memcg_kmem_commit_charge. The former is called before allocating a page
to charge it to the current cgroup, while the latter saves the memcg the
new page was charged to in its page_cgroup.

Actually, there's no need to use page_cgroups for kmem pages, because
such pages are allocated when the user actually would like to kmalloc,
but falls back to alloc_page due to the allocation order is too large,
so the user won't use internal page struct fields and we can safely use
one to save a pointer to the memcg holding the charge instead of using
page_cgorups, just like SL[AU]B does.

This will make the code cleaner and allow us to get rid of
memcg_kmem_commit_charge.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 79 +++++++++++++++++---------------------------
include/linux/mm_types.h | 6 ++++
mm/memcontrol.c | 70 ++++-----------------------------------
mm/page_alloc.c | 22 ++++++++----
4 files changed, 57 insertions(+), 120 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b0fbba00b01..33077215b8d4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -440,11 +440,8 @@ static inline bool memcg_kmem_enabled(void)
* conditions, but because they are pretty simple, they are expected to be
* fast.
*/
-bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
- int order);
-void __memcg_kmem_commit_charge(struct page *page,
- struct mem_cgroup *memcg, int order);
-void __memcg_kmem_uncharge_pages(struct page *page, int order);
+int __memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg);
+void __memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order);

int memcg_cache_id(struct mem_cgroup *memcg);

@@ -464,22 +461,26 @@ void __memcg_uncharge_slab(struct kmem_cache *cachep, int order);
void __memcg_cleanup_cache_params(struct kmem_cache *s);

/**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
+ * memcg_charge_kmem_pages: verify if a kmem page allocation is allowed.
* @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
* @order: allocation order.
+ * @memcg: a pointer to the memcg this was charged against.
*
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
+ * The function tries to charge a kmem page allocation to the memory cgroup
+ * which the current task belongs to. It should be used for accounting non-slab
+ * kmem pages allocations (see alloc_kmem_pages). For slab allocations
+ * memcg_charge_slab is used.
*
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
+ * Returns 0 on success, -ENOMEM on failure. Note we skip charging and return 0
+ * if this allocation is not to be accounted to any memcg.
*/
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+static inline int
+memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg)
{
+ *memcg = NULL;
+
if (!memcg_kmem_enabled())
- return true;
+ return 0;

/*
* __GFP_NOFAIL allocations will move on even if charging is not
@@ -489,47 +490,30 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
* and won't be worth the trouble.
*/
if (gfp & __GFP_NOFAIL)
- return true;
+ return 0;
if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
- return true;
+ return 0;

/* If the test is dying, just let it go. */
if (unlikely(fatal_signal_pending(current)))
- return true;
+ return 0;

- return __memcg_kmem_newpage_charge(gfp, memcg, order);
-}
-
-/**
- * memcg_kmem_uncharge_pages: uncharge pages from memcg
- * @page: pointer to struct page being freed
- * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
- */
-static inline void
-memcg_kmem_uncharge_pages(struct page *page, int order)
-{
- if (memcg_kmem_enabled())
- __memcg_kmem_uncharge_pages(page, order);
+ return __memcg_charge_kmem_pages(gfp, order, memcg);
}

/**
- * memcg_kmem_commit_charge: embeds correct memcg in a page
- * @page: pointer to struct page recently allocated
- * @memcg: the memcg structure we charged against
+ * memcg_uncharge_kmem_pages: uncharge a kmem page allocation
+ * @memcg: the memcg the allocation is charged to.
* @order: allocation order.
*
- * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
- * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
- * corresponding page_cgroup.
+ * The function is used to uncharge kmem page allocations charged using
+ * memcg_charge_kmem_pages.
*/
static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
{
if (memcg_kmem_enabled() && memcg)
- __memcg_kmem_commit_charge(page, memcg, order);
+ __memcg_uncharge_kmem_pages(memcg, order);
}

/**
@@ -562,18 +546,15 @@ static inline bool memcg_kmem_enabled(void)
return false;
}

-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
- return true;
-}
-
-static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
+static inline int
+memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **memcg)
{
+ *memcg = NULL;
+ return 0;
}

static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
{
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a6236cff3c31..4656c02fcd1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))

struct address_space;
+struct mem_cgroup;

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
@@ -165,6 +166,11 @@ struct page {
#endif
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
+
+ /* for non-slab kmem pages (see alloc_kmem_pages):
+ * memcg which the page is charged to */
+ struct mem_cgroup *memcg;
+
struct page *first_page; /* Compound tail pages */
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4dedb67787c7..4b155ebf1973 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3304,28 +3304,11 @@ out:
return cachep;
}

-/*
- * We need to verify if the allocation against current->mm->owner's memcg is
- * possible for the given order. But the page is not allocated yet, so we'll
- * need a further commit step to do the final arrangements.
- *
- * It is possible for the task to switch cgroups in this mean time, so at
- * commit time, we can't rely on task conversion any longer. We'll then use
- * the handle argument to return to the caller which cgroup we should commit
- * against. We could also return the memcg directly and avoid the pointer
- * passing, but a boolean return value gives better semantics considering
- * the compiled-out case as well.
- *
- * Returning true means the allocation is possible.
- */
-bool
-__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
+int __memcg_charge_kmem_pages(gfp_t gfp, int order, struct mem_cgroup **_memcg)
{
struct mem_cgroup *memcg;
int ret;

- *_memcg = NULL;
-
/*
* Disabling accounting is only relevant for some specific memcg
* internal allocations. Therefore we would initially not have such
@@ -3351,14 +3334,13 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
* allocations are extremely rare but can happen, for instance, for the
* cache arrays. We bring this test here.
*/
- if (!current->mm || current->memcg_kmem_skip_account)
- return true;
+ if (current->memcg_kmem_skip_account)
+ return 0;

memcg = get_mem_cgroup_from_mm(current->mm);
-
if (!memcg_can_account_kmem(memcg)) {
css_put(&memcg->css);
- return true;
+ return 0;
}

ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
@@ -3366,51 +3348,11 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
*_memcg = memcg;

css_put(&memcg->css);
- return (ret == 0);
-}
-
-void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
- int order)
-{
- struct page_cgroup *pc;
-
- VM_BUG_ON(mem_cgroup_is_root(memcg));
-
- /* The page allocation failed. Revert */
- if (!page) {
- memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
- return;
- }
- /*
- * The page is freshly allocated and not visible to any
- * outside callers yet. Set up pc non-atomically.
- */
- pc = lookup_page_cgroup(page);
- pc->mem_cgroup = memcg;
- pc->flags = PCG_USED;
+ return ret;
}

-void __memcg_kmem_uncharge_pages(struct page *page, int order)
+void __memcg_uncharge_kmem_pages(struct mem_cgroup *memcg, int order)
{
- struct mem_cgroup *memcg = NULL;
- struct page_cgroup *pc;
-
-
- pc = lookup_page_cgroup(page);
- if (!PageCgroupUsed(pc))
- return;
-
- memcg = pc->mem_cgroup;
- pc->flags = 0;
-
- /*
- * We trust that only if there is a memcg associated with the page, it
- * is a valid allocation
- */
- if (!memcg)
- return;
-
- VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
}
#else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4351dd972803..f4090a582caf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2902,24 +2902,32 @@ EXPORT_SYMBOL(free_pages);
struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- struct mem_cgroup *memcg = NULL;
+ struct mem_cgroup *memcg;

- if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
+ if (memcg_charge_kmem_pages(gfp_mask, order, &memcg) != 0)
return NULL;
page = alloc_pages(gfp_mask, order);
- memcg_kmem_commit_charge(page, memcg, order);
+ if (!page) {
+ memcg_uncharge_kmem_pages(memcg, order);
+ return NULL;
+ }
+ page->memcg = memcg;
return page;
}

struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- struct mem_cgroup *memcg = NULL;
+ struct mem_cgroup *memcg;

- if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
+ if (memcg_charge_kmem_pages(gfp_mask, order, &memcg) != 0)
return NULL;
page = alloc_pages_node(nid, gfp_mask, order);
- memcg_kmem_commit_charge(page, memcg, order);
+ if (!page) {
+ memcg_uncharge_kmem_pages(memcg, order);
+ return NULL;
+ }
+ page->memcg = memcg;
return page;
}

@@ -2929,7 +2937,7 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
*/
void __free_kmem_pages(struct page *page, unsigned int order)
{
- memcg_kmem_uncharge_pages(page, order);
+ memcg_uncharge_kmem_pages(page->memcg, order);
__free_pages(page, order);
}

--
1.7.10.4

2014-07-07 12:00:29

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 1/8] memcg: add pointer from memcg_cache_params to owner cache

We don't keep a pointer to the owner kmem cache in the
memcg_cache_params struct, because we can always get the cache by
reading the slot corresponding to the owner memcg in the root cache's
memcg_caches array (see memcg_params_to_cache). However, this won't work
when kmem cache re-parenting is introduced, because the slot in the root
cache's arrays must be cleared on css offline then.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/slab.h | 2 ++
mm/memcontrol.c | 25 +++++--------------------
2 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 68b1feaba9d6..8bc62d5ef903 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -523,6 +523,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
*
* Child caches will hold extra metadata needed for its operation. Fields are:
*
+ * @cachep: cache which this struct is for
* @memcg: pointer to the memcg this cache belongs to
* @list: list_head for the list of all caches in this memcg
* @root_cache: pointer to the global, root cache, this cache was derived from
@@ -536,6 +537,7 @@ struct memcg_cache_params {
union {
struct kmem_cache *memcg_caches[0];
struct {
+ struct kmem_cache *cachep;
struct mem_cgroup *memcg;
struct list_head list;
struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1b311687769..98b43a8125b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2781,19 +2781,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
memcg_kmem_is_active(memcg);
}

-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
- struct kmem_cache *cachep;
-
- VM_BUG_ON(p->is_root_cache);
- cachep = p->root_cache;
- return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
#ifdef CONFIG_SLABINFO
static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
{
@@ -2807,7 +2794,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)

mutex_lock(&memcg_slab_mutex);
list_for_each_entry(params, &memcg->memcg_slab_caches, list)
- cache_show(memcg_params_to_cache(params), m);
+ cache_show(params->cachep, m);
mutex_unlock(&memcg_slab_mutex);

return 0;
@@ -2982,6 +2969,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
return -ENOMEM;

if (memcg) {
+ s->memcg_params->cachep = s;
s->memcg_params->memcg = memcg;
s->memcg_params->root_cache = root_cache;
atomic_long_set(&s->memcg_params->refcnt, 1);
@@ -3072,10 +3060,9 @@ static void memcg_unregister_cache_func(struct work_struct *work)
{
struct memcg_cache_params *params =
container_of(work, struct memcg_cache_params, unregister_work);
- struct kmem_cache *cachep = memcg_params_to_cache(params);

mutex_lock(&memcg_slab_mutex);
- memcg_unregister_cache(cachep);
+ memcg_unregister_cache(params->cachep);
mutex_unlock(&memcg_slab_mutex);
}

@@ -3140,7 +3127,6 @@ int __memcg_cleanup_cache_params(struct kmem_cache *s)

static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
{
- struct kmem_cache *cachep;
struct memcg_cache_params *params, *tmp;
LIST_HEAD(empty_caches);

@@ -3149,7 +3135,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)

mutex_lock(&memcg_slab_mutex);
list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
- cachep = memcg_params_to_cache(params);
+ struct kmem_cache *cachep = params->cachep;

memcg_cache_mark_dead(cachep);
kmem_cache_shrink(cachep);
@@ -3173,8 +3159,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
while (!list_empty(&empty_caches)) {
params = list_first_entry(&empty_caches,
struct memcg_cache_params, list);
- cachep = memcg_params_to_cache(params);
- memcg_unregister_cache(cachep);
+ memcg_unregister_cache(params->cachep);
}
mutex_unlock(&memcg_slab_mutex);
}
--
1.7.10.4

2014-07-07 12:02:46

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm 2/8] memcg: keep all children of each root cache on a list

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params::memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg. However, this is going to change when memcg
cache reparenting is introduced - only active (not dead) caches will
reside in this array. So let's organize all child caches of the same
root cache into a list on memcg_cache_params.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 2 +-
include/linux/slab.h | 7 ++++++-
mm/memcontrol.c | 27 ++++++++++++---------------
mm/slab.c | 40 +++++++++++++++++++++++-----------------
mm/slab_common.c | 30 +++++++++++++++++-------------
mm/slub.c | 39 +++++++++++++++++++++++----------------
6 files changed, 82 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 806b8fa15c5f..5b0fbba00b01 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -461,7 +461,7 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order);
void __memcg_uncharge_slab(struct kmem_cache *cachep, int order);

-int __memcg_cleanup_cache_params(struct kmem_cache *s);
+void __memcg_cleanup_cache_params(struct kmem_cache *s);

/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8bc62d5ef903..640e6a655d51 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
* @memcg: pointer to the memcg this cache belongs to
* @list: list_head for the list of all caches in this memcg
* @root_cache: pointer to the global, root cache, this cache was derived from
+ * @siblings: list_head for the list of all child caches of the root_cache
* @refcnt: reference counter
* @dead: set to true when owner memcg is turned offline
* @unregister_work: worker to destroy the cache
@@ -535,12 +536,16 @@ struct memcg_cache_params {
bool is_root_cache;
struct rcu_head rcu_head;
union {
- struct kmem_cache *memcg_caches[0];
+ struct {
+ struct list_head children;
+ struct kmem_cache *memcg_caches[0];
+ };
struct {
struct kmem_cache *cachep;
struct mem_cgroup *memcg;
struct list_head list;
struct kmem_cache *root_cache;
+ struct list_head siblings;
atomic_long_t refcnt;
bool dead;
struct work_struct unregister_work;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98b43a8125b9..4dedb67787c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2915,6 +2915,10 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
return -ENOMEM;

new_params->is_root_cache = true;
+ INIT_LIST_HEAD(&new_params->children);
+ if (cur_params)
+ list_replace(&cur_params->children,
+ &new_params->children);

/*
* There is the chance it will be bigger than
@@ -2976,8 +2980,10 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
INIT_WORK(&s->memcg_params->unregister_work,
memcg_unregister_cache_func);
css_get(&memcg->css);
- } else
+ } else {
s->memcg_params->is_root_cache = true;
+ INIT_LIST_HEAD(&s->memcg_params->children);
+ }

return 0;
}
@@ -3105,24 +3111,15 @@ static inline void memcg_resume_kmem_account(void)
current->memcg_kmem_skip_account--;
}

-int __memcg_cleanup_cache_params(struct kmem_cache *s)
+void __memcg_cleanup_cache_params(struct kmem_cache *s)
{
- struct kmem_cache *c;
- int i, failed = 0;
+ struct memcg_cache_params *params, *tmp;

mutex_lock(&memcg_slab_mutex);
- for_each_memcg_cache_index(i) {
- c = cache_from_memcg_idx(s, i);
- if (!c)
- continue;
-
- memcg_unregister_cache(c);
-
- if (cache_from_memcg_idx(s, i))
- failed++;
- }
+ list_for_each_entry_safe(params, tmp,
+ &s->memcg_params->children, siblings)
+ memcg_unregister_cache(params->cachep);
mutex_unlock(&memcg_slab_mutex);
- return failed;
}

static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index e7763dba3570..159bddfabaee 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3821,29 +3821,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
return alloc_kmem_cache_node(cachep, gfp);
}

+static void memcg_do_tune_cpucache(struct kmem_cache *cachep, int limit,
+ int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+ struct memcg_cache_params *params;
+
+ if (!cachep->memcg_params ||
+ !cachep->memcg_params->is_root_cache)
+ return;
+
+ lockdep_assert_held(&slab_mutex);
+ list_for_each_entry(params,
+ &cachep->memcg_params->children, siblings) {
+ /* return value determined by the parent cache only */
+ __do_tune_cpucache(params->cachep, limit,
+ batchcount, shared, gfp);
+ }
+#endif
+}
+
static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
int batchcount, int shared, gfp_t gfp)
{
int ret;
- struct kmem_cache *c = NULL;
- int i = 0;

ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
- if (slab_state < FULL)
- return ret;
-
- if ((ret < 0) || !is_root_cache(cachep))
- return ret;
-
- VM_BUG_ON(!mutex_is_locked(&slab_mutex));
- for_each_memcg_cache_index(i) {
- c = cache_from_memcg_idx(cachep, i);
- if (c)
- /* return value determined by the parent cache only */
- __do_tune_cpucache(c, limit, batchcount, shared, gfp);
- }
-
+ if (!ret)
+ memcg_do_tune_cpucache(cachep, limit,
+ batchcount, shared, gfp);
return ret;
}

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d31c4bacc6a2..95a8f772b0d1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -294,8 +294,12 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
if (IS_ERR(s)) {
kfree(cache_name);
s = NULL;
+ goto out_unlock;
}

+ list_add(&s->memcg_params->siblings,
+ &root_cache->memcg_params->children);
+
out_unlock:
mutex_unlock(&slab_mutex);

@@ -307,17 +311,15 @@ out_unlock:

static int memcg_cleanup_cache_params(struct kmem_cache *s)
{
- int rc;
-
if (!s->memcg_params ||
!s->memcg_params->is_root_cache)
return 0;

mutex_unlock(&slab_mutex);
- rc = __memcg_cleanup_cache_params(s);
+ __memcg_cleanup_cache_params(s);
mutex_lock(&slab_mutex);

- return rc;
+ return !list_empty(&s->memcg_params->children);
}
#else
static int memcg_cleanup_cache_params(struct kmem_cache *s)
@@ -354,6 +356,10 @@ void kmem_cache_destroy(struct kmem_cache *s)
}

list_del(&s->list);
+#ifdef CONFIG_MEMCG_KMEM
+ if (!is_root_cache(s))
+ list_del(&s->memcg_params->siblings);
+#endif

mutex_unlock(&slab_mutex);
if (s->flags & SLAB_DESTROY_BY_RCU)
@@ -692,20 +698,17 @@ void slab_stop(struct seq_file *m, void *p)
static void
memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
{
- struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+ struct memcg_cache_params *params;
struct slabinfo sinfo;
- int i;

- if (!is_root_cache(s))
+ if (!s->memcg_params ||
+ !s->memcg_params->is_root_cache)
return;

- for_each_memcg_cache_index(i) {
- c = cache_from_memcg_idx(s, i);
- if (!c)
- continue;
-
+ list_for_each_entry(params, &s->memcg_params->children, siblings) {
memset(&sinfo, 0, sizeof(sinfo));
- get_slabinfo(c, &sinfo);
+ get_slabinfo(params->cachep, &sinfo);

info->active_slabs += sinfo.active_slabs;
info->num_slabs += sinfo.num_slabs;
@@ -713,6 +716,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
info->active_objs += sinfo.active_objs;
info->num_objs += sinfo.num_objs;
}
+#endif
}

int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index 6641a8fc63d1..1821e2096cbb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3706,6 +3706,23 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
return NULL;
}

+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+ struct kmem_cache *c;
+ struct memcg_cache_params *params;
+
+ if (!s->memcg_params)
+ return;
+
+ list_for_each_entry(params, &s->memcg_params->children, siblings) {
+ c = params->cachep;
+ c->object_size = s->object_size;
+ c->inuse = max_t(int, c->inuse, ALIGN(size, sizeof(void *)));
+ }
+#endif
+}
+
struct kmem_cache *
__kmem_cache_alias(const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *))
@@ -3714,9 +3731,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,

s = find_mergeable(size, align, flags, name, ctor);
if (s) {
- int i;
- struct kmem_cache *c;
-
s->refcount++;

/*
@@ -3726,14 +3740,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
s->object_size = max(s->object_size, (int)size);
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));

- for_each_memcg_cache_index(i) {
- c = cache_from_memcg_idx(s, i);
- if (!c)
- continue;
- c->object_size = s->object_size;
- c->inuse = max_t(int, c->inuse,
- ALIGN(size, sizeof(void *)));
- }
+ memcg_slab_merge(s, size);

if (sysfs_slab_alias(s, name)) {
s->refcount--;
@@ -4984,7 +4991,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
err = attribute->store(s, buf, len);
#ifdef CONFIG_MEMCG_KMEM
if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
- int i;
+ struct memcg_cache_params *params;

mutex_lock(&slab_mutex);
if (s->max_attr_size < len)
@@ -5007,10 +5014,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
* directly either failed or succeeded, in which case we loop
* through the descendants with best-effort propagation.
*/
- for_each_memcg_cache_index(i) {
- struct kmem_cache *c = cache_from_memcg_idx(s, i);
- if (c)
- attribute->store(c, buf, len);
+ if (s->memcg_params) {
+ list_for_each_entry(params,
+ &s->memcg_params->children, siblings)
+ attribute->store(params->cachep, buf, len);
}
mutex_unlock(&slab_mutex);
}
--
1.7.10.4

2014-07-07 14:25:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

Hi Vladimir,

On Mon, Jul 07, 2014 at 04:00:05PM +0400, Vladimir Davydov wrote:
> Hi,
>
> This patch set introduces re-parenting of kmem charges on memcg css
> offline. The idea lying behind it is very simple - instead of pointing
> from kmem objects (kmem caches, non-slab kmem pages) directly to the
> memcg which they are charged against, we make them point to a proxy
> object, mem_cgroup_kmem_context, which, in turn, points to the memcg
> which it belongs to. As a result on memcg offline, it's enough to only
> re-parent the memcg's mem_cgroup_kmem_context.

The motivation for this was to clear out all references to a memcg by
the time it's offlined, so that the unreachable css can be freed soon.

However, recent cgroup core changes further disconnected the css from
the cgroup object itself, so it's no longer as urgent to free the css.

In addition, Tejun made offlined css iterable and split css_tryget()
and css_tryget_online(), which would allow memcg to pin the css until
the last charge is gone while continuing to iterate and reclaim it on
hierarchical pressure, even after it was offlined.

This would obviate the need for reparenting as a whole, not just kmem
pages, but even remaining page cache. Michal already obsoleted the
force_empty knob that reparents as a fallback, and whether the cache
pages are in the parent or in a ghost css after cgroup deletion does
not make a real difference from a user point of view, they still get
reclaimed when the parent experiences pressure.

You could then reap dead slab caches as part of the regular per-memcg
slab scanning in reclaim, without having to resort to auxiliary lists,
vmpressure events etc.

I think it would save us a lot of code and complexity. You want
per-memcg slab scanning *anyway*, all we'd have to change in the
existing code would be to pin the css until the LRUs and kmem caches
are truly empty, and switch mem_cgroup_iter() to css_tryget().

Would this make sense to you?

Subject: Re: [PATCH -mm 2/8] memcg: keep all children of each root cache on a list

On Mon, 7 Jul 2014, Vladimir Davydov wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d31c4bacc6a2..95a8f772b0d1 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -294,8 +294,12 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
> if (IS_ERR(s)) {
> kfree(cache_name);
> s = NULL;
> + goto out_unlock;
> }
>
> + list_add(&s->memcg_params->siblings,
> + &root_cache->memcg_params->children);
> +
> out_unlock:
> mutex_unlock(&slab_mutex);
>

If there is an error then s is set to NULL. And then
the list_add is done dereferencing s?

2014-07-07 15:40:21

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

Hi Johannes,

On Mon, Jul 07, 2014 at 10:25:06AM -0400, Johannes Weiner wrote:
> Hi Vladimir,
>
> On Mon, Jul 07, 2014 at 04:00:05PM +0400, Vladimir Davydov wrote:
> > Hi,
> >
> > This patch set introduces re-parenting of kmem charges on memcg css
> > offline. The idea lying behind it is very simple - instead of pointing
> > from kmem objects (kmem caches, non-slab kmem pages) directly to the
> > memcg which they are charged against, we make them point to a proxy
> > object, mem_cgroup_kmem_context, which, in turn, points to the memcg
> > which it belongs to. As a result on memcg offline, it's enough to only
> > re-parent the memcg's mem_cgroup_kmem_context.
>
> The motivation for this was to clear out all references to a memcg by
> the time it's offlined, so that the unreachable css can be freed soon.
>
> However, recent cgroup core changes further disconnected the css from
> the cgroup object itself, so it's no longer as urgent to free the css.
>
> In addition, Tejun made offlined css iterable and split css_tryget()
> and css_tryget_online(), which would allow memcg to pin the css until
> the last charge is gone while continuing to iterate and reclaim it on
> hierarchical pressure, even after it was offlined.
>
> This would obviate the need for reparenting as a whole, not just kmem
> pages, but even remaining page cache. Michal already obsoleted the
> force_empty knob that reparents as a fallback, and whether the cache
> pages are in the parent or in a ghost css after cgroup deletion does
> not make a real difference from a user point of view, they still get
> reclaimed when the parent experiences pressure.

So, that means there's no need in a proxy object between kmem objects
and the memcg which they are charged against (mem_cgroup_kmem_context in
this patch set), because now it's OK to pin css from kmem allocations.
Furthermore there will be no need to reparent per memcg list_lrus when
they are introduced. That's nice!

> You could then reap dead slab caches as part of the regular per-memcg
> slab scanning in reclaim, without having to resort to auxiliary lists,
> vmpressure events etc.

Do you mean adding a per memcg shrinker that will call kmem_cache_shrink
for all memcg caches on memcg/global pressure?

Actually I recently made dead caches self-destructive at the cost of
slowing down kfrees to dead caches (see
https://www.lwn.net/Articles/602330/, it's already in the mmotm tree) so
no dead cache reaping is necessary. Do you think if we need it now?

> I think it would save us a lot of code and complexity. You want
> per-memcg slab scanning *anyway*, all we'd have to change in the
> existing code would be to pin the css until the LRUs and kmem caches
> are truly empty, and switch mem_cgroup_iter() to css_tryget().
>
> Would this make sense to you?

Hmm, interesting. Thank you for such a thorough explanation.

One question. Do we still need to free mem_cgroup->kmemcg_id on css
offline so that it can be reused by new kmem-active cgroups (currently
we don't)?

If we won't free it the root_cache->memcg_params->memcg_arrays may
become really huge due to lots of dead css holding the id.

Thanks.

2014-07-07 15:45:55

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH -mm 2/8] memcg: keep all children of each root cache on a list

On Mon, Jul 07, 2014 at 10:24:35AM -0500, Christoph Lameter wrote:
> On Mon, 7 Jul 2014, Vladimir Davydov wrote:
>
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index d31c4bacc6a2..95a8f772b0d1 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -294,8 +294,12 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
> > if (IS_ERR(s)) {
> > kfree(cache_name);
> > s = NULL;
> > + goto out_unlock;
> > }
> >
> > + list_add(&s->memcg_params->siblings,
> > + &root_cache->memcg_params->children);
> > +
> > out_unlock:
> > mutex_unlock(&slab_mutex);
> >
>
> If there is an error then s is set to NULL. And then
> the list_add is done dereferencing s?

No, we skip list_add on error. I think you missed "goto out_unlock"
right after "s = NULL" (btw do_kmem_cache_create never returns NULL).

Thanks.

2014-07-07 17:14:23

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

07.07.2014 18:25, Johannes Weiner:
> In addition, Tejun made offlined css iterable and split css_tryget()
> and css_tryget_online(), which would allow memcg to pin the css until
> the last charge is gone while continuing to iterate and reclaim it on
> hierarchical pressure, even after it was offlined.

One more question.

With reparenting enabled, the number of cgroups (lruvecs) that must be
iterated on global reclaim is bound by the number of live containers,
while w/o reparenting it's practically unbound, isn't it? Won't it be
the source of latency spikes?

Thanks.

2014-07-08 22:05:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

On Mon, Jul 07, 2014 at 07:40:08PM +0400, Vladimir Davydov wrote:
> On Mon, Jul 07, 2014 at 10:25:06AM -0400, Johannes Weiner wrote:
> > You could then reap dead slab caches as part of the regular per-memcg
> > slab scanning in reclaim, without having to resort to auxiliary lists,
> > vmpressure events etc.
>
> Do you mean adding a per memcg shrinker that will call kmem_cache_shrink
> for all memcg caches on memcg/global pressure?
>
> Actually I recently made dead caches self-destructive at the cost of
> slowing down kfrees to dead caches (see
> https://www.lwn.net/Articles/602330/, it's already in the mmotm tree) so
> no dead cache reaping is necessary. Do you think if we need it now?
>
> > I think it would save us a lot of code and complexity. You want
> > per-memcg slab scanning *anyway*, all we'd have to change in the
> > existing code would be to pin the css until the LRUs and kmem caches
> > are truly empty, and switch mem_cgroup_iter() to css_tryget().
> >
> > Would this make sense to you?
>
> Hmm, interesting. Thank you for such a thorough explanation.
>
> One question. Do we still need to free mem_cgroup->kmemcg_id on css
> offline so that it can be reused by new kmem-active cgroups (currently
> we don't)?
>
> If we won't free it the root_cache->memcg_params->memcg_arrays may
> become really huge due to lots of dead css holding the id.

We only need the O(1) access of the array for allocation - not frees
and reclaim, right?

So with your self-destruct code, can we prune caches of dead css and
then just remove them from the array? Or move them from the array to
a per-memcg linked list that can be scanned on memcg memory pressure?

2014-07-08 22:19:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

On Mon, Jul 07, 2014 at 09:14:15PM +0400, Vladimir Davydov wrote:
> 07.07.2014 18:25, Johannes Weiner:
> >In addition, Tejun made offlined css iterable and split css_tryget()
> >and css_tryget_online(), which would allow memcg to pin the css until
> >the last charge is gone while continuing to iterate and reclaim it on
> >hierarchical pressure, even after it was offlined.
>
> One more question.
>
> With reparenting enabled, the number of cgroups (lruvecs) that must be
> iterated on global reclaim is bound by the number of live containers,
> while w/o reparenting it's practically unbound, isn't it? Won't it be
> the source of latency spikes?

It might deteriorate a little bit, but it is a self-correcting problem
as soon as memory pressure kicks. Creating and destroying cgroups is
serialized at a global level, so I would expect the cost of doing that
at a high rate to become a problem before the csss become an issue for
the reclaim scanner.

At some point we will probably have to make the global reclaim cgroup
walk in shrink_zone() intermittent, but I'm not aware of any problems
with it so far.

2014-07-09 07:26:15

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH -mm 0/8] memcg: reparent kmem on css offline

On Tue, Jul 08, 2014 at 06:05:19PM -0400, Johannes Weiner wrote:
> On Mon, Jul 07, 2014 at 07:40:08PM +0400, Vladimir Davydov wrote:
> > On Mon, Jul 07, 2014 at 10:25:06AM -0400, Johannes Weiner wrote:
> > > You could then reap dead slab caches as part of the regular per-memcg
> > > slab scanning in reclaim, without having to resort to auxiliary lists,
> > > vmpressure events etc.
> >
> > Do you mean adding a per memcg shrinker that will call kmem_cache_shrink
> > for all memcg caches on memcg/global pressure?
> >
> > Actually I recently made dead caches self-destructive at the cost of
> > slowing down kfrees to dead caches (see
> > https://www.lwn.net/Articles/602330/, it's already in the mmotm tree) so
> > no dead cache reaping is necessary. Do you think if we need it now?
> >
> > > I think it would save us a lot of code and complexity. You want
> > > per-memcg slab scanning *anyway*, all we'd have to change in the
> > > existing code would be to pin the css until the LRUs and kmem caches
> > > are truly empty, and switch mem_cgroup_iter() to css_tryget().
> > >
> > > Would this make sense to you?
> >
> > Hmm, interesting. Thank you for such a thorough explanation.
> >
> > One question. Do we still need to free mem_cgroup->kmemcg_id on css
> > offline so that it can be reused by new kmem-active cgroups (currently
> > we don't)?
> >
> > If we won't free it the root_cache->memcg_params->memcg_arrays may
> > become really huge due to lots of dead css holding the id.
>
> We only need the O(1) access of the array for allocation - not frees
> and reclaim, right?

Yes.

> So with your self-destruct code, can we prune caches of dead css and
> then just remove them from the array? Or move them from the array to
> a per-memcg linked list that can be scanned on memcg memory pressure?

This shouldn't be a problem. Will do that.

Actually, I now doubt if we need self-destruct at all. I don't really
like it, because its implementations is rather ugly, and, what is worse,
it slows down kfree for dead caches noticeably. SLAB maintainers doesn't
seem to be fond of it either. May be, we'd better drop this in favour of
shrinking dead caches on memory pressure?

Then *empty* dead caches will be pending until memory pressure reaps
them, which looks a bit strange, because there's absolutely no reason to
keep them for so long. However, the code will be simpler then, and
kfrees to dead caches will proceed at the same speed as to active
caches.

Thanks.