2012-11-01 12:07:53

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 00/29] kmem controller for memcg.

Hi,

This work introduces the kernel memory controller for memcg. Unlike previous
submissions, this includes the whole controller, comprised of slab and stack
memory.

Slab-specific considerations: I've modified the kmem_cache_free() mechanism
so we would have the code in a single location. It is then inlined into all
interested instances of kmem_cache_free. Following this logic, there is little
to be simplified in kmalloc/kmem_cache_alloc, which already does only exactly
this. The attribute propagation code was kept, since integrating it would still
depend on quite some infrastructure.



*v6 - joint submission slab + stack.
- fixed race conditions with cache destruction.
- unified code for kmem_cache_free cache derivation in mm/slab.h.
- changed memcg_css_id to memcg_cache_id, now not equal to css_id.
- remove extra memcg_kmem_get_cache in the kmalloc path.
- fixed a bug with slab that would occur with some invocations of
kmem_cache_free
- use post_create() for memcg kmem_accounted_flag propagation.

*v5: - changed charged order, kmem charged first.
- minor nits and comments merged.

*v4: - kmem_accounted can no longer become unlimited
- kmem_accounted can no longer become limited, if group has children.
- documentation moved to this patchset
- more style changes
- css_get in charge path to ensure task won't move during charge
*v3:
- Changed function names to match memcg's
- avoid doing get/put in charge/uncharge path
- revert back to keeping the account enabled after it is first activated

Numbers can be found at https://lkml.org/lkml/2012/9/13/239

A (throwaway) git tree with them is placed at:

git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git kmemcg-slab


The kernel memory limitation mechanism for memcg concerns itself with
disallowing potentially non-reclaimable allocations to happen in exaggerate
quantities by a particular set of tasks (cgroup). Those allocations could
create pressure that affects the behavior of a different and unrelated set of
tasks.

Its basic working mechanism consists in annotating interesting allocations with
the _GFP_KMEMCG flag. When this flag is set, the current task allocating will
have its memcg identified and charged against. When reaching a specific limit,
further allocations will be denied.

As of this work, pages allocated on behalf of the slab allocator, and stack
memory are tracked. Other kinds of memory, like spurious calls to
__get_free_pages, vmalloc, page tables, etc are not tracked. Besides the memcg
cost that may be present with those allocations - that other allocations may
rightfully want to avoid - memory need to be somehow traceable back to a task
in order to be accounted by memcg. This may be trivial - as in the stack - or a
bit complicated, requiring extra work to be done - as in the case of the slab.
IOW, which memory to track is always a complexity tradeoff. We believe stack +
slab provides enough coverage of the relevant kernel memory most of the time.

Tracking accuracy depends on how well we can track memory back to a specific
task. Memory allocated for the stack is always accurately tracked, since stack
memory trivially belongs to a task and is never shared. For the slab, the
accuracy depends on the amount of object-sharing existing between tasks in
different cgroups (like memcg does for shmem, the kernel memory controller
operates in a first-touch basis). Workloads, such as OS containers, usually
have a very low amount of sharing, and will therefore present high accuracy.

One example of problematic pressure that can be prevented by this work is
a fork bomb conducted in a shell. We prevent it by noting that tasks use a
limited amount of stack pages. Seen this way, a fork bomb is just a special
case of resource abuse. If the offender is unable to grab more pages for the
stack, no new tasks can be created.

There are also other things the general mechanism protects against. For
example, using too much of pinned dentry and inode cache, by touching files an
leaving them in memory forever.

In fact, a simple:

while true; do mkdir x; cd x; done

can halt your system easily because the file system limits are hard to reach
(big disks), but the kernel memory is not. Those are examples, but the list
certainly don't stop here.

An important use case for all that, is concerned with people offering hosting
services through containers. In a physical box we can put a limit to some
resources, like total number of processes or threads. But in an environment
where each independent user gets its own piece of the machine, we don't want a
potentially malicious user to destroy good users' services.

This might be true for systemd as well, that now groups services inside
cgroups. They generally want to put forward a set of guarantees that limits the
running service in a variety of ways, so that if they become badly behaved,
they won't interfere with the rest of the system.

There is, of course, a cost for that. To attempt to mitigate that, static
branches are used. This code will only be enabled after the first user of this
service configures any kmem limit, guaranteeing near-zero overhead even if a
large number of (non-kmem limited) memcgs are deployed.

Behavior depends on the values of memory.limit_in_bytes (U), and
memory.kmem.limit_in_bytes (K):

U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.

U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.

U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.


Glauber Costa (27):
memcg: change defines to an enum
kmem accounting basic infrastructure
Add a __GFP_KMEMCG flag
memcg: kmem controller infrastructure
mm: Allocate kernel pages to the right memcg
res_counter: return amount of charges after res_counter_uncharge
memcg: kmem accounting lifecycle management
memcg: use static branches when code not in use
memcg: allow a memcg with kmem charges to be destructed.
execute the whole memcg freeing in free_worker
protect architectures where THREAD_SIZE >= PAGE_SIZE against fork
bombs
Add documentation about the kmem controller
slab/slub: struct memcg_params
slab: annotate on-slab caches nodelist locks
consider a memcg parameter in kmem_create_cache
Allocate memory for memcg caches whenever a new memcg appears
memcg: infrastructure to match an allocation to the right cache
memcg: skip memcg kmem allocations in specified code regions
sl[au]b: always get the cache from its page in kmem_cache_free
sl[au]b: Allocate objects from memcg cache
memcg: destroy memcg caches
memcg/sl[au]b Track all the memcg children of a kmem_cache.
memcg/sl[au]b: shrink dead caches
Aggregate memcg cache values in slabinfo
slab: propagate tunables values
slub: slub-specific propagation changes.
Add slab-specific documentation about the kmem controller

Suleiman Souhlal (2):
memcg: Make it possible to use the stock for more than one page.
memcg: Reclaim when more than one page needed.

Documentation/cgroups/memory.txt | 66 +-
Documentation/cgroups/resource_counter.txt | 7 +-
include/linux/gfp.h | 6 +-
include/linux/memcontrol.h | 203 ++++
include/linux/res_counter.h | 12 +-
include/linux/sched.h | 1 +
include/linux/slab.h | 48 +
include/linux/slab_def.h | 3 +
include/linux/slub_def.h | 9 +-
include/linux/thread_info.h | 2 +
include/trace/events/gfpflags.h | 1 +
init/Kconfig | 2 +-
kernel/fork.c | 4 +-
kernel/res_counter.c | 20 +-
mm/memcontrol.c | 1562 ++++++++++++++++++++++++----
mm/page_alloc.c | 35 +
mm/slab.c | 93 +-
mm/slab.h | 137 ++-
mm/slab_common.c | 118 ++-
mm/slob.c | 2 +-
mm/slub.c | 124 ++-
21 files changed, 2171 insertions(+), 284 deletions(-)

--
1.7.11.7


2012-11-01 12:07:59

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 13/29] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs

Because those architectures will draw their stacks directly from the
page allocator, rather than the slab cache, we can directly pass
__GFP_KMEMCG flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if
they go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not
to mess with the other path. Once the slab is also tracked by memcg, we
can get rid of that flag.

Tested to successfully protect against :(){ :|:& };:

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/thread_info.h | 2 ++
kernel/fork.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ccc1899..e7e0473 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
#endif

+#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
+
/*
* flag set/clear/test wrappers
* - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index 03b86f1..3ec055b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
int node)
{
- struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+ struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);

return page ? page_address(page) : NULL;
@@ -154,7 +154,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,

static inline void free_thread_info(struct thread_info *ti)
{
- free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+ free_memcg_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
}
# else
static struct kmem_cache *thread_info_cache;
--
1.7.11.7

2012-11-01 12:08:06

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 05/29] Add a __GFP_KMEMCG flag

This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to
current's memcg. It takes numerical place of the of the recently removed
__GFP_NO_KSWAPD.

[ v4: make flag unconditional, also declare it in trace code ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/gfp.h | 3 ++-
include/trace/events/gfpflags.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6418418..5effbd4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
#define ___GFP_NOTRACK 0x200000u
+#define ___GFP_KMEMCG 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u

@@ -87,7 +88,7 @@ struct vm_area_struct;

#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
-
+#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9391706..730df12 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -36,6 +36,7 @@
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
{(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \
{(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \
+ {(unsigned long)__GFP_KMEMCG, "GFP_KMEMCG"}, \
{(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \
) : "GFP_NOWAIT"

--
1.7.11.7

2012-11-01 12:08:32

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 20/29] memcg: skip memcg kmem allocations in specified code regions

This patch creates a mechanism that skip memcg allocations during
certain pieces of our core code. It basically works in the same way
as preempt_disable()/preempt_enable(): By marking a region under
which all allocations will be accounted to the root memcg.

We need this to prevent races in early cache creation, when we
allocate data using caches that are not necessarily created already.

[ v2: wrap the whole enqueue process, INIT_WORK can alloc memory ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/sched.h | 1 +
mm/memcontrol.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0d907e1..9fad6c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1581,6 +1581,7 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ unsigned int memcg_kmem_skip_account;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 318dc67..ff42586 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2994,6 +2994,41 @@ out:
kfree(s->memcg_params);
}

+/*
+ * During the creation a new cache, we need to disable our accounting mechanism
+ * altogether. This is true even if we are not creating, but rather just
+ * enqueing new caches to be created.
+ *
+ * This is because that process will trigger allocations; some visible, like
+ * explicit kmallocs to auxiliary data structures, name strings and internal
+ * cache structures; some well concealed, like INIT_WORK() that can allocate
+ * objects during debug.
+ *
+ * If any allocation happens during memcg_kmem_get_cache, we will recurse back
+ * to it. This may not be a bounded recursion: since the first cache creation
+ * failed to complete (waiting on the allocation), we'll just try to create the
+ * cache again, failing at the same point.
+ *
+ * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
+ * memcg_kmem_skip_account. So we enclose anything that might allocate memory
+ * inside the following two functions.
+ */
+static inline void memcg_stop_kmem_account(void)
+{
+ if (!current->mm)
+ return;
+
+ current->memcg_kmem_skip_account++;
+}
+
+static inline void memcg_resume_kmem_account(void)
+{
+ if (!current->mm)
+ return;
+
+ current->memcg_kmem_skip_account--;
+}
+
static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
{
char *name;
@@ -3052,7 +3087,10 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
if (new_cachep)
goto out;

+ /* Don't block progress to enqueue caches for internal infrastructure */
+ memcg_stop_kmem_account();
new_cachep = kmem_cache_dup(memcg, cachep);
+ memcg_resume_kmem_account();

if (new_cachep == NULL) {
new_cachep = cachep;
@@ -3094,8 +3132,8 @@ static void memcg_create_cache_work_func(struct work_struct *w)
* Enqueue the creation of a per-memcg kmem_cache.
* Called with rcu_read_lock.
*/
-static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
- struct kmem_cache *cachep)
+static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
+ struct kmem_cache *cachep)
{
struct create_work *cw;

@@ -3116,6 +3154,24 @@ static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
schedule_work(&cw->work);
}

+static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
+ struct kmem_cache *cachep)
+{
+ /*
+ * We need to stop accounting when we kmalloc, because if the
+ * corresponding kmalloc cache is not yet created, the first allocation
+ * in __memcg_create_cache_enqueue will recurse.
+ *
+ * However, it is better to enclose the whole function. Depending on
+ * the debugging options enabled, INIT_WORK(), for instance, can
+ * trigger an allocation. This too, will make us recurse. Because at
+ * this point we can't allow ourselves back into memcg_kmem_get_cache,
+ * the safest choice is to do it like this, wrapping the whole function.
+ */
+ memcg_stop_kmem_account();
+ __memcg_create_cache_enqueue(memcg, cachep);
+ memcg_resume_kmem_account();
+}
/*
* Return the kmem_cache we're supposed to use for a slab allocation.
* We try to use the current memcg's version of the cache.
@@ -3138,6 +3194,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
VM_BUG_ON(!cachep->memcg_params);
VM_BUG_ON(!cachep->memcg_params->is_root_cache);

+ if (!current->mm || current->memcg_kmem_skip_account)
+ return cachep;
+
rcu_read_lock();
memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
rcu_read_unlock();
--
1.7.11.7

2012-11-01 12:07:58

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 02/29] memcg: Reclaim when more than one page needed.

From: Suleiman Souhlal <[email protected]>

mem_cgroup_do_charge() was written before kmem accounting, and expects
three cases: being called for 1 page, being called for a stock of 32
pages, or being called for a hugepage. If we call for 2 or 3 pages (and
both the stack and several slabs used in process creation are such, at
least with the debug options I had), it assumed it's being called for
stock and just retried without reclaiming.

Fix that by passing down a minsize argument in addition to the csize.

And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
__GFP_NORETRY.

[v4: fixed nr pages calculation pointed out by Christoph Lameter ]

Signed-off-by: Suleiman Souhlal <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a1abe9..aa0d9b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2226,7 +2226,8 @@ enum {
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, unsigned int min_pages,
+ bool oom_check)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2249,18 +2250,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
- * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
- * of regular pages (CHARGE_BATCH), or a single regular page (1).
- *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
- if (nr_pages == CHARGE_BATCH)
+ if (nr_pages > min_pages)
return CHARGE_RETRY;

if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

+ if (gfp_mask & __GFP_NORETRY)
+ return CHARGE_NOMEM;
+
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -2273,7 +2274,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
- if (nr_pages == 1 && ret)
+ if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
return CHARGE_RETRY;

/*
@@ -2408,7 +2409,8 @@ again:
nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
}

- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
+ oom_check);
switch (ret) {
case CHARGE_OK:
break;
--
1.7.11.7

2012-11-01 12:08:49

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 24/29] memcg/sl[au]b Track all the memcg children of a kmem_cache.

This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in
gets unloaded. Otherwise, the children will still point to the
destroyed parent.

[ v6: cancel pending work before destroying child cache ]

Signed-off-by: Suleiman Souhlal <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 49 ++++++++++++++++++++++++++++++++++++++++++++--
mm/slab_common.c | 3 +++
3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7d59852..d5511cc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -447,6 +447,7 @@ struct kmem_cache *
__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);

void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s);

/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
@@ -594,6 +595,10 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
{
return cachep;
}
+
+static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6c725d..31da8bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2741,6 +2741,8 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
memcg_check_events(memcg, page);
}

+static DEFINE_MUTEX(set_limit_mutex);
+
#ifdef CONFIG_MEMCG_KMEM
static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
{
@@ -3153,6 +3155,51 @@ out:
return new_cachep;
}

+void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+{
+ struct kmem_cache *c;
+ int i;
+
+ if (!s->memcg_params)
+ return;
+ if (!s->memcg_params->is_root_cache)
+ return;
+
+ /*
+ * If the cache is being destroyed, we trust that there is no one else
+ * requesting objects from it. Even if there are, the sanity checks in
+ * kmem_cache_destroy should caught this ill-case.
+ *
+ * Still, we don't want anyone else freeing memcg_caches under our
+ * noses, which can happen if a new memcg comes to life. As usual,
+ * we'll take the set_limit_mutex to protect ourselves against this.
+ */
+ mutex_lock(&set_limit_mutex);
+ for (i = 0; i < memcg_limited_groups_array_size; i++) {
+ c = s->memcg_params->memcg_caches[i];
+ if (!c)
+ continue;
+
+ /*
+ * We will now manually delete the caches, so to avoid races
+ * we need to cancel all pending destruction workers and
+ * proceed with destruction ourselves.
+ *
+ * kmem_cache_destroy() will call kmem_cache_shrink internally,
+ * and that could spawn the workers again: it is likely that
+ * the cache still have active pages until this very moment.
+ * This would lead us back to mem_cgroup_destroy_cache.
+ *
+ * But that will not execute at all if the "dead" flag is not
+ * set, so flip it down to guarantee we are in control.
+ */
+ c->memcg_params->dead = false;
+ cancel_delayed_work_sync(&c->memcg_params->destroy);
+ kmem_cache_destroy(c);
+ }
+ mutex_unlock(&set_limit_mutex);
+}
+
struct create_work {
struct mem_cgroup *memcg;
struct kmem_cache *cachep;
@@ -4263,8 +4310,6 @@ void mem_cgroup_print_bad_page(struct page *page)
}
#endif

-static DEFINE_MUTEX(set_limit_mutex);
-
static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b76a74c..04215a5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -221,6 +221,9 @@ EXPORT_SYMBOL(kmem_cache_create);

void kmem_cache_destroy(struct kmem_cache *s)
{
+ /* Destroy all the children caches if we aren't a memcg cache */
+ kmem_cache_destroy_memcg_children(s);
+
get_online_cpus();
mutex_lock(&slab_mutex);
s->refcount--;
--
1.7.11.7

2012-11-01 12:09:20

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 23/29] memcg: destroy memcg caches

This patch implements destruction of memcg caches. Right now,
only caches where our reference counter is the last remaining are
deleted. If there are any other reference counters around, we just
leave the caches lying around until they go away.

When that happen, a destruction function is called from the cache
code. Caches are only destroyed in process context, so we queue them
up for later processing in the general case.

[ v5: removed cachep backpointer ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 2 ++
include/linux/slab.h | 8 ++++++
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++
mm/slab.c | 3 +++
mm/slab.h | 23 +++++++++++++++++
mm/slub.c | 7 +++++-
6 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d77d88d..7d59852 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -446,6 +446,8 @@ void memcg_update_array_size(int num_groups);
struct kmem_cache *
__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);

+void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
+
/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
* @gfp: the gfp allocation flags.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 892367e..ef2314e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -181,6 +181,7 @@ unsigned int kmem_cache_size(struct kmem_cache *);
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
#endif

+#include <linux/workqueue.h>
/*
* This is the main placeholder for memcg-related information in kmem caches.
* struct kmem_cache will hold a pointer to it, so the memory cost while
@@ -198,6 +199,10 @@ unsigned int kmem_cache_size(struct kmem_cache *);
* @memcg: pointer to the memcg this cache belongs to
* @list: list_head for the list of all caches in this memcg
* @root_cache: pointer to the global, root cache, this cache was derived from
+ * @dead: set to true after the memcg dies; the cache may still be around.
+ * @nr_pages: number of pages that belongs to this cache.
+ * @destroy: worker to be called whenever we are ready, or believe we may be
+ * ready, to destroy this cache.
*/
struct memcg_cache_params {
bool is_root_cache;
@@ -207,6 +212,9 @@ struct memcg_cache_params {
struct mem_cgroup *memcg;
struct list_head list;
struct kmem_cache *root_cache;
+ bool dead;
+ atomic_t nr_pages;
+ struct work_struct destroy;
};
};
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f7773ed..b6c725d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2748,6 +2748,19 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
}

+/*
+ * This is a bit cumbersome, but it is rarely used and avoids a backpointer
+ * in the memcg_cache_params struct.
+ */
+static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
+{
+ struct kmem_cache *cachep;
+
+ VM_BUG_ON(p->is_root_cache);
+ cachep = p->root_cache;
+ return cachep->memcg_params->memcg_caches[memcg_cache_id(p->memcg)];
+}
+
static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
{
struct res_counter *fail_res;
@@ -3029,6 +3042,31 @@ static inline void memcg_resume_kmem_account(void)
current->memcg_kmem_skip_account--;
}

+static void kmem_cache_destroy_work_func(struct work_struct *w)
+{
+ struct kmem_cache *cachep;
+ struct memcg_cache_params *p;
+
+ p = container_of(w, struct memcg_cache_params, destroy);
+
+ cachep = memcg_params_to_cache(p);
+
+ if (!atomic_read(&cachep->memcg_params->nr_pages))
+ kmem_cache_destroy(cachep);
+}
+
+void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
+{
+ if (!cachep->memcg_params->dead)
+ return;
+
+ /*
+ * We have to defer the actual destroying to a workqueue, because
+ * we might currently be in a context that cannot sleep.
+ */
+ schedule_work(&cachep->memcg_params->destroy);
+}
+
static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
{
char *name;
@@ -3102,6 +3140,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,

mem_cgroup_get(memcg);
new_cachep->memcg_params->root_cache = cachep;
+ atomic_set(&new_cachep->memcg_params->nr_pages , 0);

cachep->memcg_params->memcg_caches[idx] = new_cachep;
/*
@@ -3120,6 +3159,25 @@ struct create_work {
struct work_struct work;
};

+static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+{
+ struct kmem_cache *cachep;
+ struct memcg_cache_params *params;
+
+ if (!memcg_kmem_is_active(memcg))
+ return;
+
+ mutex_lock(&memcg->slab_caches_mutex);
+ list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
+ cachep = memcg_params_to_cache(params);
+ cachep->memcg_params->dead = true;
+ INIT_WORK(&cachep->memcg_params->destroy,
+ kmem_cache_destroy_work_func);
+ schedule_work(&cachep->memcg_params->destroy);
+ }
+ mutex_unlock(&memcg->slab_caches_mutex);
+}
+
static void memcg_create_cache_work_func(struct work_struct *w)
{
struct create_work *cw;
@@ -3335,6 +3393,10 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
VM_BUG_ON(mem_cgroup_is_root(memcg));
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
}
+#else
+static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -5950,6 +6012,7 @@ static int mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);

+ mem_cgroup_destroy_all_caches(memcg);
return mem_cgroup_force_empty(memcg, false);
}

diff --git a/mm/slab.c b/mm/slab.c
index c5d6937..15bb502 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1943,6 +1943,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
if (page->pfmemalloc)
SetPageSlabPfmemalloc(page + i);
}
+ memcg_bind_pages(cachep, cachep->gfporder);

if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
@@ -1979,6 +1980,8 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
__ClearPageSlab(page);
page++;
}
+
+ memcg_release_pages(cachep, cachep->gfporder);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += nr_freed;
free_memcg_kmem_pages((unsigned long)addr, cachep->gfporder);
diff --git a/mm/slab.h b/mm/slab.h
index fb1c4c4..3ef41e1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -109,6 +109,21 @@ static inline bool cache_match_memcg(struct kmem_cache *cachep,
(cachep->memcg_params->memcg == memcg);
}

+static inline void memcg_bind_pages(struct kmem_cache *s, int order)
+{
+ if (!is_root_cache(s))
+ atomic_add(1 << order, &s->memcg_params->nr_pages);
+}
+
+static inline void memcg_release_pages(struct kmem_cache *s, int order)
+{
+ if (is_root_cache(s))
+ return;
+
+ if (atomic_sub_and_test((1 << order), &s->memcg_params->nr_pages))
+ mem_cgroup_destroy_cache(s);
+}
+
static inline bool slab_equal_or_root(struct kmem_cache *s,
struct kmem_cache *p)
{
@@ -127,6 +142,14 @@ static inline bool cache_match_memcg(struct kmem_cache *cachep,
return true;
}

+static inline void memcg_bind_pages(struct kmem_cache *s, int order)
+{
+}
+
+static inline void memcg_release_pages(struct kmem_cache *s, int order)
+{
+}
+
static inline bool slab_equal_or_root(struct kmem_cache *s,
struct kmem_cache *p)
{
diff --git a/mm/slub.c b/mm/slub.c
index 8778370..ffc8ede 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1344,6 +1344,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
void *start;
void *last;
void *p;
+ int order;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

@@ -1352,7 +1353,9 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
if (!page)
goto out;

+ order = compound_order(page);
inc_slabs_node(s, page_to_nid(page), page->objects);
+ memcg_bind_pages(s, order);
page->slab_cache = s;
__SetPageSlab(page);
if (page->pfmemalloc)
@@ -1361,7 +1364,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
start = page_address(page);

if (unlikely(s->flags & SLAB_POISON))
- memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page));
+ memset(start, POISON_INUSE, PAGE_SIZE << order);

last = start;
for_each_object(p, s, start, page->objects) {
@@ -1402,6 +1405,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page)

__ClearPageSlabPfmemalloc(page);
__ClearPageSlab(page);
+
+ memcg_release_pages(s, order);
reset_page_mapcount(page);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
--
1.7.11.7

2012-11-01 12:09:15

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 29/29] Add slab-specific documentation about the kmem controller

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
Documentation/cgroups/memory.txt | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 206853b..9d9938d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -301,6 +301,13 @@ to trigger slab reclaim when those limits are reached.
kernel memory, we prevent new processes from being created when the kernel
memory usage is too high.

+* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
+of each kmem_cache is created everytime the cache is touched by the first time
+from inside the memcg. The creation is done lazily, so some objects can still be
+skipped while the cache is being created. All objects in a slab page should
+belong to the same memcg. This only fails to hold when a task is migrated to a
+different memcg during the page allocation by the cache.
+
* sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.
--
1.7.11.7

2012-11-01 12:09:13

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 26/29] Aggregate memcg cache values in slabinfo

When we create caches in memcgs, we need to display their usage
information somewhere. We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group
information stored in the group itself.

For the time being, only reads are allowed in the per-group cache.

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 8 ++++++++
include/linux/slab.h | 4 ++++
mm/memcontrol.c | 30 +++++++++++++++++++++++++++++-
mm/slab.h | 27 +++++++++++++++++++++++++++
mm/slab_common.c | 44 ++++++++++++++++++++++++++++++++++++++++----
5 files changed, 108 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d5511cc..c780dd6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -413,6 +413,11 @@ static inline void sock_release_memcg(struct sock *sk)

#ifdef CONFIG_MEMCG_KMEM
extern struct static_key memcg_kmem_enabled_key;
+
+extern int memcg_limited_groups_array_size;
+#define for_each_memcg_cache_index(_idx) \
+ for ((_idx) = 0; i < memcg_limited_groups_array_size; (_idx)++)
+
static inline bool memcg_kmem_enabled(void)
{
return static_key_false(&memcg_kmem_enabled_key);
@@ -550,6 +555,9 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
return __memcg_kmem_get_cache(cachep, gfp);
}
#else
+#define for_each_memcg_cache_index(_idx) \
+ for (; NULL; )
+
static inline bool memcg_kmem_enabled(void)
{
return false;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0df42db..1232c7f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -221,6 +221,10 @@ struct memcg_cache_params {

int memcg_update_all_caches(int num_memcgs);

+struct seq_file;
+int cache_show(struct kmem_cache *s, struct seq_file *m);
+void print_slabinfo_header(struct seq_file *m);
+
/*
* Common kmalloc functions provided by all allocators
*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e2575a..35f5cb3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -570,7 +570,8 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
* increase it.
*/
static struct ida kmem_limited_groups;
-static int memcg_limited_groups_array_size;
+int memcg_limited_groups_array_size;
+
/*
* MIN_SIZE is different than 1, because we would like to avoid going through
* the alloc/free process all the time. In a small machine, 4 kmem-limited
@@ -2763,6 +2764,27 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
return cachep->memcg_params->memcg_caches[memcg_cache_id(p->memcg)];
}

+#ifdef CONFIG_SLABINFO
+static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ struct memcg_cache_params *params;
+
+ if (!memcg_can_account_kmem(memcg))
+ return -EIO;
+
+ print_slabinfo_header(m);
+
+ mutex_lock(&memcg->slab_caches_mutex);
+ list_for_each_entry(params, &memcg->memcg_slab_caches, list)
+ cache_show(memcg_params_to_cache(params), m);
+ mutex_unlock(&memcg->slab_caches_mutex);
+
+ return 0;
+}
+#endif
+
static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
{
struct res_counter *fail_res;
@@ -5801,6 +5823,12 @@ static struct cftype mem_cgroup_files[] = {
.trigger = mem_cgroup_reset,
.read = mem_cgroup_read,
},
+#ifdef CONFIG_SLABINFO
+ {
+ .name = "kmem.slabinfo",
+ .read_seq_string = mem_cgroup_slabinfo_read,
+ },
+#endif
#endif
{ }, /* terminate */
};
diff --git a/mm/slab.h b/mm/slab.h
index 3ef41e1..08ef468 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -130,6 +130,23 @@ static inline bool slab_equal_or_root(struct kmem_cache *s,
return (p == s) ||
(s->memcg_params && (p == s->memcg_params->root_cache));
}
+
+/*
+ * We use suffixes to the name in memcg because we can't have caches
+ * created in the system with the same name. But when we print them
+ * locally, better refer to them with the base name
+ */
+static inline const char *cache_name(struct kmem_cache *s)
+{
+ if (!is_root_cache(s))
+ return s->memcg_params->root_cache->name;
+ return s->name;
+}
+
+static inline struct kmem_cache *cache_from_memcg(struct kmem_cache *s, int idx)
+{
+ return s->memcg_params->memcg_caches[idx];
+}
#else
static inline bool is_root_cache(struct kmem_cache *s)
{
@@ -155,6 +172,16 @@ static inline bool slab_equal_or_root(struct kmem_cache *s,
{
return true;
}
+
+static inline const char *cache_name(struct kmem_cache *s)
+{
+ return s->name;
+}
+
+static inline struct kmem_cache *cache_from_memcg(struct kmem_cache *s, int idx)
+{
+ return NULL;
+}
#endif

static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 04215a5..9a6f421 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -258,7 +258,7 @@ int slab_is_available(void)
}

#ifdef CONFIG_SLABINFO
-static void print_slabinfo_header(struct seq_file *m)
+void print_slabinfo_header(struct seq_file *m)
{
/*
* Output format version, so at least we can change it
@@ -302,16 +302,43 @@ static void s_stop(struct seq_file *m, void *p)
mutex_unlock(&slab_mutex);
}

-static int s_show(struct seq_file *m, void *p)
+static void
+memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
+{
+ struct kmem_cache *c;
+ struct slabinfo sinfo;
+ int i;
+
+ if (!is_root_cache(s))
+ return;
+
+ for_each_memcg_cache_index(i) {
+ c = cache_from_memcg(s, i);
+ if (!c)
+ continue;
+
+ memset(&sinfo, 0, sizeof(sinfo));
+ get_slabinfo(c, &sinfo);
+
+ info->active_slabs += sinfo.active_slabs;
+ info->num_slabs += sinfo.num_slabs;
+ info->shared_avail += sinfo.shared_avail;
+ info->active_objs += sinfo.active_objs;
+ info->num_objs += sinfo.num_objs;
+ }
+}
+
+int cache_show(struct kmem_cache *s, struct seq_file *m)
{
- struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
struct slabinfo sinfo;

memset(&sinfo, 0, sizeof(sinfo));
get_slabinfo(s, &sinfo);

+ memcg_accumulate_slabinfo(s, &sinfo);
+
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
- s->name, sinfo.active_objs, sinfo.num_objs, s->size,
+ cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
sinfo.objects_per_slab, (1 << sinfo.cache_order));

seq_printf(m, " : tunables %4u %4u %4u",
@@ -323,6 +350,15 @@ static int s_show(struct seq_file *m, void *p)
return 0;
}

+static int s_show(struct seq_file *m, void *p)
+{
+ struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+
+ if (!is_root_cache(s))
+ return 0;
+ return cache_show(s, m);
+}
+
/*
* slabinfo_op - iterator that generates /proc/slabinfo
*
--
1.7.11.7

2012-11-01 12:09:12

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 27/29] slab: propagate tunables values

SLAB allows us to tune a particular cache behavior with tunables.
When creating a new memcg cache copy, we'd like to preserve any tunables
the parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and
then things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 8 +++++---
include/linux/slab.h | 2 +-
mm/memcontrol.c | 10 ++++++----
mm/slab.c | 44 +++++++++++++++++++++++++++++++++++++++++---
mm/slab.h | 12 ++++++++++++
mm/slab_common.c | 7 ++++---
6 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c780dd6..c91e3c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -441,7 +441,8 @@ void __memcg_kmem_commit_charge(struct page *page,
void __memcg_kmem_uncharge_pages(struct page *page, int order);

int memcg_cache_id(struct mem_cgroup *memcg);
-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s);
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+ struct kmem_cache *root_cache);
void memcg_release_cache(struct kmem_cache *cachep);
void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);

@@ -583,8 +584,9 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return -1;
}

-static inline int memcg_register_cache(struct mem_cgroup *memcg,
- struct kmem_cache *s)
+static inline int
+memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+ struct kmem_cache *root_cache)
{
return 0;
}
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1232c7f..81ee767 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -128,7 +128,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
void (*)(void *));
struct kmem_cache *
kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
- unsigned long, void (*)(void *));
+ unsigned long, void (*)(void *), struct kmem_cache *);
void kmem_cache_destroy(struct kmem_cache *);
int kmem_cache_shrink(struct kmem_cache *);
void kmem_cache_free(struct kmem_cache *, void *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35f5cb3..7d14fbd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2981,7 +2981,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
return 0;
}

-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+ struct kmem_cache *root_cache)
{
size_t size = sizeof(struct memcg_cache_params);

@@ -2995,8 +2996,10 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
if (!s->memcg_params)
return -ENOMEM;

- if (memcg)
+ if (memcg) {
s->memcg_params->memcg = memcg;
+ s->memcg_params->root_cache = root_cache;
+ }
return 0;
}

@@ -3162,7 +3165,7 @@ static struct kmem_cache *kmem_cache_dup(struct mem_cgroup *memcg,
return NULL;

new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
- (s->flags & ~SLAB_PANIC), s->ctor);
+ (s->flags & ~SLAB_PANIC), s->ctor, s);

if (new)
new->allocflags |= __GFP_KMEMCG;
@@ -3206,7 +3209,6 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
}

mem_cgroup_get(memcg);
- new_cachep->memcg_params->root_cache = cachep;
atomic_set(&new_cachep->memcg_params->nr_pages , 0);

cachep->memcg_params->memcg_caches[idx] = new_cachep;
diff --git a/mm/slab.c b/mm/slab.c
index 15bb502..628a88e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4110,7 +4110,7 @@ static void do_ccupdate_local(void *info)
}

/* Always called with the slab_mutex held */
-static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
int batchcount, int shared, gfp_t gfp)
{
struct ccupdate_struct *new;
@@ -4153,12 +4153,48 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
return alloc_kmemlist(cachep, gfp);
}

+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+ int batchcount, int shared, gfp_t gfp)
+{
+ int ret;
+ struct kmem_cache *c = NULL;
+ int i = 0;
+
+ ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
+
+ if (slab_state < FULL)
+ return ret;
+
+ if ((ret < 0) || !is_root_cache(cachep))
+ return ret;
+
+ for_each_memcg_cache_index(i) {
+ c = cache_from_memcg(cachep, i);
+ if (c)
+ /* return value determined by the parent cache only */
+ __do_tune_cpucache(c, limit, batchcount, shared, gfp);
+ }
+
+ return ret;
+}
+
/* Called with slab_mutex held always */
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
{
int err;
- int limit, shared;
+ int limit = 0;
+ int shared = 0;
+ int batchcount = 0;
+
+ if (!is_root_cache(cachep)) {
+ struct kmem_cache *root = memcg_root_cache(cachep);
+ limit = root->limit;
+ shared = root->shared;
+ batchcount = root->batchcount;
+ }

+ if (limit && shared && batchcount)
+ goto skip_setup;
/*
* The head array serves three purposes:
* - create a LIFO ordering, i.e. return objects that are cache-warm
@@ -4200,7 +4236,9 @@ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
if (limit > 32)
limit = 32;
#endif
- err = do_tune_cpucache(cachep, limit, (limit + 1) / 2, shared, gfp);
+ batchcount = (limit + 1) / 2;
+skip_setup:
+ err = do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
cachep->name, -err);
diff --git a/mm/slab.h b/mm/slab.h
index 08ef468..9fc04b0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -147,6 +147,13 @@ static inline struct kmem_cache *cache_from_memcg(struct kmem_cache *s, int idx)
{
return s->memcg_params->memcg_caches[idx];
}
+
+static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
+{
+ if (is_root_cache(s))
+ return s;
+ return s->memcg_params->root_cache;
+}
#else
static inline bool is_root_cache(struct kmem_cache *s)
{
@@ -182,6 +189,11 @@ static inline struct kmem_cache *cache_from_memcg(struct kmem_cache *s, int idx)
{
return NULL;
}
+
+static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
+{
+ return s;
+}
#endif

static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9a6f421..34743d8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -136,7 +136,8 @@ out:

struct kmem_cache *
kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t size,
- size_t align, unsigned long flags, void (*ctor)(void *))
+ size_t align, unsigned long flags, void (*ctor)(void *),
+ struct kmem_cache *parent_cache)
{
struct kmem_cache *s = NULL;
int err = 0;
@@ -165,7 +166,7 @@ kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t size,
s->align = align;
s->ctor = ctor;

- if (memcg_register_cache(memcg, s)) {
+ if (memcg_register_cache(memcg, s, parent_cache)) {
kmem_cache_free(kmem_cache, s);
err = -ENOMEM;
goto out_locked;
@@ -215,7 +216,7 @@ struct kmem_cache *
kmem_cache_create(const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *))
{
- return kmem_cache_create_memcg(NULL, name, size, align, flags, ctor);
+ return kmem_cache_create_memcg(NULL, name, size, align, flags, ctor, NULL);
}
EXPORT_SYMBOL(kmem_cache_create);

--
1.7.11.7

2012-11-01 12:10:34

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 18/29] Allocate memory for memcg caches whenever a new memcg appears

Every cache that is considered a root cache (basically the "original" caches,
tied to the root memcg/no-memcg) will have an array that should be large enough
to store a cache pointer per each memcg in the system.

Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the
64k pointers range. Most of the time, we won't be using that much.

What goes in this patch, is a simple scheme to dynamically allocate such an
array, in order to minimize memory usage for memcg caches. Because we would
also like to avoid allocations all the time, at least for now, the array will
only grow. It will tend to be big enough to hold the maximum number of
kmem-limited memcgs ever achieved.

We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more
than that, we'll start doubling the size of this array every time the limit is
reached.

Because we are only considering kmem limited memcgs, a natural point for this
to happen is when we write to the limit. At that point, we already have
set_limit_mutex held, so that will become our natural synchronization
mechanism.

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 2 +
mm/memcontrol.c | 210 +++++++++++++++++++++++++++++++++++++++++----
mm/slab_common.c | 28 ++++++
3 files changed, 224 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ea1e66f..49f5e4f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -440,6 +440,8 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s);
void memcg_release_cache(struct kmem_cache *cachep);
void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);

+int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
+void memcg_update_array_size(int num_groups);
/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
* @gfp: the gfp allocation flags.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fb5b1e6..eb873af 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -376,6 +376,11 @@ static void memcg_kmem_set_activated(struct mem_cgroup *memcg)
set_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
}

+static void memcg_kmem_clear_activated(struct mem_cgroup *memcg)
+{
+ clear_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
+}
+
static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
{
if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
@@ -547,12 +552,48 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
#endif

#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This will be the memcg's index in each cache's ->memcg_params->memcg_caches.
+ * There are two main reasons for not using the css_id for this:
+ * 1) this works better in sparse environments, where we have a lot of memcgs,
+ * but only a few kmem-limited. Or also, if we have, for instance, 200
+ * memcgs, and none but the 200th is kmem-limited, we'd have to have a
+ * 200 entry array for that.
+ *
+ * 2) In order not to violate the cgroup API, we would like to do all memory
+ * allocation in ->create(). At that point, we haven't yet allocated the
+ * css_id. Having a separate index prevents us from messing with the cgroup
+ * core for this
+ *
+ * The current size of the caches array is stored in
+ * memcg_limited_groups_array_size. It will double each time we have to
+ * increase it.
+ */
+static struct ida kmem_limited_groups;
+static int memcg_limited_groups_array_size;
+/*
+ * MIN_SIZE is different than 1, because we would like to avoid going through
+ * the alloc/free process all the time. In a small machine, 4 kmem-limited
+ * cgroups is a reasonable guess. In the future, it could be a parameter or
+ * tunable, but that is strictly not necessary.
+ *
+ * MAX_SIZE should be as large as the number of css_ids. Ideally, we could get
+ * this constant directly from cgroup, but it is understandable that this is
+ * better kept as an internal representation in cgroup.c. In any case, the
+ * css_id space is not getting any smaller, and we don't have to necessarily
+ * increase ours as well if it increases.
+ */
+#define MEMCG_CACHES_MIN_SIZE 4
+#define MEMCG_CACHES_MAX_SIZE 65535
+
struct static_key memcg_kmem_enabled_key;

static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
- if (memcg_kmem_is_active(memcg))
+ if (memcg_kmem_is_active(memcg)) {
static_key_slow_dec(&memcg_kmem_enabled_key);
+ ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
+ }
/*
* This check can't live in kmem destruction function,
* since the charges will outlive the cgroup
@@ -2782,6 +2823,120 @@ int memcg_cache_id(struct mem_cgroup *memcg)
return memcg ? memcg->kmemcg_id : -1;
}

+/*
+ * This ends up being protected by the set_limit mutex, during normal
+ * operation, because that is its main call site.
+ *
+ * But when we create a new cache, we can call this as well if its parent
+ * is kmem-limited. That will have to hold set_limit_mutex as well.
+ */
+int memcg_update_cache_sizes(struct mem_cgroup *memcg)
+{
+ int num, ret;
+
+ num = ida_simple_get(&kmem_limited_groups,
+ 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
+ if (num < 0)
+ return num;
+ /*
+ * After this point, kmem_accounted (that we test atomically in
+ * the beginning of this conditional), is no longer 0. This
+ * guarantees only one process will set the following boolean
+ * to true. We don't need test_and_set because we're protected
+ * by the set_limit_mutex anyway.
+ */
+ memcg_kmem_set_activated(memcg);
+
+ ret = memcg_update_all_caches(num+1);
+ if (ret) {
+ ida_simple_remove(&kmem_limited_groups, num);
+ memcg_kmem_clear_activated(memcg);
+ return ret;
+ }
+
+ memcg->kmemcg_id = num;
+ INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+ mutex_init(&memcg->slab_caches_mutex);
+ return 0;
+}
+
+static size_t memcg_caches_array_size(int num_groups)
+{
+ ssize_t size;
+ if (num_groups <= 0)
+ return 0;
+
+ size = 2 * num_groups;
+ if (size < MEMCG_CACHES_MIN_SIZE)
+ size = MEMCG_CACHES_MIN_SIZE;
+ else if (size > MEMCG_CACHES_MAX_SIZE)
+ size = MEMCG_CACHES_MAX_SIZE;
+
+ return size;
+}
+
+/*
+ * We should update the current array size iff all caches updates succeed. This
+ * can only be done from the slab side. The slab mutex needs to be held when
+ * calling this.
+ */
+void memcg_update_array_size(int num)
+{
+ if (num > memcg_limited_groups_array_size)
+ memcg_limited_groups_array_size = memcg_caches_array_size(num);
+}
+
+int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
+{
+ struct memcg_cache_params *cur_params = s->memcg_params;
+
+ VM_BUG_ON(s->memcg_params && !s->memcg_params->is_root_cache);
+
+ if (num_groups > memcg_limited_groups_array_size) {
+ int i;
+ ssize_t size = memcg_caches_array_size(num_groups);
+
+ size *= sizeof(void *);
+ size += sizeof(struct memcg_cache_params);
+
+ s->memcg_params = kzalloc(size, GFP_KERNEL);
+ if (!s->memcg_params) {
+ s->memcg_params = cur_params;
+ return -ENOMEM;
+ }
+
+ s->memcg_params->is_root_cache = true;
+
+ /*
+ * There is the chance it will be bigger than
+ * memcg_limited_groups_array_size, if we failed an allocation
+ * in a cache, in which case all caches updated before it, will
+ * have a bigger array.
+ *
+ * But if that is the case, the data after
+ * memcg_limited_groups_array_size is certainly unused
+ */
+ for (i = 0; i < memcg_limited_groups_array_size; i++) {
+ if (!cur_params->memcg_caches[i])
+ continue;
+ s->memcg_params->memcg_caches[i] =
+ cur_params->memcg_caches[i];
+ }
+
+ /*
+ * Ideally, we would wait until all caches succeed, and only
+ * then free the old one. But this is not worth the extra
+ * pointer per-cache we'd have to have for this.
+ *
+ * It is not a big deal if some caches are left with a size
+ * bigger than the others. And all updates will reset this
+ * anyway.
+ */
+ kfree(cur_params);
+ }
+ return 0;
+}
+
int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
{
size_t size = sizeof(struct memcg_cache_params);
@@ -2789,6 +2944,9 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
if (!memcg_kmem_enabled())
return 0;

+ if (!memcg)
+ size += memcg_limited_groups_array_size * sizeof(void *);
+
s->memcg_params = kzalloc(size, GFP_KERNEL);
if (!s->memcg_params)
return -ENOMEM;
@@ -4291,14 +4449,11 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
ret = res_counter_set_limit(&memcg->kmem, val);
VM_BUG_ON(ret);

- /*
- * After this point, kmem_accounted (that we test atomically in
- * the beginning of this conditional), is no longer 0. This
- * guarantees only one process will set the following boolean
- * to true. We don't need test_and_set because we're protected
- * by the set_limit_mutex anyway.
- */
- memcg_kmem_set_activated(memcg);
+ ret = memcg_update_cache_sizes(memcg);
+ if (ret) {
+ res_counter_set_limit(&memcg->kmem, RESOURCE_MAX);
+ goto out;
+ }
must_inc_static_branch = true;
/*
* kmem charges can outlive the cgroup. In the case of slab
@@ -4337,11 +4492,13 @@ out:
return ret;
}

-static void memcg_propagate_kmem(struct mem_cgroup *memcg)
+static int memcg_propagate_kmem(struct mem_cgroup *memcg)
{
+ int ret = 0;
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
if (!parent)
- return;
+ goto out;
+
memcg->kmem_account_flags = parent->kmem_account_flags;
#ifdef CONFIG_MEMCG_KMEM
/*
@@ -4354,11 +4511,24 @@ static void memcg_propagate_kmem(struct mem_cgroup *memcg)
* It is a lot simpler just to do static_key_slow_inc() on every child
* that is accounted.
*/
- if (memcg_kmem_is_active(memcg)) {
- mem_cgroup_get(memcg);
- static_key_slow_inc(&memcg_kmem_enabled_key);
- }
+ if (!memcg_kmem_is_active(memcg))
+ goto out;
+
+ /*
+ * destroy(), called if we fail, will issue static_key_slow_inc() and
+ * mem_cgroup_put() if kmem is enabled. We have to either call them
+ * unconditionally, or clear the KMEM_ACTIVE flag. I personally find
+ * this more consistent, since it always leads to the same destroy path
+ */
+ mem_cgroup_get(memcg);
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+
+ mutex_lock(&set_limit_mutex);
+ ret = memcg_update_cache_sizes(memcg);
+ mutex_unlock(&set_limit_mutex);
#endif
+out:
+ return ret;
}

/*
@@ -5040,8 +5210,15 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
+ int ret;
+
memcg->kmemcg_id = -1;
- memcg_propagate_kmem(memcg);
+ ret = memcg_propagate_kmem(memcg);
+ if (ret)
+ return ret;
+
+ if (mem_cgroup_is_root(memcg))
+ ida_init(&kmem_limited_groups);

return mem_cgroup_sockets_init(memcg, ss);
};
@@ -5444,6 +5621,7 @@ mem_cgroup_create(struct cgroup *cont)
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
res_counter_init(&memcg->kmem, &parent->kmem);
+
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0578731..b76a74c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -81,6 +81,34 @@ static inline int kmem_cache_sanity_check(struct mem_cgroup *memcg,
}
#endif

+#ifdef CONFIG_MEMCG_KMEM
+int memcg_update_all_caches(int num_memcgs)
+{
+ struct kmem_cache *s;
+ int ret = 0;
+ mutex_lock(&slab_mutex);
+
+ list_for_each_entry(s, &slab_caches, list) {
+ if (!is_root_cache(s))
+ continue;
+
+ ret = memcg_update_cache_size(s, num_memcgs);
+ /*
+ * See comment in memcontrol.c, memcg_update_cache_size:
+ * Instead of freeing the memory, we'll just leave the caches
+ * up to this point in an updated state.
+ */
+ if (ret)
+ goto out;
+ }
+
+ memcg_update_array_size(num_memcgs);
+out:
+ mutex_unlock(&slab_mutex);
+ return ret;
+}
+#endif
+
/*
* kmem_cache_create - Create a cache.
* @name: A string which is used in /proc/slabinfo to identify this cache.
--
1.7.11.7

2012-11-01 12:10:33

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 21/29] sl[au]b: always get the cache from its page in kmem_cache_free

struct page already have this information. If we start chaining caches,
this information will always be more trustworthy than whatever is passed
into the function

[ v3: added parent testing with VM_BUG_ON ]
[ v4: make it faster when kmemcg not in use ]
[ v6: move it to slab.h ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 5 +++++
mm/slab.c | 6 +++++-
mm/slab.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/slob.c | 2 +-
mm/slub.c | 15 +++------------
5 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 16bff74..d77d88d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -547,6 +547,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
return __memcg_kmem_get_cache(cachep, gfp);
}
#else
+static inline bool memcg_kmem_enabled(void)
+{
+ return false;
+}
+
static inline bool
memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
{
diff --git a/mm/slab.c b/mm/slab.c
index dcc05f5..de9cc0d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -87,7 +87,6 @@
*/

#include <linux/slab.h>
-#include "slab.h"
#include <linux/mm.h>
#include <linux/poison.h>
#include <linux/swap.h>
@@ -128,6 +127,8 @@

#include "internal.h"

+#include "slab.h"
+
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
* 0 for faster, smaller code (especially in the critical paths).
@@ -3946,6 +3947,9 @@ EXPORT_SYMBOL(__kmalloc);
void kmem_cache_free(struct kmem_cache *cachep, void *objp)
{
unsigned long flags;
+ cachep = cache_from_obj(cachep, objp);
+ if (!cachep)
+ return;

local_irq_save(flags);
debug_check_no_locks_freed(objp, cachep->object_size);
diff --git a/mm/slab.h b/mm/slab.h
index 22eb5aa2..fb1c4c4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -108,6 +108,13 @@ static inline bool cache_match_memcg(struct kmem_cache *cachep,
return (is_root_cache(cachep) && !memcg) ||
(cachep->memcg_params->memcg == memcg);
}
+
+static inline bool slab_equal_or_root(struct kmem_cache *s,
+ struct kmem_cache *p)
+{
+ return (p == s) ||
+ (s->memcg_params && (p == s->memcg_params->root_cache));
+}
#else
static inline bool is_root_cache(struct kmem_cache *s)
{
@@ -119,5 +126,37 @@ static inline bool cache_match_memcg(struct kmem_cache *cachep,
{
return true;
}
+
+static inline bool slab_equal_or_root(struct kmem_cache *s,
+ struct kmem_cache *p)
+{
+ return true;
+}
#endif
+
+static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
+{
+ struct kmem_cache *cachep;
+ struct page *page;
+
+ /*
+ * When kmemcg is not being used, both assignments should return the
+ * same value. but we don't want to pay the assignment price in that
+ * case. If it is not compiled in, the compiler should be smart enough
+ * to not do even the assignment. In that case, slab_equal_or_root
+ * will also be a constant.
+ */
+ if (!memcg_kmem_enabled() && !unlikely(s->flags & SLAB_DEBUG_FREE))
+ return s;
+
+ page = virt_to_head_page(x);
+ cachep = page->slab_cache;
+ if (slab_equal_or_root(cachep, s))
+ return cachep;
+
+ pr_err("%s: Wrong slab cache. %s but object is from %s\n",
+ __FUNCTION__, cachep->name, s->name);
+ WARN_ON_ONCE(1);
+ return s;
+}
#endif
diff --git a/mm/slob.c b/mm/slob.c
index 3edfeaa..c86ee32 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -59,7 +59,6 @@

#include <linux/kernel.h>
#include <linux/slab.h>
-#include "slab.h"

#include <linux/mm.h>
#include <linux/swap.h> /* struct reclaim_state */
@@ -74,6 +73,7 @@

#include <linux/atomic.h>

+#include "slab.h"
/*
* slob_block has a field 'units', which indicates size of block if +ve,
* or offset of next block if -ve (in SLOB_UNITs).
diff --git a/mm/slub.c b/mm/slub.c
index a105bdc..6ff2bdb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2609,19 +2609,10 @@ redo:

void kmem_cache_free(struct kmem_cache *s, void *x)
{
- struct page *page;
-
- page = virt_to_head_page(x);
-
- if (kmem_cache_debug(s) && page->slab_cache != s) {
- pr_err("kmem_cache_free: Wrong slab cache. %s but object"
- " is from %s\n", page->slab_cache->name, s->name);
- WARN_ON_ONCE(1);
+ s = cache_from_obj(s, x);
+ if (!s)
return;
- }
-
- slab_free(s, page, x, _RET_IP_);
-
+ slab_free(s, virt_to_head_page(x), x, _RET_IP_);
trace_kmem_cache_free(_RET_IP_, x);
}
EXPORT_SYMBOL(kmem_cache_free);
--
1.7.11.7

2012-11-01 12:10:31

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 22/29] sl[au]b: Allocate objects from memcg cache

We are able to match a cache allocation to a particular memcg. If the
task doesn't change groups during the allocation itself - a rare event,
this will give us a good picture about who is the first group to touch a
cache page.

This patch uses the now available infrastructure by calling
memcg_kmem_get_cache() before all the cache allocations.

[ v6: simplified kmalloc relay code ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/slub_def.h | 5 ++++-
mm/memcontrol.c | 3 +++
mm/slab.c | 6 +++++-
mm/slub.c | 7 ++++---
4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 961e72e..364ba6c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -225,7 +225,10 @@ void *__kmalloc(size_t size, gfp_t flags);
static __always_inline void *
kmalloc_order(size_t size, gfp_t flags, unsigned int order)
{
- void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order);
+ void *ret;
+
+ flags |= (__GFP_COMP | __GFP_KMEMCG);
+ ret = (void *) __get_free_pages(flags, order);
kmemleak_alloc(ret, size, 1, flags);
return ret;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff42586..f7773ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3059,6 +3059,9 @@ static struct kmem_cache *kmem_cache_dup(struct mem_cgroup *memcg,
new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
(s->flags & ~SLAB_PANIC), s->ctor);

+ if (new)
+ new->allocflags |= __GFP_KMEMCG;
+
kfree(name);
return new;
}
diff --git a/mm/slab.c b/mm/slab.c
index de9cc0d..c5d6937 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1981,7 +1981,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
}
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += nr_freed;
- free_pages((unsigned long)addr, cachep->gfporder);
+ free_memcg_kmem_pages((unsigned long)addr, cachep->gfporder);
}

static void kmem_rcu_free(struct rcu_head *head)
@@ -3547,6 +3547,8 @@ __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
if (slab_should_failslab(cachep, flags))
return NULL;

+ cachep = memcg_kmem_get_cache(cachep, flags);
+
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);

@@ -3632,6 +3634,8 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
if (slab_should_failslab(cachep, flags))
return NULL;

+ cachep = memcg_kmem_get_cache(cachep, flags);
+
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
objp = __do_cache_alloc(cachep, flags);
diff --git a/mm/slub.c b/mm/slub.c
index 6ff2bdb..8778370 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1405,7 +1405,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
reset_page_mapcount(page);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
- __free_pages(page, order);
+ __free_memcg_kmem_pages(page, order);
}

#define need_reserve_slab_rcu \
@@ -2321,6 +2321,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
if (slab_pre_alloc_hook(s, gfpflags))
return NULL;

+ s = memcg_kmem_get_cache(s, gfpflags);
redo:

/*
@@ -3353,7 +3354,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
struct page *page;
void *ptr = NULL;

- flags |= __GFP_COMP | __GFP_NOTRACK;
+ flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
page = alloc_pages_node(node, flags, get_order(size));
if (page)
ptr = page_address(page);
@@ -3459,7 +3460,7 @@ void kfree(const void *x)
if (unlikely(!PageSlab(page))) {
BUG_ON(!PageCompound(page));
kmemleak_free(x);
- __free_pages(page, compound_order(page));
+ __free_memcg_kmem_pages(page, compound_order(page));
return;
}
slab_free(page->slab_cache, page, object, _RET_IP_);
--
1.7.11.7

2012-11-01 12:10:27

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

The page allocator is able to bind a page to a memcg when it is
allocated. But for the caches, we'd like to have as many objects as
possible in a page belonging to the same cache.

This is done in this patch by calling memcg_kmem_get_cache in the
beginning of every allocation function. This routing is patched out by
static branches when kernel memory controller is not being used.

It assumes that the task allocating, which determines the memcg in the
page allocator, belongs to the same cgroup throughout the whole process.
Misacounting can happen if the task calls memcg_kmem_get_cache() while
belonging to a cgroup, and later on changes. This is considered
acceptable, and should only happen upon task migration.

Before the cache is created by the memcg core, there is also a possible
imbalance: the task belongs to a memcg, but the cache being allocated
from is the global cache, since the child cache is not yet guaranteed to
be ready. This case is also fine, since in this case the GFP_KMEMCG will
not be passed and the page allocator will not attempt any cgroup
accounting.

[ v4: use a standard workqueue mechanism, create right away if
possible, index from cache side ]
[ v6: fixed issues pointed out by JoonSoo Kim, revert the
cache synchronous allocation ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
CC: JoonSoo Kim <[email protected]>
---
include/linux/memcontrol.h | 41 +++++++++
init/Kconfig | 2 +-
mm/memcontrol.c | 217 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 259 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 49f5e4f..16bff74 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -442,6 +442,10 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);

int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
void memcg_update_array_size(int num_groups);
+
+struct kmem_cache *
+__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
+
/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
* @gfp: the gfp allocation flags.
@@ -511,6 +515,37 @@ memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
__memcg_kmem_commit_charge(page, memcg, order);
}

+/**
+ * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
+ * @cachep: the original global kmem cache
+ * @gfp: allocation flags.
+ *
+ * This function assumes that the task allocating, which determines the memcg
+ * in the page allocator, belongs to the same cgroup throughout the whole
+ * process. Misacounting can happen if the task calls memcg_kmem_get_cache()
+ * while belonging to a cgroup, and later on changes. This is considered
+ * acceptable, and should only happen upon task migration.
+ *
+ * Before the cache is created by the memcg core, there is also a possible
+ * imbalance: the task belongs to a memcg, but the cache being allocated from
+ * is the global cache, since the child cache is not yet guaranteed to be
+ * ready. This case is also fine, since in this case the GFP_KMEMCG will not be
+ * passed and the page allocator will not attempt any cgroup accounting.
+ */
+static __always_inline struct kmem_cache *
+memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
+{
+ if (!memcg_kmem_enabled())
+ return cachep;
+ if (gfp & __GFP_NOFAIL)
+ return cachep;
+ if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
+ return cachep;
+ if (unlikely(fatal_signal_pending(current)))
+ return cachep;
+
+ return __memcg_kmem_get_cache(cachep, gfp);
+}
#else
static inline bool
memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
@@ -546,6 +581,12 @@ static inline void memcg_cache_list_add(struct mem_cgroup *memcg,
struct kmem_cache *s)
{
}
+
+static inline struct kmem_cache *
+memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
+{
+ return cachep;
+}
#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */

diff --git a/init/Kconfig b/init/Kconfig
index 5eae85b..5c86bb4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -741,7 +741,7 @@ config MEMCG_SWAP_ENABLED
then swapaccount=0 does the trick).
config MEMCG_KMEM
bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
- depends on MEMCG && EXPERIMENTAL
+ depends on MEMCG && EXPERIMENTAL && !SLOB
default n
help
The Kernel Memory extension for Memory Resource Controller can limit
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eb873af..318dc67 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -586,7 +586,14 @@ static int memcg_limited_groups_array_size;
#define MEMCG_CACHES_MIN_SIZE 4
#define MEMCG_CACHES_MAX_SIZE 65535

+/*
+ * A lot of the calls to the cache allocation functions are expected to be
+ * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * conditional to this static branch, we'll have to allow modules that does
+ * kmem_cache_alloc and the such to see this symbol as well
+ */
struct static_key memcg_kmem_enabled_key;
+EXPORT_SYMBOL(memcg_kmem_enabled_key);

static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
@@ -2958,9 +2965,219 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)

void memcg_release_cache(struct kmem_cache *s)
{
+ struct kmem_cache *root;
+ struct mem_cgroup *memcg;
+ int id;
+
+ /*
+ * This happens, for instance, when a root cache goes away before we
+ * add any memcg.
+ */
+ if (!s->memcg_params)
+ return;
+
+ if (s->memcg_params->is_root_cache)
+ goto out;
+
+ memcg = s->memcg_params->memcg;
+ id = memcg_cache_id(memcg);
+
+ root = s->memcg_params->root_cache;
+ root->memcg_params->memcg_caches[id] = NULL;
+ mem_cgroup_put(memcg);
+
+ mutex_lock(&memcg->slab_caches_mutex);
+ list_del(&s->memcg_params->list);
+ mutex_unlock(&memcg->slab_caches_mutex);
+
+out:
kfree(s->memcg_params);
}

+static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
+{
+ char *name;
+ struct dentry *dentry;
+
+ rcu_read_lock();
+ dentry = rcu_dereference(memcg->css.cgroup->dentry);
+ rcu_read_unlock();
+
+ BUG_ON(dentry == NULL);
+
+ name = kasprintf(GFP_KERNEL, "%s(%d:%s)", s->name,
+ memcg_cache_id(memcg), dentry->d_name.name);
+
+ return name;
+}
+
+static struct kmem_cache *kmem_cache_dup(struct mem_cgroup *memcg,
+ struct kmem_cache *s)
+{
+ char *name;
+ struct kmem_cache *new;
+
+ name = memcg_cache_name(memcg, s);
+ if (!name)
+ return NULL;
+
+ new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
+ (s->flags & ~SLAB_PANIC), s->ctor);
+
+ kfree(name);
+ return new;
+}
+
+/*
+ * This lock protects updaters, not readers. We want readers to be as fast as
+ * they can, and they will either see NULL or a valid cache value. Our model
+ * allow them to see NULL, in which case the root memcg will be selected.
+ *
+ * We need this lock because multiple allocations to the same cache from a non
+ * will span more than one worker. Only one of them can create the cache.
+ */
+static DEFINE_MUTEX(memcg_cache_mutex);
+static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
+ struct kmem_cache *cachep)
+{
+ struct kmem_cache *new_cachep;
+ int idx;
+
+ BUG_ON(!memcg_can_account_kmem(memcg));
+
+ idx = memcg_cache_id(memcg);
+
+ mutex_lock(&memcg_cache_mutex);
+ new_cachep = cachep->memcg_params->memcg_caches[idx];
+ if (new_cachep)
+ goto out;
+
+ new_cachep = kmem_cache_dup(memcg, cachep);
+
+ if (new_cachep == NULL) {
+ new_cachep = cachep;
+ goto out;
+ }
+
+ mem_cgroup_get(memcg);
+ new_cachep->memcg_params->root_cache = cachep;
+
+ cachep->memcg_params->memcg_caches[idx] = new_cachep;
+ /*
+ * the readers won't lock, make sure everybody sees the updated value,
+ * so they won't put stuff in the queue again for no reason
+ */
+ wmb();
+out:
+ mutex_unlock(&memcg_cache_mutex);
+ return new_cachep;
+}
+
+struct create_work {
+ struct mem_cgroup *memcg;
+ struct kmem_cache *cachep;
+ struct work_struct work;
+};
+
+static void memcg_create_cache_work_func(struct work_struct *w)
+{
+ struct create_work *cw;
+
+ cw = container_of(w, struct create_work, work);
+ memcg_create_kmem_cache(cw->memcg, cw->cachep);
+ /* Drop the reference gotten when we enqueued. */
+ css_put(&cw->memcg->css);
+ kfree(cw);
+}
+
+/*
+ * Enqueue the creation of a per-memcg kmem_cache.
+ * Called with rcu_read_lock.
+ */
+static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
+ struct kmem_cache *cachep)
+{
+ struct create_work *cw;
+
+ cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
+ if (cw == NULL)
+ return;
+
+ /* The corresponding put will be done in the workqueue. */
+ if (!css_tryget(&memcg->css)) {
+ kfree(cw);
+ return;
+ }
+
+ cw->memcg = memcg;
+ cw->cachep = cachep;
+
+ INIT_WORK(&cw->work, memcg_create_cache_work_func);
+ schedule_work(&cw->work);
+}
+
+/*
+ * Return the kmem_cache we're supposed to use for a slab allocation.
+ * We try to use the current memcg's version of the cache.
+ *
+ * If the cache does not exist yet, if we are the first user of it,
+ * we either create it immediately, if possible, or create it asynchronously
+ * in a workqueue.
+ * In the latter case, we will let the current allocation go through with
+ * the original cache.
+ *
+ * Can't be called in interrupt context or from kernel threads.
+ * This function needs to be called with rcu_read_lock() held.
+ */
+struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
+ gfp_t gfp)
+{
+ struct mem_cgroup *memcg;
+ int idx;
+
+ VM_BUG_ON(!cachep->memcg_params);
+ VM_BUG_ON(!cachep->memcg_params->is_root_cache);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
+ rcu_read_unlock();
+
+ if (!memcg_can_account_kmem(memcg))
+ return cachep;
+
+ idx = memcg_cache_id(memcg);
+
+ /*
+ * barrier to mare sure we're always seeing the up to date value. The
+ * code updating memcg_caches will issue a write barrier to match this.
+ */
+ read_barrier_depends();
+ if (unlikely(cachep->memcg_params->memcg_caches[idx] == NULL)) {
+ /*
+ * If we are in a safe context (can wait, and not in interrupt
+ * context), we could be be predictable and return right away.
+ * This would guarantee that the allocation being performed
+ * already belongs in the new cache.
+ *
+ * However, there are some clashes that can arrive from locking.
+ * For instance, because we acquire the slab_mutex while doing
+ * kmem_cache_dup, this means no further allocation could happen
+ * with the slab_mutex held.
+ *
+ * Also, because cache creation issue get_online_cpus(), this
+ * creates a lock chain: memcg_slab_mutex -> cpu_hotplug_mutex,
+ * that ends up reversed during cpu hotplug. (cpuset allocates
+ * a bunch of GFP_KERNEL memory during cpuup). Due to all that,
+ * better to defer everything.
+ */
+ memcg_create_cache_enqueue(memcg, cachep);
+ return cachep;
+ }
+
+ return cachep->memcg_params->memcg_caches[idx];
+}
+EXPORT_SYMBOL(__memcg_kmem_get_cache);
+
/*
* We need to verify if the allocation against current->mm->owner's memcg is
* possible for the given order. But the page is not allocated yet, so we'll
--
1.7.11.7

2012-11-01 12:10:24

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.

In this patch, we will call kmem_cache_shrink for all dead caches that
cannot be destroyed because of remaining pages. After shrinking, it is
possible that it could be freed. If this is not the case, we'll schedule
a lazy worker to keep trying.

[ v2: also call verify_dead for the slab ]
[ v3: use delayed_work to avoid calling verify_dead at every free]
[ v6: do not spawn worker if work is already pending ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/slab.h | 2 +-
mm/memcontrol.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ef2314e..0df42db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -214,7 +214,7 @@ struct memcg_cache_params {
struct kmem_cache *root_cache;
bool dead;
atomic_t nr_pages;
- struct work_struct destroy;
+ struct delayed_work destroy;
};
};
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 31da8bc..6e2575a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3048,12 +3048,35 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
{
struct kmem_cache *cachep;
struct memcg_cache_params *p;
+ struct delayed_work *dw = to_delayed_work(w);

- p = container_of(w, struct memcg_cache_params, destroy);
+ p = container_of(dw, struct memcg_cache_params, destroy);

cachep = memcg_params_to_cache(p);

- if (!atomic_read(&cachep->memcg_params->nr_pages))
+ /*
+ * If we get down to 0 after shrink, we could delete right away.
+ * However, memcg_release_pages() already puts us back in the workqueue
+ * in that case. If we proceed deleting, we'll get a dangling
+ * reference, and removing the object from the workqueue in that case
+ * is unnecessary complication. We are not a fast path.
+ *
+ * Note that this case is fundamentally different from racing with
+ * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
+ * kmem_cache_shrink, not only we would be reinserting a dead cache
+ * into the queue, but doing so from inside the worker racing to
+ * destroy it.
+ *
+ * So if we aren't down to zero, we'll just schedule a worker and try
+ * again
+ */
+ if (atomic_read(&cachep->memcg_params->nr_pages) != 0) {
+ kmem_cache_shrink(cachep);
+ if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
+ return;
+ /* Once per minute should be good enough. */
+ schedule_delayed_work(&cachep->memcg_params->destroy, 60 * HZ);
+ } else
kmem_cache_destroy(cachep);
}

@@ -3063,10 +3086,30 @@ void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
return;

/*
+ * There are many ways in which we can get here.
+ *
+ * We can get to a memory-pressure situation while the delayed work is
+ * still pending to run. The vmscan shrinkers can then release all
+ * cache memory and get us to destruction. If this is the case, we'll
+ * be executed twice, which is a bug (the second time will execute over
+ * bogus data). In this case, cancelling the work should be fine.
+ *
+ * But we can also get here from the worker itself, if
+ * kmem_cache_shrink is enough to shake all the remaining objects and
+ * get the page count to 0. In this case, we'll deadlock if we try to
+ * cancel the work (the worker runs with an internal lock held, which
+ * is the same lock we would hold for cancel_delayed_work_sync().)
+ *
+ * Since we can't possibly know who got us here, just refrain from
+ * running if there is already work pending
+ */
+ if (delayed_work_pending(&cachep->memcg_params->destroy))
+ return;
+ /*
* We have to defer the actual destroying to a workqueue, because
* we might currently be in a context that cannot sleep.
*/
- schedule_work(&cachep->memcg_params->destroy);
+ schedule_delayed_work(&cachep->memcg_params->destroy, 0);
}

static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
@@ -3218,9 +3261,9 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
cachep = memcg_params_to_cache(params);
cachep->memcg_params->dead = true;
- INIT_WORK(&cachep->memcg_params->destroy,
- kmem_cache_destroy_work_func);
- schedule_work(&cachep->memcg_params->destroy);
+ INIT_DELAYED_WORK(&cachep->memcg_params->destroy,
+ kmem_cache_destroy_work_func);
+ schedule_delayed_work(&cachep->memcg_params->destroy, 0);
}
mutex_unlock(&memcg->slab_caches_mutex);
}
--
1.7.11.7

2012-11-01 12:10:21

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 28/29] slub: slub-specific propagation changes.

SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve
any tunables the parent cache already had.

This can be done by tapping into the store attribute function provided
by the allocator. We of course don't need to mess with read-only
fields. Since the attributes can have multiple types and are stored
internally by sysfs, the best strategy is to issue a ->show() in the
root cache, and then ->store() in the memcg cache.

The drawback of that, is that sysfs can allocate up to a page in
buffering for show(), that we are likely not to need, but also can't
guarantee. To avoid always allocating a page for that, we can update the
caches at store time with the maximum attribute size ever stored to the
root cache. We will then get a buffer big enough to hold it. The
corolary to this, is that if no stores happened, nothing will be
propagated.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/slub_def.h | 1 +
mm/slub.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 364ba6c..9db4825 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -103,6 +103,7 @@ struct kmem_cache {
#endif
#ifdef CONFIG_MEMCG_KMEM
struct memcg_cache_params *memcg_params;
+ int max_attr_size; /* for propagation, maximum size of a stored attr */
#endif

#ifdef CONFIG_NUMA
diff --git a/mm/slub.c b/mm/slub.c
index ffc8ede..6c39af8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -203,13 +203,14 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
static int sysfs_slab_add(struct kmem_cache *);
static int sysfs_slab_alias(struct kmem_cache *, const char *);
static void sysfs_slab_remove(struct kmem_cache *);
-
+static void memcg_propagate_slab_attrs(struct kmem_cache *s);
#else
static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
static inline void sysfs_slab_remove(struct kmem_cache *s) { }

+static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
#endif

static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned long flags)
if (err)
return err;

+ memcg_propagate_slab_attrs(s);
mutex_unlock(&slab_mutex);
err = sysfs_slab_add(s);
mutex_lock(&slab_mutex);
@@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
+ int i __maybe_unused;

attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
return -EIO;

err = attribute->store(s, buf, len);
+#ifdef CONFIG_MEMCG_KMEM
+ if (slab_state < FULL)
+ return err;

+ if ((err < 0) || !is_root_cache(s))
+ return err;
+
+ mutex_lock(&slab_mutex);
+ if (s->max_attr_size < len)
+ s->max_attr_size = len;
+
+ for_each_memcg_cache_index(i) {
+ struct kmem_cache *c = cache_from_memcg(s, i);
+ if (c)
+ /* return value determined by the parent cache only */
+ attribute->store(c, buf, len);
+ }
+ mutex_unlock(&slab_mutex);
+#endif
return err;
}

+static void memcg_propagate_slab_attrs(struct kmem_cache *s)
+{
+#ifdef CONFIG_MEMCG_KMEM
+ int i;
+ char *buffer = NULL;
+
+ if (!is_root_cache(s))
+ return;
+
+ /*
+ * This mean this cache had no attribute written. Therefore, no point
+ * in copying default values around
+ */
+ if (!s->max_attr_size)
+ return;
+
+ for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
+ char mbuf[64];
+ char *buf;
+ struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
+
+ if (!attr || !attr->store || !attr->show)
+ continue;
+
+ /*
+ * It is really bad that we have to allocate here, so we will
+ * do it only as a fallback. If we actually allocate, though,
+ * we can just use the allocated buffer until the end.
+ *
+ * Most of the slub attributes will tend to be very small in
+ * size, but sysfs allows buffers up to a page, so they can
+ * theoretically happen.
+ */
+ if (buffer)
+ buf = buffer;
+ else if (s->max_attr_size < ARRAY_SIZE(mbuf))
+ buf = mbuf;
+ else {
+ buffer = (char *) get_zeroed_page(GFP_KERNEL);
+ if (WARN_ON(!buffer))
+ continue;
+ buf = buffer;
+ }
+
+ attr->show(s->memcg_params->root_cache, buf);
+ attr->store(s, buf, strlen(buf));
+ }
+
+ if (buffer)
+ free_page((unsigned long)buffer);
+#endif
+}
+
static const struct sysfs_ops slab_sysfs_ops = {
.show = slab_attr_show,
.store = slab_attr_store,
--
1.7.11.7

2012-11-01 12:07:56

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 01/29] memcg: Make it possible to use the stock for more than one page.

From: Suleiman Souhlal <[email protected]>

We currently have a percpu stock cache scheme that charges one page at a
time from memcg->res, the user counter. When the kernel memory
controller comes into play, we'll need to charge more than that.

This is because kernel memory allocations will also draw from the user
counter, and can be bigger than a single page, as it is the case with
the stack (usually 2 pages) or some higher order slabs.

[ [email protected]: added a changelog ]

Signed-off-by: Suleiman Souhlal <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..4a1abe9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2028,20 +2028,28 @@ struct memcg_stock_pcp {
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static DEFINE_MUTEX(percpu_charge_mutex);

-/*
- * Try to consume stocked charge on this cpu. If success, one page is consumed
- * from local stock and true is returned. If the stock is 0 or charges from a
- * cgroup which is not current target, returns false. This stock will be
- * refilled.
+/**
+ * consume_stock: Try to consume stocked charge on this cpu.
+ * @memcg: memcg to consume from.
+ * @nr_pages: how many pages to charge.
+ *
+ * The charges will only happen if @memcg matches the current cpu's memcg
+ * stock, and at least @nr_pages are available in that stock. Failure to
+ * service an allocation will refill the stock.
+ *
+ * returns true if successful, false otherwise.
*/
-static bool consume_stock(struct mem_cgroup *memcg)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
struct memcg_stock_pcp *stock;
bool ret = true;

+ if (nr_pages > CHARGE_BATCH)
+ return false;
+
stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages)
- stock->nr_pages--;
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ stock->nr_pages -= nr_pages;
else /* need to call res_counter_charge */
ret = false;
put_cpu_var(memcg_stock);
@@ -2340,7 +2348,7 @@ again:
VM_BUG_ON(css_is_removed(&memcg->css));
if (mem_cgroup_is_root(memcg))
goto done;
- if (nr_pages == 1 && consume_stock(memcg))
+ if (consume_stock(memcg, nr_pages))
goto done;
css_get(&memcg->css);
} else {
@@ -2365,7 +2373,7 @@ again:
rcu_read_unlock();
goto done;
}
- if (nr_pages == 1 && consume_stock(memcg)) {
+ if (consume_stock(memcg, nr_pages)) {
/*
* It seems dagerous to access memcg without css_get().
* But considering how consume_stok works, it's not
--
1.7.11.7

2012-11-01 12:12:05

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 17/29] consider a memcg parameter in kmem_create_cache

Allow a memcg parameter to be passed during cache creation. When the
slub allocator is being used, it will only merge caches that belong to
the same memcg. We'll do this by scanning the global list, and then
translating the cache to a memcg-specific cache

Default function is created as a wrapper, passing NULL to the memcg
version. We only merge caches that belong to the same memcg.

A helper is provided, memcg_css_id: because slub needs a unique cache
name for sysfs. Since this is visible, but not the canonical location
for slab data, the cache name is not used, the css_id should suffice.

[ v2: moved to idr/ida instead of redoing the indexes ]
[ v3: moved call to ida_init away from cgroup creation to fix a bug ]
[ v4: no longer using the index mechanism ]
[ v6: renamed memcg_css_id to memcg_cache_id, and return a proper id ]

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 26 +++++++++++++++++++++++
include/linux/slab.h | 14 ++++++++++++-
mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
mm/slab.h | 23 +++++++++++++++++----
mm/slab_common.c | 42 ++++++++++++++++++++++++++++++--------
mm/slub.c | 19 +++++++++++++----
6 files changed, 157 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2a2ae05..ea1e66f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,7 @@ struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+struct kmem_cache;

/* Stats that can be updated by kernel. */
enum mem_cgroup_page_stat_item {
@@ -434,6 +435,11 @@ void __memcg_kmem_commit_charge(struct page *page,
struct mem_cgroup *memcg, int order);
void __memcg_kmem_uncharge_pages(struct page *page, int order);

+int memcg_cache_id(struct mem_cgroup *memcg);
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s);
+void memcg_release_cache(struct kmem_cache *cachep);
+void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);
+
/**
* memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
* @gfp: the gfp allocation flags.
@@ -518,6 +524,26 @@ static inline void
memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
{
}
+
+static inline int memcg_cache_id(struct mem_cgroup *memcg)
+{
+ return -1;
+}
+
+static inline int memcg_register_cache(struct mem_cgroup *memcg,
+ struct kmem_cache *s)
+{
+ return 0;
+}
+
+static inline void memcg_release_cache(struct kmem_cache *cachep)
+{
+}
+
+static inline void memcg_cache_list_add(struct mem_cgroup *memcg,
+ struct kmem_cache *s)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8860e08..892367e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -116,6 +116,7 @@ struct kmem_cache {
};
#endif

+struct mem_cgroup;
/*
* struct kmem_cache related prototypes
*/
@@ -125,6 +126,9 @@ int slab_is_available(void);
struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
unsigned long,
void (*)(void *));
+struct kmem_cache *
+kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
+ unsigned long, void (*)(void *));
void kmem_cache_destroy(struct kmem_cache *);
int kmem_cache_shrink(struct kmem_cache *);
void kmem_cache_free(struct kmem_cache *, void *);
@@ -192,15 +196,23 @@ unsigned int kmem_cache_size(struct kmem_cache *);
* Child caches will hold extra metadata needed for its operation. Fields are:
*
* @memcg: pointer to the memcg this cache belongs to
+ * @list: list_head for the list of all caches in this memcg
+ * @root_cache: pointer to the global, root cache, this cache was derived from
*/
struct memcg_cache_params {
bool is_root_cache;
union {
struct kmem_cache *memcg_caches[0];
- struct mem_cgroup *memcg;
+ struct {
+ struct mem_cgroup *memcg;
+ struct list_head list;
+ struct kmem_cache *root_cache;
+ };
};
};

+int memcg_update_all_caches(int num_memcgs);
+
/*
* Common kmalloc functions provided by all allocators
*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e8962c..fb5b1e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -339,6 +339,14 @@ struct mem_cgroup {
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
#endif
+#if defined(CONFIG_MEMCG_KMEM)
+ /* analogous to slab_common's slab_caches list. per-memcg */
+ struct list_head memcg_slab_caches;
+ /* Not a spinlock, we can take a lot of time walking the list */
+ struct mutex slab_caches_mutex;
+ /* Index in the kmem_cache->memcg_params->memcg_caches array */
+ int kmemcg_id;
+#endif
};

/* internal only representation about the status of kmem accounting. */
@@ -2754,6 +2762,47 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
mem_cgroup_put(memcg);
}

+void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
+{
+ if (!memcg)
+ return;
+
+ mutex_lock(&memcg->slab_caches_mutex);
+ list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches);
+ mutex_unlock(&memcg->slab_caches_mutex);
+}
+
+/*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg->kmemcg_id : -1;
+}
+
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
+{
+ size_t size = sizeof(struct memcg_cache_params);
+
+ if (!memcg_kmem_enabled())
+ return 0;
+
+ s->memcg_params = kzalloc(size, GFP_KERNEL);
+ if (!s->memcg_params)
+ return -ENOMEM;
+
+ if (memcg)
+ s->memcg_params->memcg = memcg;
+ return 0;
+}
+
+void memcg_release_cache(struct kmem_cache *s)
+{
+ kfree(s->memcg_params);
+}
+
/*
* We need to verify if the allocation against current->mm->owner's memcg is
* possible for the given order. But the page is not allocated yet, so we'll
@@ -4991,7 +5040,9 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
+ memcg->kmemcg_id = -1;
memcg_propagate_kmem(memcg);
+
return mem_cgroup_sockets_init(memcg, ss);
};

diff --git a/mm/slab.h b/mm/slab.h
index 5ee1851..22eb5aa2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -35,12 +35,15 @@ extern struct kmem_cache *kmem_cache;
/* Functions provided by the slab allocators */
extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);

+struct mem_cgroup;
#ifdef CONFIG_SLUB
-struct kmem_cache *__kmem_cache_alias(const char *name, size_t size,
- size_t align, unsigned long flags, void (*ctor)(void *));
+struct kmem_cache *
+__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *));
#else
-static inline struct kmem_cache *__kmem_cache_alias(const char *name, size_t size,
- size_t align, unsigned long flags, void (*ctor)(void *))
+static inline struct kmem_cache *
+__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
{ return NULL; }
#endif

@@ -98,11 +101,23 @@ static inline bool is_root_cache(struct kmem_cache *s)
{
return !s->memcg_params || s->memcg_params->is_root_cache;
}
+
+static inline bool cache_match_memcg(struct kmem_cache *cachep,
+ struct mem_cgroup *memcg)
+{
+ return (is_root_cache(cachep) && !memcg) ||
+ (cachep->memcg_params->memcg == memcg);
+}
#else
static inline bool is_root_cache(struct kmem_cache *s)
{
return true;
}

+static inline bool cache_match_memcg(struct kmem_cache *cachep,
+ struct mem_cgroup *memcg)
+{
+ return true;
+}
#endif
#endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b705be7..0578731 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -18,6 +18,7 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
+#include <linux/memcontrol.h>

#include "slab.h"

@@ -27,7 +28,8 @@ DEFINE_MUTEX(slab_mutex);
struct kmem_cache *kmem_cache;

#ifdef CONFIG_DEBUG_VM
-static int kmem_cache_sanity_check(const char *name, size_t size)
+static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char *name,
+ size_t size)
{
struct kmem_cache *s = NULL;

@@ -53,7 +55,13 @@ static int kmem_cache_sanity_check(const char *name, size_t size)
continue;
}

- if (!strcmp(s->name, name)) {
+ /*
+ * For simplicity, we won't check this in the list of memcg
+ * caches. We have control over memcg naming, and if there
+ * aren't duplicates in the global list, there won't be any
+ * duplicates in the memcg lists as well.
+ */
+ if (!memcg && !strcmp(s->name, name)) {
pr_err("%s (%s): Cache name already exists.\n",
__func__, name);
dump_stack();
@@ -66,7 +74,8 @@ static int kmem_cache_sanity_check(const char *name, size_t size)
return 0;
}
#else
-static inline int kmem_cache_sanity_check(const char *name, size_t size)
+static inline int kmem_cache_sanity_check(struct mem_cgroup *memcg,
+ const char *name, size_t size)
{
return 0;
}
@@ -97,8 +106,9 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size)
* as davem.
*/

-struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t align,
- unsigned long flags, void (*ctor)(void *))
+struct kmem_cache *
+kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
{
struct kmem_cache *s = NULL;
int err = 0;
@@ -106,7 +116,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t align
get_online_cpus();
mutex_lock(&slab_mutex);

- if (!kmem_cache_sanity_check(name, size) == 0)
+ if (!kmem_cache_sanity_check(memcg, name, size) == 0)
goto out_locked;

/*
@@ -117,7 +127,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t align
*/
flags &= CACHE_CREATE_MASK;

- s = __kmem_cache_alias(name, size, align, flags, ctor);
+ s = __kmem_cache_alias(memcg, name, size, align, flags, ctor);
if (s)
goto out_locked;

@@ -126,6 +136,13 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t align
s->object_size = s->size = size;
s->align = align;
s->ctor = ctor;
+
+ if (memcg_register_cache(memcg, s)) {
+ kmem_cache_free(kmem_cache, s);
+ err = -ENOMEM;
+ goto out_locked;
+ }
+
s->name = kstrdup(name, GFP_KERNEL);
if (!s->name) {
kmem_cache_free(kmem_cache, s);
@@ -135,10 +152,9 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t align

err = __kmem_cache_create(s, flags);
if (!err) {
-
s->refcount = 1;
list_add(&s->list, &slab_caches);
-
+ memcg_cache_list_add(memcg, s);
} else {
kfree(s->name);
kmem_cache_free(kmem_cache, s);
@@ -166,6 +182,13 @@ out_locked:

return s;
}
+
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size, size_t align,
+ unsigned long flags, void (*ctor)(void *))
+{
+ return kmem_cache_create_memcg(NULL, name, size, align, flags, ctor);
+}
EXPORT_SYMBOL(kmem_cache_create);

void kmem_cache_destroy(struct kmem_cache *s)
@@ -181,6 +204,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (s->flags & SLAB_DESTROY_BY_RCU)
rcu_barrier();

+ memcg_release_cache(s);
kfree(s->name);
kmem_cache_free(kmem_cache, s);
} else {
diff --git a/mm/slub.c b/mm/slub.c
index 259bc2c..a105bdc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -31,6 +31,7 @@
#include <linux/fault-inject.h>
#include <linux/stacktrace.h>
#include <linux/prefetch.h>
+#include <linux/memcontrol.h>

#include <trace/events/kmem.h>

@@ -3880,7 +3881,7 @@ static int slab_unmergeable(struct kmem_cache *s)
return 0;
}

-static struct kmem_cache *find_mergeable(size_t size,
+static struct kmem_cache *find_mergeable(struct mem_cgroup *memcg, size_t size,
size_t align, unsigned long flags, const char *name,
void (*ctor)(void *))
{
@@ -3916,17 +3917,21 @@ static struct kmem_cache *find_mergeable(size_t size,
if (s->size - size >= sizeof(void *))
continue;

+ if (!cache_match_memcg(s, memcg))
+ continue;
+
return s;
}
return NULL;
}

-struct kmem_cache *__kmem_cache_alias(const char *name, size_t size,
- size_t align, unsigned long flags, void (*ctor)(void *))
+struct kmem_cache *
+__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
{
struct kmem_cache *s;

- s = find_mergeable(size, align, flags, name, ctor);
+ s = find_mergeable(memcg, size, align, flags, name, ctor);
if (s) {
s->refcount++;
/*
@@ -5246,6 +5251,12 @@ static char *create_unique_id(struct kmem_cache *s)
if (p != name + 1)
*p++ = '-';
p += sprintf(p, "%07d", s->size);
+
+#ifdef CONFIG_MEMCG_KMEM
+ if (!is_root_cache(s))
+ p += sprintf(p, "-%08d", memcg_cache_id(s->memcg_params->memcg));
+#endif
+
BUG_ON(p > name + ID_STR_LENGTH - 1);
return name;
}
--
1.7.11.7

2012-11-01 12:12:03

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 12/29] execute the whole memcg freeing in free_worker

A lot of the initialization we do in mem_cgroup_create() is done with
softirqs enabled. This include grabbing a css id, which holds
&ss->id_lock->rlock, and the per-zone trees, which holds
rtpz->lock->rlock. All of those signal to the lockdep mechanism that
those locks can be used in SOFTIRQ-ON-W context. This means that the
freeing of memcg structure must happen in a compatible context,
otherwise we'll get a deadlock, like the one bellow, caught by lockdep:

[<ffffffff81103095>] free_accounted_pages+0x47/0x4c
[<ffffffff81047f90>] free_task+0x31/0x5c
[<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
[<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
[<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
[<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
[<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
[<ffffffff8105266d>] __do_softirq+0x122/0x277

This usage pattern could not be triggered before kmem came into play.
With the introduction of kmem stack handling, it is possible that we
call the last mem_cgroup_put() from the task destructor, which is run in
an rcu callback. Such callbacks are run with softirqs disabled, leading
to the offensive usage pattern.

In general, we have little, if any, means to guarantee in which context
the last memcg_put will happen. The best we can do is test it and try to
make sure no invalid context releases are happening. But as we add more
code to memcg, the possible interactions grow in number and expose more
ways to get context conflicts. One thing to keep in mind, is that part
of the freeing process is already deferred to a worker, such as vfree(),
that can only be called from process context.

For the moment, the only two functions we really need moved away are:

* free_css_id(), and
* mem_cgroup_remove_from_trees().

But because the later accesses per-zone info,
free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
are left with the per_cpu stats only. Better move it all.

Signed-off-by: Glauber Costa <[email protected]>
Tested-by: Greg Thelen <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++----------------------------
1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61a382a..5e8962c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5212,16 +5212,29 @@ out_free:
}

/*
- * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
- * but in process context. The work_freeing structure is overlaid
- * on the rcu_freeing structure, which itself is overlaid on memsw.
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
*/
-static void free_work(struct work_struct *work)
+
+static void __mem_cgroup_free(struct mem_cgroup *memcg)
{
- struct mem_cgroup *memcg;
+ int node;
int size = sizeof(struct mem_cgroup);

- memcg = container_of(work, struct mem_cgroup, work_freeing);
+ mem_cgroup_remove_from_trees(memcg);
+ free_css_id(&mem_cgroup_subsys, &memcg->css);
+
+ for_each_node(node)
+ free_mem_cgroup_per_zone_info(memcg, node);
+
+ free_percpu(memcg->stat);
+
/*
* We need to make sure that (at least for now), the jump label
* destruction code runs outside of the cgroup lock. This is because
@@ -5240,38 +5253,27 @@ static void free_work(struct work_struct *work)
vfree(memcg);
}

-static void free_rcu(struct rcu_head *rcu_head)
-{
- struct mem_cgroup *memcg;
-
- memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
- INIT_WORK(&memcg->work_freeing, free_work);
- schedule_work(&memcg->work_freeing);
-}

/*
- * At destroying mem_cgroup, references from swap_cgroup can remain.
- * (scanning all at force_empty is too costly...)
- *
- * Instead of clearing all references at force_empty, we remember
- * the number of reference from swap_cgroup and free mem_cgroup when
- * it goes down to 0.
- *
- * Removal of cgroup itself succeeds regardless of refs from swap.
+ * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
+ * but in process context. The work_freeing structure is overlaid
+ * on the rcu_freeing structure, which itself is overlaid on memsw.
*/
-
-static void __mem_cgroup_free(struct mem_cgroup *memcg)
+static void free_work(struct work_struct *work)
{
- int node;
+ struct mem_cgroup *memcg;

- mem_cgroup_remove_from_trees(memcg);
- free_css_id(&mem_cgroup_subsys, &memcg->css);
+ memcg = container_of(work, struct mem_cgroup, work_freeing);
+ __mem_cgroup_free(memcg);
+}

- for_each_node(node)
- free_mem_cgroup_per_zone_info(memcg, node);
+static void free_rcu(struct rcu_head *rcu_head)
+{
+ struct mem_cgroup *memcg;

- free_percpu(memcg->stat);
- call_rcu(&memcg->rcu_freeing, free_rcu);
+ memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
+ INIT_WORK(&memcg->work_freeing, free_work);
+ schedule_work(&memcg->work_freeing);
}

static void mem_cgroup_get(struct mem_cgroup *memcg)
@@ -5283,7 +5285,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
{
if (atomic_sub_and_test(count, &memcg->refcnt)) {
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
- __mem_cgroup_free(memcg);
+ call_rcu(&memcg->rcu_freeing, free_rcu);
if (parent)
mem_cgroup_put(parent);
}
--
1.7.11.7

2012-11-01 12:12:40

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 16/29] slab: annotate on-slab caches nodelist locks

We currently provide lockdep annotation for kmalloc caches, and also
caches that have SLAB_DEBUG_OBJECTS enabled. The reason for this is that
we can quite frequently nest in the l3->list_lock lock, which is not
something trivial to avoid.

My proposal with this patch, is to extend this to caches whose slab
management object lives within the slab as well ("on_slab"). The need
for this arose in the context of testing kmemcg-slab patches. With such
patchset, we can have per-memcg kmalloc caches. So the same path that
led to nesting between kmalloc caches will could then lead to in-memcg
nesting. Because they are not annotated, lockdep will trigger.

Signed-off-by: Glauber Costa <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: David Rientjes <[email protected]>
CC: JoonSoo Kim <[email protected]>
---
mm/slab.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index 98b3460..dcc05f5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -653,6 +653,26 @@ static void init_node_lock_keys(int q)
}
}

+static void on_slab_lock_classes_node(struct kmem_cache *cachep, int q)
+{
+ struct kmem_list3 *l3;
+ l3 = cachep->nodelists[q];
+ if (!l3)
+ return;
+
+ slab_set_lock_classes(cachep, &on_slab_l3_key,
+ &on_slab_alc_key, q);
+}
+
+static inline void on_slab_lock_classes(struct kmem_cache *cachep)
+{
+ int node;
+
+ VM_BUG_ON(OFF_SLAB(cachep));
+ for_each_node(node)
+ on_slab_lock_classes_node(cachep, node);
+}
+
static inline void init_lock_keys(void)
{
int node;
@@ -669,6 +689,14 @@ static inline void init_lock_keys(void)
{
}

+static inline void on_slab_lock_classes(struct kmem_cache *cachep)
+{
+}
+
+static inline void on_slab_lock_classes_node(struct kmem_cache *cachep, int node)
+{
+}
+
static void slab_set_debugobj_lock_classes_node(struct kmem_cache *cachep, int node)
{
}
@@ -1396,6 +1424,9 @@ static int __cpuinit cpuup_prepare(long cpu)
free_alien_cache(alien);
if (cachep->flags & SLAB_DEBUG_OBJECTS)
slab_set_debugobj_lock_classes_node(cachep, node);
+ else if (!OFF_SLAB(cachep) &&
+ !(cachep->flags & SLAB_DESTROY_BY_RCU))
+ on_slab_lock_classes_node(cachep, node);
}
init_node_lock_keys(node);

@@ -2550,7 +2581,8 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
WARN_ON_ONCE(flags & SLAB_DESTROY_BY_RCU);

slab_set_debugobj_lock_classes(cachep);
- }
+ } else if (!OFF_SLAB(cachep) && !(flags & SLAB_DESTROY_BY_RCU))
+ on_slab_lock_classes(cachep);

return 0;
}
--
1.7.11.7

2012-11-01 12:13:05

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 11/29] memcg: allow a memcg with kmem charges to be destructed.

Because the ultimate goal of the kmem tracking in memcg is to track slab
pages as well, we can't guarantee that we'll always be able to point a
page to a particular process, and migrate the charges along with it -
since in the common case, a page will contain data belonging to multiple
processes.

Because of that, when we destroy a memcg, we only make sure the
destruction will succeed by discounting the kmem charges from the user
charges when we try to empty the cgroup.

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 403f5a7..61a382a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -545,6 +545,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
if (memcg_kmem_is_active(memcg))
static_key_slow_dec(&memcg_kmem_enabled_key);
+ /*
+ * This check can't live in kmem destruction function,
+ * since the charges will outlive the cgroup
+ */
+ WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -3995,6 +4000,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
int node, zid, shrink;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
+ u64 usage;

css_get(&memcg->css);

@@ -4028,8 +4034,17 @@ move_account:
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
cond_resched();
+ /*
+ * Kernel memory may not necessarily be trackable to a specific
+ * process. So they are not migrated, and therefore we can't
+ * expect their value to drop to 0 here.
+ *
+ * having res filled up with kmem only is enough
+ */
+ usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
+ res_counter_read_u64(&memcg->kmem, RES_USAGE);
/* "ret" should also be checked to ensure all lists are empty. */
- } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
+ } while (usage > 0 || ret);
out:
css_put(&memcg->css);
return ret;
--
1.7.11.7

2012-11-01 12:13:04

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 09/29] memcg: kmem accounting lifecycle management

Because kmem charges can outlive the cgroup, we need to make sure that
we won't free the memcg structure while charges are still in flight.
For reviewing simplicity, the charge functions will issue
mem_cgroup_get() at every charge, and mem_cgroup_put() at every
uncharge.

This can get expensive, however, and we can do better. mem_cgroup_get()
only really needs to be issued once: when the first limit is set. In the
same spirit, we only need to issue mem_cgroup_put() when the last charge
is gone.

We'll need an extra bit in kmem_account_flags for that: KMEM_ACCOUNTED_DEAD.
it will be set when the cgroup dies, if there are charges in the group.
If there aren't, we can proceed right away.

Our uncharge function will have to test that bit every time the charges
drop to 0. Because that is not the likely output of
res_counter_uncharge, this should not impose a big hit on us: it is
certainly much better than a reference count decrease at every
operation.

[ v3: merged all lifecycle related patches in one ]
[ v5: changed memcg_kmem_dead's name ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 50 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1eefb64..91a021a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,6 +344,7 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+ KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
};

#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
@@ -353,6 +354,23 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
{
set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}
+
+static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+ return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
+}
+
+static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+{
+ if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
+ set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+}
+
+static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+{
+ return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
+ &memcg->kmem_account_flags);
+}
#endif

/* Stuffs for move charges at task migration. */
@@ -2691,10 +2709,16 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)

static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
{
- res_counter_uncharge(&memcg->kmem, size);
res_counter_uncharge(&memcg->res, size);
if (do_swap_account)
res_counter_uncharge(&memcg->memsw, size);
+
+ /* Not down to 0 */
+ if (res_counter_uncharge(&memcg->kmem, size))
+ return;
+
+ if (memcg_kmem_test_and_clear_dead(memcg))
+ mem_cgroup_put(memcg);
}

/*
@@ -2733,13 +2757,9 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
return true;
}

- mem_cgroup_get(memcg);
-
ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
if (!ret)
*_memcg = memcg;
- else
- mem_cgroup_put(memcg);

css_put(&memcg->css);
return (ret == 0);
@@ -2755,7 +2775,6 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
/* The page allocation failed. Revert */
if (!page) {
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
- mem_cgroup_put(memcg);
return;
}

@@ -2796,7 +2815,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)

VM_BUG_ON(mem_cgroup_is_root(memcg));
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
- mem_cgroup_put(memcg);
}
#endif /* CONFIG_MEMCG_KMEM */

@@ -4180,6 +4198,13 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
VM_BUG_ON(ret);

memcg_kmem_set_active(memcg);
+ /*
+ * kmem charges can outlive the cgroup. In the case of slab
+ * pages, for instance, a page contain objects from various
+ * processes, so it is unfeasible to migrate them away. We
+ * need to reference count the memcg because of that.
+ */
+ mem_cgroup_get(memcg);
} else
ret = res_counter_set_limit(&memcg->kmem, val);
out:
@@ -4195,6 +4220,10 @@ static void memcg_propagate_kmem(struct mem_cgroup *memcg)
if (!parent)
return;
memcg->kmem_account_flags = parent->kmem_account_flags;
+#ifdef CONFIG_MEMCG_KMEM
+ if (memcg_kmem_is_active(memcg))
+ mem_cgroup_get(memcg);
+#endif
}

/*
@@ -4883,6 +4912,20 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
{
mem_cgroup_sockets_destroy(memcg);
+
+ memcg_kmem_mark_dead(memcg);
+
+ if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+ return;
+
+ /*
+ * Charges already down to 0, undo mem_cgroup_get() done in the charge
+ * path here, being careful not to race with memcg_uncharge_kmem: it is
+ * possible that the charges went down to 0 between mark_dead and the
+ * res_counter read, so in that case, we don't need the put
+ */
+ if (memcg_kmem_test_and_clear_dead(memcg))
+ mem_cgroup_put(memcg);
}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
--
1.7.11.7

2012-11-01 12:13:02

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 14/29] Add documentation about the kmem controller

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
Documentation/cgroups/memory.txt | 59 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..206853b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -71,6 +71,11 @@ Brief summary of control files.
memory.oom_control # set/show oom controls.
memory.numa_stat # show the number of memory usage per numa node

+ memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
+ memory.kmem.usage_in_bytes # show current kernel memory allocation
+ memory.kmem.failcnt # show the number of kernel memory usage hits limits
+ memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
+
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits
@@ -268,20 +273,66 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally
different than user memory, since it can't be swapped out, which makes it
possible to DoS the system by consuming too much of this precious resource.

+Kernel memory won't be accounted at all until limit on a group is set. This
+allows for existing setups to continue working without disruption. The limit
+cannot be set if the cgroup have children, or if there are already tasks in the
+cgroup. Attempting to set the limit under those conditions will return -EBUSY.
+When use_hierarchy == 1 and a group is accounted, its children will
+automatically be accounted regardless of their limit value.
+
+After a group is first limited, it will be kept being accounted until it
+is removed. The memory limitation itself, can of course be removed by writing
+-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
+limited.
+
Kernel memory limits are not imposed for the root cgroup. Usage for the root
-cgroup may or may not be accounted.
+cgroup may or may not be accounted. The memory used is accumulated into
+memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
+(currently only for tcp).
+The main "kmem" counter is fed into the main counter, so kmem charges will
+also be visible from the user counter.

Currently no soft limit is implemented for kernel memory. It is future work
to trigger slab reclaim when those limits are reached.

2.7.1 Current Kernel Memory resources accounted

+* stack pages: every process consumes some stack pages. By accounting into
+kernel memory, we prevent new processes from being created when the kernel
+memory usage is too high.
+
* sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.

* tcp memory pressure: sockets memory pressure for the tcp protocol.

+2.7.3 Common use cases
+
+Because the "kmem" counter is fed to the main user counter, kernel memory can
+never be limited completely independently of user memory. Say "U" is the user
+limit, and "K" the kernel limit. There are three possible ways limits can be
+set:
+
+ U != 0, K = unlimited:
+ This is the standard memcg limitation mechanism already present before kmem
+ accounting. Kernel memory is completely ignored.
+
+ U != 0, K < U:
+ Kernel memory is a subset of the user memory. This setup is useful in
+ deployments where the total amount of memory per-cgroup is overcommited.
+ Overcommiting kernel memory limits is definitely not recommended, since the
+ box can still run out of non-reclaimable memory.
+ In this case, the admin could set up K so that the sum of all groups is
+ never greater than the total memory, and freely set U at the cost of his
+ QoS.
+
+ U != 0, K >= U:
+ Since kmem charges will also be fed to the user counter and reclaim will be
+ triggered for the cgroup for both kinds of memory. This setup gives the
+ admin a unified view of memory, and it is also useful for people who just
+ want to track kernel memory usage.
+
3. User Interface

0. Configuration
@@ -290,6 +341,7 @@ a. Enable CONFIG_CGROUPS
b. Enable CONFIG_RESOURCE_COUNTERS
c. Enable CONFIG_MEMCG
d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
+d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
@@ -406,6 +458,11 @@ About use_hierarchy, see Section 6.
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.

+ Also, note that when memory.kmem.limit_in_bytes is set the charges due to
+ kernel pages will still be seen. This is not considered a failure and the
+ write will still return success. In this case, it is expected that
+ memory.kmem.usage_in_bytes == memory.usage_in_bytes.
+
About use_hierarchy, see Section 6.

5.2 stat file
--
1.7.11.7

2012-11-01 12:13:00

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 15/29] slab/slub: struct memcg_params

For the kmem slab controller, we need to record some extra
information in the kmem_cache structure.

Signed-off-by: Glauber Costa <[email protected]>
Signed-off-by: Suleiman Souhlal <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/slab.h | 24 ++++++++++++++++++++++++
include/linux/slab_def.h | 3 +++
include/linux/slub_def.h | 3 +++
mm/slab.h | 13 +++++++++++++
4 files changed, 43 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0dd2dfa..8860e08 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -178,6 +178,30 @@ unsigned int kmem_cache_size(struct kmem_cache *);
#endif

/*
+ * This is the main placeholder for memcg-related information in kmem caches.
+ * struct kmem_cache will hold a pointer to it, so the memory cost while
+ * disabled is 1 pointer. The runtime cost while enabled, gets bigger than it
+ * would otherwise be if that would be bundled in kmem_cache: we'll need an
+ * extra pointer chase. But the trade off clearly lays in favor of not
+ * penalizing non-users.
+ *
+ * Both the root cache and the child caches will have it. For the root cache,
+ * this will hold a dynamically allocated array large enough to hold
+ * information about the currently limited memcgs in the system.
+ *
+ * Child caches will hold extra metadata needed for its operation. Fields are:
+ *
+ * @memcg: pointer to the memcg this cache belongs to
+ */
+struct memcg_cache_params {
+ bool is_root_cache;
+ union {
+ struct kmem_cache *memcg_caches[0];
+ struct mem_cgroup *memcg;
+ };
+};
+
+/*
* Common kmalloc functions provided by all allocators
*/
void * __must_check __krealloc(const void *, size_t, gfp_t);
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 36d7031..665afa4 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -81,6 +81,9 @@ struct kmem_cache {
*/
int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */
+#ifdef CONFIG_MEMCG_KMEM
+ struct memcg_cache_params *memcg_params;
+#endif

/* 6) per-cpu/per-node data, touched during every alloc/free */
/*
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..961e72e 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -101,6 +101,9 @@ struct kmem_cache {
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
#endif
+#ifdef CONFIG_MEMCG_KMEM
+ struct memcg_cache_params *memcg_params;
+#endif

#ifdef CONFIG_NUMA
/*
diff --git a/mm/slab.h b/mm/slab.h
index 66a62d3..5ee1851 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -92,4 +92,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
ssize_t slabinfo_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos);
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool is_root_cache(struct kmem_cache *s)
+{
+ return !s->memcg_params || s->memcg_params->is_root_cache;
+}
+#else
+static inline bool is_root_cache(struct kmem_cache *s)
+{
+ return true;
+}
+
+#endif
#endif
--
1.7.11.7

2012-11-01 12:12:57

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 10/29] memcg: use static branches when code not in use

We can use static branches to patch the code in or out when not used.

Because the _ACTIVE bit on kmem_accounted is only set after the
increment is done, we guarantee that the root memcg will always be
selected for kmem charges until all call sites are patched (see
memcg_kmem_enabled). This guarantees that no mischarges are applied.

static branch decrement happens when the last reference count from the
kmem accounting in memcg dies. This will only happen when the charges
drop down to 0.

When that happen, we need to disable the static branch only on those
memcgs that enabled it. To achieve this, we would be forced to
complicate the code by keeping track of which memcgs were the ones
that actually enabled limits, and which ones got it from its parents.

It is a lot simpler just to do static_key_slow_inc() on every child
that is accounted.

[ v4: adapted this patch to the changes in kmem_accounted ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 4 ++-
mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6ca1cf..2a2ae05 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
#include <linux/hardirq.h>
+#include <linux/jump_label.h>

struct mem_cgroup;
struct page_cgroup;
@@ -410,9 +411,10 @@ static inline void sock_release_memcg(struct sock *sk)
#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */

#ifdef CONFIG_MEMCG_KMEM
+extern struct static_key memcg_kmem_enabled_key;
static inline bool memcg_kmem_enabled(void)
{
- return true;
+ return static_key_false(&memcg_kmem_enabled_key);
}

/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91a021a..403f5a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,10 +344,13 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+ KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
};

-#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
+/* We account when limit is on, but only after call sites are patched */
+#define KMEM_ACCOUNTED_MASK \
+ ((1 << KMEM_ACCOUNTED_ACTIVE) | (1 << KMEM_ACCOUNTED_ACTIVATED))

#ifdef CONFIG_MEMCG_KMEM
static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
@@ -360,6 +363,11 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}

+static void memcg_kmem_set_activated(struct mem_cgroup *memcg)
+{
+ set_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
+}
+
static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
{
if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
@@ -530,6 +538,26 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif

+#ifdef CONFIG_MEMCG_KMEM
+struct static_key memcg_kmem_enabled_key;
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+ if (memcg_kmem_is_active(memcg))
+ static_key_slow_dec(&memcg_kmem_enabled_key);
+}
+#else
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
+static void disarm_static_keys(struct mem_cgroup *memcg)
+{
+ disarm_sock_keys(memcg);
+ disarm_kmem_keys(memcg);
+}
+
static void drain_all_stock_async(struct mem_cgroup *memcg);

static struct mem_cgroup_per_zone *
@@ -4167,6 +4195,8 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
{
int ret = -EINVAL;
#ifdef CONFIG_MEMCG_KMEM
+ bool must_inc_static_branch = false;
+
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
/*
* For simplicity, we won't allow this to be disabled. It also can't
@@ -4197,7 +4227,15 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
ret = res_counter_set_limit(&memcg->kmem, val);
VM_BUG_ON(ret);

- memcg_kmem_set_active(memcg);
+ /*
+ * After this point, kmem_accounted (that we test atomically in
+ * the beginning of this conditional), is no longer 0. This
+ * guarantees only one process will set the following boolean
+ * to true. We don't need test_and_set because we're protected
+ * by the set_limit_mutex anyway.
+ */
+ memcg_kmem_set_activated(memcg);
+ must_inc_static_branch = true;
/*
* kmem charges can outlive the cgroup. In the case of slab
* pages, for instance, a page contain objects from various
@@ -4210,6 +4248,27 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
out:
mutex_unlock(&set_limit_mutex);
cgroup_unlock();
+
+ /*
+ * We are by now familiar with the fact that we can't inc the static
+ * branch inside cgroup_lock. See disarm functions for details. A
+ * worker here is overkill, but also wrong: After the limit is set, we
+ * must start accounting right away. Since this operation can't fail,
+ * we can safely defer it to here - no rollback will be needed.
+ *
+ * The boolean used to control this is also safe, because
+ * KMEM_ACCOUNTED_ACTIVATED guarantees that only one process will be
+ * able to set it to true;
+ */
+ if (must_inc_static_branch) {
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+ /*
+ * setting the active bit after the inc will guarantee no one
+ * starts accounting before all call sites are patched
+ */
+ memcg_kmem_set_active(memcg);
+ }
+
#endif
return ret;
}
@@ -4221,8 +4280,20 @@ static void memcg_propagate_kmem(struct mem_cgroup *memcg)
return;
memcg->kmem_account_flags = parent->kmem_account_flags;
#ifdef CONFIG_MEMCG_KMEM
- if (memcg_kmem_is_active(memcg))
+ /*
+ * When that happen, we need to disable the static branch only on those
+ * memcgs that enabled it. To achieve this, we would be forced to
+ * complicate the code by keeping track of which memcgs were the ones
+ * that actually enabled limits, and which ones got it from its
+ * parents.
+ *
+ * It is a lot simpler just to do static_key_slow_inc() on every child
+ * that is accounted.
+ */
+ if (memcg_kmem_is_active(memcg)) {
mem_cgroup_get(memcg);
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+ }
#endif
}

@@ -5147,7 +5218,7 @@ static void free_work(struct work_struct *work)
* to move this code around, and make sure it is outside
* the cgroup_lock.
*/
- disarm_sock_keys(memcg);
+ disarm_static_keys(memcg);
if (size < PAGE_SIZE)
kfree(memcg);
else
--
1.7.11.7

2012-11-01 12:14:18

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 08/29] res_counter: return amount of charges after res_counter_uncharge

It is useful to know how many charges are still left after a call to
res_counter_uncharge. While it is possible to issue a res_counter_read
after uncharge, this can be racy.

If we need, for instance, to take some action when the counters drop
down to 0, only one of the callers should see it. This is the same
semantics as the atomic variables in the kernel.

Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.

Signed-off-by: Glauber Costa <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: David Rientjes <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
Documentation/cgroups/resource_counter.txt | 7 ++++---
include/linux/res_counter.h | 12 +++++++-----
kernel/res_counter.c | 20 +++++++++++++-------
3 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 0c4a344..c4d99ed 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,16 +83,17 @@ to work with it.
res_counter->lock internally (it must be called with res_counter->lock
held). The force parameter indicates whether we can bypass the limit.

- e. void res_counter_uncharge[_locked]
+ e. u64 res_counter_uncharge[_locked]
(struct res_counter *rc, unsigned long val)

When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
- "uncharging".
+ "uncharging". The return value of this function indicate the amount
+ of charges still present in the counter.

The _locked routines imply that the res_counter->lock is taken.

- f. void res_counter_uncharge_until
+ f. u64 res_counter_uncharge_until
(struct res_counter *rc, struct res_counter *top,
unsinged long val)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 7d7fbe2..4b173b6 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -130,14 +130,16 @@ int res_counter_charge_nofail(struct res_counter *counter,
*
* these calls check for usage underflow and show a warning on the console
* _locked call expects the counter->lock to be taken
+ *
+ * returns the total charges still present in @counter.
*/

-void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
+u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);

-void res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val);
+u64 res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index ad581aa..7b3d6dc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -86,33 +86,39 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
return __res_counter_charge(counter, val, limit_fail_at, true);
}

-void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
+u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
if (WARN_ON(counter->usage < val))
val = counter->usage;

counter->usage -= val;
+ return counter->usage;
}

-void res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val)
+u64 res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;
+ u64 ret = 0;

local_irq_save(flags);
for (c = counter; c != top; c = c->parent) {
+ u64 r;
spin_lock(&c->lock);
- res_counter_uncharge_locked(c, val);
+ r = res_counter_uncharge_locked(c, val);
+ if (c == counter)
+ ret = r;
spin_unlock(&c->lock);
}
local_irq_restore(flags);
+ return ret;
}

-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
{
- res_counter_uncharge_until(counter, NULL, val);
+ return res_counter_uncharge_until(counter, NULL, val);
}

static inline unsigned long long *
--
1.7.11.7

2012-11-01 12:14:20

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 06/29] memcg: kmem controller infrastructure

This patch introduces infrastructure for tracking kernel memory pages to
a given memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.

In memcontrol.h those functions are wrapped in inline acessors. The
idea is to later on, patch those with static branches, so we don't incur
any overhead when no mem cgroups with limited kmem are being used.

Users of this functionality shall interact with the memcg core code
through the following functions:

memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.

memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.

memcg_kmem_uncharge_page: called at free time, will revert the charge.

[ v2: improved comments and standardized function names ]
[ v3: handle no longer opaque, functions not exported,
even more comments ]
[ v4: reworked Used bit handling and surroundings for more clarity ]
[ v5: simplified code for kmemcg compiled out and core functions in
memcontrol.c, moved kmem code to the middle to avoid forward decls ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/memcontrol.h | 110 +++++++++++++++++++++++++++++
mm/memcontrol.c | 170 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 280 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 11ddc7f..e6ca1cf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
+#include <linux/hardirq.h>

struct mem_cgroup;
struct page_cgroup;
@@ -407,5 +408,114 @@ static inline void sock_release_memcg(struct sock *sk)
{
}
#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool memcg_kmem_enabled(void)
+{
+ return true;
+}
+
+/*
+ * In general, we'll do everything in our power to not incur in any overhead
+ * for non-memcg users for the kmem functions. Not even a function call, if we
+ * can avoid it.
+ *
+ * Therefore, we'll inline all those functions so that in the best case, we'll
+ * see that kmemcg is off for everybody and proceed quickly. If it is on,
+ * we'll still do most of the flag checking inline. We check a lot of
+ * conditions, but because they are pretty simple, they are expected to be
+ * fast.
+ */
+bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
+ int order);
+void __memcg_kmem_commit_charge(struct page *page,
+ struct mem_cgroup *memcg, int order);
+void __memcg_kmem_uncharge_pages(struct page *page, int order);
+
+/**
+ * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @memcg: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static __always_inline bool
+memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+{
+ if (!memcg_kmem_enabled())
+ return true;
+
+ /*
+ * __GFP_NOFAIL allocations will move on even if charging is not
+ * possible. Therefore we don't even try, and have this allocation
+ * unaccounted. We could in theory charge it with
+ * res_counter_charge_nofail, but we hope those allocations are rare,
+ * and won't be worth the trouble.
+ */
+ if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
+ return true;
+ if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
+ return true;
+
+ /* If the test is dying, just let it go. */
+ if (unlikely(fatal_signal_pending(current)))
+ return true;
+
+ return __memcg_kmem_newpage_charge(gfp, memcg, order);
+}
+
+/**
+ * memcg_kmem_uncharge_pages: uncharge pages from memcg
+ * @page: pointer to struct page being freed
+ * @order: allocation order.
+ *
+ * there is no need to specify memcg here, since it is embedded in page_cgroup
+ */
+static __always_inline void
+memcg_kmem_uncharge_pages(struct page *page, int order)
+{
+ if (memcg_kmem_enabled())
+ __memcg_kmem_uncharge_pages(page, order);
+}
+
+/**
+ * memcg_kmem_commit_charge: embeds correct memcg in a page
+ * @page: pointer to struct page recently allocated
+ * @memcg: the memcg structure we charged against
+ * @order: allocation order.
+ *
+ * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
+ * failure of the allocation. if @page is NULL, this function will revert the
+ * charges. Otherwise, it will commit the memcg given by @memcg to the
+ * corresponding page_cgroup.
+ */
+static __always_inline void
+memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+{
+ if (memcg_kmem_enabled() && memcg)
+ __memcg_kmem_commit_charge(page, memcg, order);
+}
+
+#else
+static inline bool
+memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+{
+ return true;
+}
+
+static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
+{
+}
+
+static inline void
+memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df7d6f7..1eefb64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -2630,6 +2634,172 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
memcg_check_events(memcg, page);
}

+#ifdef CONFIG_MEMCG_KMEM
+static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
+{
+ return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
+ (memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
+}
+
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+ struct res_counter *fail_res;
+ struct mem_cgroup *_memcg;
+ int ret = 0;
+ bool may_oom;
+
+ ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+ if (ret)
+ return ret;
+
+ /*
+ * Conditions under which we can wait for the oom_killer. Those are
+ * the same conditions tested by the core page allocator
+ */
+ may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);
+
+ _memcg = memcg;
+ ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
+ &_memcg, may_oom);
+
+ if (ret == -EINTR) {
+ /*
+ * __mem_cgroup_try_charge() chosed to bypass to root due to
+ * OOM kill or fatal signal. Since our only options are to
+ * either fail the allocation or charge it to this cgroup, do
+ * it as a temporary condition. But we can't fail. From a
+ * kmem/slab perspective, the cache has already been selected,
+ * by mem_cgroup_kmem_get_cache(), so it is too late to change
+ * our minds.
+ *
+ * This condition will only trigger if the task entered
+ * memcg_charge_kmem in a sane state, but was OOM-killed during
+ * __mem_cgroup_try_charge() above. Tasks that were already
+ * dying when the allocation triggers should have been already
+ * directed to the root cgroup in memcontrol.h
+ */
+ res_counter_charge_nofail(&memcg->res, size, &fail_res);
+ if (do_swap_account)
+ res_counter_charge_nofail(&memcg->memsw, size,
+ &fail_res);
+ ret = 0;
+ } else if (ret)
+ res_counter_uncharge(&memcg->kmem, size);
+
+ return ret;
+}
+
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+{
+ res_counter_uncharge(&memcg->kmem, size);
+ res_counter_uncharge(&memcg->res, size);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, size);
+}
+
+/*
+ * We need to verify if the allocation against current->mm->owner's memcg is
+ * possible for the given order. But the page is not allocated yet, so we'll
+ * need a further commit step to do the final arrangements.
+ *
+ * It is possible for the task to switch cgroups in this mean time, so at
+ * commit time, we can't rely on task conversion any longer. We'll then use
+ * the handle argument to return to the caller which cgroup we should commit
+ * against. We could also return the memcg directly and avoid the pointer
+ * passing, but a boolean return value gives better semantics considering
+ * the compiled-out case as well.
+ *
+ * Returning true means the allocation is possible.
+ */
+bool
+__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
+{
+ struct mem_cgroup *memcg;
+ int ret;
+
+ *_memcg = NULL;
+ memcg = try_get_mem_cgroup_from_mm(current->mm);
+
+ /*
+ * very rare case described in mem_cgroup_from_task. Unfortunately there
+ * isn't much we can do without complicating this too much, and it would
+ * be gfp-dependent anyway. Just let it go
+ */
+ if (unlikely(!memcg))
+ return true;
+
+ if (!memcg_can_account_kmem(memcg)) {
+ css_put(&memcg->css);
+ return true;
+ }
+
+ mem_cgroup_get(memcg);
+
+ ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+ if (!ret)
+ *_memcg = memcg;
+ else
+ mem_cgroup_put(memcg);
+
+ css_put(&memcg->css);
+ return (ret == 0);
+}
+
+void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
+ int order)
+{
+ struct page_cgroup *pc;
+
+ VM_BUG_ON(mem_cgroup_is_root(memcg));
+
+ /* The page allocation failed. Revert */
+ if (!page) {
+ memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ mem_cgroup_put(memcg);
+ return;
+ }
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ pc->mem_cgroup = memcg;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+}
+
+void __memcg_kmem_uncharge_pages(struct page *page, int order)
+{
+ struct mem_cgroup *memcg = NULL;
+ struct page_cgroup *pc;
+
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Fast unlocked return. Theoretically might have changed, have to
+ * check again after locking.
+ */
+ if (!PageCgroupUsed(pc))
+ return;
+
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
+ memcg = pc->mem_cgroup;
+ ClearPageCgroupUsed(pc);
+ }
+ unlock_page_cgroup(pc);
+
+ /*
+ * We trust that only if there is a memcg associated with the page, it
+ * is a valid allocation
+ */
+ if (!memcg)
+ return;
+
+ VM_BUG_ON(mem_cgroup_is_root(memcg));
+ memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ mem_cgroup_put(memcg);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE

#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
--
1.7.11.7

2012-11-01 12:14:15

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 07/29] mm: Allocate kernel pages to the right memcg

When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

[ v2: inverted test order to avoid a memcg_get leak,
free_accounted_pages simplification ]
[ v4: test for TIF_MEMDIE at newpage_charge ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: David Rientjes <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Suleiman Souhlal <[email protected]>
CC: Tejun Heo <[email protected]>
---
include/linux/gfp.h | 3 +++
mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5effbd4..bf98214 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -362,6 +362,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, int cold);
extern void free_hot_cold_page_list(struct list_head *list, int cold);

+extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
+extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
+
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e29912e..3cc0940 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2599,6 +2599,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
+ struct mem_cgroup *memcg = NULL;

gfp_mask &= gfp_allowed_mask;

@@ -2617,6 +2618,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

+ /*
+ * Will only have any effect when __GFP_KMEMCG is set. This is
+ * verified in the (always inline) callee
+ */
+ if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
+ return NULL;
+
retry_cpuset:
cpuset_mems_cookie = get_mems_allowed();

@@ -2652,6 +2660,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;

+ memcg_kmem_commit_charge(page, memcg, order);
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2704,6 +2714,31 @@ void free_pages(unsigned long addr, unsigned int order)

EXPORT_SYMBOL(free_pages);

+/*
+ * __free_memcg_kmem_pages and free_memcg_kmem_pages will free
+ * pages allocated with __GFP_KMEMCG.
+ *
+ * Those pages are accounted to a particular memcg, embedded in the
+ * corresponding page_cgroup. To avoid adding a hit in the allocator to search
+ * for that information only to find out that it is NULL for users who have no
+ * interest in that whatsoever, we provide these functions.
+ *
+ * The caller knows better which flags it relies on.
+ */
+void __free_memcg_kmem_pages(struct page *page, unsigned int order)
+{
+ memcg_kmem_uncharge_pages(page, order);
+ __free_pages(page, order);
+}
+
+void free_memcg_kmem_pages(unsigned long addr, unsigned int order)
+{
+ if (addr != 0) {
+ VM_BUG_ON(!virt_addr_valid((void *)addr));
+ __free_memcg_kmem_pages(virt_to_page((void *)addr), order);
+ }
+}
+
static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
{
if (addr) {
--
1.7.11.7

2012-11-01 12:15:12

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 03/29] memcg: change defines to an enum

This is just a cleanup patch for clarity of expression. In earlier
submissions, people asked it to be in a separate patch, so here it is.

[ v2: use named enum as type throughout the file as well ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa0d9b0..e3338c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -386,9 +386,12 @@ enum charge_type {
};

/* for encoding cft->private value on file */
-#define _MEM (0)
-#define _MEMSWAP (1)
-#define _OOM_TYPE (2)
+enum res_type {
+ _MEM,
+ _MEMSWAP,
+ _OOM_TYPE,
+};
+
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
@@ -3915,7 +3918,8 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
char str[64];
u64 val;
- int type, name, len;
+ int name, len;
+ enum res_type type;

type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
@@ -3951,7 +3955,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ enum res_type type;
+ int name;
unsigned long long val;
int ret;

@@ -4027,7 +4032,8 @@ out:
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ int name;
+ enum res_type type;

type = MEMFILE_TYPE(event);
name = MEMFILE_ATTR(event);
@@ -4363,7 +4369,7 @@ static int mem_cgroup_usage_register_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 threshold, usage;
int i, size, ret;

@@ -4446,7 +4452,7 @@ static void mem_cgroup_usage_unregister_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 usage;
int i, j, size;

@@ -4524,7 +4530,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *event;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);

BUG_ON(type != _OOM_TYPE);
event = kmalloc(sizeof(*event), GFP_KERNEL);
@@ -4549,7 +4555,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *ev, *tmp;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);

BUG_ON(type != _OOM_TYPE);

--
1.7.11.7

2012-11-01 12:15:10

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v6 04/29] kmem accounting basic infrastructure

This patch adds the basic infrastructure for the accounting of kernel
memory. To control that, the following files are created:

* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes

They have the same meaning of their user memory counterparts. They
reflect the state of the "kmem" res_counter.

Per cgroup kmem memory accounting is not enabled until a limit is set
for the group. Once the limit is set the accounting cannot be disabled
for that group. This means that after the patch is applied, no
behavioral changes exists for whoever is still using memcg to control
their memory usage, until memory.kmem.limit_in_bytes is set for the
first time.

We always account to both user and kernel resource_counters. This
effectively means that an independent kernel limit is in place when the
limit is set to a lower value than the user memory. A equal or higher
value means that the user limit will always hit first, meaning that kmem
is effectively unlimited.

People who want to track kernel memory but not limit it, can set this
limit to a very high number (like RESOURCE_MAX - 1page - that no one
will ever hit, or equal to the user memory)

[ v4: make kmem files part of the main array;
do not allow limit to be set for non-empty cgroups ]
[ v5: cosmetic changes ]
[ v6: name changes and reorganizations, moved memcg_propagate_kmem ]

Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 123 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3338c7..df7d6f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -266,6 +266,10 @@ struct mem_cgroup {
};

/*
+ * the counter to account for kernel memory usage.
+ */
+ struct res_counter kmem;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -280,6 +284,7 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
+ unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */

bool oom_lock;
atomic_t under_oom;
@@ -332,6 +337,20 @@ struct mem_cgroup {
#endif
};

+/* internal only representation about the status of kmem accounting. */
+enum {
+ KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+};
+
+#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+{
+ set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
+}
+#endif
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -390,6 +409,7 @@ enum res_type {
_MEM,
_MEMSWAP,
_OOM_TYPE,
+ _KMEM,
};

#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
@@ -1433,6 +1453,10 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+ printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+ res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
}

/*
@@ -3940,6 +3964,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
else
val = res_counter_read_u64(&memcg->memsw, name);
break;
+ case _KMEM:
+ val = res_counter_read_u64(&memcg->kmem, name);
+ break;
default:
BUG();
}
@@ -3947,6 +3974,59 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
+
+static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
+{
+ int ret = -EINVAL;
+#ifdef CONFIG_MEMCG_KMEM
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ /*
+ * For simplicity, we won't allow this to be disabled. It also can't
+ * be changed if the cgroup has children already, or if tasks had
+ * already joined.
+ *
+ * If tasks join before we set the limit, a person looking at
+ * kmem.usage_in_bytes will have no way to determine when it took
+ * place, which makes the value quite meaningless.
+ *
+ * After it first became limited, changes in the value of the limit are
+ * of course permitted.
+ *
+ * Taking the cgroup_lock is really offensive, but it is so far the only
+ * way to guarantee that no children will appear. There are plenty of
+ * other offenders, and they should all go away. Fine grained locking
+ * is probably the way to go here. When we are fully hierarchical, we
+ * can also get rid of the use_hierarchy check.
+ */
+ cgroup_lock();
+ mutex_lock(&set_limit_mutex);
+ if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
+ if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
+ !list_empty(&cont->children))) {
+ ret = -EBUSY;
+ goto out;
+ }
+ ret = res_counter_set_limit(&memcg->kmem, val);
+ VM_BUG_ON(ret);
+
+ memcg_kmem_set_active(memcg);
+ } else
+ ret = res_counter_set_limit(&memcg->kmem, val);
+out:
+ mutex_unlock(&set_limit_mutex);
+ cgroup_unlock();
+#endif
+ return ret;
+}
+
+static void memcg_propagate_kmem(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ return;
+ memcg->kmem_account_flags = parent->kmem_account_flags;
+}
+
/*
* The user of this function is...
* RES_LIMIT.
@@ -3978,8 +4058,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
break;
if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
- else
+ else if (type == _MEMSWAP)
ret = mem_cgroup_resize_memsw_limit(memcg, val);
+ else if (type == _KMEM)
+ ret = memcg_update_kmem_limit(cont, val);
+ else
+ return -EINVAL;
break;
case RES_SOFT_LIMIT:
ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -4045,14 +4129,22 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
case RES_MAX_USAGE:
if (type == _MEM)
res_counter_reset_max(&memcg->res);
- else
+ else if (type == _MEMSWAP)
res_counter_reset_max(&memcg->memsw);
+ else if (type == _KMEM)
+ res_counter_reset_max(&memcg->kmem);
+ else
+ return -EINVAL;
break;
case RES_FAILCNT:
if (type == _MEM)
res_counter_reset_failcnt(&memcg->res);
- else
+ else if (type == _MEMSWAP)
res_counter_reset_failcnt(&memcg->memsw);
+ else if (type == _KMEM)
+ res_counter_reset_failcnt(&memcg->kmem);
+ else
+ return -EINVAL;
break;
}

@@ -4614,6 +4706,7 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
+ memcg_propagate_kmem(memcg);
return mem_cgroup_sockets_init(memcg, ss);
};

@@ -4728,6 +4821,31 @@ static struct cftype mem_cgroup_files[] = {
.read = mem_cgroup_read,
},
#endif
+#ifdef CONFIG_MEMCG_KMEM
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.failcnt",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+#endif
{ }, /* terminate */
};

@@ -4973,6 +5091,7 @@ mem_cgroup_create(struct cgroup *cont)
if (parent && parent->use_hierarchy) {
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
+ res_counter_init(&memcg->kmem, &parent->kmem);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -4983,6 +5102,7 @@ mem_cgroup_create(struct cgroup *cont)
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
+ res_counter_init(&memcg->kmem, NULL);
/*
* Deeper hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
--
1.7.11.7

Subject: Re: [PATCH v6 05/29] Add a __GFP_KMEMCG flag

On Thu, 1 Nov 2012, Glauber Costa wrote:

> This flag is used to indicate to the callees that this allocation is a
> kernel allocation in process context, and should be accounted to
> current's memcg. It takes numerical place of the of the recently removed
> __GFP_NO_KSWAPD.

Acked-by: Christoph Lameter <[email protected]>

Subject: Re: [PATCH v6 06/29] memcg: kmem controller infrastructure

On Thu, 1 Nov 2012, Glauber Costa wrote:

> +#ifdef CONFIG_MEMCG_KMEM
> +static inline bool memcg_kmem_enabled(void)
> +{
> + return true;
> +}
> +

Maybe it would be better to do this in the same way that NUMA_BUILD was
done in kernel.h?


> +static __always_inline bool
> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +{
> + if (!memcg_kmem_enabled())
> + return true;
> +
> + /*
> + * __GFP_NOFAIL allocations will move on even if charging is not
> + * possible. Therefore we don't even try, and have this allocation
> + * unaccounted. We could in theory charge it with
> + * res_counter_charge_nofail, but we hope those allocations are rare,
> + * and won't be worth the trouble.
> + */
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;


> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;

This type of check is repeatedly occurring in various subsystems. Could we
get a function (maybe inline) to do this check?

2012-11-02 00:04:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On Thu, 1 Nov 2012 16:07:16 +0400
Glauber Costa <[email protected]> wrote:

> Hi,
>
> This work introduces the kernel memory controller for memcg. Unlike previous
> submissions, this includes the whole controller, comprised of slab and stack
> memory.

I'm in the middle of (re)reading all this. Meanwhile I'll push it all
out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.

One thing:

> Numbers can be found at https://lkml.org/lkml/2012/9/13/239

You claim in the above that the fork worload is 'slab intensive". Or
at least, you seem to - it's a bit fuzzy.

But how slab intensive is it, really?

What is extremely slab intensive is networking. The networking guys
are very sensitive to slab performance. If this hasn't already been
done, could you please determine what impact this has upon networking?
I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
testing approaches.

2012-11-02 00:05:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 11/29] memcg: allow a memcg with kmem charges to be destructed.

On Thu, 1 Nov 2012 16:07:27 +0400
Glauber Costa <[email protected]> wrote:

> Because the ultimate goal of the kmem tracking in memcg is to track slab
> pages as well, we can't guarantee that we'll always be able to point a
> page to a particular process, and migrate the charges along with it -
> since in the common case, a page will contain data belonging to multiple
> processes.
>
> Because of that, when we destroy a memcg, we only make sure the
> destruction will succeed by discounting the kmem charges from the user
> charges when we try to empty the cgroup.

There was a significant conflict with the sched/numa changes in
linux-next, which I resolved as below. Please check it.

static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;
int node, zid;
u64 usage;

do {
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
return -EBUSY;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
mem_cgroup_start_move(memcg);
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
mem_cgroup_force_empty_list(memcg,
node, zid, lru);
}
}
}
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
cond_resched();

/*
* Kernel memory may not necessarily be trackable to a specific
* process. So they are not migrated, and therefore we can't
* expect their value to drop to 0 here.
* Having res filled up with kmem only is enough.
*
* This is a safety check because mem_cgroup_force_empty_list
* could have raced with mem_cgroup_replace_page_cache callers
* so the lru seemed empty but the page could have been added
* right after the check. RES_USAGE should be safe as we always
* charge before adding to the LRU.
*/
usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
res_counter_read_u64(&memcg->kmem, RES_USAGE);
} while (usage > 0);

return 0;
}

2012-11-02 00:05:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 23/29] memcg: destroy memcg caches

On Thu, 1 Nov 2012 16:07:39 +0400
Glauber Costa <[email protected]> wrote:

> This patch implements destruction of memcg caches. Right now,
> only caches where our reference counter is the last remaining are
> deleted. If there are any other reference counters around, we just
> leave the caches lying around until they go away.
>
> When that happen, a destruction function is called from the cache
> code. Caches are only destroyed in process context, so we queue them
> up for later processing in the general case.
>
>
> ...
>
> @@ -5950,6 +6012,7 @@ static int mem_cgroup_pre_destroy(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> + mem_cgroup_destroy_all_caches(memcg);
> return mem_cgroup_force_empty(memcg, false);
> }
>

Conflicts with linux-next cgroup changes. Looks pretty simple:


static int mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
int ret;

css_get(&memcg->css);
ret = mem_cgroup_reparent_charges(memcg);
mem_cgroup_destroy_all_caches(memcg);
css_put(&memcg->css);

return ret;
}

2012-11-02 07:41:28

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On 11/02/2012 04:04 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:16 +0400
> Glauber Costa <[email protected]> wrote:
>
>> Hi,
>>
>> This work introduces the kernel memory controller for memcg. Unlike previous
>> submissions, this includes the whole controller, comprised of slab and stack
>> memory.
>
> I'm in the middle of (re)reading all this. Meanwhile I'll push it all
> out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.
>
> One thing:
>
>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>
> You claim in the above that the fork worload is 'slab intensive". Or
> at least, you seem to - it's a bit fuzzy.
>
> But how slab intensive is it, really?
>
> What is extremely slab intensive is networking. The networking guys
> are very sensitive to slab performance. If this hasn't already been
> done, could you please determine what impact this has upon networking?
> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
> testing approaches.
>

I can test it, but unfortunately I am unlikely to get to prepare a good
environment before Barcelona.

I know, however, that Greg Thelen was testing netperf in his setup.
Greg, do you have any publishable numbers you could share?

2012-11-02 07:46:57

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 23/29] memcg: destroy memcg caches

On 11/02/2012 04:05 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:39 +0400
> Glauber Costa <[email protected]> wrote:
>
>> This patch implements destruction of memcg caches. Right now,
>> only caches where our reference counter is the last remaining are
>> deleted. If there are any other reference counters around, we just
>> leave the caches lying around until they go away.
>>
>> When that happen, a destruction function is called from the cache
>> code. Caches are only destroyed in process context, so we queue them
>> up for later processing in the general case.
>>
>>
>> ...
>>
>> @@ -5950,6 +6012,7 @@ static int mem_cgroup_pre_destroy(struct cgroup *cont)
>> {
>> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>>
>> + mem_cgroup_destroy_all_caches(memcg);
>> return mem_cgroup_force_empty(memcg, false);
>> }
>>
>
> Conflicts with linux-next cgroup changes. Looks pretty simple:
>
>
> static int mem_cgroup_pre_destroy(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> int ret;
>
> css_get(&memcg->css);
> ret = mem_cgroup_reparent_charges(memcg);
> mem_cgroup_destroy_all_caches(memcg);
> css_put(&memcg->css);
>
> return ret;
> }
>

There is one significant difference between the code I had and the code
after your fix up.

In my patch, caches were destroyed before the call to
mem_cgroup_force_empty. In the final, version, they are destroyed after it.

I am here thinking, but I am not sure if this have any significant
impact... If we run mem_cgroup_destroy_all_caches() before reparenting,
we'll have shrunk a lot of the pending caches, and we will have less
pages to reparent. But we only reparent pages in the lru anyway, and
then expect kmem and remaining umem to match. So *in theory* it should
be fine.

Where can I grab your final tree so I can test it and make sure it is
all good ?

2012-11-02 07:50:44

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 11/29] memcg: allow a memcg with kmem charges to be destructed.

On 11/02/2012 04:05 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:27 +0400
> Glauber Costa <[email protected]> wrote:
>
>> Because the ultimate goal of the kmem tracking in memcg is to track slab
>> pages as well, we can't guarantee that we'll always be able to point a
>> page to a particular process, and migrate the charges along with it -
>> since in the common case, a page will contain data belonging to multiple
>> processes.
>>
>> Because of that, when we destroy a memcg, we only make sure the
>> destruction will succeed by discounting the kmem charges from the user
>> charges when we try to empty the cgroup.
>
> There was a significant conflict with the sched/numa changes in
> linux-next, which I resolved as below. Please check it.
>
> static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> {
> struct cgroup *cgrp = memcg->css.cgroup;
> int node, zid;
> u64 usage;
>
> do {
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> return -EBUSY;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> drain_all_stock_sync(memcg);
> mem_cgroup_start_move(memcg);
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> enum lru_list lru;
> for_each_lru(lru) {
> mem_cgroup_force_empty_list(memcg,
> node, zid, lru);
> }
> }
> }
> mem_cgroup_end_move(memcg);
> memcg_oom_recover(memcg);
> cond_resched();
>
> /*
> * Kernel memory may not necessarily be trackable to a specific
> * process. So they are not migrated, and therefore we can't
> * expect their value to drop to 0 here.
> * Having res filled up with kmem only is enough.
> *
> * This is a safety check because mem_cgroup_force_empty_list
> * could have raced with mem_cgroup_replace_page_cache callers
> * so the lru seemed empty but the page could have been added
> * right after the check. RES_USAGE should be safe as we always
> * charge before adding to the LRU.
> */
> usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> res_counter_read_u64(&memcg->kmem, RES_USAGE);
> } while (usage > 0);
>
> return 0;
> }
>
Andrew,

It looks fine.

One thing: Open code reading makes very difficult to spot which exactly
the conflict is. Any chance you could send those in some kind of diff
format?

In this case, it appears to me that the reason for the conflict is that
the loop conditional "while (usage > 0 || ret)" was changed to "while
(usage > 0)". Being this the case, and because kmemcg has no business
with that "ret" value, this resolution is appropriate.

2012-11-02 08:30:10

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On Fri, Nov 2, 2012 at 2:04 AM, Andrew Morton <[email protected]> wrote:
> One thing:
>
>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>
> You claim in the above that the fork worload is 'slab intensive". Or
> at least, you seem to - it's a bit fuzzy.
>
> But how slab intensive is it, really?
>
> What is extremely slab intensive is networking. The networking guys
> are very sensitive to slab performance. If this hasn't already been
> done, could you please determine what impact this has upon networking?
> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
> testing approaches.

IIRC, networking guys have reduced their dependency on slab
performance recently.

Few simple benchmarks to run are hackbench, netperf, and Christoph's
famous microbenchmarks. The sad reality is that you usually have to
wait for few release cycles before people notice that you've destroyed
performance of their favourite workload. :-/

2012-11-02 19:26:07

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

Hello, Glauber.

2012/11/2 Glauber Costa <[email protected]>:
> On 11/02/2012 04:04 AM, Andrew Morton wrote:
>> On Thu, 1 Nov 2012 16:07:16 +0400
>> Glauber Costa <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> This work introduces the kernel memory controller for memcg. Unlike previous
>>> submissions, this includes the whole controller, comprised of slab and stack
>>> memory.
>>
>> I'm in the middle of (re)reading all this. Meanwhile I'll push it all
>> out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.
>>
>> One thing:
>>
>>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>>
>> You claim in the above that the fork worload is 'slab intensive". Or
>> at least, you seem to - it's a bit fuzzy.
>>
>> But how slab intensive is it, really?
>>
>> What is extremely slab intensive is networking. The networking guys
>> are very sensitive to slab performance. If this hasn't already been
>> done, could you please determine what impact this has upon networking?
>> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
>> testing approaches.
>>
>
> I can test it, but unfortunately I am unlikely to get to prepare a good
> environment before Barcelona.
>
> I know, however, that Greg Thelen was testing netperf in his setup.
> Greg, do you have any publishable numbers you could share?

Below is my humble opinion.
I am worrying about data cache footprint which is possibly caused by
this patchset, especially slab implementation.
If there are several memcg cgroups, each cgroup has it's own kmem_caches.
When each group do slab-intensive job hard, data cache may be overflowed easily,
and cache miss rate will be high, therefore this would decrease system
performance highly.
Is there any result about this?

Thanks.

2012-11-02 20:19:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 23/29] memcg: destroy memcg caches

On Fri 02-11-12 11:46:42, Glauber Costa wrote:
> On 11/02/2012 04:05 AM, Andrew Morton wrote:
> > On Thu, 1 Nov 2012 16:07:39 +0400
> > Glauber Costa <[email protected]> wrote:
> >
> >> This patch implements destruction of memcg caches. Right now,
> >> only caches where our reference counter is the last remaining are
> >> deleted. If there are any other reference counters around, we just
> >> leave the caches lying around until they go away.
> >>
> >> When that happen, a destruction function is called from the cache
> >> code. Caches are only destroyed in process context, so we queue them
> >> up for later processing in the general case.
> >>
> >>
> >> ...
> >>
> >> @@ -5950,6 +6012,7 @@ static int mem_cgroup_pre_destroy(struct cgroup *cont)
> >> {
> >> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> >>
> >> + mem_cgroup_destroy_all_caches(memcg);
> >> return mem_cgroup_force_empty(memcg, false);
> >> }
> >>
> >
> > Conflicts with linux-next cgroup changes. Looks pretty simple:
> >
> >
> > static int mem_cgroup_pre_destroy(struct cgroup *cont)
> > {
> > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > int ret;
> >
> > css_get(&memcg->css);
> > ret = mem_cgroup_reparent_charges(memcg);
> > mem_cgroup_destroy_all_caches(memcg);
> > css_put(&memcg->css);
> >
> > return ret;
> > }
> >
>
> There is one significant difference between the code I had and the code
> after your fix up.
>
> In my patch, caches were destroyed before the call to
> mem_cgroup_force_empty. In the final, version, they are destroyed after it.
>
> I am here thinking, but I am not sure if this have any significant
> impact... If we run mem_cgroup_destroy_all_caches() before reparenting,
> we'll have shrunk a lot of the pending caches, and we will have less
> pages to reparent. But we only reparent pages in the lru anyway, and
> then expect kmem and remaining umem to match. So *in theory* it should
> be fine.
>
> Where can I grab your final tree so I can test it and make sure it is
> all good ?

Everything is in the -mm git tree (I tend to take mmots trees if they
compile).

--
Michal Hocko
SUSE Labs

2012-11-02 23:06:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

Hey, Joonsoo.

On Sat, Nov 03, 2012 at 04:25:59AM +0900, JoonSoo Kim wrote:
> I am worrying about data cache footprint which is possibly caused by
> this patchset, especially slab implementation.
> If there are several memcg cgroups, each cgroup has it's own kmem_caches.
> When each group do slab-intensive job hard, data cache may be overflowed easily,
> and cache miss rate will be high, therefore this would decrease system
> performance highly.

It would be nice to be able to remove such overhead too, but the
baselines for cgroup implementations (well, at least the ones that I
think important) in somewhat decreasing priority are...

1. Don't over-complicate the target subsystem.

2. Overhead when cgroup is not used should be minimal. Prefereably to
the level of being unnoticeable.

3. Overhead while cgroup is being actively used should be reasonable.

If you wanna split your system into N groups and maintain memory
resource segregation among them, I don't think it's unreasonable to
ask for paying data cache footprint overhead.

So, while improvements would be nice, I wouldn't consider overheads of
this type as a blocker.

Thanks.

--
tejun

2012-11-03 03:37:20

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On Fri, Nov 2, 2012 at 12:41 AM, Glauber Costa <[email protected]> wrote:
> On 11/02/2012 04:04 AM, Andrew Morton wrote:
>> On Thu, 1 Nov 2012 16:07:16 +0400
>> Glauber Costa <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> This work introduces the kernel memory controller for memcg. Unlike previous
>>> submissions, this includes the whole controller, comprised of slab and stack
>>> memory.
>>
>> I'm in the middle of (re)reading all this. Meanwhile I'll push it all
>> out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.
>>
>> One thing:
>>
>>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>>
>> You claim in the above that the fork worload is 'slab intensive". Or
>> at least, you seem to - it's a bit fuzzy.
>>
>> But how slab intensive is it, really?
>>
>> What is extremely slab intensive is networking. The networking guys
>> are very sensitive to slab performance. If this hasn't already been
>> done, could you please determine what impact this has upon networking?
>> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
>> testing approaches.
>>
>
> I can test it, but unfortunately I am unlikely to get to prepare a good
> environment before Barcelona.
>
> I know, however, that Greg Thelen was testing netperf in his setup.
> Greg, do you have any publishable numbers you could share?

I should have some netperf numbers next week. Sorry I've been distracted by
other projects recently.

2012-11-05 08:15:04

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On 11/03/2012 12:06 AM, Tejun Heo wrote:
> Hey, Joonsoo.
>
> On Sat, Nov 03, 2012 at 04:25:59AM +0900, JoonSoo Kim wrote:
>> I am worrying about data cache footprint which is possibly caused by
>> this patchset, especially slab implementation.
>> If there are several memcg cgroups, each cgroup has it's own kmem_caches.
>> When each group do slab-intensive job hard, data cache may be overflowed easily,
>> and cache miss rate will be high, therefore this would decrease system
>> performance highly.
>
> It would be nice to be able to remove such overhead too, but the
> baselines for cgroup implementations (well, at least the ones that I
> think important) in somewhat decreasing priority are...
>
> 1. Don't over-complicate the target subsystem.
>
> 2. Overhead when cgroup is not used should be minimal. Prefereably to
> the level of being unnoticeable.
>
> 3. Overhead while cgroup is being actively used should be reasonable.
>
> If you wanna split your system into N groups and maintain memory
> resource segregation among them, I don't think it's unreasonable to
> ask for paying data cache footprint overhead.
>
> So, while improvements would be nice, I wouldn't consider overheads of
> this type as a blocker.
>
> Thanks.
>
There is another thing I should add.

We are essentially replicating all the allocator meta-data, so if you
look at it, this is exactly the same thing as workloads that allocate
from different allocators (i.e.: a lot of network structures, and a lot
of dentries).

In this sense, it really basically depends what is your comparison
point. Full containers - the main (but not exclusive) reason for this,
are more or less an alternative for virtual machines. In those, you
would be allocating from a different cache because you would be getting
those through a bunch of memory address translations. From this, we do a
lot better, since we only change the cache you allocate from, keeping
all the rest unchanged.

2012-11-05 08:18:45

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 00/29] kmem controller for memcg.

On 11/02/2012 08:25 PM, JoonSoo Kim wrote:
> Hello, Glauber.
>
> 2012/11/2 Glauber Costa <[email protected]>:
>> On 11/02/2012 04:04 AM, Andrew Morton wrote:
>>> On Thu, 1 Nov 2012 16:07:16 +0400
>>> Glauber Costa <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> This work introduces the kernel memory controller for memcg. Unlike previous
>>>> submissions, this includes the whole controller, comprised of slab and stack
>>>> memory.
>>>
>>> I'm in the middle of (re)reading all this. Meanwhile I'll push it all
>>> out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.
>>>
>>> One thing:
>>>
>>>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>>>
>>> You claim in the above that the fork worload is 'slab intensive". Or
>>> at least, you seem to - it's a bit fuzzy.
>>>
>>> But how slab intensive is it, really?
>>>
>>> What is extremely slab intensive is networking. The networking guys
>>> are very sensitive to slab performance. If this hasn't already been
>>> done, could you please determine what impact this has upon networking?
>>> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
>>> testing approaches.
>>>
>>
>> I can test it, but unfortunately I am unlikely to get to prepare a good
>> environment before Barcelona.
>>
>> I know, however, that Greg Thelen was testing netperf in his setup.
>> Greg, do you have any publishable numbers you could share?
>
> Below is my humble opinion.
> I am worrying about data cache footprint which is possibly caused by
> this patchset, especially slab implementation.
> If there are several memcg cgroups, each cgroup has it's own kmem_caches.

I answered the performance part in response to Tejun's response.

Let me just add something here: Just keep in mind this is not "per
memcg", this is "per memcg that are kernel-memory limited". So in a
sense, you are only paying this, and allocate from different caches, if
you runtime enable this.

This should all be documented in the Documentation/ patch. But let me
know if there is anything that needs further clarification

2012-11-06 00:23:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 18/29] Allocate memory for memcg caches whenever a new memcg appears

On Thu, 1 Nov 2012 16:07:34 +0400
Glauber Costa <[email protected]> wrote:

> Every cache that is considered a root cache (basically the "original" caches,
> tied to the root memcg/no-memcg) will have an array that should be large enough
> to store a cache pointer per each memcg in the system.
>
> Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the
> 64k pointers range. Most of the time, we won't be using that much.
>
> What goes in this patch, is a simple scheme to dynamically allocate such an
> array, in order to minimize memory usage for memcg caches. Because we would
> also like to avoid allocations all the time, at least for now, the array will
> only grow. It will tend to be big enough to hold the maximum number of
> kmem-limited memcgs ever achieved.
>
> We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more
> than that, we'll start doubling the size of this array every time the limit is
> reached.
>
> Because we are only considering kmem limited memcgs, a natural point for this
> to happen is when we write to the limit. At that point, we already have
> set_limit_mutex held, so that will become our natural synchronization
> mechanism.
>
> ...
>
> +static struct ida kmem_limited_groups;

Could use DEFINE_IDA() here

>
> ...
>
> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> {
> + int ret;
> +
> memcg->kmemcg_id = -1;
> - memcg_propagate_kmem(memcg);
> + ret = memcg_propagate_kmem(memcg);
> + if (ret)
> + return ret;
> +
> + if (mem_cgroup_is_root(memcg))
> + ida_init(&kmem_limited_groups);

and zap this?

2012-11-06 00:28:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On Thu, 1 Nov 2012 16:07:35 +0400
Glauber Costa <[email protected]> wrote:

> +static __always_inline struct kmem_cache *
> +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)

I still don't understand why this code uses __always_inline so much.

I don't recall seeing the compiler producing out-of-line versions of
"static inline" functions (and perhaps it has special treatment for
functions which were defined in a header file?).

And if the compiler *does* decide to uninline the function, perhaps it
knows best, and the function shouldn't have been declared inline in the
first place.


If it is indeed better to use __always_inline in this code then we have
a heck of a lot of other "static inline" definitions whcih we need to
convert! So, what's going on here?

2012-11-06 00:33:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 20/29] memcg: skip memcg kmem allocations in specified code regions

On Thu, 1 Nov 2012 16:07:36 +0400
Glauber Costa <[email protected]> wrote:

> This patch creates a mechanism that skip memcg allocations during
> certain pieces of our core code. It basically works in the same way
> as preempt_disable()/preempt_enable(): By marking a region under
> which all allocations will be accounted to the root memcg.
>
> We need this to prevent races in early cache creation, when we
> allocate data using caches that are not necessarily created already.
>
> ...
>
> +static inline void memcg_stop_kmem_account(void)
> +{
> + if (!current->mm)
> + return;

It is utterly unobvious to this reader why the code tests ->mm in this
fashion. So we need either smarter readers or a code comment.

2012-11-06 00:40:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 23/29] memcg: destroy memcg caches

On Thu, 1 Nov 2012 16:07:39 +0400
Glauber Costa <[email protected]> wrote:

> This patch implements destruction of memcg caches. Right now,
> only caches where our reference counter is the last remaining are
> deleted. If there are any other reference counters around, we just
> leave the caches lying around until they go away.
>
> When that happen, a destruction function is called from the cache
> code. Caches are only destroyed in process context, so we queue them
> up for later processing in the general case.
>
> ...
>
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -181,6 +181,7 @@ unsigned int kmem_cache_size(struct kmem_cache *);
> #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
> #endif
>
> +#include <linux/workqueue.h>

Was there any reason for putting this include 185 lines into the file?

If not, then let's not do it. It reduces readability and increases the
risk that someone will later include the saame file (or somthing it includes)
a second time, to satisfy some dependency at line 100.

2012-11-06 00:48:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Thu, 1 Nov 2012 16:07:41 +0400
Glauber Costa <[email protected]> wrote:

> This means that when we destroy a memcg cache that happened to be empty,
> those caches may take a lot of time to go away: removing the memcg
> reference won't destroy them - because there are pending references, and
> the empty pages will stay there, until a shrinker is called upon for any
> reason.
>
> In this patch, we will call kmem_cache_shrink for all dead caches that
> cannot be destroyed because of remaining pages. After shrinking, it is
> possible that it could be freed. If this is not the case, we'll schedule
> a lazy worker to keep trying.

This patch is really quite nasty. We poll the cache once per minute
trying to shrink then free it? a) it gives rise to concerns that there
will be scenarios where the system could suffer unlimited memory windup
but mainly b) it's just lame.

The kernel doesn't do this sort of thing. The kernel tries to be
precise: in a situation like this we keep track of the number of
outstanding objects and when that falls to zero, we free their
container synchronously. If those objects are normally left floating
around in an allocated but reclaimable state then we can address that
by synchronously freeing them if their container has been destroyed.

Or something like that. If it's something else then fine, but not this.

What do we need to do to fix this?

2012-11-06 00:57:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 26/29] Aggregate memcg cache values in slabinfo

On Thu, 1 Nov 2012 16:07:42 +0400
Glauber Costa <[email protected]> wrote:

> When we create caches in memcgs, we need to display their usage
> information somewhere. We'll adopt a scheme similar to /proc/meminfo,
> with aggregate totals shown in the global file, and per-group
> information stored in the group itself.
>
> For the time being, only reads are allowed in the per-group cache.
>
> ...
>
> +#define for_each_memcg_cache_index(_idx) \
> + for ((_idx) = 0; i < memcg_limited_groups_array_size; (_idx)++)

Use of this requires slab_mutex, yes?

Please add a comment, and confirm that all callers do indeed hold the
correct lock.


We could add a mutex_is_locked() check to the macro perhaps, but this
isn't the place to assume the presence of slab_mutex, so it gets messy.

>
> ...
>

2012-11-06 08:04:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On Mon 05-11-12 16:28:37, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:35 +0400
> Glauber Costa <[email protected]> wrote:
>
> > +static __always_inline struct kmem_cache *
> > +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>
> I still don't understand why this code uses __always_inline so much.

AFAIU, __always_inline (resp. __attribute__((always_inline))) is the
same thing as inline if optimizations are enabled
(http://ohse.de/uwe/articles/gcc-attributes.html#func-always_inline).
Which is the case for the kernel. I was always wondering why we have
this __always_inline thingy.
It has been introduced back in 2004 by Andi but the commit log doesn't
say much:
"
[PATCH] gcc-3.5 fixes

Trivial gcc-3.5 build fixes.
"
Andi what was the original motivation for this attribute?

> I don't recall seeing the compiler producing out-of-line versions of
> "static inline" functions

and if it decides then __always_inline will not help, right?

--
Michal Hocko
SUSE Labs

2012-11-06 10:54:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 11/29] memcg: allow a memcg with kmem charges to be destructed.

On Thu 01-11-12 17:05:39, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:27 +0400
> Glauber Costa <[email protected]> wrote:
>
> > Because the ultimate goal of the kmem tracking in memcg is to track slab
> > pages as well, we can't guarantee that we'll always be able to point a
> > page to a particular process, and migrate the charges along with it -
> > since in the common case, a page will contain data belonging to multiple
> > processes.
> >
> > Because of that, when we destroy a memcg, we only make sure the
> > destruction will succeed by discounting the kmem charges from the user
> > charges when we try to empty the cgroup.
>
> There was a significant conflict with the sched/numa changes in
> linux-next,

Just for record. The conflict was introduced by 2ef37d3f (memcg: Simplify
mem_cgroup_force_empty_list error handling) which came in via Tejun's
tree.
Your resolution looks good to me. Sorry about the trouble.

> which I resolved as below. Please check it.
>
> static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> {
> struct cgroup *cgrp = memcg->css.cgroup;
> int node, zid;
> u64 usage;
>
> do {
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> return -EBUSY;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> drain_all_stock_sync(memcg);
> mem_cgroup_start_move(memcg);
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> enum lru_list lru;
> for_each_lru(lru) {
> mem_cgroup_force_empty_list(memcg,
> node, zid, lru);
> }
> }
> }
> mem_cgroup_end_move(memcg);
> memcg_oom_recover(memcg);
> cond_resched();
>
> /*
> * Kernel memory may not necessarily be trackable to a specific
> * process. So they are not migrated, and therefore we can't
> * expect their value to drop to 0 here.
> * Having res filled up with kmem only is enough.
> *
> * This is a safety check because mem_cgroup_force_empty_list
> * could have raced with mem_cgroup_replace_page_cache callers
> * so the lru seemed empty but the page could have been added
> * right after the check. RES_USAGE should be safe as we always
> * charge before adding to the LRU.
> */
> usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> res_counter_read_u64(&memcg->kmem, RES_USAGE);
> } while (usage > 0);
>
> return 0;
> }
>

--
Michal Hocko
SUSE Labs

2012-11-06 19:25:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

On Thu, 1 Nov 2012 16:07:44 +0400
Glauber Costa <[email protected]> wrote:

> SLUB allows us to tune a particular cache behavior with sysfs-based
> tunables. When creating a new memcg cache copy, we'd like to preserve
> any tunables the parent cache already had.
>
> This can be done by tapping into the store attribute function provided
> by the allocator. We of course don't need to mess with read-only
> fields. Since the attributes can have multiple types and are stored
> internally by sysfs, the best strategy is to issue a ->show() in the
> root cache, and then ->store() in the memcg cache.
>
> The drawback of that, is that sysfs can allocate up to a page in
> buffering for show(), that we are likely not to need, but also can't
> guarantee. To avoid always allocating a page for that, we can update the
> caches at store time with the maximum attribute size ever stored to the
> root cache. We will then get a buffer big enough to hold it. The
> corolary to this, is that if no stores happened, nothing will be
> propagated.
>
> It can also happen that a root cache has its tunables updated during
> normal system operation. In this case, we will propagate the change to
> all caches that are already active.
>
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned long flags)
> if (err)
> return err;
>
> + memcg_propagate_slab_attrs(s);
> mutex_unlock(&slab_mutex);
> err = sysfs_slab_add(s);
> mutex_lock(&slab_mutex);
> @@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
> struct slab_attribute *attribute;
> struct kmem_cache *s;
> int err;
> + int i __maybe_unused;
>
> attribute = to_slab_attr(attr);
> s = to_slab(kobj);
> @@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
> return -EIO;
>
> err = attribute->store(s, buf, len);
> +#ifdef CONFIG_MEMCG_KMEM
> + if (slab_state < FULL)
> + return err;
>
> + if ((err < 0) || !is_root_cache(s))
> + return err;
> +
> + mutex_lock(&slab_mutex);
> + if (s->max_attr_size < len)
> + s->max_attr_size = len;
> +
> + for_each_memcg_cache_index(i) {
> + struct kmem_cache *c = cache_from_memcg(s, i);
> + if (c)
> + /* return value determined by the parent cache only */
> + attribute->store(c, buf, len);
> + }
> + mutex_unlock(&slab_mutex);
> +#endif
> return err;
> }

hm, __maybe_unused is an ugly thing. We can avoid it by tweaking the
code a bit:

diff -puN mm/slub.c~slub-slub-specific-propagation-changes-fix mm/slub.c
--- a/mm/slub.c~slub-slub-specific-propagation-changes-fix
+++ a/mm/slub.c
@@ -5175,7 +5175,6 @@ static ssize_t slab_attr_store(struct ko
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
- int i __maybe_unused;

attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5185,23 +5184,24 @@ static ssize_t slab_attr_store(struct ko

err = attribute->store(s, buf, len);
#ifdef CONFIG_MEMCG_KMEM
- if (slab_state < FULL)
- return err;
+ if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
+ int i;

- if ((err < 0) || !is_root_cache(s))
- return err;
-
- mutex_lock(&slab_mutex);
- if (s->max_attr_size < len)
- s->max_attr_size = len;
-
- for_each_memcg_cache_index(i) {
- struct kmem_cache *c = cache_from_memcg(s, i);
- if (c)
- /* return value determined by the parent cache only */
- attribute->store(c, buf, len);
+ mutex_lock(&slab_mutex);
+ if (s->max_attr_size < len)
+ s->max_attr_size = len;
+
+ for_each_memcg_cache_index(i) {
+ struct kmem_cache *c = cache_from_memcg(s, i);
+ /*
+ * This function's return value is determined by the
+ * parent cache only
+ */
+ if (c)
+ attribute->store(c, buf, len);
+ }
+ mutex_unlock(&slab_mutex);
}
- mutex_unlock(&slab_mutex);
#endif
return err;
}

Also, the comment in there tells the reader *what the code does*, not
*why it does it*. Why do we ignore the ->store return value for child
caches?

2012-11-07 07:04:30

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On 11/06/2012 01:28 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:35 +0400
> Glauber Costa <[email protected]> wrote:
>
>> +static __always_inline struct kmem_cache *
>> +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>
> I still don't understand why this code uses __always_inline so much.
>
> I don't recall seeing the compiler producing out-of-line versions of
> "static inline" functions (and perhaps it has special treatment for
> functions which were defined in a header file?).
>
> And if the compiler *does* decide to uninline the function, perhaps it
> knows best, and the function shouldn't have been declared inline in the
> first place.
>
>
> If it is indeed better to use __always_inline in this code then we have
> a heck of a lot of other "static inline" definitions whcih we need to
> convert! So, what's going on here?
>

The original motivation is indeed performance related. We want to make
sure it is inline so it will figure out quickly the "I am not a memcg
user" case and keep it going. The slub, for instance, is full of
__always_inline functions to make sure that the fast path contains
absolutely no function calls. So I was just following this here.

I can remove the marker without a problem and leave it to the compiler
if you think it is best

2012-11-07 07:05:23

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 18/29] Allocate memory for memcg caches whenever a new memcg appears

On 11/06/2012 01:23 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:34 +0400
> Glauber Costa <[email protected]> wrote:
>
>> Every cache that is considered a root cache (basically the "original" caches,
>> tied to the root memcg/no-memcg) will have an array that should be large enough
>> to store a cache pointer per each memcg in the system.
>>
>> Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the
>> 64k pointers range. Most of the time, we won't be using that much.
>>
>> What goes in this patch, is a simple scheme to dynamically allocate such an
>> array, in order to minimize memory usage for memcg caches. Because we would
>> also like to avoid allocations all the time, at least for now, the array will
>> only grow. It will tend to be big enough to hold the maximum number of
>> kmem-limited memcgs ever achieved.
>>
>> We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more
>> than that, we'll start doubling the size of this array every time the limit is
>> reached.
>>
>> Because we are only considering kmem limited memcgs, a natural point for this
>> to happen is when we write to the limit. At that point, we already have
>> set_limit_mutex held, so that will become our natural synchronization
>> mechanism.
>>
>> ...
>>
>> +static struct ida kmem_limited_groups;
>
> Could use DEFINE_IDA() here
>
>>
>> ...
>>
>> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>> {
>> + int ret;
>> +
>> memcg->kmemcg_id = -1;
>> - memcg_propagate_kmem(memcg);
>> + ret = memcg_propagate_kmem(memcg);
>> + if (ret)
>> + return ret;
>> +
>> + if (mem_cgroup_is_root(memcg))
>> + ida_init(&kmem_limited_groups);
>
> and zap this?
>

Ok.

I am starting to go over your replies now, and general question:
Since you have already included this in mm, would you like me to
resubmit the series changing things according to your feedback, or
should I send incremental patches?

2012-11-07 07:10:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 18/29] Allocate memory for memcg caches whenever a new memcg appears

On Wed, 7 Nov 2012 08:05:10 +0100 Glauber Costa <[email protected]> wrote:

> Since you have already included this in mm, would you like me to
> resubmit the series changing things according to your feedback, or
> should I send incremental patches?

I normally don't care. I do turn replacements into incrementals so
that I and others can see what changed, but that's all scripted.

However in this case the patches have been changed somewhat (mainly
because of sched/numa getting in the way) so incrementals would be nice
if convenient, please.

2012-11-07 07:13:22

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On 11/06/2012 01:48 AM, Andrew Morton wrote:
> On Thu, 1 Nov 2012 16:07:41 +0400
> Glauber Costa <[email protected]> wrote:
>
>> This means that when we destroy a memcg cache that happened to be empty,
>> those caches may take a lot of time to go away: removing the memcg
>> reference won't destroy them - because there are pending references, and
>> the empty pages will stay there, until a shrinker is called upon for any
>> reason.
>>
>> In this patch, we will call kmem_cache_shrink for all dead caches that
>> cannot be destroyed because of remaining pages. After shrinking, it is
>> possible that it could be freed. If this is not the case, we'll schedule
>> a lazy worker to keep trying.
>
> This patch is really quite nasty. We poll the cache once per minute
> trying to shrink then free it? a) it gives rise to concerns that there
> will be scenarios where the system could suffer unlimited memory windup
> but mainly b) it's just lame.
>
> The kernel doesn't do this sort of thing. The kernel tries to be
> precise: in a situation like this we keep track of the number of
> outstanding objects and when that falls to zero, we free their
> container synchronously. If those objects are normally left floating
> around in an allocated but reclaimable state then we can address that
> by synchronously freeing them if their container has been destroyed.
>
> Or something like that. If it's something else then fine, but not this.
>
> What do we need to do to fix this?
>
The original patch had a unlikely() test in the free path, conditional
on whether or not the cache is dead, that would then call this is the
cache would now be empty.

I got several requests to remove it and change it to something like
this, because that is a fast path (I myself think an unlikely branch is
not that bad)

If you think such a test is acceptable, I can bring it back and argue in
the basis of "akpm made me do it!". But meanwhile I will give this extra
though to see if there is any alternative way I can do it...

2012-11-07 07:13:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On Wed, 7 Nov 2012 08:04:03 +0100 Glauber Costa <[email protected]> wrote:

> On 11/06/2012 01:28 AM, Andrew Morton wrote:
> > On Thu, 1 Nov 2012 16:07:35 +0400
> > Glauber Costa <[email protected]> wrote:
> >
> >> +static __always_inline struct kmem_cache *
> >> +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
> >
> > I still don't understand why this code uses __always_inline so much.
> >
> > I don't recall seeing the compiler producing out-of-line versions of
> > "static inline" functions (and perhaps it has special treatment for
> > functions which were defined in a header file?).
> >
> > And if the compiler *does* decide to uninline the function, perhaps it
> > knows best, and the function shouldn't have been declared inline in the
> > first place.
> >
> >
> > If it is indeed better to use __always_inline in this code then we have
> > a heck of a lot of other "static inline" definitions whcih we need to
> > convert! So, what's going on here?
> >
>
> The original motivation is indeed performance related. We want to make
> sure it is inline so it will figure out quickly the "I am not a memcg
> user" case and keep it going. The slub, for instance, is full of
> __always_inline functions to make sure that the fast path contains
> absolutely no function calls. So I was just following this here.

Well. Do we really know that inlining is best in all these cases? And
in future, as the code evolves? If for some reason the compiler
chooses not to inline the function, maybe it was right. Small code
footprint has benefits.

> I can remove the marker without a problem and leave it to the compiler
> if you think it is best

It's a minor thing. But __always_inline is rather specialised and
readers of this code will be wondering why it was done here. Unless we
can actually demonstrate benefit from __always_inline, I'd suggest
following convention here.

2012-11-07 07:16:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Wed, 7 Nov 2012 08:13:08 +0100 Glauber Costa <[email protected]> wrote:

> On 11/06/2012 01:48 AM, Andrew Morton wrote:
> > On Thu, 1 Nov 2012 16:07:41 +0400
> > Glauber Costa <[email protected]> wrote:
> >
> >> This means that when we destroy a memcg cache that happened to be empty,
> >> those caches may take a lot of time to go away: removing the memcg
> >> reference won't destroy them - because there are pending references, and
> >> the empty pages will stay there, until a shrinker is called upon for any
> >> reason.
> >>
> >> In this patch, we will call kmem_cache_shrink for all dead caches that
> >> cannot be destroyed because of remaining pages. After shrinking, it is
> >> possible that it could be freed. If this is not the case, we'll schedule
> >> a lazy worker to keep trying.
> >
> > This patch is really quite nasty. We poll the cache once per minute
> > trying to shrink then free it? a) it gives rise to concerns that there
> > will be scenarios where the system could suffer unlimited memory windup
> > but mainly b) it's just lame.
> >
> > The kernel doesn't do this sort of thing. The kernel tries to be
> > precise: in a situation like this we keep track of the number of
> > outstanding objects and when that falls to zero, we free their
> > container synchronously. If those objects are normally left floating
> > around in an allocated but reclaimable state then we can address that
> > by synchronously freeing them if their container has been destroyed.
> >
> > Or something like that. If it's something else then fine, but not this.
> >
> > What do we need to do to fix this?
> >
> The original patch had a unlikely() test in the free path, conditional
> on whether or not the cache is dead, that would then call this is the
> cache would now be empty.
>
> I got several requests to remove it and change it to something like
> this, because that is a fast path (I myself think an unlikely branch is
> not that bad)
>
> If you think such a test is acceptable, I can bring it back and argue in
> the basis of "akpm made me do it!". But meanwhile I will give this extra
> though to see if there is any alternative way I can do it...

OK, thanks, please do take a look at it.

I'd be interested in seeing the old version of the patch which had this
test-n-branch. Perhaps there's some trick we can pull to lessen its cost.

2012-11-07 09:22:40

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On 11/07/2012 08:16 AM, Andrew Morton wrote:
> On Wed, 7 Nov 2012 08:13:08 +0100 Glauber Costa <[email protected]> wrote:
>
>> On 11/06/2012 01:48 AM, Andrew Morton wrote:
>>> On Thu, 1 Nov 2012 16:07:41 +0400
>>> Glauber Costa <[email protected]> wrote:
>>>
>>>> This means that when we destroy a memcg cache that happened to be empty,
>>>> those caches may take a lot of time to go away: removing the memcg
>>>> reference won't destroy them - because there are pending references, and
>>>> the empty pages will stay there, until a shrinker is called upon for any
>>>> reason.
>>>>
>>>> In this patch, we will call kmem_cache_shrink for all dead caches that
>>>> cannot be destroyed because of remaining pages. After shrinking, it is
>>>> possible that it could be freed. If this is not the case, we'll schedule
>>>> a lazy worker to keep trying.
>>>
>>> This patch is really quite nasty. We poll the cache once per minute
>>> trying to shrink then free it? a) it gives rise to concerns that there
>>> will be scenarios where the system could suffer unlimited memory windup
>>> but mainly b) it's just lame.
>>>
>>> The kernel doesn't do this sort of thing. The kernel tries to be
>>> precise: in a situation like this we keep track of the number of
>>> outstanding objects and when that falls to zero, we free their
>>> container synchronously. If those objects are normally left floating
>>> around in an allocated but reclaimable state then we can address that
>>> by synchronously freeing them if their container has been destroyed.
>>>
>>> Or something like that. If it's something else then fine, but not this.
>>>
>>> What do we need to do to fix this?
>>>
>> The original patch had a unlikely() test in the free path, conditional
>> on whether or not the cache is dead, that would then call this is the
>> cache would now be empty.
>>
>> I got several requests to remove it and change it to something like
>> this, because that is a fast path (I myself think an unlikely branch is
>> not that bad)
>>
>> If you think such a test is acceptable, I can bring it back and argue in
>> the basis of "akpm made me do it!". But meanwhile I will give this extra
>> though to see if there is any alternative way I can do it...
>
> OK, thanks, please do take a look at it.
>
> I'd be interested in seeing the old version of the patch which had this
> test-n-branch. Perhaps there's some trick we can pull to lessen its cost.
>
Attached.

This is the last version that used it (well, I believe it is). There is
other unrelated things in this patch, that I got rid of. Look for
kmem_cache_verify_dead().

In a summary, all calls to the free function would as a last step do:
kmem_cache_verify_dead() that would either be an empty placeholder, or:

+static inline void kmem_cache_verify_dead(struct kmem_cache *s)
+{
+ if (unlikely(s->memcg_params.dead))
+ schedule_work(&s->memcg_params.cache_shrinker);
+}


cache_shrinker got changed to the destroy worker. So if we are freeing
an object from a cache that is dead, we try to schedule a worker that
will eventually call kmem_cache_srhink(), and hopefully
kmem_cache_destroy() - if last object.


Attachments:
0015-memcg-sl-au-b-shrink-dead-caches.patch (6.66 kB)

2012-11-07 15:55:52

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

On 11/01/2012 08:07 AM, Glauber Costa wrote:
> SLUB allows us to tune a particular cache behavior with sysfs-based
> tunables. When creating a new memcg cache copy, we'd like to preserve
> any tunables the parent cache already had.
>
> This can be done by tapping into the store attribute function provided
> by the allocator. We of course don't need to mess with read-only
> fields. Since the attributes can have multiple types and are stored
> internally by sysfs, the best strategy is to issue a ->show() in the
> root cache, and then ->store() in the memcg cache.
>
> The drawback of that, is that sysfs can allocate up to a page in
> buffering for show(), that we are likely not to need, but also can't
> guarantee. To avoid always allocating a page for that, we can update the
> caches at store time with the maximum attribute size ever stored to the
> root cache. We will then get a buffer big enough to hold it. The
> corolary to this, is that if no stores happened, nothing will be
> propagated.
>
> It can also happen that a root cache has its tunables updated during
> normal system operation. In this case, we will propagate the change to
> all caches that are already active.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Christoph Lameter <[email protected]>
> CC: Pekka Enberg <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Kamezawa Hiroyuki <[email protected]>
> CC: Johannes Weiner <[email protected]>
> CC: Suleiman Souhlal <[email protected]>
> CC: Tejun Heo <[email protected]>
> ---

Hi guys,

This patch is making lockdep angry! *bark bark*

[ 351.935003] ======================================================
[ 351.937693] [ INFO: possible circular locking dependency detected ]
[ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
[ 351.942444] -------------------------------------------------------
[ 351.943528] trinity-child13/6961 is trying to acquire lock:
[ 351.943528] (s_active#43){++++.+}, at: [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.943528]
[ 351.943528] but task is already holding lock:
[ 351.943528] (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
[ 351.943528]
[ 351.943528] which lock already depends on the new lock.
[ 351.943528]
[ 351.943528]
[ 351.943528] the existing dependency chain (in reverse order) is:
[ 351.943528]
-> #1 (slab_mutex){+.+.+.}:
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff83a944d9>] __mutex_lock_common+0x59/0x5a0
[ 351.960334] [<ffffffff83a94a5f>] mutex_lock_nested+0x3f/0x50
[ 351.960334] [<ffffffff81256a6e>] slab_attr_store+0xde/0x110
[ 351.960334] [<ffffffff812f820a>] sysfs_write_file+0xfa/0x150
[ 351.960334] [<ffffffff8127a220>] vfs_write+0xb0/0x180
[ 351.960334] [<ffffffff8127a540>] sys_pwrite64+0x60/0xb0
[ 351.960334] [<ffffffff83a99298>] tracesys+0xe1/0xe6
[ 351.960334]
-> #0 (s_active#43){++++.+}:
[ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
[ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
[ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
[ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
[ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
[ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
[ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
[ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
[ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
[ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
[ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
[ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
[ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
[ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
[ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
[ 351.960334]
[ 351.960334] other info that might help us debug this:
[ 351.960334]
[ 351.960334] Possible unsafe locking scenario:
[ 351.960334]
[ 351.960334] CPU0 CPU1
[ 351.960334] ---- ----
[ 351.960334] lock(slab_mutex);
[ 351.960334] lock(s_active#43);
[ 351.960334] lock(slab_mutex);
[ 351.960334] lock(s_active#43);
[ 351.960334]
[ 351.960334] *** DEADLOCK ***
[ 351.960334]
[ 351.960334] 2 locks held by trinity-child13/6961:
[ 351.960334] #0: (mon_lock){+.+.+.}, at: [<ffffffff82b21005>] mon_text_release+0x25/0xe0
[ 351.960334] #1: (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
[ 351.960334]
[ 351.960334] stack backtrace:
[ 351.960334] Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
[ 351.960334] Call Trace:
[ 351.960334] [<ffffffff83a3c736>] print_circular_bug+0x1fb/0x20c
[ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
[ 351.960334] [<ffffffff81184045>] ? debug_check_no_locks_freed+0x185/0x1e0
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
[ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
[ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
[ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
[ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
[ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
[ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
[ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
[ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
[ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
[ 351.960334] [<ffffffff811243b9>] ? get_signal_to_deliver+0x8b9/0x930
[ 351.960334] [<ffffffff8117d402>] ? get_lock_stats+0x22/0x70
[ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
[ 351.960334] [<ffffffff83a977fb>] ? _raw_spin_unlock_irq+0x2b/0x80
[ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
[ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
[ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
[ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
[ 351.960334] [<ffffffff811c8b33>] ? rcu_cleanup_after_idle+0x23/0x170
[ 351.960334] [<ffffffff811cc1c4>] ? rcu_eqs_exit_common+0x64/0x3a0
[ 351.960334] [<ffffffff811caa5d>] ? rcu_user_enter+0x10d/0x140
[ 351.960334] [<ffffffff811cc8d5>] ? rcu_user_exit+0xc5/0xf0
[ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
[ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17


Thanks,
Sasha

2012-11-07 22:46:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Wed, 7 Nov 2012 10:22:17 +0100
Glauber Costa <[email protected]> wrote:

> >>> container synchronously. If those objects are normally left floating
> >>> around in an allocated but reclaimable state then we can address that
> >>> by synchronously freeing them if their container has been destroyed.
> >>>
> >>> Or something like that. If it's something else then fine, but not this.
> >>>
> >>> What do we need to do to fix this?
> >>>
> >> The original patch had a unlikely() test in the free path, conditional
> >> on whether or not the cache is dead, that would then call this is the
> >> cache would now be empty.
> >>
> >> I got several requests to remove it and change it to something like
> >> this, because that is a fast path (I myself think an unlikely branch is
> >> not that bad)
> >>
> >> If you think such a test is acceptable, I can bring it back and argue in
> >> the basis of "akpm made me do it!". But meanwhile I will give this extra
> >> though to see if there is any alternative way I can do it...
> >
> > OK, thanks, please do take a look at it.
> >
> > I'd be interested in seeing the old version of the patch which had this
> > test-n-branch. Perhaps there's some trick we can pull to lessen its cost.
> >
> Attached.
>
> This is the last version that used it (well, I believe it is). There is
> other unrelated things in this patch, that I got rid of. Look for
> kmem_cache_verify_dead().
>
> In a summary, all calls to the free function would as a last step do:
> kmem_cache_verify_dead() that would either be an empty placeholder, or:
>
> +static inline void kmem_cache_verify_dead(struct kmem_cache *s)
> +{
> + if (unlikely(s->memcg_params.dead))
> + schedule_work(&s->memcg_params.cache_shrinker);
> +}

hm, a few things.

What's up with kmem_cache_shrink? It's global and exported to modules
but its only external caller is some weird and hopelessly poorly
documented site down in drivers/acpi/osl.c. slab and slob implement
kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
trying to actually do there, then do it properly, then killkillkill!

Secondly, as slab and slub (at least) have the ability to shed cached
memory, why aren't they hooked into the core cache-shinking machinery.
After all, it's called "shrink_slab"!


If we can fix all that up then I wonder whether this particular patch
needs to exist at all. If the kmem_cache is no longer used then we
can simply leave it floating around in memory and the regular cache
shrinking code out of shrink_slab() will clean up any remaining pages.
The kmem_cache itself can be reclaimed via another shrinker, if
necessary?

2012-11-08 06:52:04

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

On 11/07/2012 04:53 PM, Sasha Levin wrote:
> On 11/01/2012 08:07 AM, Glauber Costa wrote:
>> SLUB allows us to tune a particular cache behavior with sysfs-based
>> tunables. When creating a new memcg cache copy, we'd like to preserve
>> any tunables the parent cache already had.
>>
>> This can be done by tapping into the store attribute function provided
>> by the allocator. We of course don't need to mess with read-only
>> fields. Since the attributes can have multiple types and are stored
>> internally by sysfs, the best strategy is to issue a ->show() in the
>> root cache, and then ->store() in the memcg cache.
>>
>> The drawback of that, is that sysfs can allocate up to a page in
>> buffering for show(), that we are likely not to need, but also can't
>> guarantee. To avoid always allocating a page for that, we can update the
>> caches at store time with the maximum attribute size ever stored to the
>> root cache. We will then get a buffer big enough to hold it. The
>> corolary to this, is that if no stores happened, nothing will be
>> propagated.
>>
>> It can also happen that a root cache has its tunables updated during
>> normal system operation. In this case, we will propagate the change to
>> all caches that are already active.
>>
>> Signed-off-by: Glauber Costa <[email protected]>
>> CC: Christoph Lameter <[email protected]>
>> CC: Pekka Enberg <[email protected]>
>> CC: Michal Hocko <[email protected]>
>> CC: Kamezawa Hiroyuki <[email protected]>
>> CC: Johannes Weiner <[email protected]>
>> CC: Suleiman Souhlal <[email protected]>
>> CC: Tejun Heo <[email protected]>
>> ---
>
> Hi guys,
>
> This patch is making lockdep angry! *bark bark*
>
> [ 351.935003] ======================================================
> [ 351.937693] [ INFO: possible circular locking dependency detected ]
> [ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
> [ 351.942444] -------------------------------------------------------
> [ 351.943528] trinity-child13/6961 is trying to acquire lock:
> [ 351.943528] (s_active#43){++++.+}, at: [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
> [ 351.943528]
> [ 351.943528] but task is already holding lock:
> [ 351.943528] (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
> [ 351.943528]
> [ 351.943528] which lock already depends on the new lock.
> [ 351.943528]
> [ 351.943528]
> [ 351.943528] the existing dependency chain (in reverse order) is:
> [ 351.943528]
> -> #1 (slab_mutex){+.+.+.}:
> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
> [ 351.960334] [<ffffffff83a944d9>] __mutex_lock_common+0x59/0x5a0
> [ 351.960334] [<ffffffff83a94a5f>] mutex_lock_nested+0x3f/0x50
> [ 351.960334] [<ffffffff81256a6e>] slab_attr_store+0xde/0x110
> [ 351.960334] [<ffffffff812f820a>] sysfs_write_file+0xfa/0x150
> [ 351.960334] [<ffffffff8127a220>] vfs_write+0xb0/0x180
> [ 351.960334] [<ffffffff8127a540>] sys_pwrite64+0x60/0xb0
> [ 351.960334] [<ffffffff83a99298>] tracesys+0xe1/0xe6
> [ 351.960334]
> -> #0 (s_active#43){++++.+}:
> [ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
> [ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
> [ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
> [ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
> [ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
> [ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
> [ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
> [ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
> [ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
> [ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
> [ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
> [ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
> [ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
> [ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
> [ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
> [ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
> [ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
> [ 351.960334]
> [ 351.960334] other info that might help us debug this:
> [ 351.960334]
> [ 351.960334] Possible unsafe locking scenario:
> [ 351.960334]
> [ 351.960334] CPU0 CPU1
> [ 351.960334] ---- ----
> [ 351.960334] lock(slab_mutex);
> [ 351.960334] lock(s_active#43);
> [ 351.960334] lock(slab_mutex);
> [ 351.960334] lock(s_active#43);
> [ 351.960334]
> [ 351.960334] *** DEADLOCK ***
> [ 351.960334]
> [ 351.960334] 2 locks held by trinity-child13/6961:
> [ 351.960334] #0: (mon_lock){+.+.+.}, at: [<ffffffff82b21005>] mon_text_release+0x25/0xe0
> [ 351.960334] #1: (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
> [ 351.960334]
> [ 351.960334] stack backtrace:
> [ 351.960334] Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
> [ 351.960334] Call Trace:
> [ 351.960334] [<ffffffff83a3c736>] print_circular_bug+0x1fb/0x20c
> [ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
> [ 351.960334] [<ffffffff81184045>] ? debug_check_no_locks_freed+0x185/0x1e0
> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
> [ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
> [ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
> [ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
> [ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
> [ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
> [ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
> [ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
> [ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
> [ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
> [ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
> [ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
> [ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
> [ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
> [ 351.960334] [<ffffffff811243b9>] ? get_signal_to_deliver+0x8b9/0x930
> [ 351.960334] [<ffffffff8117d402>] ? get_lock_stats+0x22/0x70
> [ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
> [ 351.960334] [<ffffffff83a977fb>] ? _raw_spin_unlock_irq+0x2b/0x80
> [ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
> [ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
> [ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
> [ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
> [ 351.960334] [<ffffffff811c8b33>] ? rcu_cleanup_after_idle+0x23/0x170
> [ 351.960334] [<ffffffff811cc1c4>] ? rcu_eqs_exit_common+0x64/0x3a0
> [ 351.960334] [<ffffffff811caa5d>] ? rcu_user_enter+0x10d/0x140
> [ 351.960334] [<ffffffff811cc8d5>] ? rcu_user_exit+0xc5/0xf0
> [ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
> [ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
>
>
> Thanks,
> Sasha

Hello Sasha,

May I ask how did you trigger this ?


2012-11-08 07:13:31

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On 11/07/2012 11:46 PM, Andrew Morton wrote:
> On Wed, 7 Nov 2012 10:22:17 +0100
> Glauber Costa <[email protected]> wrote:
>
>>>>> container synchronously. If those objects are normally left floating
>>>>> around in an allocated but reclaimable state then we can address that
>>>>> by synchronously freeing them if their container has been destroyed.
>>>>>
>>>>> Or something like that. If it's something else then fine, but not this.
>>>>>
>>>>> What do we need to do to fix this?
>>>>>
>>>> The original patch had a unlikely() test in the free path, conditional
>>>> on whether or not the cache is dead, that would then call this is the
>>>> cache would now be empty.
>>>>
>>>> I got several requests to remove it and change it to something like
>>>> this, because that is a fast path (I myself think an unlikely branch is
>>>> not that bad)
>>>>
>>>> If you think such a test is acceptable, I can bring it back and argue in
>>>> the basis of "akpm made me do it!". But meanwhile I will give this extra
>>>> though to see if there is any alternative way I can do it...
>>>
>>> OK, thanks, please do take a look at it.
>>>
>>> I'd be interested in seeing the old version of the patch which had this
>>> test-n-branch. Perhaps there's some trick we can pull to lessen its cost.
>>>
>> Attached.
>>
>> This is the last version that used it (well, I believe it is). There is
>> other unrelated things in this patch, that I got rid of. Look for
>> kmem_cache_verify_dead().
>>
>> In a summary, all calls to the free function would as a last step do:
>> kmem_cache_verify_dead() that would either be an empty placeholder, or:
>>
>> +static inline void kmem_cache_verify_dead(struct kmem_cache *s)
>> +{
>> + if (unlikely(s->memcg_params.dead))
>> + schedule_work(&s->memcg_params.cache_shrinker);
>> +}
>
> hm, a few things.
>
> What's up with kmem_cache_shrink? It's global and exported to modules
> but its only external caller is some weird and hopelessly poorly
> documented site down in drivers/acpi/osl.c. slab and slob implement
> kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
> trying to actually do there, then do it properly, then killkillkill!
>
> Secondly, as slab and slub (at least) have the ability to shed cached
> memory, why aren't they hooked into the core cache-shinking machinery.
> After all, it's called "shrink_slab"!
>
>
> If we can fix all that up then I wonder whether this particular patch
> needs to exist at all. If the kmem_cache is no longer used then we
> can simply leave it floating around in memory and the regular cache
> shrinking code out of shrink_slab() will clean up any remaining pages.
> The kmem_cache itself can be reclaimed via another shrinker, if
> necessary?
>

So my motivation here, is that when you free the last object on a cache,
or even the last object on a specific page, it won't necessarily free
the page.

The page is left there in the system, until kmem_cache_shrink is called.
Because I am taking action on pages, not objects, I would like them to
be released, so I know the cache went down, and I can destroy it. As a
matter of fact, at least the slub, when kmem_cache_destroy is explicitly
called, will call flush_slab, which is pretty much the core of
kmem_cache_shrink().

shrink_slab() will only call into caches with a registered shrinker, so
my fear was that if I don't call kmem_cache_shrink() explicitly, that
memory may not ever be released.

If you have any idea about to fix that, I am all years. I don't actually
like this patch very much, it was a PITA to get right =(
I will love to ditch it.

Maybe we can do this from vmscan.c? It would be still calling the same
function, we don't get any beauty points for that, but at least it is
done in a place that makes sense

2012-11-08 11:05:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On Tue 06-11-12 09:03:54, Michal Hocko wrote:
> On Mon 05-11-12 16:28:37, Andrew Morton wrote:
> > On Thu, 1 Nov 2012 16:07:35 +0400
> > Glauber Costa <[email protected]> wrote:
> >
> > > +static __always_inline struct kmem_cache *
> > > +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
> >
> > I still don't understand why this code uses __always_inline so much.
>
> AFAIU, __always_inline (resp. __attribute__((always_inline))) is the
> same thing as inline if optimizations are enabled
> (http://ohse.de/uwe/articles/gcc-attributes.html#func-always_inline).

And this doesn't tell the whole story because there is -fearly-inlining
which enabled by default and it makes a difference when optimizations
are enabled so __always_inline really enforces inlining.

--
Michal Hocko
SUSE Labs

2012-11-08 14:33:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v6 19/29] memcg: infrastructure to match an allocation to the right cache

On Thu 08-11-12 12:05:13, Michal Hocko wrote:
> On Tue 06-11-12 09:03:54, Michal Hocko wrote:
> > On Mon 05-11-12 16:28:37, Andrew Morton wrote:
> > > On Thu, 1 Nov 2012 16:07:35 +0400
> > > Glauber Costa <[email protected]> wrote:
> > >
> > > > +static __always_inline struct kmem_cache *
> > > > +memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
> > >
> > > I still don't understand why this code uses __always_inline so much.
> >
> > AFAIU, __always_inline (resp. __attribute__((always_inline))) is the
> > same thing as inline if optimizations are enabled
> > (http://ohse.de/uwe/articles/gcc-attributes.html#func-always_inline).
>
> And this doesn't tell the whole story because there is -fearly-inlining
> which enabled by default and it makes a difference when optimizations
> are enabled so __always_inline really enforces inlining.

and -fearly-inlining is another doc trap. I have tried with -O2
-fno-early-inlining and __always_inline code has been inlined with gcc
4.3 and 4.7 while simple inline is ignored so it really seems that
__always_inline is always inlined but man page is little a bit mean to
tell us all the details.

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Wed, 7 Nov 2012, Andrew Morton wrote:

> What's up with kmem_cache_shrink? It's global and exported to modules
> but its only external caller is some weird and hopelessly poorly
> documented site down in drivers/acpi/osl.c. slab and slob implement
> kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
> trying to actually do there, then do it properly, then killkillkill!

kmem_cache_shrink is also used internally. Its simply releasing unused
cached objects.

> Secondly, as slab and slub (at least) have the ability to shed cached
> memory, why aren't they hooked into the core cache-shinking machinery.
> After all, it's called "shrink_slab"!

Because the core cache shrinking needs the slab caches to free up memory
from inodes and dentries. We could call kmem_cache_shrink at the end of
the shrink passes in vmscan. The price would be that the caches would have
to be repopulated when new allocations occur.
>
> If we can fix all that up then I wonder whether this particular patch
> needs to exist at all. If the kmem_cache is no longer used then we
> can simply leave it floating around in memory and the regular cache
> shrinking code out of shrink_slab() will clean up any remaining pages.
> The kmem_cache itself can be reclaimed via another shrinker, if
> necessary?

The kmem_cache can only be released if all its objects (used and unused)
are released. kmem_cache_shrink drops the unused objects on some internal
slab specific list. That may enable us to release the kmem_cache
structure.

2012-11-08 19:21:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Thu, 8 Nov 2012 17:15:36 +0000
Christoph Lameter <[email protected]> wrote:

> On Wed, 7 Nov 2012, Andrew Morton wrote:
>
> > What's up with kmem_cache_shrink? It's global and exported to modules
> > but its only external caller is some weird and hopelessly poorly
> > documented site down in drivers/acpi/osl.c. slab and slob implement
> > kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
> > trying to actually do there, then do it properly, then killkillkill!
>
> kmem_cache_shrink is also used internally. Its simply releasing unused
> cached objects.

Only in slub. It could be removed outright from the others and
simplified in slub.

> > Secondly, as slab and slub (at least) have the ability to shed cached
> > memory, why aren't they hooked into the core cache-shinking machinery.
> > After all, it's called "shrink_slab"!
>
> Because the core cache shrinking needs the slab caches to free up memory
> from inodes and dentries. We could call kmem_cache_shrink at the end of
> the shrink passes in vmscan. The price would be that the caches would have
> to be repopulated when new allocations occur.

Well, the shrinker shouldn't strips away all the cache. It will perform
a partial trim, the magnitude of which increases with perceived
external memory pressure.

AFACIT, this is correct and desirable behaviour for shrinking
slab's internal caches.

> >
> > If we can fix all that up then I wonder whether this particular patch
> > needs to exist at all. If the kmem_cache is no longer used then we
> > can simply leave it floating around in memory and the regular cache
> > shrinking code out of shrink_slab() will clean up any remaining pages.
> > The kmem_cache itself can be reclaimed via another shrinker, if
> > necessary?
>
> The kmem_cache can only be released if all its objects (used and unused)
> are released. kmem_cache_shrink drops the unused objects on some internal
> slab specific list. That may enable us to release the kmem_cache
> structure.

2012-11-08 22:31:47

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On 11/08/2012 08:21 PM, Andrew Morton wrote:
> On Thu, 8 Nov 2012 17:15:36 +0000
> Christoph Lameter <[email protected]> wrote:
>
>> On Wed, 7 Nov 2012, Andrew Morton wrote:
>>
>>> What's up with kmem_cache_shrink? It's global and exported to modules
>>> but its only external caller is some weird and hopelessly poorly
>>> documented site down in drivers/acpi/osl.c. slab and slob implement
>>> kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
>>> trying to actually do there, then do it properly, then killkillkill!
>>
>> kmem_cache_shrink is also used internally. Its simply releasing unused
>> cached objects.
>
> Only in slub. It could be removed outright from the others and
> simplified in slub.
>
>>> Secondly, as slab and slub (at least) have the ability to shed cached
>>> memory, why aren't they hooked into the core cache-shinking machinery.
>>> After all, it's called "shrink_slab"!
>>
>> Because the core cache shrinking needs the slab caches to free up memory
>> from inodes and dentries. We could call kmem_cache_shrink at the end of
>> the shrink passes in vmscan. The price would be that the caches would have
>> to be repopulated when new allocations occur.
>
> Well, the shrinker shouldn't strips away all the cache. It will perform
> a partial trim, the magnitude of which increases with perceived
> external memory pressure.
>
> AFACIT, this is correct and desirable behaviour for shrinking
> slab's internal caches.
>

I believe calling this from shrink_slab() is not a bad idea at all. If
you're all in favour, I'll cook a patch for this soon

2012-11-08 22:40:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Thu, 8 Nov 2012 23:31:16 +0100
Glauber Costa <[email protected]> wrote:

> On 11/08/2012 08:21 PM, Andrew Morton wrote:
> > On Thu, 8 Nov 2012 17:15:36 +0000
> > Christoph Lameter <[email protected]> wrote:
> >
> >> On Wed, 7 Nov 2012, Andrew Morton wrote:
> >>
> >>> What's up with kmem_cache_shrink? It's global and exported to modules
> >>> but its only external caller is some weird and hopelessly poorly
> >>> documented site down in drivers/acpi/osl.c. slab and slob implement
> >>> kmem_cache_shrink() *only* for acpi! wtf? Let's work out what acpi is
> >>> trying to actually do there, then do it properly, then killkillkill!
> >>
> >> kmem_cache_shrink is also used internally. Its simply releasing unused
> >> cached objects.
> >
> > Only in slub. It could be removed outright from the others and
> > simplified in slub.
> >
> >>> Secondly, as slab and slub (at least) have the ability to shed cached
> >>> memory, why aren't they hooked into the core cache-shinking machinery.
> >>> After all, it's called "shrink_slab"!
> >>
> >> Because the core cache shrinking needs the slab caches to free up memory
> >> from inodes and dentries. We could call kmem_cache_shrink at the end of
> >> the shrink passes in vmscan. The price would be that the caches would have
> >> to be repopulated when new allocations occur.
> >
> > Well, the shrinker shouldn't strips away all the cache. It will perform
> > a partial trim, the magnitude of which increases with perceived
> > external memory pressure.
> >
> > AFACIT, this is correct and desirable behaviour for shrinking
> > slab's internal caches.
> >
>
> I believe calling this from shrink_slab() is not a bad idea at all. If
> you're all in favour, I'll cook a patch for this soon

It sounds like a pretty big change but yes, well worth exploring.

I'd still like to give ACPI a thwap. That kmem_cache_shrink() in
drivers/acpi/osl.c was added unchangelogged in a megapatch
(73459f73e5d1602c59) so it's a mystery. Cc's optimistically added.

2012-11-09 03:37:56

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

On 11/08/2012 01:51 AM, Glauber Costa wrote:
> On 11/07/2012 04:53 PM, Sasha Levin wrote:
>> On 11/01/2012 08:07 AM, Glauber Costa wrote:
>>> SLUB allows us to tune a particular cache behavior with sysfs-based
>>> tunables. When creating a new memcg cache copy, we'd like to preserve
>>> any tunables the parent cache already had.
>>>
>>> This can be done by tapping into the store attribute function provided
>>> by the allocator. We of course don't need to mess with read-only
>>> fields. Since the attributes can have multiple types and are stored
>>> internally by sysfs, the best strategy is to issue a ->show() in the
>>> root cache, and then ->store() in the memcg cache.
>>>
>>> The drawback of that, is that sysfs can allocate up to a page in
>>> buffering for show(), that we are likely not to need, but also can't
>>> guarantee. To avoid always allocating a page for that, we can update the
>>> caches at store time with the maximum attribute size ever stored to the
>>> root cache. We will then get a buffer big enough to hold it. The
>>> corolary to this, is that if no stores happened, nothing will be
>>> propagated.
>>>
>>> It can also happen that a root cache has its tunables updated during
>>> normal system operation. In this case, we will propagate the change to
>>> all caches that are already active.
>>>
>>> Signed-off-by: Glauber Costa <[email protected]>
>>> CC: Christoph Lameter <[email protected]>
>>> CC: Pekka Enberg <[email protected]>
>>> CC: Michal Hocko <[email protected]>
>>> CC: Kamezawa Hiroyuki <[email protected]>
>>> CC: Johannes Weiner <[email protected]>
>>> CC: Suleiman Souhlal <[email protected]>
>>> CC: Tejun Heo <[email protected]>
>>> ---
>>
>> Hi guys,
>>
>> This patch is making lockdep angry! *bark bark*
>>
>> [ 351.935003] ======================================================
>> [ 351.937693] [ INFO: possible circular locking dependency detected ]
>> [ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
>> [ 351.942444] -------------------------------------------------------
>> [ 351.943528] trinity-child13/6961 is trying to acquire lock:
>> [ 351.943528] (s_active#43){++++.+}, at: [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
>> [ 351.943528]
>> [ 351.943528] but task is already holding lock:
>> [ 351.943528] (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
>> [ 351.943528]
>> [ 351.943528] which lock already depends on the new lock.
>> [ 351.943528]
>> [ 351.943528]
>> [ 351.943528] the existing dependency chain (in reverse order) is:
>> [ 351.943528]
>> -> #1 (slab_mutex){+.+.+.}:
>> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
>> [ 351.960334] [<ffffffff83a944d9>] __mutex_lock_common+0x59/0x5a0
>> [ 351.960334] [<ffffffff83a94a5f>] mutex_lock_nested+0x3f/0x50
>> [ 351.960334] [<ffffffff81256a6e>] slab_attr_store+0xde/0x110
>> [ 351.960334] [<ffffffff812f820a>] sysfs_write_file+0xfa/0x150
>> [ 351.960334] [<ffffffff8127a220>] vfs_write+0xb0/0x180
>> [ 351.960334] [<ffffffff8127a540>] sys_pwrite64+0x60/0xb0
>> [ 351.960334] [<ffffffff83a99298>] tracesys+0xe1/0xe6
>> [ 351.960334]
>> -> #0 (s_active#43){++++.+}:
>> [ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
>> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
>> [ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
>> [ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
>> [ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
>> [ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
>> [ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
>> [ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
>> [ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
>> [ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
>> [ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
>> [ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
>> [ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
>> [ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
>> [ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
>> [ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
>> [ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
>> [ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
>> [ 351.960334]
>> [ 351.960334] other info that might help us debug this:
>> [ 351.960334]
>> [ 351.960334] Possible unsafe locking scenario:
>> [ 351.960334]
>> [ 351.960334] CPU0 CPU1
>> [ 351.960334] ---- ----
>> [ 351.960334] lock(slab_mutex);
>> [ 351.960334] lock(s_active#43);
>> [ 351.960334] lock(slab_mutex);
>> [ 351.960334] lock(s_active#43);
>> [ 351.960334]
>> [ 351.960334] *** DEADLOCK ***
>> [ 351.960334]
>> [ 351.960334] 2 locks held by trinity-child13/6961:
>> [ 351.960334] #0: (mon_lock){+.+.+.}, at: [<ffffffff82b21005>] mon_text_release+0x25/0xe0
>> [ 351.960334] #1: (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
>> [ 351.960334]
>> [ 351.960334] stack backtrace:
>> [ 351.960334] Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
>> [ 351.960334] Call Trace:
>> [ 351.960334] [<ffffffff83a3c736>] print_circular_bug+0x1fb/0x20c
>> [ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
>> [ 351.960334] [<ffffffff81184045>] ? debug_check_no_locks_freed+0x185/0x1e0
>> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
>> [ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
>> [ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
>> [ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
>> [ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
>> [ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
>> [ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
>> [ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
>> [ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
>> [ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
>> [ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
>> [ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
>> [ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
>> [ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
>> [ 351.960334] [<ffffffff811243b9>] ? get_signal_to_deliver+0x8b9/0x930
>> [ 351.960334] [<ffffffff8117d402>] ? get_lock_stats+0x22/0x70
>> [ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
>> [ 351.960334] [<ffffffff83a977fb>] ? _raw_spin_unlock_irq+0x2b/0x80
>> [ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
>> [ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
>> [ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
>> [ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
>> [ 351.960334] [<ffffffff811c8b33>] ? rcu_cleanup_after_idle+0x23/0x170
>> [ 351.960334] [<ffffffff811cc1c4>] ? rcu_eqs_exit_common+0x64/0x3a0
>> [ 351.960334] [<ffffffff811caa5d>] ? rcu_user_enter+0x10d/0x140
>> [ 351.960334] [<ffffffff811cc8d5>] ? rcu_user_exit+0xc5/0xf0
>> [ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
>> [ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
>>
>>
>> Thanks,
>> Sasha
>
> Hello Sasha,
>
> May I ask how did you trigger this ?

Fuzzing with trinity, inside a KVM guest.


Thanks,
Sasha

Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Thu, 8 Nov 2012, Andrew Morton wrote:

> > kmem_cache_shrink is also used internally. Its simply releasing unused
> > cached objects.
>
> Only in slub. It could be removed outright from the others and
> simplified in slub.

Both SLAB and SLUB must purge their queues before closing/destroying a
cache. There is not much code that can be eliminated.

> > Because the core cache shrinking needs the slab caches to free up memory
> > from inodes and dentries. We could call kmem_cache_shrink at the end of
> > the shrink passes in vmscan. The price would be that the caches would have
> > to be repopulated when new allocations occur.
>
> Well, the shrinker shouldn't strips away all the cache. It will perform
> a partial trim, the magnitude of which increases with perceived
> external memory pressure.

The partial trim of the objects cached by SLAB is performed in 2 second
intervals from the cache reaper.

We are talking here about flushing all
the cached objects from the inode and dentry cache etc in vmscan right?

Subject: Re: [PATCH v6 25/29] memcg/sl[au]b: shrink dead caches

On Thu, 8 Nov 2012, Andrew Morton wrote:

> I'd still like to give ACPI a thwap. That kmem_cache_shrink() in
> drivers/acpi/osl.c was added unchangelogged in a megapatch
> (73459f73e5d1602c59) so it's a mystery. Cc's optimistically added.

It does not hurt though and releasing cache objects when there will be no
objects added and removed from a slab cache is a good thing to do.

2012-11-14 12:06:40

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

On 11/09/2012 07:37 AM, Sasha Levin wrote:
> On 11/08/2012 01:51 AM, Glauber Costa wrote:
>> On 11/07/2012 04:53 PM, Sasha Levin wrote:
>>> On 11/01/2012 08:07 AM, Glauber Costa wrote:
>>>> SLUB allows us to tune a particular cache behavior with sysfs-based
>>>> tunables. When creating a new memcg cache copy, we'd like to preserve
>>>> any tunables the parent cache already had.
>>>>
>>>> This can be done by tapping into the store attribute function provided
>>>> by the allocator. We of course don't need to mess with read-only
>>>> fields. Since the attributes can have multiple types and are stored
>>>> internally by sysfs, the best strategy is to issue a ->show() in the
>>>> root cache, and then ->store() in the memcg cache.
>>>>
>>>> The drawback of that, is that sysfs can allocate up to a page in
>>>> buffering for show(), that we are likely not to need, but also can't
>>>> guarantee. To avoid always allocating a page for that, we can update the
>>>> caches at store time with the maximum attribute size ever stored to the
>>>> root cache. We will then get a buffer big enough to hold it. The
>>>> corolary to this, is that if no stores happened, nothing will be
>>>> propagated.
>>>>
>>>> It can also happen that a root cache has its tunables updated during
>>>> normal system operation. In this case, we will propagate the change to
>>>> all caches that are already active.
>>>>
>>>> Signed-off-by: Glauber Costa <[email protected]>
>>>> CC: Christoph Lameter <[email protected]>
>>>> CC: Pekka Enberg <[email protected]>
>>>> CC: Michal Hocko <[email protected]>
>>>> CC: Kamezawa Hiroyuki <[email protected]>
>>>> CC: Johannes Weiner <[email protected]>
>>>> CC: Suleiman Souhlal <[email protected]>
>>>> CC: Tejun Heo <[email protected]>
>>>> ---
>>>
>>> Hi guys,
>>>
>>> This patch is making lockdep angry! *bark bark*
>>>
>>> [ 351.935003] ======================================================
>>> [ 351.937693] [ INFO: possible circular locking dependency detected ]
>>> [ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
>>> [ 351.942444] -------------------------------------------------------
>>> [ 351.943528] trinity-child13/6961 is trying to acquire lock:
>>> [ 351.943528] (s_active#43){++++.+}, at: [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
>>> [ 351.943528]
>>> [ 351.943528] but task is already holding lock:
>>> [ 351.943528] (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
>>> [ 351.943528]
>>> [ 351.943528] which lock already depends on the new lock.
>>> [ 351.943528]
>>> [ 351.943528]
>>> [ 351.943528] the existing dependency chain (in reverse order) is:
>>> [ 351.943528]
>>> -> #1 (slab_mutex){+.+.+.}:
>>> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
>>> [ 351.960334] [<ffffffff83a944d9>] __mutex_lock_common+0x59/0x5a0
>>> [ 351.960334] [<ffffffff83a94a5f>] mutex_lock_nested+0x3f/0x50
>>> [ 351.960334] [<ffffffff81256a6e>] slab_attr_store+0xde/0x110
>>> [ 351.960334] [<ffffffff812f820a>] sysfs_write_file+0xfa/0x150
>>> [ 351.960334] [<ffffffff8127a220>] vfs_write+0xb0/0x180
>>> [ 351.960334] [<ffffffff8127a540>] sys_pwrite64+0x60/0xb0
>>> [ 351.960334] [<ffffffff83a99298>] tracesys+0xe1/0xe6
>>> [ 351.960334]
>>> -> #0 (s_active#43){++++.+}:
>>> [ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
>>> [ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
>>> [ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
>>> [ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
>>> [ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
>>> [ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
>>> [ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
>>> [ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
>>> [ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
>>> [ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
>>> [ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
>>> [ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
>>> [ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
>>> [ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
>>> [ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
>>> [ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
>>> [ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
>>> [ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
>>> [ 351.960334]

First: Sorry I took so long, I had some problems in my way back from
Spain...

I just managed to reproduce it, by following the callchain. In summary:

1) when we store an attribute, we will call sysfs_get_active(), that
will hold the sd->dep_map lock, where 'sd' is the specific dirent.

2) ->store() is called with that held.

3) ->store() will hold the slab_mutex

4) While destroying the cache, with the slab_mutex held, we will
eventually get to kobject_put(), that deep down in the callchain will
resort to sysfs_addrm_finish, that can hold that lock again.

In summary, creating a kmem limited memcg, storing an argument in the
global cache, and then deleting the memcg should trigger this. The funny
thing is that I had a test exactly like this in which it didn't trigger,
and now I know why: I was storing attributes for "dentry", which can
stay around for longer until it completely runs out of objects, which
will depend on the vmscan shrinkers kicking in. storing to a more short
lived cache will easily trigger this - Thanks!

During __kmem_cache_create, we drop the slab_mutex around
sysfs_slab_add. Although the justification for that is a bit different,
I think this is generally sane and the same could be done here.

I will send a patch for this - and other issues - shortly.

Thanks again, Sasha.