Andrew,
As you requested, here are some fixups and clarifications for the kmemcg series.
It also handles one bug reported by Sasha.
Please note that I didn't touch kmem_cache_shrink(): I believe that deserves a
deeper and more thoughtful solution that will take time to brew. I plan to
address that eventually in the scope of per-memcg kernel memory reclaim.
I did, however, remove the delayed_work in favor of a normal worker. Memory
will stay around for longer, but it will be reclaimed eventually, and given
your objections I believe this is a more desirable trade off.
Please let me know if there is anything you would like to see different, and
sorry for not handling this earlier.
Glauber Costa (7):
memcg: simplify ida initialization
move include of workqueue.h to top of slab.h file
memcg: remove test for current->mm in memcg_stop/resume_kmem_account
memcg: replace __always_inline with plain inline
memcg: get rid of once-per-second cache shrinking for dead memcgs
memcg: add comments clarifying aspects of cache attribute propagation
slub: drop mutex before deleting sysfs entry
include/linux/memcontrol.h | 12 +++++++++---
include/linux/slab.h | 6 +++---
mm/memcontrol.c | 34 ++++++++++------------------------
mm/slab.c | 1 +
mm/slub.c | 34 +++++++++++++++++++++++++++++-----
5 files changed, 52 insertions(+), 35 deletions(-)
--
1.7.11.7
Suggested by akpm. I originally decided to put it closer to the use of
the work struct, but let's move it to top.
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
include/linux/slab.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 81ee767..18f8c98 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -11,6 +11,8 @@
#include <linux/gfp.h>
#include <linux/types.h>
+#include <linux/workqueue.h>
+
/*
* Flags to pass to kmem_cache_create().
@@ -180,8 +182,6 @@ unsigned int kmem_cache_size(struct kmem_cache *);
#ifndef ARCH_SLAB_MINALIGN
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
#endif
-
-#include <linux/workqueue.h>
/*
* This is the main placeholder for memcg-related information in kmem caches.
* struct kmem_cache will hold a pointer to it, so the memory cost while
--
1.7.11.7
As suggested by akpm, change the manual initialization of our kmem
index to DEFINE_IDA()
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
mm/memcontrol.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6136fec..c0c6adf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -569,7 +569,7 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
* memcg_limited_groups_array_size. It will double each time we have to
* increase it.
*/
-static struct ida kmem_limited_groups;
+static DEFINE_IDA(kmem_limited_groups);
int memcg_limited_groups_array_size;
/*
@@ -5686,9 +5686,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
if (ret)
return ret;
- if (mem_cgroup_is_root(memcg))
- ida_init(&kmem_limited_groups);
-
return mem_cgroup_sockets_init(memcg, ss);
};
--
1.7.11.7
Sasha Levin recently reported a lockdep problem resulting from the
new attribute propagation introduced by kmemcg series. In short,
slab_mutex will be called from within the sysfs attribute store
function. This will create a dependency, that will later be held
backwards when a cache is destroyed - since destruction occurs with
the slab_mutex held, and then calls in to the sysfs directory removal
function.
In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will
look weird now.
Lockdep info:
[ 351.935003] ======================================================
[ 351.937693] [ INFO: possible circular locking dependency detected ]
[ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
[ 351.942444] -------------------------------------------------------
[ 351.943528] trinity-child13/6961 is trying to acquire lock:
[ 351.943528] (s_active#43){++++.+}, at: [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.943528]
[ 351.943528] but task is already holding lock:
[ 351.943528] (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
[ 351.943528]
[ 351.943528] which lock already depends on the new lock.
[ 351.943528]
[ 351.943528]
[ 351.943528] the existing dependency chain (in reverse order) is:
[ 351.943528]
-> #1 (slab_mutex){+.+.+.}:
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff83a944d9>] __mutex_lock_common+0x59/0x5a0
[ 351.960334] [<ffffffff83a94a5f>] mutex_lock_nested+0x3f/0x50
[ 351.960334] [<ffffffff81256a6e>] slab_attr_store+0xde/0x110
[ 351.960334] [<ffffffff812f820a>] sysfs_write_file+0xfa/0x150
[ 351.960334] [<ffffffff8127a220>] vfs_write+0xb0/0x180
[ 351.960334] [<ffffffff8127a540>] sys_pwrite64+0x60/0xb0
[ 351.960334] [<ffffffff83a99298>] tracesys+0xe1/0xe6
[ 351.960334]
-> #0 (s_active#43){++++.+}:
[ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
[ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
[ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
[ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
[ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
[ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
[ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
[ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
[ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
[ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
[ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
[ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
[ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
[ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
[ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
[ 351.960334]
[ 351.960334] other info that might help us debug this:
[ 351.960334]
[ 351.960334] Possible unsafe locking scenario:
[ 351.960334]
[ 351.960334] CPU0 CPU1
[ 351.960334] ---- ----
[ 351.960334] lock(slab_mutex);
[ 351.960334] lock(s_active#43);
[ 351.960334] lock(slab_mutex);
[ 351.960334] lock(s_active#43);
[ 351.960334]
[ 351.960334] *** DEADLOCK ***
[ 351.960334]
[ 351.960334] 2 locks held by trinity-child13/6961:
[ 351.960334] #0: (mon_lock){+.+.+.}, at: [<ffffffff82b21005>] mon_text_release+0x25/0xe0
[ 351.960334] #1: (slab_mutex){+.+.+.}, at: [<ffffffff81228a42>] kmem_cache_destroy+0x22/0xe0
[ 351.960334]
[ 351.960334] stack backtrace:
[ 351.960334] Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
[ 351.960334] Call Trace:
[ 351.960334] [<ffffffff83a3c736>] print_circular_bug+0x1fb/0x20c
[ 351.960334] [<ffffffff811825af>] __lock_acquire+0x14df/0x1ca0
[ 351.960334] [<ffffffff81184045>] ? debug_check_no_locks_freed+0x185/0x1e0
[ 351.960334] [<ffffffff8118536a>] lock_acquire+0x1aa/0x240
[ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812f9272>] sysfs_deactivate+0x122/0x1a0
[ 351.960334] [<ffffffff812f9e11>] ? sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812f9e11>] sysfs_addrm_finish+0x31/0x60
[ 351.960334] [<ffffffff812fa369>] sysfs_remove_dir+0x89/0xd0
[ 351.960334] [<ffffffff819e1d96>] kobject_del+0x16/0x40
[ 351.960334] [<ffffffff8125ed40>] __kmem_cache_shutdown+0x40/0x60
[ 351.960334] [<ffffffff81228a60>] kmem_cache_destroy+0x40/0xe0
[ 351.960334] [<ffffffff82b21058>] mon_text_release+0x78/0xe0
[ 351.960334] [<ffffffff8127b3b2>] __fput+0x122/0x2d0
[ 351.960334] [<ffffffff8127b569>] ____fput+0x9/0x10
[ 351.960334] [<ffffffff81131b4e>] task_work_run+0xbe/0x100
[ 351.960334] [<ffffffff81110742>] do_exit+0x432/0xbd0
[ 351.960334] [<ffffffff811243b9>] ? get_signal_to_deliver+0x8b9/0x930
[ 351.960334] [<ffffffff8117d402>] ? get_lock_stats+0x22/0x70
[ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
[ 351.960334] [<ffffffff83a977fb>] ? _raw_spin_unlock_irq+0x2b/0x80
[ 351.960334] [<ffffffff81110fa4>] do_group_exit+0x84/0xd0
[ 351.960334] [<ffffffff8112431d>] get_signal_to_deliver+0x81d/0x930
[ 351.960334] [<ffffffff8117d48e>] ? put_lock_stats.isra.16+0xe/0x40
[ 351.960334] [<ffffffff8106d5aa>] do_signal+0x3a/0x950
[ 351.960334] [<ffffffff811c8b33>] ? rcu_cleanup_after_idle+0x23/0x170
[ 351.960334] [<ffffffff811cc1c4>] ? rcu_eqs_exit_common+0x64/0x3a0
[ 351.960334] [<ffffffff811caa5d>] ? rcu_user_enter+0x10d/0x140
[ 351.960334] [<ffffffff811cc8d5>] ? rcu_user_exit+0xc5/0xf0
[ 351.960334] [<ffffffff8106df1e>] do_notify_resume+0x3e/0x90
[ 351.960334] [<ffffffff83a993aa>] int_signal+0x12/0x17
Signed-off-by: Glauber Costa <[email protected]>
Reported-by: Sasha Levin <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Pekka Enberg <[email protected]>
---
mm/slub.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index fead2cd..0769ccc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3197,8 +3197,19 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
{
int rc = kmem_cache_close(s);
- if (!rc)
+ if (!rc) {
+ /*
+ * We do the same lock strategy around sysfs_slab_add, see
+ * __kmem_cache_create. Because this is pretty much the last
+ * operation we do and the lock will be released shortly after
+ * that in slab_common.c, we could just move sysfs_slab_remove
+ * to a later point in common code. We should do that when we
+ * have a common sysfs framework for all allocators.
+ */
+ mutex_unlock(&slab_mutex);
sysfs_slab_remove(s);
+ mutex_lock(&slab_mutex);
+ }
return rc;
}
--
1.7.11.7
The idea is to synchronously do it, leaving it up to the shrinking
facilities in vmscan.c and/or others. Not actively retrying shrinking
may leave the caches alive for more time, but it will remove the ugly
wakeups. One would argue that if the caches have free objects but are
not being shrunk, it is because we don't need that memory yet.
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
include/linux/slab.h | 2 +-
mm/memcontrol.c | 17 +++++++----------
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 18f8c98..456c327 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -214,7 +214,7 @@ struct memcg_cache_params {
struct kmem_cache *root_cache;
bool dead;
atomic_t nr_pages;
- struct delayed_work destroy;
+ struct work_struct destroy;
};
};
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f9c5981..e3d805f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3077,9 +3077,8 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
{
struct kmem_cache *cachep;
struct memcg_cache_params *p;
- struct delayed_work *dw = to_delayed_work(w);
- p = container_of(dw, struct memcg_cache_params, destroy);
+ p = container_of(w, struct memcg_cache_params, destroy);
cachep = memcg_params_to_cache(p);
@@ -3103,8 +3102,6 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
kmem_cache_shrink(cachep);
if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
return;
- /* Once per minute should be good enough. */
- schedule_delayed_work(&cachep->memcg_params->destroy, 60 * HZ);
} else
kmem_cache_destroy(cachep);
}
@@ -3127,18 +3124,18 @@ void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
* kmem_cache_shrink is enough to shake all the remaining objects and
* get the page count to 0. In this case, we'll deadlock if we try to
* cancel the work (the worker runs with an internal lock held, which
- * is the same lock we would hold for cancel_delayed_work_sync().)
+ * is the same lock we would hold for cancel_work_sync().)
*
* Since we can't possibly know who got us here, just refrain from
* running if there is already work pending
*/
- if (delayed_work_pending(&cachep->memcg_params->destroy))
+ if (work_pending(&cachep->memcg_params->destroy))
return;
/*
* We have to defer the actual destroying to a workqueue, because
* we might currently be in a context that cannot sleep.
*/
- schedule_delayed_work(&cachep->memcg_params->destroy, 0);
+ schedule_work(&cachep->memcg_params->destroy);
}
static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
@@ -3261,7 +3258,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
* set, so flip it down to guarantee we are in control.
*/
c->memcg_params->dead = false;
- cancel_delayed_work_sync(&c->memcg_params->destroy);
+ cancel_work_sync(&c->memcg_params->destroy);
kmem_cache_destroy(c);
}
mutex_unlock(&set_limit_mutex);
@@ -3285,9 +3282,9 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
cachep = memcg_params_to_cache(params);
cachep->memcg_params->dead = true;
- INIT_DELAYED_WORK(&cachep->memcg_params->destroy,
+ INIT_WORK(&cachep->memcg_params->destroy,
kmem_cache_destroy_work_func);
- schedule_delayed_work(&cachep->memcg_params->destroy, 0);
+ schedule_work(&cachep->memcg_params->destroy);
}
mutex_unlock(&memcg->slab_caches_mutex);
}
--
1.7.11.7
The original reason for the existence of this test, was that
memcg_kmem_cache_create could be called from either softirq context
(where memcg_stop/resume_account is not needed), or process context,
(where memcg_stop/resume_account is needed). Just skipping it
in-function was the cleanest way to merge both behaviors. The reason for
that is that we would try to create caches right away through
memcg_kmem_cache_create if the context would allow us to.
However, the final version of the code that merged did not have this
behavior and we always queue up new cache creation. Thus, instead of a
comment explaining why current->mm test is needed, my proposal in this
patch is to remove memcg_stop/resume_account from the worker thread and
make sure all callers have a valid mm context.
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
mm/memcontrol.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c0c6adf..f9c5981 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3063,17 +3063,13 @@ out:
*/
static inline void memcg_stop_kmem_account(void)
{
- if (!current->mm)
- return;
-
+ VM_BUG_ON(!current->mm);
current->memcg_kmem_skip_account++;
}
static inline void memcg_resume_kmem_account(void)
{
- if (!current->mm)
- return;
-
+ VM_BUG_ON(!current->mm);
current->memcg_kmem_skip_account--;
}
@@ -3206,11 +3202,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
if (new_cachep)
goto out;
- /* Don't block progress to enqueue caches for internal infrastructure */
- memcg_stop_kmem_account();
new_cachep = kmem_cache_dup(memcg, cachep);
- memcg_resume_kmem_account();
-
if (new_cachep == NULL) {
new_cachep = cachep;
goto out;
--
1.7.11.7
Following the pattern found in the allocators, where we do our best to
the fast paths function-call free, all the externally visible functions
for kmemcg were marked __always_inline.
It is fair to say, however, that this should be up to the compiler. We
will still keep as much of the flag testing as we can in memcontrol.h to
give the compiler the option to inline it, but won't force it.
I tested this with 4.7.2, it will inline all three functions anyway when
compiling with -O2, and will refrain from it when compiling with -Os.
This seems like a good behavior.
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
include/linux/memcontrol.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c91e3c1..17d0d41 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -467,7 +467,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
* We return true automatically if this allocation is not to be accounted to
* any memcg.
*/
-static __always_inline bool
+static inline bool
memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
{
if (!memcg_kmem_enabled())
@@ -499,7 +499,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
*
* there is no need to specify memcg here, since it is embedded in page_cgroup
*/
-static __always_inline void
+static inline void
memcg_kmem_uncharge_pages(struct page *page, int order)
{
if (memcg_kmem_enabled())
@@ -517,7 +517,7 @@ memcg_kmem_uncharge_pages(struct page *page, int order)
* charges. Otherwise, it will commit the memcg given by @memcg to the
* corresponding page_cgroup.
*/
-static __always_inline void
+static inline void
memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
{
if (memcg_kmem_enabled() && memcg)
--
1.7.11.7
This patch clarifies two aspects of cache attribute propagation.
First, the expected context for the for_each_memcg_cache macro in
memcontrol.h. The usages already in the codebase are safe. In mm/slub.c,
it is trivially safe because the lock is acquired right before the loop.
In mm/slab.c, it is less so: the lock is acquired by an outer function a
few steps back in the stack, so a VM_BUG_ON() is added to make sure it
is indeed safe.
A comment is also added to detail why we are returning the value of the
parent cache and ignoring the children's when we propagate the
attributes.
Signed-off-by: Glauber Costa <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Kamezawa Hiroyuki <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Andrew Morton <[email protected]>
---
include/linux/memcontrol.h | 6 ++++++
mm/slab.c | 1 +
mm/slub.c | 21 +++++++++++++++++----
3 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 17d0d41..48eddec 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -415,6 +415,12 @@ static inline void sock_release_memcg(struct sock *sk)
extern struct static_key memcg_kmem_enabled_key;
extern int memcg_limited_groups_array_size;
+
+/*
+ * Helper macro to loop through all memcg-specific caches. Callers must still
+ * check if the cache is valid (it is either valid or NULL).
+ * the slab_mutex must be held when looping through those caches
+ */
#define for_each_memcg_cache_index(_idx) \
for ((_idx) = 0; i < memcg_limited_groups_array_size; (_idx)++)
diff --git a/mm/slab.c b/mm/slab.c
index 699d1d42..d408bf731 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4165,6 +4165,7 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
if ((ret < 0) || !is_root_cache(cachep))
return ret;
+ VM_BUG_ON(!mutex_is_locked(&slab_mutex));
for_each_memcg_cache_index(i) {
c = cache_from_memcg(cachep, i);
if (c)
diff --git a/mm/slub.c b/mm/slub.c
index 56a8db2..fead2cd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5199,12 +5199,25 @@ static ssize_t slab_attr_store(struct kobject *kobj,
if (s->max_attr_size < len)
s->max_attr_size = len;
+ /*
+ * This is a best effort propagation, so this function's return
+ * value will be determined by the parent cache only. This is
+ * basically because not all attributes will have a well
+ * defined semantics for rollbacks - most of the actions will
+ * have permanent effects.
+ *
+ * Returning the error value of any of the children that fail
+ * is not 100 % defined, in the sense that users seeing the
+ * error code won't be able to know anything about the state of
+ * the cache.
+ *
+ * Only returning the error code for the parent cache at least
+ * has well defined semantics. The cache being written to
+ * directly either failed or succeeded, in which case we loop
+ * through the descendants with best-effort propagation.
+ */
for_each_memcg_cache_index(i) {
struct kmem_cache *c = cache_from_memcg(s, i);
- /*
- * This function's return value is determined by the
- * parent cache only
- */
if (c)
attribute->store(c, buf, len);
}
--
1.7.11.7
On Thu, 15 Nov 2012, Glauber Costa wrote:
> Andrew,
>
> As you requested, here are some fixups and clarifications for the kmemcg series.
> It also handles one bug reported by Sasha.
>
On the series:
Acked-by: David Rientjes <[email protected]>
(2012/11/15 11:54), Glauber Costa wrote:
> The original reason for the existence of this test, was that
> memcg_kmem_cache_create could be called from either softirq context
> (where memcg_stop/resume_account is not needed), or process context,
> (where memcg_stop/resume_account is needed). Just skipping it
> in-function was the cleanest way to merge both behaviors. The reason for
> that is that we would try to create caches right away through
> memcg_kmem_cache_create if the context would allow us to.
>
> However, the final version of the code that merged did not have this
> behavior and we always queue up new cache creation. Thus, instead of a
> comment explaining why current->mm test is needed, my proposal in this
> patch is to remove memcg_stop/resume_account from the worker thread and
> make sure all callers have a valid mm context.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Kamezawa Hiroyuki <[email protected]>
> CC: Johannes Weiner <[email protected]>
> CC: Andrew Morton <[email protected]>
seems ok to me. But do we need VM_BUG_ON() ?
It seems functions called under memcg_stop_kmem_account() doesn't access
current->mm...
Anyway.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/15 11:54), Glauber Costa wrote:
> Following the pattern found in the allocators, where we do our best to
> the fast paths function-call free, all the externally visible functions
> for kmemcg were marked __always_inline.
>
> It is fair to say, however, that this should be up to the compiler. We
> will still keep as much of the flag testing as we can in memcontrol.h to
> give the compiler the option to inline it, but won't force it.
>
> I tested this with 4.7.2, it will inline all three functions anyway when
> compiling with -O2, and will refrain from it when compiling with -Os.
> This seems like a good behavior.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Kamezawa Hiroyuki <[email protected]>
> CC: Johannes Weiner <[email protected]>
> CC: Andrew Morton <[email protected]>
I'm O.K. with this.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/15 11:54), Glauber Costa wrote:
> Suggested by akpm. I originally decided to put it closer to the use of
> the work struct, but let's move it to top.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Kamezawa Hiroyuki <[email protected]>
> CC: Johannes Weiner <[email protected]>
> CC: Andrew Morton <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/15 11:54), Glauber Costa wrote:
> The idea is to synchronously do it, leaving it up to the shrinking
> facilities in vmscan.c and/or others. Not actively retrying shrinking
> may leave the caches alive for more time, but it will remove the ugly
> wakeups. One would argue that if the caches have free objects but are
> not being shrunk, it is because we don't need that memory yet.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Kamezawa Hiroyuki <[email protected]>
> CC: Johannes Weiner <[email protected]>
> CC: Andrew Morton <[email protected]>
I agree this patch but can we have a way to see the number of unaccounted
zombie cache usage for debugging ?
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/slab.h | 2 +-
> mm/memcontrol.c | 17 +++++++----------
> 2 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 18f8c98..456c327 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -214,7 +214,7 @@ struct memcg_cache_params {
> struct kmem_cache *root_cache;
> bool dead;
> atomic_t nr_pages;
> - struct delayed_work destroy;
> + struct work_struct destroy;
> };
> };
> };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f9c5981..e3d805f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3077,9 +3077,8 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
> {
> struct kmem_cache *cachep;
> struct memcg_cache_params *p;
> - struct delayed_work *dw = to_delayed_work(w);
>
> - p = container_of(dw, struct memcg_cache_params, destroy);
> + p = container_of(w, struct memcg_cache_params, destroy);
>
> cachep = memcg_params_to_cache(p);
>
> @@ -3103,8 +3102,6 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
> kmem_cache_shrink(cachep);
> if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
> return;
> - /* Once per minute should be good enough. */
> - schedule_delayed_work(&cachep->memcg_params->destroy, 60 * HZ);
> } else
> kmem_cache_destroy(cachep);
> }
> @@ -3127,18 +3124,18 @@ void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
> * kmem_cache_shrink is enough to shake all the remaining objects and
> * get the page count to 0. In this case, we'll deadlock if we try to
> * cancel the work (the worker runs with an internal lock held, which
> - * is the same lock we would hold for cancel_delayed_work_sync().)
> + * is the same lock we would hold for cancel_work_sync().)
> *
> * Since we can't possibly know who got us here, just refrain from
> * running if there is already work pending
> */
> - if (delayed_work_pending(&cachep->memcg_params->destroy))
> + if (work_pending(&cachep->memcg_params->destroy))
> return;
> /*
> * We have to defer the actual destroying to a workqueue, because
> * we might currently be in a context that cannot sleep.
> */
> - schedule_delayed_work(&cachep->memcg_params->destroy, 0);
> + schedule_work(&cachep->memcg_params->destroy);
> }
>
> static char *memcg_cache_name(struct mem_cgroup *memcg, struct kmem_cache *s)
> @@ -3261,7 +3258,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
> * set, so flip it down to guarantee we are in control.
> */
> c->memcg_params->dead = false;
> - cancel_delayed_work_sync(&c->memcg_params->destroy);
> + cancel_work_sync(&c->memcg_params->destroy);
> kmem_cache_destroy(c);
> }
> mutex_unlock(&set_limit_mutex);
> @@ -3285,9 +3282,9 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
> list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
> cachep = memcg_params_to_cache(params);
> cachep->memcg_params->dead = true;
> - INIT_DELAYED_WORK(&cachep->memcg_params->destroy,
> + INIT_WORK(&cachep->memcg_params->destroy,
> kmem_cache_destroy_work_func);
> - schedule_delayed_work(&cachep->memcg_params->destroy, 0);
> + schedule_work(&cachep->memcg_params->destroy);
> }
> mutex_unlock(&memcg->slab_caches_mutex);
> }
>
On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
> (2012/11/15 11:54), Glauber Costa wrote:
>> The idea is to synchronously do it, leaving it up to the shrinking
>> facilities in vmscan.c and/or others. Not actively retrying shrinking
>> may leave the caches alive for more time, but it will remove the ugly
>> wakeups. One would argue that if the caches have free objects but are
>> not being shrunk, it is because we don't need that memory yet.
>>
>> Signed-off-by: Glauber Costa <[email protected]>
>> CC: Michal Hocko <[email protected]>
>> CC: Kamezawa Hiroyuki <[email protected]>
>> CC: Johannes Weiner <[email protected]>
>> CC: Andrew Morton <[email protected]>
>
> I agree this patch but can we have a way to see the number of unaccounted
> zombie cache usage for debugging ?
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
Any particular interface in mind ?
(2012/11/15 22:47), Glauber Costa wrote:
> On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
>> (2012/11/15 11:54), Glauber Costa wrote:
>>> The idea is to synchronously do it, leaving it up to the shrinking
>>> facilities in vmscan.c and/or others. Not actively retrying shrinking
>>> may leave the caches alive for more time, but it will remove the ugly
>>> wakeups. One would argue that if the caches have free objects but are
>>> not being shrunk, it is because we don't need that memory yet.
>>>
>>> Signed-off-by: Glauber Costa <[email protected]>
>>> CC: Michal Hocko <[email protected]>
>>> CC: Kamezawa Hiroyuki <[email protected]>
>>> CC: Johannes Weiner <[email protected]>
>>> CC: Andrew Morton <[email protected]>
>>
>> I agree this patch but can we have a way to see the number of unaccounted
>> zombie cache usage for debugging ?
>>
>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>
> Any particular interface in mind ?
>
Hmm, it's debug interface and having cgroup file may be bad.....
If it can be seen in bytes or some, /proc/vmstat ?
out_of_track_slabs xxxxxxx. hm ?
Thanks,
-Kame
On 11/16/2012 09:07 AM, Kamezawa Hiroyuki wrote:
> (2012/11/15 22:47), Glauber Costa wrote:
>> On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
>>> (2012/11/15 11:54), Glauber Costa wrote:
>>>> The idea is to synchronously do it, leaving it up to the shrinking
>>>> facilities in vmscan.c and/or others. Not actively retrying shrinking
>>>> may leave the caches alive for more time, but it will remove the ugly
>>>> wakeups. One would argue that if the caches have free objects but are
>>>> not being shrunk, it is because we don't need that memory yet.
>>>>
>>>> Signed-off-by: Glauber Costa <[email protected]>
>>>> CC: Michal Hocko <[email protected]>
>>>> CC: Kamezawa Hiroyuki <[email protected]>
>>>> CC: Johannes Weiner <[email protected]>
>>>> CC: Andrew Morton <[email protected]>
>>>
>>> I agree this patch but can we have a way to see the number of unaccounted
>>> zombie cache usage for debugging ?
>>>
>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>>
>> Any particular interface in mind ?
>>
>
> Hmm, it's debug interface and having cgroup file may be bad.....
> If it can be seen in bytes or some, /proc/vmstat ?
>
> out_of_track_slabs xxxxxxx. hm ?
>
I particularly think that, being this a debug interface, it is also
useful to have an indication of which caches are still in place. This is
because the cache itself, is the best indication we have about the
specific workload that may be keeping it in memory.
I first thought debugfs could help us probing useful information out of
it, but given all the abuse people inflicted in debugfs... maybe we
could have a file in the root memcg with that information for all
removed memcgs? If we do that, we can go further and list the memcgs
that are pending due to memsw as well. memory.dangling_memcgs ?
(2012/11/16 16:11), Glauber Costa wrote:
> On 11/16/2012 09:07 AM, Kamezawa Hiroyuki wrote:
>> (2012/11/15 22:47), Glauber Costa wrote:
>>> On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
>>>> (2012/11/15 11:54), Glauber Costa wrote:
>>>>> The idea is to synchronously do it, leaving it up to the shrinking
>>>>> facilities in vmscan.c and/or others. Not actively retrying shrinking
>>>>> may leave the caches alive for more time, but it will remove the ugly
>>>>> wakeups. One would argue that if the caches have free objects but are
>>>>> not being shrunk, it is because we don't need that memory yet.
>>>>>
>>>>> Signed-off-by: Glauber Costa <[email protected]>
>>>>> CC: Michal Hocko <[email protected]>
>>>>> CC: Kamezawa Hiroyuki <[email protected]>
>>>>> CC: Johannes Weiner <[email protected]>
>>>>> CC: Andrew Morton <[email protected]>
>>>>
>>>> I agree this patch but can we have a way to see the number of unaccounted
>>>> zombie cache usage for debugging ?
>>>>
>>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>>>
>>> Any particular interface in mind ?
>>>
>>
>> Hmm, it's debug interface and having cgroup file may be bad.....
>> If it can be seen in bytes or some, /proc/vmstat ?
>>
>> out_of_track_slabs xxxxxxx. hm ?
>>
>
> I particularly think that, being this a debug interface, it is also
> useful to have an indication of which caches are still in place. This is
> because the cache itself, is the best indication we have about the
> specific workload that may be keeping it in memory.
>
> I first thought debugfs could help us probing useful information out of
> it, but given all the abuse people inflicted in debugfs... maybe we
> could have a file in the root memcg with that information for all
> removed memcgs? If we do that, we can go further and list the memcgs
> that are pending due to memsw as well. memory.dangling_memcgs ?
>
Hm, I'm ok with it... others ?
Thanks,
-Kame
On Fri 16-11-12 16:21:59, KAMEZAWA Hiroyuki wrote:
> (2012/11/16 16:11), Glauber Costa wrote:
> > On 11/16/2012 09:07 AM, Kamezawa Hiroyuki wrote:
> >> (2012/11/15 22:47), Glauber Costa wrote:
> >>> On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
> >>>> (2012/11/15 11:54), Glauber Costa wrote:
> >>>>> The idea is to synchronously do it, leaving it up to the shrinking
> >>>>> facilities in vmscan.c and/or others. Not actively retrying shrinking
> >>>>> may leave the caches alive for more time, but it will remove the ugly
> >>>>> wakeups. One would argue that if the caches have free objects but are
> >>>>> not being shrunk, it is because we don't need that memory yet.
> >>>>>
> >>>>> Signed-off-by: Glauber Costa <[email protected]>
> >>>>> CC: Michal Hocko <[email protected]>
> >>>>> CC: Kamezawa Hiroyuki <[email protected]>
> >>>>> CC: Johannes Weiner <[email protected]>
> >>>>> CC: Andrew Morton <[email protected]>
> >>>>
> >>>> I agree this patch but can we have a way to see the number of unaccounted
> >>>> zombie cache usage for debugging ?
> >>>>
> >>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> >>>>
> >>> Any particular interface in mind ?
> >>>
> >>
> >> Hmm, it's debug interface and having cgroup file may be bad.....
> >> If it can be seen in bytes or some, /proc/vmstat ?
> >>
> >> out_of_track_slabs xxxxxxx. hm ?
> >>
> >
> > I particularly think that, being this a debug interface, it is also
> > useful to have an indication of which caches are still in place. This is
> > because the cache itself, is the best indication we have about the
> > specific workload that may be keeping it in memory.
> >
> > I first thought debugfs could help us probing useful information out of
> > it, but given all the abuse people inflicted in debugfs... maybe we
> > could have a file in the root memcg with that information for all
> > removed memcgs? If we do that, we can go further and list the memcgs
> > that are pending due to memsw as well. memory.dangling_memcgs ?
> >
>
> Hm, I'm ok with it... others ?
What about memory.kmem.dangling_caches?
--
Michal Hocko
SUSE Labs
On 11/16/2012 06:55 PM, Michal Hocko wrote:
> On Fri 16-11-12 16:21:59, KAMEZAWA Hiroyuki wrote:
>> (2012/11/16 16:11), Glauber Costa wrote:
>>> On 11/16/2012 09:07 AM, Kamezawa Hiroyuki wrote:
>>>> (2012/11/15 22:47), Glauber Costa wrote:
>>>>> On 11/15/2012 01:41 PM, Kamezawa Hiroyuki wrote:
>>>>>> (2012/11/15 11:54), Glauber Costa wrote:
>>>>>>> The idea is to synchronously do it, leaving it up to the shrinking
>>>>>>> facilities in vmscan.c and/or others. Not actively retrying shrinking
>>>>>>> may leave the caches alive for more time, but it will remove the ugly
>>>>>>> wakeups. One would argue that if the caches have free objects but are
>>>>>>> not being shrunk, it is because we don't need that memory yet.
>>>>>>>
>>>>>>> Signed-off-by: Glauber Costa <[email protected]>
>>>>>>> CC: Michal Hocko <[email protected]>
>>>>>>> CC: Kamezawa Hiroyuki <[email protected]>
>>>>>>> CC: Johannes Weiner <[email protected]>
>>>>>>> CC: Andrew Morton <[email protected]>
>>>>>>
>>>>>> I agree this patch but can we have a way to see the number of unaccounted
>>>>>> zombie cache usage for debugging ?
>>>>>>
>>>>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>>>>>
>>>>> Any particular interface in mind ?
>>>>>
>>>>
>>>> Hmm, it's debug interface and having cgroup file may be bad.....
>>>> If it can be seen in bytes or some, /proc/vmstat ?
>>>>
>>>> out_of_track_slabs xxxxxxx. hm ?
>>>>
>>>
>>> I particularly think that, being this a debug interface, it is also
>>> useful to have an indication of which caches are still in place. This is
>>> because the cache itself, is the best indication we have about the
>>> specific workload that may be keeping it in memory.
>>>
>>> I first thought debugfs could help us probing useful information out of
>>> it, but given all the abuse people inflicted in debugfs... maybe we
>>> could have a file in the root memcg with that information for all
>>> removed memcgs? If we do that, we can go further and list the memcgs
>>> that are pending due to memsw as well. memory.dangling_memcgs ?
>>>
>>
>> Hm, I'm ok with it... others ?
>
> What about memory.kmem.dangling_caches?
>
If that is what it does, sure.
But as I said, kmem is not the only thing that can keep caches in
memory. If we're going for this, maybe we should be more comprehensive
and show it all.