# Why do we need this?
We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.
The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged pages
are gone. If a slab object is actively used by other cgroups, it won't be
reclaimed, and will prevent the origin cgroup from being reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which are
using the same underlying fs, and what's even more important, it's shared
between multiple generations of the same workload. So if something is running
periodically every time in a new cgroup (like how systemd works), we do
accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key difference:
we disable protection and apply some extra pressure on LRUs of dying cgroups,
and these LRUs contain all charged pages.
My experiments show that with the disabled kernel memory accounting the number
of dying cgroups stabilizes at a relatively small number (~100, depends on
memory pressure and cgroup creation rate), and with kernel memory accounting
it grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu stats),
so it leads to noticeable memory losses. Memory occupied by dying cgroups
is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
of memory wasted for dying cgroups. It leads to a degradation of performance
with the uptime, and generally limits the usage of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on slab
shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
The following attempts to find the right balance [5, 6] were not successful.
So instead of trying to find a maybe non-existing balance, let's do reparent
accounted slab caches to the parent cgroup on cgroup removal.
# Implementation approach
There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable pointer
to the corresponding kmem_cache. So the idea is to reparent kmem_caches
instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.
There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual kmem_caches
to be released as soon as they become inactive and free.
Some additional implementation details are provided in corresponding
commit messages.
# Results
Below is the average number of dying cgroups on two groups of our production
hosts. They do run some sort of web frontend workload, the memory pressure
is moderate. As we can see, with the kernel memory reparenting the number
stabilizes in 60s range; however with the original version it grows almost
linearly and doesn't show any signs of plateauing. The difference in slab
and percpu usage between patched and unpatched versions also grows linearly.
In 7 days it exceeded 200Mb.
day 0 1 2 3 4 5 6 7
original 56 362 628 752 1070 1250 1490 1560
patched 23 46 51 55 60 57 67 69
mem diff(Mb) 22 74 123 152 164 182 214 241
# History
v7:
1) refined cover letter and some commit logs
2) dropped patch 1
3) dropped the dying check on kmem_cache creation path
4) dropped __rcu annotation in patch 10, switched to READ_ONCE()/WRITE_ONCE()
where is necessary
v6:
1) split biggest patches into parts to make the review easier
2) changed synchronization around the dying flag
3) sysfs entry removal on deactivation is back
4) got rid of redundant rcu wait on kmem_cache release
5) fixed getting memcg pointer in mem_cgroup_from_kmem()
5) fixed missed smp_rmb()
6) removed redundant CONFIG_SLOB
7) some renames and cosmetic fixes
v5:
1) fixed a compilation warning around missing kmemcg_queue_cache_shutdown()
2) s/rcu_read_lock()/rcu_read_unlock() in memcg_kmem_get_cache()
v4:
1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches()
2) fixed rcu_read_lock() usage in memcg_charge_slab()
3) fixed synchronization around dying flag in kmemcg_queue_cache_shutdown()
4) refreshed test results data
5) reworked PageTail() checks in memcg_from_slab_page()
6) added some comments in multiple places
v3:
1) reworked memcg kmem_cache search on allocation path
2) fixed /proc/kpagecgroup interface
v2:
1) switched to percpu kmem_cache refcounter
2) a reference to kmem_cache is held during the allocation
3) slabs stats are fixed for !MEMCG case (and the refactoring
is separated into a standalone patch)
4) kmem_cache reparenting is performed from deactivatation context
v1:
https://lkml.org/lkml/2019/4/17/1095
# Links
[1]: commit 68600f623d69 ("mm: don't miss the last page because of
round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
Roman Gushchin (10):
mm: postpone kmem_cache memcg pointer initialization to
memcg_link_cache()
mm: rename slab delayed deactivation functions and fields
mm: generalize postponed non-root kmem_cache deactivation
mm: introduce __memcg_kmem_uncharge_memcg()
mm: unify SLAB and SLUB page accounting
mm: don't check the dying flag on kmem_cache creation
mm: synchronize access to kmem_cache dying flag using a spinlock
mm: rework non-root kmem_cache lifecycle management
mm: stop setting page->mem_cgroup pointer for slab pages
mm: reparent memcg kmem_caches on cgroup removal
include/linux/memcontrol.h | 10 +++
include/linux/slab.h | 11 +--
mm/list_lru.c | 3 +-
mm/memcontrol.c | 101 ++++++++++++++++-------
mm/slab.c | 25 ++----
mm/slab.h | 143 +++++++++++++++++++++++---------
mm/slab_common.c | 164 ++++++++++++++++++++++---------------
mm/slub.c | 24 +-----
8 files changed, 301 insertions(+), 180 deletions(-)
--
2.21.0
Every slab page charged to a non-root memory cgroup has a pointer
to the memory cgroup and holds a reference to it, which protects
a non-empty memory cgroup from being released. At the same time
the page has a pointer to the corresponding kmem_cache, and also
hold a reference to the kmem_cache. And kmem_cache by itself
holds a reference to the cgroup.
So there is clearly some redundancy, which allows to stop setting
the page->mem_cgroup pointer and rely on getting memcg pointer
indirectly via kmem_cache. Further it will allow to change this
pointer easier, without a need to go over all charged pages.
So let's stop setting page->mem_cgroup pointer for slab pages,
and stop using the css refcounter directly for protecting
the memory cgroup from going away. Instead rely on kmem_cache
as an intermediate object.
Make sure that vmstats and shrinker lists are working as previously,
as well as /proc/kpagecgroup interface.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
---
mm/list_lru.c | 3 +-
mm/memcontrol.c | 12 ++++----
mm/slab.h | 74 ++++++++++++++++++++++++++++++++++++++++---------
3 files changed, 70 insertions(+), 19 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 927d85be32f6..0f1f6b06b7f3 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -12,6 +12,7 @@
#include <linux/slab.h>
#include <linux/mutex.h>
#include <linux/memcontrol.h>
+#include "slab.h"
#ifdef CONFIG_MEMCG_KMEM
static LIST_HEAD(list_lrus);
@@ -63,7 +64,7 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
if (!memcg_kmem_enabled())
return NULL;
page = virt_to_head_page(ptr);
- return page->mem_cgroup;
+ return memcg_from_slab_page(page);
}
static inline struct list_lru_one *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 43a42bc3ed3f..25e72779fd33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -486,7 +486,10 @@ ino_t page_cgroup_ino(struct page *page)
unsigned long ino = 0;
rcu_read_lock();
- memcg = READ_ONCE(page->mem_cgroup);
+ if (PageHead(page) && PageSlab(page))
+ memcg = memcg_from_slab_page(page);
+ else
+ memcg = READ_ONCE(page->mem_cgroup);
while (memcg && !(memcg->css.flags & CSS_ONLINE))
memcg = parent_mem_cgroup(memcg);
if (memcg)
@@ -2807,9 +2810,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
cancel_charge(memcg, nr_pages);
return -ENOMEM;
}
-
- page->mem_cgroup = memcg;
-
return 0;
}
@@ -2832,8 +2832,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
memcg = get_mem_cgroup_from_current();
if (!mem_cgroup_is_root(memcg)) {
ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
- if (!ret)
+ if (!ret) {
+ page->mem_cgroup = memcg;
__SetPageKmemcg(page);
+ }
}
css_put(&memcg->css);
return ret;
diff --git a/mm/slab.h b/mm/slab.h
index 5d2b8511e6fb..7ead47cb9338 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -255,30 +255,67 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
return s->memcg_params.root_cache;
}
+/*
+ * Expects a pointer to a slab page. Please note, that PageSlab() check
+ * isn't sufficient, as it returns true also for tail compound slab pages,
+ * which do not have slab_cache pointer set.
+ * So this function assumes that the page can pass PageHead() and PageSlab()
+ * checks.
+ */
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+ struct kmem_cache *s;
+
+ s = READ_ONCE(page->slab_cache);
+ if (s && !is_root_cache(s))
+ return s->memcg_params.memcg;
+
+ return NULL;
+}
+
+/*
+ * Charge the slab page belonging to the non-root kmem_cache.
+ * Can be called for non-root kmem_caches only.
+ */
static __always_inline int memcg_charge_slab(struct page *page,
gfp_t gfp, int order,
struct kmem_cache *s)
{
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
int ret;
- if (is_root_cache(s))
- return 0;
-
- ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+ memcg = s->memcg_params.memcg;
+ ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
if (ret)
return ret;
+ lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+ mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+
+ /* transer try_charge() page references to kmem_cache */
percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+ css_put_many(&memcg->css, 1 << order);
return 0;
}
+/*
+ * Uncharge a slab page belonging to a non-root kmem_cache.
+ * Can be called for non-root kmem_caches only.
+ */
static __always_inline void memcg_uncharge_slab(struct page *page, int order,
struct kmem_cache *s)
{
- if (!is_root_cache(s))
- percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
- memcg_kmem_uncharge(page, order);
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
+
+ memcg = s->memcg_params.memcg;
+ lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+ mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+ memcg_kmem_uncharge_memcg(page, order, memcg);
+
+ percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
}
extern void slab_init_memcg_params(struct kmem_cache *);
@@ -314,6 +351,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
return s;
}
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+ return NULL;
+}
+
static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
struct kmem_cache *s)
{
@@ -351,18 +393,24 @@ static __always_inline int charge_slab_page(struct page *page,
gfp_t gfp, int order,
struct kmem_cache *s)
{
- int ret = memcg_charge_slab(page, gfp, order, s);
-
- if (!ret)
- mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
+ if (is_root_cache(s)) {
+ mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+ 1 << order);
+ return 0;
+ }
- return ret;
+ return memcg_charge_slab(page, gfp, order, s);
}
static __always_inline void uncharge_slab_page(struct page *page, int order,
struct kmem_cache *s)
{
- mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
+ if (is_root_cache(s)) {
+ mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+ -(1 << order));
+ return;
+ }
+
memcg_uncharge_slab(page, order, s);
}
--
2.21.0
Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
check is passed.
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 10 ++++++++++
mm/memcontrol.c | 25 +++++++++++++++++--------
2 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ca57bacfdd2..9abf31bbe53a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1304,6 +1304,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
void __memcg_kmem_uncharge(struct page *page, int order);
int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
struct mem_cgroup *memcg);
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+ unsigned int nr_pages);
extern struct static_key_false memcg_kmem_enabled_key;
extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1345,6 +1347,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
return 0;
}
+
+static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
+ struct mem_cgroup *memcg)
+{
+ if (memcg_kmem_enabled())
+ __memcg_kmem_uncharge_memcg(memcg, 1 << order);
+}
+
/*
* helper for accessing a memcg's index. It will be used as an index in the
* child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be9344777b29..8eaf553b67f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2812,6 +2812,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
css_put(&memcg->css);
return ret;
}
+
+/**
+ * __memcg_kmem_uncharge_memcg: uncharge a kmem page
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ page_counter_uncharge(&memcg->kmem, nr_pages);
+
+ page_counter_uncharge(&memcg->memory, nr_pages);
+ if (do_memsw_account())
+ page_counter_uncharge(&memcg->memsw, nr_pages);
+}
/**
* __memcg_kmem_uncharge: uncharge a kmem page
* @page: page to uncharge
@@ -2826,14 +2842,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
return;
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
- page_counter_uncharge(&memcg->kmem, nr_pages);
-
- page_counter_uncharge(&memcg->memory, nr_pages);
- if (do_memsw_account())
- page_counter_uncharge(&memcg->memsw, nr_pages);
-
+ __memcg_kmem_uncharge_memcg(memcg, nr_pages);
page->mem_cgroup = NULL;
/* slab pages do not have PageKmemcg flag set */
--
2.21.0
There is no point in checking the root_cache->memcg_params.dying
flag on kmem_cache creation path. New allocations shouldn't be
performed using a dead root kmem_cache, so no new memcg kmem_cache
creation can be scheduled after the flag is set. And if it was
scheduled before, flush_memcg_workqueue() will wait for it anyway.
So let's drop this check to simplify the code.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/slab_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 5e7638f495d1..9383104651cd 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -640,7 +640,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
* The memory cgroup could have been offlined while the cache
* creation work was pending.
*/
- if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
+ if (memcg->kmem_state != KMEM_ONLINE)
goto out_unlock;
idx = memcg_cache_id(memcg);
--
2.21.0
Currently each charged slab page holds a reference to the cgroup to
which it's charged. Kmem_caches are held by the memcg and are released
all together with the memory cgroup. It means that none of kmem_caches
are released unless at least one reference to the memcg exists, which
is very far from optimal.
Let's rework it in a way that allows releasing individual kmem_caches
as soon as the cgroup is offline, the kmem_cache is empty and there
are no pending allocations.
To make it possible, let's introduce a new percpu refcounter for
non-root kmem caches. The counter is initialized to the percpu mode,
and is switched to the atomic mode during kmem_cache deactivation. The
counter is bumped for every charged page and also for every running
allocation. So the kmem_cache can't be released unless all allocations
complete.
To shutdown non-active empty kmem_caches, let's reuse the work queue,
previously used for the kmem_cache deactivation. Once the reference
counter reaches 0, let's schedule an asynchronous kmem_cache release.
* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):
time find / -name fname-no-exist
echo 2 > /proc/sys/vm/drop_caches
repeat 10 times
Results:
orig patched
real 0m1.455s real 0m1.355s
user 0m0.206s user 0m0.219s
sys 0m0.855s sys 0m0.807s
real 0m1.487s real 0m1.699s
user 0m0.221s user 0m0.256s
sys 0m0.806s sys 0m0.948s
real 0m1.515s real 0m1.505s
user 0m0.183s user 0m0.215s
sys 0m0.876s sys 0m0.858s
real 0m1.291s real 0m1.380s
user 0m0.193s user 0m0.198s
sys 0m0.843s sys 0m0.786s
real 0m1.364s real 0m1.374s
user 0m0.180s user 0m0.182s
sys 0m0.868s sys 0m0.806s
real 0m1.352s real 0m1.312s
user 0m0.201s user 0m0.212s
sys 0m0.820s sys 0m0.761s
real 0m1.302s real 0m1.349s
user 0m0.205s user 0m0.203s
sys 0m0.803s sys 0m0.792s
real 0m1.334s real 0m1.301s
user 0m0.194s user 0m0.201s
sys 0m0.806s sys 0m0.779s
real 0m1.426s real 0m1.434s
user 0m0.216s user 0m0.181s
sys 0m0.824s sys 0m0.864s
real 0m1.350s real 0m1.295s
user 0m0.200s user 0m0.190s
sys 0m0.842s sys 0m0.811s
So it looks like the difference is not noticeable in this test.
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
---
include/linux/slab.h | 3 +-
mm/memcontrol.c | 50 +++++++++++++++++++++-------
mm/slab.h | 44 +++++++-----------------
mm/slab_common.c | 79 ++++++++++++++++++++++++++------------------
4 files changed, 99 insertions(+), 77 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 47923c173f30..1b54e5f83342 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -16,6 +16,7 @@
#include <linux/overflow.h>
#include <linux/types.h>
#include <linux/workqueue.h>
+#include <linux/percpu-refcount.h>
/*
@@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
void memcg_deactivate_kmem_caches(struct mem_cgroup *);
-void memcg_destroy_kmem_caches(struct mem_cgroup *);
/*
* Please use this macro to create slab caches. Simply specify the
@@ -641,6 +641,7 @@ struct memcg_cache_params {
struct mem_cgroup *memcg;
struct list_head children_node;
struct list_head kmem_caches_node;
+ struct percpu_ref refcnt;
void (*work_fn)(struct kmem_cache *);
union {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8eaf553b67f1..43a42bc3ed3f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2672,12 +2672,13 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
{
struct memcg_kmem_cache_create_work *cw;
+ if (!css_tryget_online(&memcg->css))
+ return;
+
cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
if (!cw)
return;
- css_get(&memcg->css);
-
cw->memcg = memcg;
cw->cachep = cachep;
INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
@@ -2712,6 +2713,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
{
struct mem_cgroup *memcg;
struct kmem_cache *memcg_cachep;
+ struct memcg_cache_array *arr;
int kmemcg_id;
VM_BUG_ON(!is_root_cache(cachep));
@@ -2719,14 +2721,28 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
if (memcg_kmem_bypass())
return cachep;
- memcg = get_mem_cgroup_from_current();
+ rcu_read_lock();
+
+ if (unlikely(current->active_memcg))
+ memcg = current->active_memcg;
+ else
+ memcg = mem_cgroup_from_task(current);
+
+ if (!memcg || memcg == root_mem_cgroup)
+ goto out_unlock;
+
kmemcg_id = READ_ONCE(memcg->kmemcg_id);
if (kmemcg_id < 0)
- goto out;
+ goto out_unlock;
+
+ arr = rcu_dereference(cachep->memcg_params.memcg_caches);
- memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
- if (likely(memcg_cachep))
- return memcg_cachep;
+ /*
+ * Make sure we will access the up-to-date value. The code updating
+ * memcg_caches issues a write barrier to match the data dependency
+ * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
+ */
+ memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
/*
* If we are in a safe context (can wait, and not in interrupt
@@ -2739,10 +2755,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
* memcg_create_kmem_cache, this means no further allocation
* could happen with the slab_mutex held. So it's better to
* defer everything.
+ *
+ * If the memcg is dying or memcg_cache is about to be released,
+ * don't bother creating new kmem_caches. Because memcg_cachep
+ * is ZEROed as the fist step of kmem offlining, we don't need
+ * percpu_ref_tryget_live() here. css_tryget_online() check in
+ * memcg_schedule_kmem_cache_create() will prevent us from
+ * creation of a new kmem_cache.
*/
- memcg_schedule_kmem_cache_create(memcg, cachep);
-out:
- css_put(&memcg->css);
+ if (unlikely(!memcg_cachep))
+ memcg_schedule_kmem_cache_create(memcg, cachep);
+ else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+ cachep = memcg_cachep;
+out_unlock:
+ rcu_read_unlock();
return cachep;
}
@@ -2753,7 +2779,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
void memcg_kmem_put_cache(struct kmem_cache *cachep)
{
if (!is_root_cache(cachep))
- css_put(&cachep->memcg_params.memcg->css);
+ percpu_ref_put(&cachep->memcg_params.refcnt);
}
/**
@@ -3300,7 +3326,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
memcg_offline_kmem(memcg);
if (memcg->kmem_state == KMEM_ALLOCATED) {
- memcg_destroy_kmem_caches(memcg);
+ WARN_ON(!list_empty(&memcg->kmem_caches));
static_branch_dec(&memcg_kmem_enabled_key);
WARN_ON(page_counter_read(&memcg->kmem));
}
diff --git a/mm/slab.h b/mm/slab.h
index 46623a576a3c..5d2b8511e6fb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -248,31 +248,6 @@ static inline const char *cache_name(struct kmem_cache *s)
return s->name;
}
-/*
- * Note, we protect with RCU only the memcg_caches array, not per-memcg caches.
- * That said the caller must assure the memcg's cache won't go away by either
- * taking a css reference to the owner cgroup, or holding the slab_mutex.
- */
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
- struct kmem_cache *cachep;
- struct memcg_cache_array *arr;
-
- rcu_read_lock();
- arr = rcu_dereference(s->memcg_params.memcg_caches);
-
- /*
- * Make sure we will access the up-to-date value. The code updating
- * memcg_caches issues a write barrier to match this (see
- * memcg_create_kmem_cache()).
- */
- cachep = READ_ONCE(arr->entries[idx]);
- rcu_read_unlock();
-
- return cachep;
-}
-
static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
{
if (is_root_cache(s))
@@ -284,14 +259,25 @@ static __always_inline int memcg_charge_slab(struct page *page,
gfp_t gfp, int order,
struct kmem_cache *s)
{
+ int ret;
+
if (is_root_cache(s))
return 0;
- return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+
+ ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+ if (ret)
+ return ret;
+
+ percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+
+ return 0;
}
static __always_inline void memcg_uncharge_slab(struct page *page, int order,
struct kmem_cache *s)
{
+ if (!is_root_cache(s))
+ percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
memcg_kmem_uncharge(page, order);
}
@@ -323,12 +309,6 @@ static inline const char *cache_name(struct kmem_cache *s)
return s->name;
}
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
- return NULL;
-}
-
static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
{
return s;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1e5eaf84bf08..6b7750f7ea33 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,6 +132,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
LIST_HEAD(slab_root_caches);
static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
+static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
+
void slab_init_memcg_params(struct kmem_cache *s)
{
s->memcg_params.root_cache = NULL;
@@ -146,6 +148,12 @@ static int init_memcg_params(struct kmem_cache *s,
struct memcg_cache_array *arr;
if (root_cache) {
+ int ret = percpu_ref_init(&s->memcg_params.refcnt,
+ kmemcg_cache_shutdown,
+ 0, GFP_KERNEL);
+ if (ret)
+ return ret;
+
s->memcg_params.root_cache = root_cache;
INIT_LIST_HEAD(&s->memcg_params.children_node);
INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
@@ -171,6 +179,8 @@ static void destroy_memcg_params(struct kmem_cache *s)
{
if (is_root_cache(s))
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
+ else
+ percpu_ref_exit(&s->memcg_params.refcnt);
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -226,6 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
if (is_root_cache(s)) {
list_add(&s->root_caches_node, &slab_root_caches);
} else {
+ css_get(&memcg->css);
s->memcg_params.memcg = memcg;
list_add(&s->memcg_params.children_node,
&s->memcg_params.root_cache->memcg_params.children);
@@ -241,6 +252,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
+ css_put(&s->memcg_params.memcg->css);
}
}
#else
@@ -678,7 +690,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
}
/*
- * Since readers won't lock (see cache_from_memcg_idx()), we need a
+ * Since readers won't lock (see memcg_kmem_get_cache()), we need a
* barrier here to ensure nobody will see the kmem_cache partially
* initialized.
*/
@@ -703,14 +715,12 @@ static void kmemcg_workfn(struct work_struct *work)
mutex_lock(&slab_mutex);
s->memcg_params.work_fn(s);
+ s->memcg_params.work_fn = NULL;
mutex_unlock(&slab_mutex);
put_online_mems();
put_online_cpus();
-
- /* done, put the ref from kmemcg_cache_deactivate() */
- css_put(&s->memcg_params.memcg->css);
}
static void kmemcg_rcufn(struct rcu_head *head)
@@ -727,10 +737,39 @@ static void kmemcg_rcufn(struct rcu_head *head)
queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
}
+static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
+{
+ WARN_ON(shutdown_cache(s));
+}
+
+static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
+{
+ struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
+ memcg_params.refcnt);
+ unsigned long flags;
+
+ spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
+ if (s->memcg_params.root_cache->memcg_params.dying)
+ goto unlock;
+
+ WARN_ON(s->memcg_params.work_fn);
+ s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
+ INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
+ queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
+
+unlock:
+ spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
+}
+
+static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
+{
+ __kmemcg_cache_deactivate_after_rcu(s);
+ percpu_ref_kill(&s->memcg_params.refcnt);
+}
+
static void kmemcg_cache_deactivate(struct kmem_cache *s)
{
- if (WARN_ON_ONCE(is_root_cache(s)) ||
- WARN_ON_ONCE(s->memcg_params.work_fn))
+ if (WARN_ON_ONCE(is_root_cache(s)))
return;
__kmemcg_cache_deactivate(s);
@@ -744,10 +783,8 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
if (s->memcg_params.root_cache->memcg_params.dying)
goto unlock;
- /* pin memcg so that @s doesn't get destroyed in the middle */
- css_get(&s->memcg_params.memcg->css);
-
- s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
+ WARN_ON_ONCE(s->memcg_params.work_fn);
+ s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
unlock:
spin_unlock_irq(&memcg_kmem_wq_lock);
@@ -781,28 +818,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
put_online_cpus();
}
-void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
-{
- struct kmem_cache *s, *s2;
-
- get_online_cpus();
- get_online_mems();
-
- mutex_lock(&slab_mutex);
- list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
- memcg_params.kmem_caches_node) {
- /*
- * The cgroup is about to be freed and therefore has no charges
- * left. Hence, all its caches must be empty by now.
- */
- BUG_ON(shutdown_cache(s));
- }
- mutex_unlock(&slab_mutex);
-
- put_online_mems();
- put_online_cpus();
-}
-
static int shutdown_memcg_caches(struct kmem_cache *s)
{
struct memcg_cache_array *arr;
--
2.21.0
On Tue, Jun 11, 2019 at 04:18:09PM -0700, Roman Gushchin wrote:
> There is no point in checking the root_cache->memcg_params.dying
> flag on kmem_cache creation path. New allocations shouldn't be
> performed using a dead root kmem_cache, so no new memcg kmem_cache
> creation can be scheduled after the flag is set. And if it was
> scheduled before, flush_memcg_workqueue() will wait for it anyway.
>
> So let's drop this check to simplify the code.
>
> Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
On Tue, Jun 11, 2019 at 4:18 PM Roman Gushchin <[email protected]> wrote:
>
> There is no point in checking the root_cache->memcg_params.dying
> flag on kmem_cache creation path. New allocations shouldn't be
> performed using a dead root kmem_cache,
Yes, it's the user's responsibility to synchronize the kmem cache
destruction and allocations.
> so no new memcg kmem_cache
> creation can be scheduled after the flag is set. And if it was
> scheduled before, flush_memcg_workqueue() will wait for it anyway.
>
> So let's drop this check to simplify the code.
>
> Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Tue, Jun 11, 2019 at 4:18 PM Roman Gushchin <[email protected]> wrote:
>
> Currently each charged slab page holds a reference to the cgroup to
> which it's charged. Kmem_caches are held by the memcg and are released
> all together with the memory cgroup. It means that none of kmem_caches
> are released unless at least one reference to the memcg exists, which
> is very far from optimal.
>
> Let's rework it in a way that allows releasing individual kmem_caches
> as soon as the cgroup is offline, the kmem_cache is empty and there
> are no pending allocations.
>
> To make it possible, let's introduce a new percpu refcounter for
> non-root kmem caches. The counter is initialized to the percpu mode,
> and is switched to the atomic mode during kmem_cache deactivation. The
> counter is bumped for every charged page and also for every running
> allocation. So the kmem_cache can't be released unless all allocations
> complete.
>
> To shutdown non-active empty kmem_caches, let's reuse the work queue,
> previously used for the kmem_cache deactivation. Once the reference
> counter reaches 0, let's schedule an asynchronous kmem_cache release.
>
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
>
> time find / -name fname-no-exist
> echo 2 > /proc/sys/vm/drop_caches
> repeat 10 times
>
> Results:
>
> orig patched
>
> real 0m1.455s real 0m1.355s
> user 0m0.206s user 0m0.219s
> sys 0m0.855s sys 0m0.807s
>
> real 0m1.487s real 0m1.699s
> user 0m0.221s user 0m0.256s
> sys 0m0.806s sys 0m0.948s
>
> real 0m1.515s real 0m1.505s
> user 0m0.183s user 0m0.215s
> sys 0m0.876s sys 0m0.858s
>
> real 0m1.291s real 0m1.380s
> user 0m0.193s user 0m0.198s
> sys 0m0.843s sys 0m0.786s
>
> real 0m1.364s real 0m1.374s
> user 0m0.180s user 0m0.182s
> sys 0m0.868s sys 0m0.806s
>
> real 0m1.352s real 0m1.312s
> user 0m0.201s user 0m0.212s
> sys 0m0.820s sys 0m0.761s
>
> real 0m1.302s real 0m1.349s
> user 0m0.205s user 0m0.203s
> sys 0m0.803s sys 0m0.792s
>
> real 0m1.334s real 0m1.301s
> user 0m0.194s user 0m0.201s
> sys 0m0.806s sys 0m0.779s
>
> real 0m1.426s real 0m1.434s
> user 0m0.216s user 0m0.181s
> sys 0m0.824s sys 0m0.864s
>
> real 0m1.350s real 0m1.295s
> user 0m0.200s user 0m0.190s
> sys 0m0.842s sys 0m0.811s
>
> So it looks like the difference is not noticeable in this test.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Tue, Jun 11, 2019 at 4:18 PM Roman Gushchin <[email protected]> wrote:
>
> Every slab page charged to a non-root memory cgroup has a pointer
> to the memory cgroup and holds a reference to it, which protects
> a non-empty memory cgroup from being released. At the same time
> the page has a pointer to the corresponding kmem_cache, and also
> hold a reference to the kmem_cache. And kmem_cache by itself
> holds a reference to the cgroup.
>
> So there is clearly some redundancy, which allows to stop setting
> the page->mem_cgroup pointer and rely on getting memcg pointer
> indirectly via kmem_cache. Further it will allow to change this
> pointer easier, without a need to go over all charged pages.
>
> So let's stop setting page->mem_cgroup pointer for slab pages,
> and stop using the css refcounter directly for protecting
> the memory cgroup from going away. Instead rely on kmem_cache
> as an intermediate object.
>
> Make sure that vmstats and shrinker lists are working as previously,
> as well as /proc/kpagecgroup interface.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Folks,
I do get errors like the following when running a new testcase in our KVM CI.
The test basically unloads kvm, reloads with with hpage=1 (enable huge page
support for guests on s390) start a guest with libvirt and hugepages, shut the
guest down and unload the kvm module.
WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
Hardware name: IBM 2964 NC9 712 (LPAR)
Workqueue: events sysfs_slab_remove_workfn
Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
0000001529746848: 47000700 bc 0,1792
#000000152974684c: a7f40001 brc 15,152974684e
>0000001529746850: a7f4fff2 brc 15,1529746834
0000001529746854: 0707 bcr 0,%r7
0000001529746856: 0707 bcr 0,%r7
0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
000000152974685e: a738ffff lhi %r3,-1
Call Trace:
([<000003e009263d00>] 0x3e009263d00)
[<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
[<0000001529b04882>] kobject_put+0xaa/0xe8
[<000000152918cf28>] process_one_work+0x1e8/0x428
[<000000152918d1b0>] worker_thread+0x48/0x460
[<00000015291942c6>] kthread+0x126/0x160
[<0000001529b22344>] ret_from_fork+0x28/0x30
[<0000001529b2234c>] kernel_thread_starter+0x0/0x10
Last Breaking-Event-Address:
[<000000152974684c>] percpu_ref_exit+0x4c/0x58
---[ end trace b035e7da5788eb09 ]---
I have bisected this to
# first bad commit: [f0a3a24b532d9a7e56a33c5112b2a212ed6ec580] mm: memcg/slab: rework non-root kmem_cache lifecycle management
unmounting /sys/fs/cgroup/memory/ before the test makes the problem go away so
it really seems to be related to the new percpu-refs from this patch.
Any ideas?
Christian
On 21.11.19 12:17, Christian Borntraeger wrote:
> Folks,
>
> I do get errors like the following when running a new testcase in our KVM CI.
> The test basically unloads kvm, reloads with with hpage=1 (enable huge page
> support for guests on s390) start a guest with libvirt and hugepages, shut the
> guest down and unload the kvm module.
It also crashes without large pages. The trigger is really that the time between
"guest is going away" and rmmod kvm is really short.
i
>
> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> Hardware name: IBM 2964 NC9 712 (LPAR)
> Workqueue: events sysfs_slab_remove_workfn
> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
> 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
> 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
> 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
> 0000001529746848: 47000700 bc 0,1792
> #000000152974684c: a7f40001 brc 15,152974684e
> >0000001529746850: a7f4fff2 brc 15,1529746834
> 0000001529746854: 0707 bcr 0,%r7
> 0000001529746856: 0707 bcr 0,%r7
> 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
> 000000152974685e: a738ffff lhi %r3,-1
> Call Trace:
> ([<000003e009263d00>] 0x3e009263d00)
> [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
> [<0000001529b04882>] kobject_put+0xaa/0xe8
> [<000000152918cf28>] process_one_work+0x1e8/0x428
> [<000000152918d1b0>] worker_thread+0x48/0x460
> [<00000015291942c6>] kthread+0x126/0x160
> [<0000001529b22344>] ret_from_fork+0x28/0x30
> [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> Last Breaking-Event-Address:
> [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> ---[ end trace b035e7da5788eb09 ]---
>
> I have bisected this to
> # first bad commit: [f0a3a24b532d9a7e56a33c5112b2a212ed6ec580] mm: memcg/slab: rework non-root kmem_cache lifecycle management
>
> unmounting /sys/fs/cgroup/memory/ before the test makes the problem go away so
> it really seems to be related to the new percpu-refs from this patch.
>
> Any ideas?
>
> Christian
>
On Thu, Nov 21, 2019 at 12:17:39PM +0100, Christian Borntraeger wrote:
> Folks,
>
> I do get errors like the following when running a new testcase in our KVM CI.
> The test basically unloads kvm, reloads with with hpage=1 (enable huge page
> support for guests on s390) start a guest with libvirt and hugepages, shut the
> guest down and unload the kvm module.
>
> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> Hardware name: IBM 2964 NC9 712 (LPAR)
> Workqueue: events sysfs_slab_remove_workfn
> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
> 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
> 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
> 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
> 0000001529746848: 47000700 bc 0,1792
> #000000152974684c: a7f40001 brc 15,152974684e
> >0000001529746850: a7f4fff2 brc 15,1529746834
> 0000001529746854: 0707 bcr 0,%r7
> 0000001529746856: 0707 bcr 0,%r7
> 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
> 000000152974685e: a738ffff lhi %r3,-1
> Call Trace:
> ([<000003e009263d00>] 0x3e009263d00)
> [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
> [<0000001529b04882>] kobject_put+0xaa/0xe8
> [<000000152918cf28>] process_one_work+0x1e8/0x428
> [<000000152918d1b0>] worker_thread+0x48/0x460
> [<00000015291942c6>] kthread+0x126/0x160
> [<0000001529b22344>] ret_from_fork+0x28/0x30
> [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> Last Breaking-Event-Address:
> [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> ---[ end trace b035e7da5788eb09 ]---
>
> I have bisected this to
> # first bad commit: [f0a3a24b532d9a7e56a33c5112b2a212ed6ec580] mm: memcg/slab: rework non-root kmem_cache lifecycle management
>
> unmounting /sys/fs/cgroup/memory/ before the test makes the problem go away so
> it really seems to be related to the new percpu-refs from this patch.
>
> Any ideas?
Hello, Christian!
It seems to be a race between releasing of the root kmem_cache (caused by rmmod)
and a memcg kmem_cache. Does delaying rmmod for say, a minute, "resolve" the
issue?
I'll take a look at this code path.
Thanks!
On 21.11.19 17:58, Roman Gushchin wrote:
> On Thu, Nov 21, 2019 at 12:17:39PM +0100, Christian Borntraeger wrote:
>> Folks,
>>
>> I do get errors like the following when running a new testcase in our KVM CI.
>> The test basically unloads kvm, reloads with with hpage=1 (enable huge page
>> support for guests on s390) start a guest with libvirt and hugepages, shut the
>> guest down and unload the kvm module.
>>
>> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
>> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
>> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
>> Hardware name: IBM 2964 NC9 712 (LPAR)
>> Workqueue: events sysfs_slab_remove_workfn
>> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
>> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
>> 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
>> 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
>> 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
>> Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
>> 0000001529746848: 47000700 bc 0,1792
>> #000000152974684c: a7f40001 brc 15,152974684e
>> >0000001529746850: a7f4fff2 brc 15,1529746834
>> 0000001529746854: 0707 bcr 0,%r7
>> 0000001529746856: 0707 bcr 0,%r7
>> 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
>> 000000152974685e: a738ffff lhi %r3,-1
>> Call Trace:
>> ([<000003e009263d00>] 0x3e009263d00)
>> [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
>> [<0000001529b04882>] kobject_put+0xaa/0xe8
>> [<000000152918cf28>] process_one_work+0x1e8/0x428
>> [<000000152918d1b0>] worker_thread+0x48/0x460
>> [<00000015291942c6>] kthread+0x126/0x160
>> [<0000001529b22344>] ret_from_fork+0x28/0x30
>> [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
>> Last Breaking-Event-Address:
>> [<000000152974684c>] percpu_ref_exit+0x4c/0x58
>> ---[ end trace b035e7da5788eb09 ]---
>>
>> I have bisected this to
>> # first bad commit: [f0a3a24b532d9a7e56a33c5112b2a212ed6ec580] mm: memcg/slab: rework non-root kmem_cache lifecycle management
>>
>> unmounting /sys/fs/cgroup/memory/ before the test makes the problem go away so
>> it really seems to be related to the new percpu-refs from this patch.
>>
>> Any ideas?
>
> Hello, Christian!
>
> It seems to be a race between releasing of the root kmem_cache (caused by rmmod)
> and a memcg kmem_cache. Does delaying rmmod for say, a minute, "resolve" the
> issue?
Yes, rmmod has to be called directly after the guest shutdown to see the issue.
See my 2nd mail.
On Thu, Nov 21, 2019 at 05:59:54PM +0100, Christian Borntraeger wrote:
>
>
> On 21.11.19 17:58, Roman Gushchin wrote:
> > On Thu, Nov 21, 2019 at 12:17:39PM +0100, Christian Borntraeger wrote:
> >> Folks,
> >>
> >> I do get errors like the following when running a new testcase in our KVM CI.
> >> The test basically unloads kvm, reloads with with hpage=1 (enable huge page
> >> support for guests on s390) start a guest with libvirt and hugepages, shut the
> >> guest down and unload the kvm module.
> >>
> >> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> >> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> >> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> >> Hardware name: IBM 2964 NC9 712 (LPAR)
> >> Workqueue: events sysfs_slab_remove_workfn
> >> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
> >> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
> >> 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
> >> 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
> >> 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> >> Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
> >> 0000001529746848: 47000700 bc 0,1792
> >> #000000152974684c: a7f40001 brc 15,152974684e
> >> >0000001529746850: a7f4fff2 brc 15,1529746834
> >> 0000001529746854: 0707 bcr 0,%r7
> >> 0000001529746856: 0707 bcr 0,%r7
> >> 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
> >> 000000152974685e: a738ffff lhi %r3,-1
> >> Call Trace:
> >> ([<000003e009263d00>] 0x3e009263d00)
> >> [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
> >> [<0000001529b04882>] kobject_put+0xaa/0xe8
> >> [<000000152918cf28>] process_one_work+0x1e8/0x428
> >> [<000000152918d1b0>] worker_thread+0x48/0x460
> >> [<00000015291942c6>] kthread+0x126/0x160
> >> [<0000001529b22344>] ret_from_fork+0x28/0x30
> >> [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> >> Last Breaking-Event-Address:
> >> [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> >> ---[ end trace b035e7da5788eb09 ]---
> >>
> >> I have bisected this to
> >> # first bad commit: [f0a3a24b532d9a7e56a33c5112b2a212ed6ec580] mm: memcg/slab: rework non-root kmem_cache lifecycle management
> >>
> >> unmounting /sys/fs/cgroup/memory/ before the test makes the problem go away so
> >> it really seems to be related to the new percpu-refs from this patch.
> >>
> >> Any ideas?
> >
> > Hello, Christian!
> >
> > It seems to be a race between releasing of the root kmem_cache (caused by rmmod)
> > and a memcg kmem_cache. Does delaying rmmod for say, a minute, "resolve" the
> > issue?
>
> Yes, rmmod has to be called directly after the guest shutdown to see the issue.
> See my 2nd mail.
I see. Do you know, which kmem_cache it is? If not, can you, please,
figure it out?
I tried to reproduce the issue, but wasn't successful so far. So I wonder
what can make your case special.
Thanks!
On Thu, 2019-11-21 at 13:45 -0500, Roman Gushchin wrote:
> On Thu, Nov 21, 2019 at 05:59:54PM +0100, Christian Borntraeger
> wrote:
> >
> >
> > Yes, rmmod has to be called directly after the guest shutdown to
> > see the issue.
> > See my 2nd mail.
>
> I see. Do you know, which kmem_cache it is? If not, can you, please,
> figure it out?
>
> I tried to reproduce the issue, but wasn't successful so far. So I
> wonder
> what can make your case special.
I do not know either, but have a guess.
My guess would be that either the slab object or the
slab page is RCU freed, and the kmem_cache destruction
is called before that RCU callback has completed.
On Thu, Nov 21, 2019 at 12:43:01PM -0800, Rik van Riel wrote:
> On Thu, 2019-11-21 at 13:45 -0500, Roman Gushchin wrote:
> > On Thu, Nov 21, 2019 at 05:59:54PM +0100, Christian Borntraeger
> > wrote:
> > >
> > >
> > > Yes, rmmod has to be called directly after the guest shutdown to
> > > see the issue.
> > > See my 2nd mail.
> >
> > I see. Do you know, which kmem_cache it is? If not, can you, please,
> > figure it out?
> >
> > I tried to reproduce the issue, but wasn't successful so far. So I
> > wonder
> > what can make your case special.
>
> I do not know either, but have a guess.
>
> My guess would be that either the slab object or the
> slab page is RCU freed, and the kmem_cache destruction
> is called before that RCU callback has completed.
>
I've a reproducer, but it requires SLAB_TYPESAFE_BY_RCU to panic.
The only question is if it's the same or different issues.
As soon as I'll have a fix, I'll post it here to test.
Thanks!
On Thu, Nov 21, 2019 at 12:55:28PM -0800, Roman Gushchin wrote:
> On Thu, Nov 21, 2019 at 12:43:01PM -0800, Rik van Riel wrote:
> > On Thu, 2019-11-21 at 13:45 -0500, Roman Gushchin wrote:
> > > On Thu, Nov 21, 2019 at 05:59:54PM +0100, Christian Borntraeger
> > > wrote:
> > > >
> > > >
> > > > Yes, rmmod has to be called directly after the guest shutdown to
> > > > see the issue.
> > > > See my 2nd mail.
> > >
> > > I see. Do you know, which kmem_cache it is? If not, can you, please,
> > > figure it out?
> > >
> > > I tried to reproduce the issue, but wasn't successful so far. So I
> > > wonder
> > > what can make your case special.
> >
> > I do not know either, but have a guess.
> >
> > My guess would be that either the slab object or the
> > slab page is RCU freed, and the kmem_cache destruction
> > is called before that RCU callback has completed.
> >
>
> I've a reproducer, but it requires SLAB_TYPESAFE_BY_RCU to panic.
> The only question is if it's the same or different issues.
> As soon as I'll have a fix, I'll post it here to test.
Ah, no, the issue I've reproduced is already fixed by commit b749ecfaf6c5
("mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release").
Christian, can you, please, confirm that you have this one in your tree?
Also, can you, please, provide you config?
And you mentioned some panics, but didn't send any dmesg messages.
Can you, please, provide them?
Thanks!
On 21.11.19 23:09, Roman Gushchin wrote:
> On Thu, Nov 21, 2019 at 12:55:28PM -0800, Roman Gushchin wrote:
>> On Thu, Nov 21, 2019 at 12:43:01PM -0800, Rik van Riel wrote:
>>> On Thu, 2019-11-21 at 13:45 -0500, Roman Gushchin wrote:
>>>> On Thu, Nov 21, 2019 at 05:59:54PM +0100, Christian Borntraeger
>>>> wrote:
>>>>>
>>>>>
>>>>> Yes, rmmod has to be called directly after the guest shutdown to
>>>>> see the issue.
>>>>> See my 2nd mail.
>>>>
>>>> I see. Do you know, which kmem_cache it is? If not, can you, please,
>>>> figure it out?
I can try to figure out.
>>>>
>>>> I tried to reproduce the issue, but wasn't successful so far. So I
>>>> wonder
>>>> what can make your case special.
>>>
>>> I do not know either, but have a guess.
>>>
>>> My guess would be that either the slab object or the
>>> slab page is RCU freed, and the kmem_cache destruction
>>> is called before that RCU callback has completed.
>>>
>>
>> I've a reproducer, but it requires SLAB_TYPESAFE_BY_RCU to panic.
>> The only question is if it's the same or different issues.
>> As soon as I'll have a fix, I'll post it here to test.
>
> Ah, no, the issue I've reproduced is already fixed by commit b749ecfaf6c5
> ("mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release").
>
> Christian, can you, please, confirm that you have this one in your tree?
The problem still reproduces on the latest next kernel.
>
> Also, can you, please, provide you config?
I can but this is an s390 config.
> And you mentioned some panics, but didn't send any dmesg messages.
> Can you, please, provide them?
Those were all the WARN_ONs from percpu_ref_exit, I just had panic_on_warn enabled.
I attached my config and dmesg just in case for the next kernel of today.
On 21.11.19 19:45, Roman Gushchin wrote:
> I see. Do you know, which kmem_cache it is? If not, can you, please,
> figure it out?
The release function for that ref is kmemcg_cache_shutdown.
On Fri, Nov 22, 2019 at 05:28:46PM +0100, Christian Borntraeger wrote:
> On 21.11.19 19:45, Roman Gushchin wrote:
> > I see. Do you know, which kmem_cache it is? If not, can you, please,
> > figure it out?
>
> The release function for that ref is kmemcg_cache_shutdown.
>
Hi Christian!
Can you, please, test if the following patch resolves the problem?
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8afa188f6e20..628e5f0ee19e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -888,6 +888,8 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
static void flush_memcg_workqueue(struct kmem_cache *s)
{
+ bool wait_for_children;
+
spin_lock_irq(&memcg_kmem_wq_lock);
s->memcg_params.dying = true;
spin_unlock_irq(&memcg_kmem_wq_lock);
@@ -904,6 +906,13 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
* previous workitems on workqueue are processed.
*/
flush_workqueue(memcg_kmem_cache_wq);
+
+ mutex_lock(&slab_mutex);
+ wait_for_children = !list_empty(&s->memcg_params.children);
+ mutex_unlock(&slab_mutex);
+
+ if (wait_for_children)
+ rcu_barrier();
}
#else
static inline int shutdown_memcg_caches(struct kmem_cache *s)
--
Thanks!
On 24.11.19 01:39, Roman Gushchin wrote:
> On Fri, Nov 22, 2019 at 05:28:46PM +0100, Christian Borntraeger wrote:
>> On 21.11.19 19:45, Roman Gushchin wrote:
>>> I see. Do you know, which kmem_cache it is? If not, can you, please,
>>> figure it out?
>>
>> The release function for that ref is kmemcg_cache_shutdown.
>>
>
> Hi Christian!
>
> Can you, please, test if the following patch resolves the problem?
Yes, it does.
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8afa188f6e20..628e5f0ee19e 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -888,6 +888,8 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>
> static void flush_memcg_workqueue(struct kmem_cache *s)
> {
> + bool wait_for_children;
> +
> spin_lock_irq(&memcg_kmem_wq_lock);
> s->memcg_params.dying = true;
> spin_unlock_irq(&memcg_kmem_wq_lock);
> @@ -904,6 +906,13 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
> * previous workitems on workqueue are processed.
> */
> flush_workqueue(memcg_kmem_cache_wq);
> +
> + mutex_lock(&slab_mutex);
> + wait_for_children = !list_empty(&s->memcg_params.children);
> + mutex_unlock(&slab_mutex);
Not sure if (for reading) we really need the mutex.
> +
> + if (wait_for_children)
> + rcu_barrier();
> }
> #else
> static inline int shutdown_memcg_caches(struct kmem_cache *s)
>
> --
>
> Thanks!
>
On Mon, Nov 25, 2019 at 09:00:56AM +0100, Christian Borntraeger wrote:
>
>
> On 24.11.19 01:39, Roman Gushchin wrote:
> > On Fri, Nov 22, 2019 at 05:28:46PM +0100, Christian Borntraeger wrote:
> >> On 21.11.19 19:45, Roman Gushchin wrote:
> >>> I see. Do you know, which kmem_cache it is? If not, can you, please,
> >>> figure it out?
> >>
> >> The release function for that ref is kmemcg_cache_shutdown.
> >>
> >
> > Hi Christian!
> >
> > Can you, please, test if the following patch resolves the problem?
>
> Yes, it does.
Thanks for testing it!
I'll send the patch shortly.
>
>
> >
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 8afa188f6e20..628e5f0ee19e 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -888,6 +888,8 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
> >
> > static void flush_memcg_workqueue(struct kmem_cache *s)
> > {
> > + bool wait_for_children;
> > +
> > spin_lock_irq(&memcg_kmem_wq_lock);
> > s->memcg_params.dying = true;
> > spin_unlock_irq(&memcg_kmem_wq_lock);
> > @@ -904,6 +906,13 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
> > * previous workitems on workqueue are processed.
> > */
> > flush_workqueue(memcg_kmem_cache_wq);
> > +
> > + mutex_lock(&slab_mutex);
> > + wait_for_children = !list_empty(&s->memcg_params.children);
> > + mutex_unlock(&slab_mutex);
>
> Not sure if (for reading) we really need the mutex.
Good point!
At this moment the list of children caches can't grow, only shrink.
So if we're reading it without the slab mutex, the worst thing can
happen is that we'll make an excessive rcu_barrier() call.
Which is fine given that resulting code looks much simpler.
Thanks!