# Why do we need this?
We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the mainreason: slab objects.
The underlying problem is quite simple: any page charged
to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
all charged pages are gone. If a slab object is actively used by other cgroups,
it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which are
using the same underlying fs, and what's even more important, it's shared
between multiple generations of the same workload. So if something is running
periodically every time in a new cgroup (like how systemd works), we do
accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key difference:
we disable protection and apply some extra pressure on LRUs of dying cgroups,
and these LRUs contain all charged pages.
My experiments show that with the disabled kernel memory accounting the number
of dying cgroups stabilizes at a relatively small number (~100, depends on
memory pressure and cgroup creation rate), and with kernel memory accounting
it grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu stats),
so it leads to noticeable memory losses. Memory occupied by dying cgroups
is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
of memory wasted for dying cgroups. It leads to a degradation of performance
with the uptime, and generally limits the usage of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on slab
shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
The following attempts to find the right balance [5, 6] were not successful.
So instead of trying to find a maybe non-existing balance, let's do reparent
the accounted slabs to the parent cgroup on cgroup removal.
# Implementation approach
There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable pointer
to the corresponding kmem_cache. So the idea is to reparent kmem_caches
instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so it shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.
There is a bonus: currently we do release empty kmem_caches on cgroup
removal, however all other are waiting for the releasing of the memory cgroup.
These refactorings allow kmem_caches to be released as soon as they
become inactive and free.
Some additional implementation details are provided in corresponding
commit messages.
# Results
Below is the average number of dying cgroups on two groups of our production
hosts. They do run some sort of web frontend workload, the memory pressure
is moderate. As we can see, with the kernel memory reparenting the number
stabilizes in 50s range; however with the original version it grows almost
linearly and doesn't show any signs of plateauing.
Releasing kmem_caches and memory cgroups created by systemd on startup
releases almost 50Mb immediately, and the difference in slab and percpu
usage between patched and unpatched versions also grows linearly.
In 6 days it reached 200Mb.
day 0 1 2 3 4 5 6
original 39 338 580 827 1098 1349 1574
patched 23 44 45 47 50 46 55
mem diff(Mb) 53 73 99 137 148 182 209
# Links
[1]: commit 68600f623d69 ("mm: don't miss the last page because of
round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
Roman Gushchin (5):
mm: postpone kmem_cache memcg pointer initialization to
memcg_link_cache()
mm: generalize postponed non-root kmem_cache deactivation
mm: introduce __memcg_kmem_uncharge_memcg()
mm: rework non-root kmem_cache lifecycle management
mm: reparent slab memory on cgroup removal
include/linux/memcontrol.h | 10 +++
include/linux/slab.h | 8 +-
mm/memcontrol.c | 38 ++++++----
mm/slab.c | 21 ++----
mm/slab.h | 64 ++++++++++++++--
mm/slab_common.c | 147 ++++++++++++++++++++-----------------
mm/slub.c | 32 +-------
7 files changed, 185 insertions(+), 135 deletions(-)
--
2.20.1
Initialize kmem_cache->memcg_params.memcg pointer in
memcg_link_cache() rather than in init_memcg_params().
Once kmem_cache will hold a reference to the memory cgroup,
it will simplify the refcounting.
For non-root kmem_caches memcg_link_cache() is always called
before the kmem_cache becomes visible to a user, so it's safe.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/slab.c | 2 +-
mm/slab.h | 5 +++--
mm/slab_common.c | 14 +++++++-------
mm/slub.c | 2 +-
4 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index b1eefe751d2a..57a332f524cf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1268,7 +1268,7 @@ void __init kmem_cache_init(void)
nr_node_ids * sizeof(struct kmem_cache_node *),
SLAB_HWCACHE_ALIGN, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
- memcg_link_cache(kmem_cache);
+ memcg_link_cache(kmem_cache, NULL);
slab_state = PARTIAL;
/*
diff --git a/mm/slab.h b/mm/slab.h
index 43ac818b8592..6a562ca72bca 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -289,7 +289,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
}
extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
+extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
void (*deact_fn)(struct kmem_cache *));
@@ -344,7 +344,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
{
}
-static inline void memcg_link_cache(struct kmem_cache *s)
+static inline void memcg_link_cache(struct kmem_cache *s,
+ struct mem_cgroup *memcg)
{
}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..6e00bdf8618d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -140,13 +140,12 @@ void slab_init_memcg_params(struct kmem_cache *s)
}
static int init_memcg_params(struct kmem_cache *s,
- struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+ struct kmem_cache *root_cache)
{
struct memcg_cache_array *arr;
if (root_cache) {
s->memcg_params.root_cache = root_cache;
- s->memcg_params.memcg = memcg;
INIT_LIST_HEAD(&s->memcg_params.children_node);
INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
return 0;
@@ -221,11 +220,12 @@ int memcg_update_all_caches(int num_memcgs)
return ret;
}
-void memcg_link_cache(struct kmem_cache *s)
+void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
{
if (is_root_cache(s)) {
list_add(&s->root_caches_node, &slab_root_caches);
} else {
+ s->memcg_params.memcg = memcg;
list_add(&s->memcg_params.children_node,
&s->memcg_params.root_cache->memcg_params.children);
list_add(&s->memcg_params.kmem_caches_node,
@@ -244,7 +244,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
}
#else
static inline int init_memcg_params(struct kmem_cache *s,
- struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+ struct kmem_cache *root_cache)
{
return 0;
}
@@ -384,7 +384,7 @@ static struct kmem_cache *create_cache(const char *name,
s->useroffset = useroffset;
s->usersize = usersize;
- err = init_memcg_params(s, memcg, root_cache);
+ err = init_memcg_params(s, root_cache);
if (err)
goto out_free_cache;
@@ -394,7 +394,7 @@ static struct kmem_cache *create_cache(const char *name,
s->refcount = 1;
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
+ memcg_link_cache(s, memcg);
out:
if (err)
return ERR_PTR(err);
@@ -997,7 +997,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
create_boot_cache(s, name, size, flags, useroffset, usersize);
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
+ memcg_link_cache(s, NULL);
s->refcount = 1;
return s;
}
diff --git a/mm/slub.c b/mm/slub.c
index a34fbe1f6ede..2b9244529d76 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4224,7 +4224,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
}
slab_init_memcg_params(s);
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
+ memcg_link_cache(s, NULL);
return s;
}
--
2.20.1
Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_unchare_memcg() if memcg_kmem_enabled()
check is passed.
Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 10 ++++++++++
mm/memcontrol.c | 25 +++++++++++++++++--------
2 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 36bdfe8e5965..deb209510902 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1298,6 +1298,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
void __memcg_kmem_uncharge(struct page *page, int order);
int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
struct mem_cgroup *memcg);
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+ unsigned int nr_pages);
extern struct static_key_false memcg_kmem_enabled_key;
extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1339,6 +1341,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
return 0;
}
+
+static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
+ struct mem_cgroup *memcg)
+{
+ if (memcg_kmem_enabled())
+ __memcg_kmem_uncharge_memcg(memcg, 1 << order);
+}
+
/*
* helper for accessing a memcg's index. It will be used as an index in the
* child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a8f1c35176..b2c39f187cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2750,6 +2750,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
css_put(&memcg->css);
return ret;
}
+
+/**
+ * __memcg_kmem_uncharge_memcg: uncharge a kmem page
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ page_counter_uncharge(&memcg->kmem, nr_pages);
+
+ page_counter_uncharge(&memcg->memory, nr_pages);
+ if (do_memsw_account())
+ page_counter_uncharge(&memcg->memsw, nr_pages);
+}
/**
* __memcg_kmem_uncharge: uncharge a kmem page
* @page: page to uncharge
@@ -2764,14 +2780,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
return;
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
- page_counter_uncharge(&memcg->kmem, nr_pages);
-
- page_counter_uncharge(&memcg->memory, nr_pages);
- if (do_memsw_account())
- page_counter_uncharge(&memcg->memsw, nr_pages);
-
+ __memcg_kmem_uncharge_memcg(memcg, nr_pages);
page->mem_cgroup = NULL;
/* slab pages do not have PageKmemcg flag set */
--
2.20.1
Let's reparent memcg slab memory on memcg offlining. This allows us
to release the memory cgroup without waiting for the last outstanding
kernel object (e.g. dentry used by another application).
So instead of reparenting all accounted slab pages, let's do reparent
a relatively small amount of kmem_caches. Reparenting is performed as
the last part of the deactivation process, so it's guaranteed that all
kmem_caches are not active at this moment.
Since the parent cgroup is already charged, everything we need to do
is to move the kmem_cache to the parent's kmem_caches list,
swap the memcg pointer, bump parent's css refcounter and drop
the cgroup's refcounter. Quite simple.
We can't race with the slab allocation path, and if we race with
deallocation path, it's not a big deal: parent's charge and slab stats
are always correct*, and we don't care anymore about the child usage
and stats. The child cgroup is already offline, so we don't use or
show it anywhere.
* please, look at the comment in kmemcg_cache_deactivate_after_rcu()
for some additional details
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/memcontrol.c | 4 +++-
mm/slab.h | 4 +++-
mm/slab_common.c | 28 ++++++++++++++++++++++++++++
3 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 87c06e342e05..2f61d13df0c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3239,7 +3239,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
if (memcg->kmem_state == KMEM_ALLOCATED) {
WARN_ON(!list_empty(&memcg->kmem_caches));
static_branch_dec(&memcg_kmem_enabled_key);
- WARN_ON(page_counter_read(&memcg->kmem));
}
}
#else
@@ -4651,6 +4650,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* The following stuff does not apply to the root */
if (!parent) {
+#ifdef CONFIG_MEMCG_KMEM
+ INIT_LIST_HEAD(&memcg->kmem_caches);
+#endif
root_mem_cgroup = memcg;
return &memcg->css;
}
diff --git a/mm/slab.h b/mm/slab.h
index 1f49945f5c1d..be4f04ef65f9 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -329,10 +329,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
return;
}
- memcg = s->memcg_params.memcg;
+ rcu_read_lock();
+ memcg = READ_ONCE(s->memcg_params.memcg);
lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
mod_lruvec_state(lruvec, idx, -(1 << order));
memcg_kmem_uncharge_memcg(page, order, memcg);
+ rcu_read_unlock();
kmemcg_cache_put_many(s, 1 << order);
}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3fdd02979a1c..fc2e86de402f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -745,7 +745,35 @@ void kmemcg_queue_cache_shutdown(struct kmem_cache *s)
static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
{
+ struct mem_cgroup *memcg, *parent;
+
__kmemcg_cache_deactivate_after_rcu(s);
+
+ memcg = s->memcg_params.memcg;
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;
+
+ if (memcg == parent)
+ return;
+
+ /*
+ * Let's reparent the kmem_cache. It's already deactivated, so we
+ * can't race with memcg_charge_slab(). We still can race with
+ * memcg_uncharge_slab(), but it's not a problem. The parent cgroup
+ * is already charged, so it's ok to uncharge either the parent cgroup
+ * directly, either recursively.
+ * The same is true for recursive vmstats. Local vmstats are not use
+ * anywhere, except count_shadow_nodes(). But reparenting will not
+ * cahnge anything for count_shadow_nodes(): on memcg removal
+ * shrinker lists are reparented, so it always returns SHRINK_EMPTY
+ * for non-leaf dead memcgs. For the parent memcgs local slab stats
+ * are always 0 now, so reparenting will not change anything.
+ */
+ list_move(&s->memcg_params.kmem_caches_node, &parent->kmem_caches);
+ s->memcg_params.memcg = parent;
+ css_get(&parent->css);
+ css_put(&memcg->css);
}
static void kmemcg_cache_deactivate(struct kmem_cache *s)
--
2.20.1
This commit makes several important changes in the lifecycle
of a non-root kmem_cache, which also affect the lifecycle
of a memory cgroup.
Currently each charged slab page has a page->mem_cgroup pointer
to the memory cgroup and holds a reference to it.
Kmem_caches are held by the cgroup. On offlining empty kmem_caches
are freed, all other are freed on cgroup release.
So the current scheme can be illustrated as:
page->mem_cgroup->kmem_cache.
To implement the slab memory reparenting we need to invert the scheme
into: page->kmem_cache->mem_cgroup.
Let's make every page to hold a reference to the kmem_cache (we
already have a stable pointer), and make kmem_caches to hold a single
reference to the memory cgroup.
To make this possible we need to introduce a new refcounter
for non-root kmem_caches. It's atomic for now, but can be easily
converted to a percpu counter, had we any performance penalty*.
The initial value is set to 1, and it's decremented on deactivation,
so we never shutdown an active cache.
To shutdown non-active empty kmem_caches, let's reuse the
infrastructure of the RCU-delayed work queue, used previously for
the deactivation. After the generalization, it's perfectly suited
for our needs.
Since now we can release a kmem_cache at any moment after the
deactivation, let's call sysfs_slab_remove() only from the shutdown
path. It makes deactivation path simpler.
Because we don't set the page->mem_cgroup pointer, we need to change
the way how memcg-level stats is working for slab pages. We can't use
mod_lruvec_page_state() helpers anymore, so switch over to
mod_lruvec_state().
* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):
time find / -name fname-no-exist
echo 2 > /proc/sys/vm/drop_caches
repeat several times
Results (I've chosen best results in several runs):
orig patched
real 0m0.712s 0m0.690s
user 0m0.104s 0m0.101s
sys 0m0.346s 0m0.340s
real 0m0.728s 0m0.723s
user 0m0.114s 0m0.115s
sys 0m0.342s 0m0.338s
real 0m0.685s 0m0.767s
user 0m0.118s 0m0.114s
sys 0m0.343s 0m0.336s
So it looks like the difference is not noticeable in this test.
Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/slab.h | 2 +-
mm/memcontrol.c | 9 ++++----
mm/slab.c | 15 +-----------
mm/slab.h | 54 +++++++++++++++++++++++++++++++++++++++++---
mm/slab_common.c | 51 +++++++++++++++++------------------------
mm/slub.c | 22 +-----------------
6 files changed, 79 insertions(+), 74 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 47923c173f30..4daaade76c63 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -152,7 +152,6 @@ int kmem_cache_shrink(struct kmem_cache *);
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
void memcg_deactivate_kmem_caches(struct mem_cgroup *);
-void memcg_destroy_kmem_caches(struct mem_cgroup *);
/*
* Please use this macro to create slab caches. Simply specify the
@@ -641,6 +640,7 @@ struct memcg_cache_params {
struct mem_cgroup *memcg;
struct list_head children_node;
struct list_head kmem_caches_node;
+ atomic_long_t refcnt;
void (*work_fn)(struct kmem_cache *);
union {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2c39f187cbb..87c06e342e05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2719,9 +2719,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
cancel_charge(memcg, nr_pages);
return -ENOMEM;
}
-
- page->mem_cgroup = memcg;
-
return 0;
}
@@ -2744,8 +2741,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
memcg = get_mem_cgroup_from_current();
if (!mem_cgroup_is_root(memcg)) {
ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
- if (!ret)
+ if (!ret) {
+ page->mem_cgroup = memcg;
__SetPageKmemcg(page);
+ }
}
css_put(&memcg->css);
return ret;
@@ -3238,7 +3237,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
memcg_offline_kmem(memcg);
if (memcg->kmem_state == KMEM_ALLOCATED) {
- memcg_destroy_kmem_caches(memcg);
+ WARN_ON(!list_empty(&memcg->kmem_caches));
static_branch_dec(&memcg_kmem_enabled_key);
WARN_ON(page_counter_read(&memcg->kmem));
}
diff --git a/mm/slab.c b/mm/slab.c
index 14466a73d057..171b21ca617f 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
int nodeid)
{
struct page *page;
- int nr_pages;
flags |= cachep->allocflags;
@@ -1404,12 +1403,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
return NULL;
}
- nr_pages = (1 << cachep->gfporder);
- if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
- mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
- else
- mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
-
__SetPageSlab(page);
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
{
int order = cachep->gfporder;
- unsigned long nr_freed = (1 << order);
-
- if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
- mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
- else
- mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
BUG_ON(!PageSlab(page));
__ClearPageSlabPfmemalloc(page);
@@ -1438,7 +1425,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
page->mapping = NULL;
if (current->reclaim_state)
- current->reclaim_state->reclaimed_slab += nr_freed;
+ current->reclaim_state->reclaimed_slab += 1 << order;
memcg_uncharge_slab(page, order, cachep);
__free_pages(page, order);
}
diff --git a/mm/slab.h b/mm/slab.h
index 4a261c97c138..1f49945f5c1d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -173,7 +173,9 @@ void __kmem_cache_release(struct kmem_cache *);
int __kmem_cache_shrink(struct kmem_cache *);
void __kmemcg_cache_deactivate(struct kmem_cache *s);
void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
+void kmemcg_cache_shutdown(struct kmem_cache *s);
void slab_kmem_cache_release(struct kmem_cache *);
+void kmemcg_queue_cache_shutdown(struct kmem_cache *s);
struct seq_file;
struct file;
@@ -274,19 +276,65 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
return s->memcg_params.root_cache;
}
+static __always_inline void kmemcg_cache_get_many(struct kmem_cache *s, long n)
+{
+ atomic_long_add(n, &s->memcg_params.refcnt);
+}
+
+static __always_inline void kmemcg_cache_put_many(struct kmem_cache *s, long n)
+{
+ if (atomic_long_sub_and_test(n, &s->memcg_params.refcnt))
+ kmemcg_queue_cache_shutdown(s);
+}
+
static __always_inline int memcg_charge_slab(struct page *page,
gfp_t gfp, int order,
struct kmem_cache *s)
{
- if (is_root_cache(s))
+ int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
+ int ret;
+
+ if (is_root_cache(s)) {
+ mod_node_page_state(page_pgdat(page), idx, 1 << order);
return 0;
- return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+ }
+
+ memcg = s->memcg_params.memcg;
+ ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+ if (!ret) {
+ lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+ mod_lruvec_state(lruvec, idx, 1 << order);
+
+ /* transer try_charge() page references to kmem_cache */
+ kmemcg_cache_get_many(s, 1 << order);
+ css_put_many(&memcg->css, 1 << order);
+ }
+
+ return 0;
}
static __always_inline void memcg_uncharge_slab(struct page *page, int order,
struct kmem_cache *s)
{
- memcg_kmem_uncharge(page, order);
+ int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
+
+ if (is_root_cache(s)) {
+ mod_node_page_state(page_pgdat(page), idx, -(1 << order));
+ return;
+ }
+
+ memcg = s->memcg_params.memcg;
+ lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+ mod_lruvec_state(lruvec, idx, -(1 << order));
+ memcg_kmem_uncharge_memcg(page, order, memcg);
+
+ kmemcg_cache_put_many(s, 1 << order);
}
extern void slab_init_memcg_params(struct kmem_cache *);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 4e5b4292a763..3fdd02979a1c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -148,6 +148,7 @@ static int init_memcg_params(struct kmem_cache *s,
s->memcg_params.root_cache = root_cache;
INIT_LIST_HEAD(&s->memcg_params.children_node);
INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
+ atomic_long_set(&s->memcg_params.refcnt, 1);
return 0;
}
@@ -225,6 +226,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
if (is_root_cache(s)) {
list_add(&s->root_caches_node, &slab_root_caches);
} else {
+ css_get(&memcg->css);
s->memcg_params.memcg = memcg;
list_add(&s->memcg_params.children_node,
&s->memcg_params.root_cache->memcg_params.children);
@@ -240,6 +242,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
+ css_put(&s->memcg_params.memcg->css);
}
}
#else
@@ -703,21 +706,19 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work)
s->memcg_params.work_fn(s);
s->memcg_params.work_fn = NULL;
+ kmemcg_cache_put_many(s, 1);
mutex_unlock(&slab_mutex);
put_online_mems();
put_online_cpus();
-
- /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
- css_put(&s->memcg_params.memcg->css);
}
/*
* We need to grab blocking locks. Bounce to ->work. The
* work item shares the space with the RCU head and can't be
- * initialized eariler.
-*/
+ * initialized earlier.
+ */
static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
{
struct kmem_cache *s = container_of(head, struct kmem_cache,
@@ -727,6 +728,21 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
}
+static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
+{
+ WARN_ON(shutdown_cache(s));
+}
+
+void kmemcg_queue_cache_shutdown(struct kmem_cache *s)
+{
+ if (s->memcg_params.root_cache->memcg_params.dying)
+ return;
+
+ WARN_ON(s->memcg_params.work_fn);
+ s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
+ call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
+}
+
static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
{
__kmemcg_cache_deactivate_after_rcu(s);
@@ -739,9 +755,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
if (s->memcg_params.root_cache->memcg_params.dying)
return;
- /* pin memcg so that @s doesn't get destroyed in the middle */
- css_get(&s->memcg_params.memcg->css);
-
WARN_ON_ONCE(s->memcg_params.work_fn);
s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
@@ -775,28 +788,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
put_online_cpus();
}
-void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
-{
- struct kmem_cache *s, *s2;
-
- get_online_cpus();
- get_online_mems();
-
- mutex_lock(&slab_mutex);
- list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
- memcg_params.kmem_caches_node) {
- /*
- * The cgroup is about to be freed and therefore has no charges
- * left. Hence, all its caches must be empty by now.
- */
- BUG_ON(shutdown_cache(s));
- }
- mutex_unlock(&slab_mutex);
-
- put_online_mems();
- put_online_cpus();
-}
-
static int shutdown_memcg_caches(struct kmem_cache *s)
{
struct memcg_cache_array *arr;
diff --git a/mm/slub.c b/mm/slub.c
index 195f61785c7d..a28b3b3abf29 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1692,11 +1692,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
if (!page)
return NULL;
- mod_lruvec_page_state(page,
- (s->flags & SLAB_RECLAIM_ACCOUNT) ?
- NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
- 1 << oo_order(oo));
-
inc_slabs_node(s, page_to_nid(page), page->objects);
return page;
@@ -1730,11 +1725,6 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
check_object(s, page, p, SLUB_RED_INACTIVE);
}
- mod_lruvec_page_state(page,
- (s->flags & SLAB_RECLAIM_ACCOUNT) ?
- NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
- -pages);
-
__ClearPageSlabPfmemalloc(page);
__ClearPageSlab(page);
@@ -4037,18 +4027,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
{
/*
* Called with all the locks held after a sched RCU grace period.
- * Even if @s becomes empty after shrinking, we can't know that @s
- * doesn't have allocations already in-flight and thus can't
- * destroy @s until the associated memcg is released.
- *
- * However, let's remove the sysfs files for empty caches here.
- * Each cache has a lot of interface files which aren't
- * particularly useful for empty draining caches; otherwise, we can
- * easily end up with millions of unnecessary sysfs files on
- * systems which have a lot of memory and transient cgroups.
*/
- if (!__kmem_cache_shrink(s))
- sysfs_slab_remove(s);
+ __kmem_cache_shrink(s);
}
void __kmemcg_cache_deactivate(struct kmem_cache *s)
--
2.20.1
Currently SLUB uses a work scheduled after an RCU grace period
to deactivate a non-root kmem_cache. This mechanism can be reused
for kmem_caches reparenting, but requires some generalization.
Let's decouple all infrastructure (rcu callback, work callback)
from the SLUB-specific code, so it can be used with SLAB as well.
Also, let's rename some functions to make the code look simpler.
All SLAB/SLUB-specific functions start with "__". Remove "deact_"
prefix from the corresponding struct fields.
Here is the graph of a new calling scheme:
kmemcg_cache_deactivate()
__kmemcg_cache_deactivate() SLAB/SLUB-specific
kmemcg_schedule_work_after_rcu() rcu
kmemcg_after_rcu_workfn() work
kmemcg_cache_deactivate_after_rcu()
__kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of:
__kmemcg_cache_deactivate() SLAB/SLUB-specific
slab_deactivate_memcg_cache_rcu_sched() SLUB-only
kmemcg_deactivate_rcufn SLUB-only, rcu
kmemcg_deactivate_workfn SLUB-only, work
kmemcg_cache_deact_after_rcu() SLUB-only
Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/slab.h | 6 ++---
mm/slab.c | 4 +++
mm/slab.h | 3 ++-
mm/slab_common.c | 62 ++++++++++++++++++++------------------------
mm/slub.c | 8 +-----
5 files changed, 38 insertions(+), 45 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..47923c173f30 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -642,10 +642,10 @@ struct memcg_cache_params {
struct list_head children_node;
struct list_head kmem_caches_node;
- void (*deact_fn)(struct kmem_cache *);
+ void (*work_fn)(struct kmem_cache *);
union {
- struct rcu_head deact_rcu_head;
- struct work_struct deact_work;
+ struct rcu_head rcu_head;
+ struct work_struct work;
};
};
};
diff --git a/mm/slab.c b/mm/slab.c
index 57a332f524cf..14466a73d057 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2317,6 +2317,10 @@ void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
{
__kmem_cache_shrink(cachep);
}
+
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
+{
+}
#endif
int __kmem_cache_shutdown(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 6a562ca72bca..4a261c97c138 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -172,6 +172,7 @@ int __kmem_cache_shutdown(struct kmem_cache *);
void __kmem_cache_release(struct kmem_cache *);
int __kmem_cache_shrink(struct kmem_cache *);
void __kmemcg_cache_deactivate(struct kmem_cache *s);
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
void slab_kmem_cache_release(struct kmem_cache *);
struct seq_file;
@@ -291,7 +292,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
extern void slab_init_memcg_params(struct kmem_cache *);
extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
- void (*deact_fn)(struct kmem_cache *));
+ void (*work_fn)(struct kmem_cache *));
#else /* CONFIG_MEMCG_KMEM */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6e00bdf8618d..4e5b4292a763 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -691,17 +691,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
put_online_cpus();
}
-static void kmemcg_deactivate_workfn(struct work_struct *work)
+static void kmemcg_after_rcu_workfn(struct work_struct *work)
{
struct kmem_cache *s = container_of(work, struct kmem_cache,
- memcg_params.deact_work);
+ memcg_params.work);
get_online_cpus();
get_online_mems();
mutex_lock(&slab_mutex);
- s->memcg_params.deact_fn(s);
+ s->memcg_params.work_fn(s);
+ s->memcg_params.work_fn = NULL;
mutex_unlock(&slab_mutex);
@@ -712,37 +713,28 @@ static void kmemcg_deactivate_workfn(struct work_struct *work)
css_put(&s->memcg_params.memcg->css);
}
-static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+/*
+ * We need to grab blocking locks. Bounce to ->work. The
+ * work item shares the space with the RCU head and can't be
+ * initialized eariler.
+*/
+static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
{
struct kmem_cache *s = container_of(head, struct kmem_cache,
- memcg_params.deact_rcu_head);
+ memcg_params.rcu_head);
- /*
- * We need to grab blocking locks. Bounce to ->deact_work. The
- * work item shares the space with the RCU head and can't be
- * initialized eariler.
- */
- INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
- queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work);
+ INIT_WORK(&s->memcg_params.work, kmemcg_after_rcu_workfn);
+ queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
}
-/**
- * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
- * sched RCU grace period
- * @s: target kmem_cache
- * @deact_fn: deactivation function to call
- *
- * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
- * held after a sched RCU grace period. The slab is guaranteed to stay
- * alive until @deact_fn is finished. This is to be used from
- * __kmemcg_cache_deactivate().
- */
-void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
- void (*deact_fn)(struct kmem_cache *))
+static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
{
- if (WARN_ON_ONCE(is_root_cache(s)) ||
- WARN_ON_ONCE(s->memcg_params.deact_fn))
- return;
+ __kmemcg_cache_deactivate_after_rcu(s);
+}
+
+static void kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+ __kmemcg_cache_deactivate(s);
if (s->memcg_params.root_cache->memcg_params.dying)
return;
@@ -750,8 +742,9 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
/* pin memcg so that @s doesn't get destroyed in the middle */
css_get(&s->memcg_params.memcg->css);
- s->memcg_params.deact_fn = deact_fn;
- call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
+ WARN_ON_ONCE(s->memcg_params.work_fn);
+ s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
+ call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
}
void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
@@ -773,7 +766,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
if (!c)
continue;
- __kmemcg_cache_deactivate(c);
+ kmemcg_cache_deactivate(c);
arr->entries[idx] = NULL;
}
mutex_unlock(&slab_mutex);
@@ -866,11 +859,12 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
mutex_unlock(&slab_mutex);
/*
- * SLUB deactivates the kmem_caches through call_rcu. Make
+ * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
* sure all registered rcu callbacks have been invoked.
*/
- if (IS_ENABLED(CONFIG_SLUB))
- rcu_barrier();
+#ifndef CONFIG_SLOB
+ rcu_barrier();
+#endif
/*
* SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
diff --git a/mm/slub.c b/mm/slub.c
index 2b9244529d76..195f61785c7d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4033,7 +4033,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
}
#ifdef CONFIG_MEMCG
-static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
{
/*
* Called with all the locks held after a sched RCU grace period.
@@ -4059,12 +4059,6 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
*/
slub_set_cpu_partial(s, 0);
s->min_partial = 0;
-
- /*
- * s->cpu_partial is checked locklessly (see put_cpu_partial), so
- * we have to make sure the change is visible before shrinking.
- */
- slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
}
#endif /* CONFIG_MEMCG */
--
2.20.1
On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
>
> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
>
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> are freed, all other are freed on cgroup release.
No, they are not freed (i.e. destroyed) on offlining, only
deactivated. All memcg kmem_caches are freed/destroyed on memcg's
css_free.
>
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
>
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
>
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.
What about memcg_kmem_get_cache()? That function assumes that by
taking reference on memcg, it's kmem_caches will stay. I think you
need to get reference on the kmem_cache in memcg_kmem_get_cache()
within the rcu lock where you get the memcg through css_tryget_online.
>
> To make this possible we need to introduce a new refcounter
> for non-root kmem_caches. It's atomic for now, but can be easily
> converted to a percpu counter, had we any performance penalty*.
> The initial value is set to 1, and it's decremented on deactivation,
> so we never shutdown an active cache.
>
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
>
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.
>
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().
>
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
>
> time find / -name fname-no-exist
> echo 2 > /proc/sys/vm/drop_caches
> repeat several times
>
> Results (I've chosen best results in several runs):
>
> orig patched
>
> real 0m0.712s 0m0.690s
> user 0m0.104s 0m0.101s
> sys 0m0.346s 0m0.340s
>
> real 0m0.728s 0m0.723s
> user 0m0.114s 0m0.115s
> sys 0m0.342s 0m0.338s
>
> real 0m0.685s 0m0.767s
> user 0m0.118s 0m0.114s
> sys 0m0.343s 0m0.336s
>
> So it looks like the difference is not noticeable in this test.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> include/linux/slab.h | 2 +-
> mm/memcontrol.c | 9 ++++----
> mm/slab.c | 15 +-----------
> mm/slab.h | 54 +++++++++++++++++++++++++++++++++++++++++---
> mm/slab_common.c | 51 +++++++++++++++++------------------------
> mm/slub.c | 22 +-----------------
> 6 files changed, 79 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 47923c173f30..4daaade76c63 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -152,7 +152,6 @@ int kmem_cache_shrink(struct kmem_cache *);
>
> void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> -void memcg_destroy_kmem_caches(struct mem_cgroup *);
>
> /*
> * Please use this macro to create slab caches. Simply specify the
> @@ -641,6 +640,7 @@ struct memcg_cache_params {
> struct mem_cgroup *memcg;
> struct list_head children_node;
> struct list_head kmem_caches_node;
> + atomic_long_t refcnt;
>
> void (*work_fn)(struct kmem_cache *);
> union {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b2c39f187cbb..87c06e342e05 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2719,9 +2719,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> cancel_charge(memcg, nr_pages);
> return -ENOMEM;
> }
> -
> - page->mem_cgroup = memcg;
> -
> return 0;
> }
>
> @@ -2744,8 +2741,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
> memcg = get_mem_cgroup_from_current();
> if (!mem_cgroup_is_root(memcg)) {
> ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
> - if (!ret)
> + if (!ret) {
> + page->mem_cgroup = memcg;
> __SetPageKmemcg(page);
> + }
> }
> css_put(&memcg->css);
> return ret;
> @@ -3238,7 +3237,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
> memcg_offline_kmem(memcg);
>
> if (memcg->kmem_state == KMEM_ALLOCATED) {
> - memcg_destroy_kmem_caches(memcg);
> + WARN_ON(!list_empty(&memcg->kmem_caches));
> static_branch_dec(&memcg_kmem_enabled_key);
> WARN_ON(page_counter_read(&memcg->kmem));
> }
> diff --git a/mm/slab.c b/mm/slab.c
> index 14466a73d057..171b21ca617f 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
> int nodeid)
> {
> struct page *page;
> - int nr_pages;
>
> flags |= cachep->allocflags;
>
> @@ -1404,12 +1403,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
> return NULL;
> }
>
> - nr_pages = (1 << cachep->gfporder);
> - if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
> - else
> - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
> -
> __SetPageSlab(page);
> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
> if (sk_memalloc_socks() && page_is_pfmemalloc(page))
> @@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
> static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
> {
> int order = cachep->gfporder;
> - unsigned long nr_freed = (1 << order);
> -
> - if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
> - else
> - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
>
> BUG_ON(!PageSlab(page));
> __ClearPageSlabPfmemalloc(page);
> @@ -1438,7 +1425,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
> page->mapping = NULL;
>
> if (current->reclaim_state)
> - current->reclaim_state->reclaimed_slab += nr_freed;
> + current->reclaim_state->reclaimed_slab += 1 << order;
> memcg_uncharge_slab(page, order, cachep);
> __free_pages(page, order);
> }
> diff --git a/mm/slab.h b/mm/slab.h
> index 4a261c97c138..1f49945f5c1d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -173,7 +173,9 @@ void __kmem_cache_release(struct kmem_cache *);
> int __kmem_cache_shrink(struct kmem_cache *);
> void __kmemcg_cache_deactivate(struct kmem_cache *s);
> void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
> +void kmemcg_cache_shutdown(struct kmem_cache *s);
> void slab_kmem_cache_release(struct kmem_cache *);
> +void kmemcg_queue_cache_shutdown(struct kmem_cache *s);
>
> struct seq_file;
> struct file;
> @@ -274,19 +276,65 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
> return s->memcg_params.root_cache;
> }
>
> +static __always_inline void kmemcg_cache_get_many(struct kmem_cache *s, long n)
> +{
> + atomic_long_add(n, &s->memcg_params.refcnt);
> +}
> +
> +static __always_inline void kmemcg_cache_put_many(struct kmem_cache *s, long n)
> +{
> + if (atomic_long_sub_and_test(n, &s->memcg_params.refcnt))
> + kmemcg_queue_cache_shutdown(s);
> +}
> +
> static __always_inline int memcg_charge_slab(struct page *page,
> gfp_t gfp, int order,
> struct kmem_cache *s)
> {
> - if (is_root_cache(s))
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> + int ret;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, 1 << order);
> return 0;
> - return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
> + }
> +
> + memcg = s->memcg_params.memcg;
> + ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
> + if (!ret) {
> + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> + mod_lruvec_state(lruvec, idx, 1 << order);
> +
> + /* transer try_charge() page references to kmem_cache */
> + kmemcg_cache_get_many(s, 1 << order);
> + css_put_many(&memcg->css, 1 << order);
> + }
> +
> + return 0;
> }
>
> static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> struct kmem_cache *s)
> {
> - memcg_kmem_uncharge(page, order);
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> + return;
> + }
> +
> + memcg = s->memcg_params.memcg;
> + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> + mod_lruvec_state(lruvec, idx, -(1 << order));
> + memcg_kmem_uncharge_memcg(page, order, memcg);
> +
> + kmemcg_cache_put_many(s, 1 << order);
> }
>
> extern void slab_init_memcg_params(struct kmem_cache *);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..3fdd02979a1c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -148,6 +148,7 @@ static int init_memcg_params(struct kmem_cache *s,
> s->memcg_params.root_cache = root_cache;
> INIT_LIST_HEAD(&s->memcg_params.children_node);
> INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
> + atomic_long_set(&s->memcg_params.refcnt, 1);
> return 0;
> }
>
> @@ -225,6 +226,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
> if (is_root_cache(s)) {
> list_add(&s->root_caches_node, &slab_root_caches);
> } else {
> + css_get(&memcg->css);
> s->memcg_params.memcg = memcg;
> list_add(&s->memcg_params.children_node,
> &s->memcg_params.root_cache->memcg_params.children);
> @@ -240,6 +242,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
> } else {
> list_del(&s->memcg_params.children_node);
> list_del(&s->memcg_params.kmem_caches_node);
> + css_put(&s->memcg_params.memcg->css);
> }
> }
> #else
> @@ -703,21 +706,19 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work)
>
> s->memcg_params.work_fn(s);
> s->memcg_params.work_fn = NULL;
> + kmemcg_cache_put_many(s, 1);
>
> mutex_unlock(&slab_mutex);
>
> put_online_mems();
> put_online_cpus();
> -
> - /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
> - css_put(&s->memcg_params.memcg->css);
> }
>
> /*
> * We need to grab blocking locks. Bounce to ->work. The
> * work item shares the space with the RCU head and can't be
> - * initialized eariler.
> -*/
> + * initialized earlier.
> + */
> static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
> {
> struct kmem_cache *s = container_of(head, struct kmem_cache,
> @@ -727,6 +728,21 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
> queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
> }
>
> +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
> +{
> + WARN_ON(shutdown_cache(s));
> +}
> +
> +void kmemcg_queue_cache_shutdown(struct kmem_cache *s)
> +{
> + if (s->memcg_params.root_cache->memcg_params.dying)
> + return;
> +
> + WARN_ON(s->memcg_params.work_fn);
> + s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
> + call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> +}
> +
> static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
> {
> __kmemcg_cache_deactivate_after_rcu(s);
> @@ -739,9 +755,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
> if (s->memcg_params.root_cache->memcg_params.dying)
> return;
>
> - /* pin memcg so that @s doesn't get destroyed in the middle */
> - css_get(&s->memcg_params.memcg->css);
> -
> WARN_ON_ONCE(s->memcg_params.work_fn);
> s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
> call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> @@ -775,28 +788,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
> put_online_cpus();
> }
>
> -void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
> -{
> - struct kmem_cache *s, *s2;
> -
> - get_online_cpus();
> - get_online_mems();
> -
> - mutex_lock(&slab_mutex);
> - list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
> - memcg_params.kmem_caches_node) {
> - /*
> - * The cgroup is about to be freed and therefore has no charges
> - * left. Hence, all its caches must be empty by now.
> - */
> - BUG_ON(shutdown_cache(s));
> - }
> - mutex_unlock(&slab_mutex);
> -
> - put_online_mems();
> - put_online_cpus();
> -}
> -
> static int shutdown_memcg_caches(struct kmem_cache *s)
> {
> struct memcg_cache_array *arr;
> diff --git a/mm/slub.c b/mm/slub.c
> index 195f61785c7d..a28b3b3abf29 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1692,11 +1692,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> if (!page)
> return NULL;
>
> - mod_lruvec_page_state(page,
> - (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> - 1 << oo_order(oo));
> -
> inc_slabs_node(s, page_to_nid(page), page->objects);
>
> return page;
> @@ -1730,11 +1725,6 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
> check_object(s, page, p, SLUB_RED_INACTIVE);
> }
>
> - mod_lruvec_page_state(page,
> - (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> - -pages);
> -
> __ClearPageSlabPfmemalloc(page);
> __ClearPageSlab(page);
>
> @@ -4037,18 +4027,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
> {
> /*
> * Called with all the locks held after a sched RCU grace period.
> - * Even if @s becomes empty after shrinking, we can't know that @s
> - * doesn't have allocations already in-flight and thus can't
> - * destroy @s until the associated memcg is released.
> - *
> - * However, let's remove the sysfs files for empty caches here.
> - * Each cache has a lot of interface files which aren't
> - * particularly useful for empty draining caches; otherwise, we can
> - * easily end up with millions of unnecessary sysfs files on
> - * systems which have a lot of memory and transient cgroups.
> */
> - if (!__kmem_cache_shrink(s))
> - sysfs_slab_remove(s);
> + __kmem_cache_shrink(s);
> }
>
> void __kmemcg_cache_deactivate(struct kmem_cache *s)
> --
> 2.20.1
>
On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
> >
> > This commit makes several important changes in the lifecycle
> > of a non-root kmem_cache, which also affect the lifecycle
> > of a memory cgroup.
> >
> > Currently each charged slab page has a page->mem_cgroup pointer
> > to the memory cgroup and holds a reference to it.
> > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > are freed, all other are freed on cgroup release.
>
> No, they are not freed (i.e. destroyed) on offlining, only
> deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> css_free.
You're right, my bad. I was thinking about the corresponding sysfs entry
when was writing it. We try to free it from the deactivation path too.
>
> >
> > So the current scheme can be illustrated as:
> > page->mem_cgroup->kmem_cache.
> >
> > To implement the slab memory reparenting we need to invert the scheme
> > into: page->kmem_cache->mem_cgroup.
> >
> > Let's make every page to hold a reference to the kmem_cache (we
> > already have a stable pointer), and make kmem_caches to hold a single
> > reference to the memory cgroup.
>
> What about memcg_kmem_get_cache()? That function assumes that by
> taking reference on memcg, it's kmem_caches will stay. I think you
> need to get reference on the kmem_cache in memcg_kmem_get_cache()
> within the rcu lock where you get the memcg through css_tryget_online.
Yeah, a very good question.
I believe it's safe because css_tryget_online() guarantees that
the cgroup is online and won't go offline before css_free() in
slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
and drop it on offlining, so it protects the online kmem_cache.
Thank you for looking into the patchset!
On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
> > >
> > > This commit makes several important changes in the lifecycle
> > > of a non-root kmem_cache, which also affect the lifecycle
> > > of a memory cgroup.
> > >
> > > Currently each charged slab page has a page->mem_cgroup pointer
> > > to the memory cgroup and holds a reference to it.
> > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > are freed, all other are freed on cgroup release.
> >
> > No, they are not freed (i.e. destroyed) on offlining, only
> > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > css_free.
>
> You're right, my bad. I was thinking about the corresponding sysfs entry
> when was writing it. We try to free it from the deactivation path too.
>
> >
> > >
> > > So the current scheme can be illustrated as:
> > > page->mem_cgroup->kmem_cache.
> > >
> > > To implement the slab memory reparenting we need to invert the scheme
> > > into: page->kmem_cache->mem_cgroup.
> > >
> > > Let's make every page to hold a reference to the kmem_cache (we
> > > already have a stable pointer), and make kmem_caches to hold a single
> > > reference to the memory cgroup.
> >
> > What about memcg_kmem_get_cache()? That function assumes that by
> > taking reference on memcg, it's kmem_caches will stay. I think you
> > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > within the rcu lock where you get the memcg through css_tryget_online.
>
> Yeah, a very good question.
>
> I believe it's safe because css_tryget_online() guarantees that
> the cgroup is online and won't go offline before css_free() in
> slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> and drop it on offlining, so it protects the online kmem_cache.
>
Let's suppose a thread doing a remote charging calls
memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
memcg having refcnt equal to 1. That thread got a reference on the
remote memcg but no reference on the kmem_cache. Let's suppose that
thread got stuck in the reclaim and scheduled away. In the meantime
that remote memcg got offlined and decremented the refcnt of all of
its kmem_caches. The empty kmem_cache which the thread stuck in
reclaim have pointer to can get deleted and may be using an already
destroyed kmem_cache after coming back from reclaim.
I think the above situation is possible unless the thread gets the
reference on the kmem_cache in memcg_kmem_get_cache().
Shakeel
On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
> > > >
> > > > This commit makes several important changes in the lifecycle
> > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > of a memory cgroup.
> > > >
> > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > to the memory cgroup and holds a reference to it.
> > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > are freed, all other are freed on cgroup release.
> > >
> > > No, they are not freed (i.e. destroyed) on offlining, only
> > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > css_free.
> >
> > You're right, my bad. I was thinking about the corresponding sysfs entry
> > when was writing it. We try to free it from the deactivation path too.
> >
> > >
> > > >
> > > > So the current scheme can be illustrated as:
> > > > page->mem_cgroup->kmem_cache.
> > > >
> > > > To implement the slab memory reparenting we need to invert the scheme
> > > > into: page->kmem_cache->mem_cgroup.
> > > >
> > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > reference to the memory cgroup.
> > >
> > > What about memcg_kmem_get_cache()? That function assumes that by
> > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > within the rcu lock where you get the memcg through css_tryget_online.
> >
> > Yeah, a very good question.
> >
> > I believe it's safe because css_tryget_online() guarantees that
> > the cgroup is online and won't go offline before css_free() in
> > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > and drop it on offlining, so it protects the online kmem_cache.
> >
>
> Let's suppose a thread doing a remote charging calls
> memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> memcg having refcnt equal to 1. That thread got a reference on the
> remote memcg but no reference on the kmem_cache. Let's suppose that
> thread got stuck in the reclaim and scheduled away. In the meantime
> that remote memcg got offlined and decremented the refcnt of all of
> its kmem_caches. The empty kmem_cache which the thread stuck in
> reclaim have pointer to can get deleted and may be using an already
> destroyed kmem_cache after coming back from reclaim.
>
> I think the above situation is possible unless the thread gets the
> reference on the kmem_cache in memcg_kmem_get_cache().
Yes, you're right and I'm writing a nonsense: css_tryget_online()
can't prevent the cgroup from being offlined.
So, the problem with getting a reference in memcg_kmem_get_cache()
is that it's an atomic operation on the hot path, something I'd like
to avoid.
I can make the refcounter percpu, but it'll add some complexity and size
to the kmem_cache object. Still an option, of course.
I wonder if we can use rcu_read_lock() instead, and bump the refcounter
only if we're going into reclaim.
What do you think?
Thanks!
Hello Roman,
On Wed, Apr 17, 2019 at 02:54:29PM -0700, Roman Gushchin wrote:
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
True, introducing a list of charged pages would negatively affect
SL[AU]B performance since we would need to protect it with some kind
of lock.
>
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
>
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
> instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> page->kmem_cache->memcg indirection instead. It's used only on
> slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
> be able to destroy kmem_caches when they become empty and release
> the associated memory cgroup.
Which means an unconditional atomic inc/dec on charge/uncharge paths
AFAIU. Note, we have per cpu batching so charging a kmem page in cgroup
v2 doesn't require an atomic variable modification. I guess you could
use some sort of per cpu ref counting though.
Anyway, releasing mem_cgroup objects, but leaving kmem_cache objects
dangling looks kinda awkward to me. It would be great if we could
release both, but I assume it's hardly possible due to SL[AU]B
complexity.
What about reusing dead cgroups instead? Yeah, it would be kinda unfair,
because a fresh cgroup would get a legacy of objects left from previous
owners, but still, if we delete a cgroup, the workload must be dead and
so apart from a few long-lived objects, there should mostly be cached
objects charged to it, which should be easily released on memory
pressure. Sorry if somebody's asked this question before - I must have
missed that.
Thanks,
Vladimir
On Wed, 17 Apr 2019, Roman Gushchin wrote:
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.
Ok you are freeing one word in the page struct that can be used for other
purposes now?
On Wed, 17 Apr 2019, Roman Gushchin wrote:
> static __always_inline int memcg_charge_slab(struct page *page,
> gfp_t gfp, int order,
> struct kmem_cache *s)
> {
> - if (is_root_cache(s))
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> + int ret;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, 1 << order);
Hmmm... This is functionality that is not memcg specific being moved into
a memcg function??? Maybe rename the function to indicate that it is not
memcg specific and add the proper #ifdefs?
> static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> struct kmem_cache *s)
> {
> - memcg_kmem_uncharge(page, order);
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> + return;
> + }
And again.
On Wed, Apr 17, 2019 at 8:07 PM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <[email protected]> wrote:
> > >
> > > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
> > > > >
> > > > > This commit makes several important changes in the lifecycle
> > > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > > of a memory cgroup.
> > > > >
> > > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > > to the memory cgroup and holds a reference to it.
> > > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > > are freed, all other are freed on cgroup release.
> > > >
> > > > No, they are not freed (i.e. destroyed) on offlining, only
> > > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > > css_free.
> > >
> > > You're right, my bad. I was thinking about the corresponding sysfs entry
> > > when was writing it. We try to free it from the deactivation path too.
> > >
> > > >
> > > > >
> > > > > So the current scheme can be illustrated as:
> > > > > page->mem_cgroup->kmem_cache.
> > > > >
> > > > > To implement the slab memory reparenting we need to invert the scheme
> > > > > into: page->kmem_cache->mem_cgroup.
> > > > >
> > > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > > reference to the memory cgroup.
> > > >
> > > > What about memcg_kmem_get_cache()? That function assumes that by
> > > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > > within the rcu lock where you get the memcg through css_tryget_online.
> > >
> > > Yeah, a very good question.
> > >
> > > I believe it's safe because css_tryget_online() guarantees that
> > > the cgroup is online and won't go offline before css_free() in
> > > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > > and drop it on offlining, so it protects the online kmem_cache.
> > >
> >
> > Let's suppose a thread doing a remote charging calls
> > memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> > memcg having refcnt equal to 1. That thread got a reference on the
> > remote memcg but no reference on the kmem_cache. Let's suppose that
> > thread got stuck in the reclaim and scheduled away. In the meantime
> > that remote memcg got offlined and decremented the refcnt of all of
> > its kmem_caches. The empty kmem_cache which the thread stuck in
> > reclaim have pointer to can get deleted and may be using an already
> > destroyed kmem_cache after coming back from reclaim.
> >
> > I think the above situation is possible unless the thread gets the
> > reference on the kmem_cache in memcg_kmem_get_cache().
>
> Yes, you're right and I'm writing a nonsense: css_tryget_online()
> can't prevent the cgroup from being offlined.
>
The reason I knew about that race is because I tried something similar
but for different use-case:
https://lkml.org/lkml/2018/3/26/472
> So, the problem with getting a reference in memcg_kmem_get_cache()
> is that it's an atomic operation on the hot path, something I'd like
> to avoid.
>
> I can make the refcounter percpu, but it'll add some complexity and size
> to the kmem_cache object. Still an option, of course.
>
I kind of prefer this option.
> I wonder if we can use rcu_read_lock() instead, and bump the refcounter
> only if we're going into reclaim.
>
> What do you think?
Should it be just reclaim or anything that can reschedule the current thread?
I can tell how we resolve the similar issue for our
eager-kmem_cache-deletion use-case. Our solution (hack) works only for
CONFIG_SLAB (we only use SLAB) and non-preemptible kernel. The
underlying motivation was to reduce the overhead of slab reaper of
traversing thousands of empty offlined kmem caches. CONFIG_SLAB
disables interrupts before accessing the per-cpu caches and reenables
the interrupts if it has to fallback to the page allocation. We use
this window to call memcg_kmem_get_cache() and only increment the
refcnt of kmem_cache if going to the fallback. Thus no need to do
atomic operation on the hot path.
Anyways, I think having percpu refcounter for each memcg kmem_cache is
not that costy for CONFIG_MEMCG_KMEM users and to me that seems like
the most simple solution.
Shakeel
On Thu, Apr 18, 2019 at 01:34:52PM +0000, Christopher Lameter wrote:
> On Wed, 17 Apr 2019, Roman Gushchin wrote:
>
> > Let's make every page to hold a reference to the kmem_cache (we
> > already have a stable pointer), and make kmem_caches to hold a single
> > reference to the memory cgroup.
>
> Ok you are freeing one word in the page struct that can be used for other
> purposes now?
>
Looks so!
On Thu, Apr 18, 2019 at 07:05:24AM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 8:07 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <[email protected]> wrote:
> > > > > >
> > > > > > This commit makes several important changes in the lifecycle
> > > > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > > > of a memory cgroup.
> > > > > >
> > > > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > > > to the memory cgroup and holds a reference to it.
> > > > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > > > are freed, all other are freed on cgroup release.
> > > > >
> > > > > No, they are not freed (i.e. destroyed) on offlining, only
> > > > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > > > css_free.
> > > >
> > > > You're right, my bad. I was thinking about the corresponding sysfs entry
> > > > when was writing it. We try to free it from the deactivation path too.
> > > >
> > > > >
> > > > > >
> > > > > > So the current scheme can be illustrated as:
> > > > > > page->mem_cgroup->kmem_cache.
> > > > > >
> > > > > > To implement the slab memory reparenting we need to invert the scheme
> > > > > > into: page->kmem_cache->mem_cgroup.
> > > > > >
> > > > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > > > reference to the memory cgroup.
> > > > >
> > > > > What about memcg_kmem_get_cache()? That function assumes that by
> > > > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > > > within the rcu lock where you get the memcg through css_tryget_online.
> > > >
> > > > Yeah, a very good question.
> > > >
> > > > I believe it's safe because css_tryget_online() guarantees that
> > > > the cgroup is online and won't go offline before css_free() in
> > > > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > > > and drop it on offlining, so it protects the online kmem_cache.
> > > >
> > >
> > > Let's suppose a thread doing a remote charging calls
> > > memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> > > memcg having refcnt equal to 1. That thread got a reference on the
> > > remote memcg but no reference on the kmem_cache. Let's suppose that
> > > thread got stuck in the reclaim and scheduled away. In the meantime
> > > that remote memcg got offlined and decremented the refcnt of all of
> > > its kmem_caches. The empty kmem_cache which the thread stuck in
> > > reclaim have pointer to can get deleted and may be using an already
> > > destroyed kmem_cache after coming back from reclaim.
> > >
> > > I think the above situation is possible unless the thread gets the
> > > reference on the kmem_cache in memcg_kmem_get_cache().
> >
> > Yes, you're right and I'm writing a nonsense: css_tryget_online()
> > can't prevent the cgroup from being offlined.
> >
>
> The reason I knew about that race is because I tried something similar
> but for different use-case:
>
> https://lkml.org/lkml/2018/3/26/472
>
> > So, the problem with getting a reference in memcg_kmem_get_cache()
> > is that it's an atomic operation on the hot path, something I'd like
> > to avoid.
> >
> > I can make the refcounter percpu, but it'll add some complexity and size
> > to the kmem_cache object. Still an option, of course.
> >
>
> I kind of prefer this option.
>
> > I wonder if we can use rcu_read_lock() instead, and bump the refcounter
> > only if we're going into reclaim.
> >
> > What do you think?
>
> Should it be just reclaim or anything that can reschedule the current thread?
>
> I can tell how we resolve the similar issue for our
> eager-kmem_cache-deletion use-case. Our solution (hack) works only for
> CONFIG_SLAB (we only use SLAB) and non-preemptible kernel. The
> underlying motivation was to reduce the overhead of slab reaper of
> traversing thousands of empty offlined kmem caches. CONFIG_SLAB
> disables interrupts before accessing the per-cpu caches and reenables
> the interrupts if it has to fallback to the page allocation. We use
> this window to call memcg_kmem_get_cache() and only increment the
> refcnt of kmem_cache if going to the fallback. Thus no need to do
> atomic operation on the hot path.
>
> Anyways, I think having percpu refcounter for each memcg kmem_cache is
> not that costy for CONFIG_MEMCG_KMEM users and to me that seems like
> the most simple solution.
>
> Shakeel
Ok, sounds like a percpu refcounter is the best option.
I'll try this approach in v2.
Thanks!
On Thu, Apr 18, 2019 at 01:38:44PM +0000, Christopher Lameter wrote:
> On Wed, 17 Apr 2019, Roman Gushchin wrote:
>
> > static __always_inline int memcg_charge_slab(struct page *page,
> > gfp_t gfp, int order,
> > struct kmem_cache *s)
> > {
> > - if (is_root_cache(s))
> > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> > + struct mem_cgroup *memcg;
> > + struct lruvec *lruvec;
> > + int ret;
> > +
> > + if (is_root_cache(s)) {
> > + mod_node_page_state(page_pgdat(page), idx, 1 << order);
>
> Hmmm... This is functionality that is not memcg specific being moved into
> a memcg function??? Maybe rename the function to indicate that it is not
> memcg specific and add the proper #ifdefs?
>
> > static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> > struct kmem_cache *s)
> > {
> > - memcg_kmem_uncharge(page, order);
> > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> > + struct mem_cgroup *memcg;
> > + struct lruvec *lruvec;
> > +
> > + if (is_root_cache(s)) {
> > + mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> > + return;
> > + }
>
> And again.
>
Good point! Will do in v2.
Thanks!
On Thu, Apr 18, 2019 at 11:15:38AM +0300, Vladimir Davydov wrote:
> Hello Roman,
>
> On Wed, Apr 17, 2019 at 02:54:29PM -0700, Roman Gushchin wrote:
> > There is however a significant problem with reparenting of slab memory:
> > there is no list of charged pages. Some of them are in shrinker lists,
> > but not all. Introducing of a new list is really not an option.
>
> True, introducing a list of charged pages would negatively affect
> SL[AU]B performance since we would need to protect it with some kind
> of lock.
>
> >
> > But fortunately there is a way forward: every slab page has a stable pointer
> > to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> > instead of slab pages.
> >
> > It's actually simpler and cheaper, but requires some underlying changes:
> > 1) Make kmem_caches to hold a single reference to the memory cgroup,
> > instead of a separate reference per every slab page.
> > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> > page->kmem_cache->memcg indirection instead. It's used only on
> > slab page release, so it shouldn't be a big issue.
> > 3) Introduce a refcounter for non-root slab caches. It's required to
> > be able to destroy kmem_caches when they become empty and release
> > the associated memory cgroup.
>
> Which means an unconditional atomic inc/dec on charge/uncharge paths
> AFAIU. Note, we have per cpu batching so charging a kmem page in cgroup
> v2 doesn't require an atomic variable modification. I guess you could
> use some sort of per cpu ref counting though.
Yes, looks like I have to switch to the percpu counter (see the thread
with Shakeel).
>
> Anyway, releasing mem_cgroup objects, but leaving kmem_cache objects
> dangling looks kinda awkward to me. It would be great if we could
> release both, but I assume it's hardly possible due to SL[AU]B
> complexity.
Kmem_caches are *much* smaller than memcgs. If the size of kmem_cache
is smaller than the size of objects which are pinning it, I think it's
acceptable. I hope to release all associated percpu memory early to make
it even smaller.
On the other hand memcgs are much larger than typical object which
are pinning it (dentries and inodes). And it rends to grow with new features
being added.
I agree that releasing both would be cool, but I doubt it's possible.
>
> What about reusing dead cgroups instead? Yeah, it would be kinda unfair,
> because a fresh cgroup would get a legacy of objects left from previous
> owners, but still, if we delete a cgroup, the workload must be dead and
> so apart from a few long-lived objects, there should mostly be cached
> objects charged to it, which should be easily released on memory
> pressure. Sorry if somebody's asked this question before - I must have
> missed that.
It's an interesting idea. The problem is that the dying cgroup can be
an almost fully functional cgroup for a long time: it can have associated
sockets, pagecache, kernel objects, etc. It's a part of cgroup tree,
all constraints and limits are still applied, it might have some background
activity.
Thanks!