2020-06-23 17:44:58

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 00/19] The new cgroup slab memory controller

This is v7 of the slab cgroup controller rework.

The patchset moves the accounting from the page level to the object
level. It allows to share slab pages between memory cgroups.
This leads to a significant win in the slab utilization (up to 45%)
and the corresponding drop in the total kernel memory footprint.
The reduced number of unmovable slab pages should also have a positive
effect on the memory fragmentation.

The patchset makes the slab accounting code simpler: there is no more
need in the complicated dynamic creation and destruction of per-cgroup
slab caches, all memory cgroups use a global set of shared slab caches.
The lifetime of slab caches is not more connected to the lifetime
of memory cgroups.

The more precise accounting does require more CPU, however in practice
the difference seems to be negligible. We've been using the new slab
controller in Facebook production for several months with different
workloads and haven't seen any noticeable regressions. What we've seen
were memory savings in order of 1 GB per host (it varied heavily depending
on the actual workload, size of RAM, number of CPUs, memory pressure, etc).

The third version of the patchset added yet another step towards
the simplification of the code: sharing of slab caches between
accounted and non-accounted allocations. It comes with significant
upsides (most noticeable, a complete elimination of dynamic slab caches
creation) but not without some regression risks, so this change sits
on top of the patchset and is not completely merged in. So in the unlikely
event of a noticeable performance regression it can be reverted separately.

The slab memory accounting works in exactly the same way for SLAB and SLUB.
With both allocators the new controller shows significant memory savings,
with SLUB the difference is bigger. On my 16-core desktop machine running
Fedora 32 the size of the slab memory measured after the start of the system
was lower by 58% and 38% with SLUB and SLAB correspondingly.

As an estimation of a potential CPU overhead, below are results of
slab_bulk_test01 test, kindly provided by Jesper D. Brouer. He also
helped with the evaluation of results.
The test can be found here: https://github.com/netoptimizer/prototype-kernel/
The smallest number in each row should be picked for a comparison.

SLUB-patched - bulk-API
- SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc)

SLUB-original - bulk-API
- SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc)

SLAB-patched - bulk-API
- SLAB-patched : bulk_quick_reuse objects=1 : 67 - 67 - 140 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=2 : 55 - 46 - 46 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=3 : 93 - 94 - 39 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=4 : 35 - 88 - 85 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=8 : 30 - 30 - 30 cycles(tsc)

SLAB-original- bulk-API
- SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 - 67 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=2 : 45 - 46 - 46 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=3 : 38 - 39 - 39 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=4 : 35 - 87 - 87 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=8 : 29 - 66 - 30 cycles(tsc)


v7:
1) rebased on top of Vlastimil's slub improvements, by Andrew
2) page->obj_cgroups is allocated from the same node, by Shakeel
3) perf optimization in get_obj_cgroup_from_current(), by Shakeel
4) added synchronization around allocation of page->obj_cgroups,
by Shakeel
5) fixed kmemleak false positives, by Qian Cai
6) fixed a compiler warning on clang, by Nathan
7) other minor fixes

v6:
1) rebased on top of the mm tree
2) removed a redundant check from cache_from_obj(), suggested by Vlastimil

v5:
1) fixed a build error, spotted by Vlastimil
2) added a comment about memcg->nr_charged_bytes, asked by Johannes
3) added missed acks and reviews

v4:
1) rebased on top of the mm tree, some fixes here and there
2) merged obj_to_index() with slab_index(), suggested by Vlastimil
3) changed objects_per_slab() to a better objects_per_slab_page(),
suggested by Vlastimil
4) other minor fixes and changes

v3:
1) added a patch that switches to a global single set of kmem_caches
2) kmem API clean up dropped, because if has been already merged
3) byte-sized slab vmstat API over page-sized global counters and
bytes-sized memcg/lruvec counters
3) obj_cgroup refcounting simplifications and other minor fixes
4) other minor changes

v2:
1) implemented re-layering and renaming suggested by Johannes,
added his patch to the set. Thanks!
2) fixed the issue discovered by Bharata B Rao. Thanks!
3) added kmem API clean up part
4) added slab/memcg follow-up clean up part
5) fixed a couple of issues discovered by internal testing on FB fleet.
6) added kselftests
7) included metadata into the charge calculation
8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
1) fixed a bug in zoneinfo_show_print()
2) added some comments to the subpage charging API, a minor fix
3) separated memory.kmem.slabinfo deprecation into a separate patch,
provided a drgn-based replacement
4) rebased on top of the current mm tree

RFC:
https://lwn.net/Articles/798605/


Johannes Weiner (1):
mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (18):
mm: memcg: factor out memcg- and lruvec-level changes out of
__mod_lruvec_state()
mm: memcg: prepare for byte-sized vmstat items
mm: memcg: convert vmstat slab counters to bytes
mm: slub: implement SLUB version of obj_to_index()
mm: memcg/slab: obj_cgroup API
mm: memcg/slab: allocate obj_cgroups for non-root slab pages
mm: memcg/slab: save obj_cgroup for non-root slab objects
mm: memcg/slab: charge individual slab objects instead of pages
mm: memcg/slab: deprecate memory.kmem.slabinfo
mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
mm: memcg/slab: use a single set of kmem_caches for all accounted
allocations
mm: memcg/slab: simplify memcg cache creation
mm: memcg/slab: remove memcg_kmem_get_cache()
mm: memcg/slab: deprecate slab_root_caches
mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
mm: memcg/slab: use a single set of kmem_caches for all allocations
kselftests: cgroup: add kernel memory accounting tests
tools/cgroup: add memcg_slabinfo.py tool

drivers/base/node.c | 6 +-
fs/proc/meminfo.c | 4 +-
include/linux/memcontrol.h | 85 ++-
include/linux/mm_types.h | 5 +-
include/linux/mmzone.h | 24 +-
include/linux/slab.h | 5 -
include/linux/slab_def.h | 9 +-
include/linux/slub_def.h | 31 +-
include/linux/vmstat.h | 14 +-
kernel/power/snapshot.c | 2 +-
mm/memcontrol.c | 610 +++++++++++--------
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 8 +-
mm/slab.c | 70 +--
mm/slab.h | 370 +++++-------
mm/slab_common.c | 643 +--------------------
mm/slob.c | 12 +-
mm/slub.c | 229 +-------
mm/vmscan.c | 3 +-
mm/vmstat.c | 30 +-
mm/workingset.c | 6 +-
tools/cgroup/memcg_slabinfo.py | 226 ++++++++
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
25 files changed, 1380 insertions(+), 1399 deletions(-)
create mode 100644 tools/cgroup/memcg_slabinfo.py
create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

--
2.26.2


2020-06-23 17:45:03

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()

memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
a first argument, so the is_root_cache(s) check is redundant and can be
removed without any functional change.

Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/slab_common.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0618d3595c08..e546778030a2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1087,9 +1087,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
struct kmem_cache *c;
struct slabinfo sinfo;

- if (!is_root_cache(s))
- return;
-
c = memcg_cache(s);
if (c) {
memset(&sinfo, 0, sizeof(sinfo));
--
2.26.2

2020-06-23 17:45:04

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 05/19] mm: memcontrol: decouple reference counting from page accounting

From: Johannes Weiner <[email protected]>

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's new
slab controller, which maintains pools of objects and doesn't want to keep
an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track pointers
to an object) is only for historical reasons: memcg used to not take any
css references and simply stalled offlining until all charges had been
reparented and the page counters had dropped to zero. When we got rid of
the reparenting requirement, the simple mechanical translation was to take
a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"), commit
64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
groups").

The new slab controller exposes the limitations in this scheme, so let's
switch it to a more idiomatic reference counting model based on actual
kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
huge pages will no longer acquire tail references in advance, we'll
get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- Pages allocated in the root cgroup will acquire and release css
references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
references; this change gets rid of that as well.

Roman:
1) Rebased on top of the current mm tree: added css_get() in
mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
2) I've reformatted commit references in the commit log to make
checkpatch.pl happy.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 37 +++++++++++++++++++++----------------
mm/slab.h | 2 --
2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6216f7369b2..3e5597f8dec5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2094,13 +2094,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
{
struct mem_cgroup *old = stock->cached;

+ if (!old)
+ return;
+
if (stock->nr_pages) {
page_counter_uncharge(&old->memory, stock->nr_pages);
if (do_memsw_account())
page_counter_uncharge(&old->memsw, stock->nr_pages);
- css_put_many(&old->css, stock->nr_pages);
stock->nr_pages = 0;
}
+
+ css_put(&old->css);
stock->cached = NULL;
}

@@ -2136,6 +2140,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
stock = this_cpu_ptr(&memcg_stock);
if (stock->cached != memcg) { /* reset if necessary */
drain_stock(stock);
+ css_get(&memcg->css);
stock->cached = memcg;
}
stock->nr_pages += nr_pages;
@@ -2594,12 +2599,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
- css_get_many(&memcg->css, nr_pages);

return 0;

done_restock:
- css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);

@@ -2657,8 +2660,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
-
- css_put_many(&memcg->css, nr_pages);
}
#endif

@@ -2966,6 +2967,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
if (!ret) {
page->mem_cgroup = memcg;
__SetPageKmemcg(page);
+ return 0;
}
}
css_put(&memcg->css);
@@ -2988,12 +2990,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
__memcg_kmem_uncharge(memcg, nr_pages);
page->mem_cgroup = NULL;
+ css_put(&memcg->css);

/* slab pages do not have PageKmemcg flag set */
if (PageKmemcg(page))
__ClearPageKmemcg(page);
-
- css_put_many(&memcg->css, nr_pages);
}
#endif /* CONFIG_MEMCG_KMEM */

@@ -3005,13 +3006,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
+ struct mem_cgroup *memcg = head->mem_cgroup;
int i;

if (mem_cgroup_disabled())
return;

- for (i = 1; i < HPAGE_PMD_NR; i++)
- head[i].mem_cgroup = head->mem_cgroup;
+ for (i = 1; i < HPAGE_PMD_NR; i++) {
+ css_get(&memcg->css);
+ head[i].mem_cgroup = memcg;
+ }
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -5456,7 +5460,10 @@ static int mem_cgroup_move_account(struct page *page,
*/
smp_mb();

- page->mem_cgroup = to; /* caller should have done css_get */
+ css_get(&to->css);
+ css_put(&from->css);
+
+ page->mem_cgroup = to;

__unlock_page_memcg(from);

@@ -6506,6 +6513,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
if (ret)
goto out_put;

+ css_get(&memcg->css);
commit_charge(page, memcg);

local_irq_disable();
@@ -6560,9 +6568,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);
-
- if (!mem_cgroup_is_root(ug->memcg))
- css_put_many(&ug->memcg->css, ug->nr_pages);
}

static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6600,6 +6605,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)

ug->dummy_page = page;
page->mem_cgroup = NULL;
+ css_put(&ug->memcg->css);
}

static void uncharge_list(struct list_head *page_list)
@@ -6705,8 +6711,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
- css_get_many(&memcg->css, nr_pages);

+ css_get(&memcg->css);
commit_charge(newpage, memcg);

local_irq_save(flags);
@@ -6943,8 +6949,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
mem_cgroup_charge_statistics(memcg, page, -nr_entries);
memcg_check_events(memcg, page);

- if (!mem_cgroup_is_root(memcg))
- css_put_many(&memcg->css, nr_entries);
+ css_put(&memcg->css);
}

/**
diff --git a/mm/slab.h b/mm/slab.h
index cc6afddd5632..1e2d80991904 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -401,9 +401,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);

- /* transer try_charge() page references to kmem_cache */
percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
- css_put_many(&memcg->css, nr_pages);
out:
css_put(&memcg->css);
return ret;
--
2.26.2

2020-06-23 17:45:07

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 13/19] mm: memcg/slab: simplify memcg cache creation

Because the number of non-root kmem_caches doesn't depend on the number of
memory cgroups anymore and is generally not very big, there is no more
need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's possible to
just embed the work structure into the kmem_cache and avoid the dynamic
allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts to
create a non-root kmem_cache for a root kmem_cache: the second and all
following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be finished.
Instead, cancel_work_sync() can be used to cancel/wait for only one work.

Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
include/linux/memcontrol.h | 1 -
mm/memcontrol.c | 48 +-------------------------------------
mm/slab.h | 2 ++
mm/slab_common.c | 22 +++++++++--------
4 files changed, 15 insertions(+), 58 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 11fd18b3d6c6..2ac84dcfc9e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1418,7 +1418,6 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

extern struct static_key_false memcg_kmem_enabled_key;
-extern struct workqueue_struct *memcg_kmem_cache_wq;

extern int memcg_nr_cache_ids;
void memcg_get_cache_ids(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 417070fb0fd0..d23c2bdeea66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -399,8 +399,6 @@ void memcg_put_cache_ids(void)
*/
DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
EXPORT_SYMBOL(memcg_kmem_enabled_key);
-
-struct workqueue_struct *memcg_kmem_cache_wq;
#endif

static int memcg_shrinker_map_size;
@@ -2902,39 +2900,6 @@ static void memcg_free_cache_id(int id)
ida_simple_remove(&memcg_cache_ida, id);
}

-struct memcg_kmem_cache_create_work {
- struct kmem_cache *cachep;
- struct work_struct work;
-};
-
-static void memcg_kmem_cache_create_func(struct work_struct *w)
-{
- struct memcg_kmem_cache_create_work *cw =
- container_of(w, struct memcg_kmem_cache_create_work, work);
- struct kmem_cache *cachep = cw->cachep;
-
- memcg_create_kmem_cache(cachep);
-
- kfree(cw);
-}
-
-/*
- * Enqueue the creation of a per-memcg kmem_cache.
- */
-static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
-{
- struct memcg_kmem_cache_create_work *cw;
-
- cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
- if (!cw)
- return;
-
- cw->cachep = cachep;
- INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
-
- queue_work(memcg_kmem_cache_wq, &cw->work);
-}
-
/**
* memcg_kmem_get_cache: select memcg or root cache for allocation
* @cachep: the original global kmem cache
@@ -2951,7 +2916,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)

memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
if (unlikely(!memcg_cachep)) {
- memcg_schedule_kmem_cache_create(cachep);
+ queue_work(system_wq, &cachep->memcg_params.work);
return cachep;
}

@@ -7026,17 +6991,6 @@ static int __init mem_cgroup_init(void)
{
int cpu, node;

-#ifdef CONFIG_MEMCG_KMEM
- /*
- * Kmem cache creation is mostly done with the slab_mutex held,
- * so use a workqueue with limited concurrency to avoid stalling
- * all worker threads in case lots of cgroups are created and
- * destroyed simultaneously.
- */
- memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1);
- BUG_ON(!memcg_kmem_cache_wq);
-#endif
-
cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
memcg_hotplug_cpu_dead);

diff --git a/mm/slab.h b/mm/slab.h
index 673448e5cfb3..66482f8467e7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -45,12 +45,14 @@ struct kmem_cache {
* @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
* cgroups.
* @root_caches_node: list node for slab_root_caches list.
+ * @work: work struct used to create the non-root cache.
*/
struct memcg_cache_params {
struct kmem_cache *root_cache;

struct kmem_cache *memcg_cache;
struct list_head __root_caches_node;
+ struct work_struct work;
};
#endif /* CONFIG_SLOB */

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 90c25c8db1a5..fe4f5be0e51f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,10 +132,18 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,

LIST_HEAD(slab_root_caches);

+static void memcg_kmem_cache_create_func(struct work_struct *work)
+{
+ struct kmem_cache *cachep = container_of(work, struct kmem_cache,
+ memcg_params.work);
+ memcg_create_kmem_cache(cachep);
+}
+
void slab_init_memcg_params(struct kmem_cache *s)
{
s->memcg_params.root_cache = NULL;
s->memcg_params.memcg_cache = NULL;
+ INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
}

static void init_memcg_params(struct kmem_cache *s,
@@ -584,15 +592,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
return 0;
}

-static void flush_memcg_workqueue(struct kmem_cache *s)
+static void cancel_memcg_cache_creation(struct kmem_cache *s)
{
- /*
- * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
- * deactivates the memcg kmem_caches through workqueue. Make sure all
- * previous workitems on workqueue are processed.
- */
- if (likely(memcg_kmem_cache_wq))
- flush_workqueue(memcg_kmem_cache_wq);
+ cancel_work_sync(&s->memcg_params.work);
}
#else
static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -600,7 +602,7 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
return 0;
}

-static inline void flush_memcg_workqueue(struct kmem_cache *s)
+static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
{
}
#endif /* CONFIG_MEMCG_KMEM */
@@ -619,7 +621,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;

- flush_memcg_workqueue(s);
+ cancel_memcg_cache_creation(s);

get_online_cpus();
get_online_mems();
--
2.26.2

2020-06-23 17:45:13

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated on
demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle kmem_cache
clones for accounted allocations. There is no more need to create them,
accumulate statistics, propagate attributes, etc. It's a quite
significant simplification.

Also, because the total number of slab_caches is reduced almost twice (not
all kmem_caches have a memcg clone), some additional memory savings are
expected. On my devvm it additionally saves about 3.5% of slab memory.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
include/linux/slab.h | 2 -
include/linux/slab_def.h | 3 -
include/linux/slub_def.h | 10 --
mm/memcontrol.c | 5 +-
mm/slab.c | 41 +------
mm/slab.h | 185 ++++++++-----------------------
mm/slab_common.c | 230 +--------------------------------------
mm/slub.c | 163 +--------------------------
8 files changed, 62 insertions(+), 577 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8b1f91e320f9..24df2393ec03 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,6 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
void kmem_cache_destroy(struct kmem_cache *);
int kmem_cache_shrink(struct kmem_cache *);

-void memcg_create_kmem_cache(struct kmem_cache *cachep);
-
/*
* Please use this macro to create slab caches. Simply specify the
* name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index ccda7b9669a5..9eb430c163c2 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -72,9 +72,6 @@ struct kmem_cache {
int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */

-#ifdef CONFIG_MEMCG
- struct memcg_cache_params memcg_params;
-#endif
#ifdef CONFIG_KASAN
struct kasan_cache kasan_info;
#endif
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f87302dcfe8c..1be0ed5befa1 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -108,17 +108,7 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
- struct work_struct kobj_remove_work;
#endif
-#ifdef CONFIG_MEMCG
- struct memcg_cache_params memcg_params;
- /* For propagation, maximum size of a stored attr */
- unsigned int max_attr_size;
-#ifdef CONFIG_SYSFS
- struct kset *memcg_kset;
-#endif
-#endif
-
#ifdef CONFIG_SLAB_FREELIST_HARDENED
unsigned long random;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c910fe326ca6..1b858cd18b52 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2826,7 +2826,10 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)

off = obj_to_index(page->slab_cache, page, p);
objcg = page_obj_cgroups(page)[off];
- return obj_cgroup_memcg(objcg);
+ if (objcg)
+ return obj_cgroup_memcg(objcg);
+
+ return NULL;
}

/* All other pages use page->mem_cgroup */
diff --git a/mm/slab.c b/mm/slab.c
index 23f0376f66ba..ebac5e400ad0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1369,11 +1369,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
return NULL;
}

- if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
- __free_pages(page, cachep->gfporder);
- return NULL;
- }
-
+ charge_slab_page(page, flags, cachep->gfporder, cachep);
__SetPageSlab(page);
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -3792,8 +3788,8 @@ static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp)
}

/* Always called with the slab_mutex held */
-static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
- int batchcount, int shared, gfp_t gfp)
+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+ int batchcount, int shared, gfp_t gfp)
{
struct array_cache __percpu *cpu_cache, *prev;
int cpu;
@@ -3838,30 +3834,6 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
return setup_kmem_cache_nodes(cachep, gfp);
}

-static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
- int batchcount, int shared, gfp_t gfp)
-{
- int ret;
- struct kmem_cache *c;
-
- ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
- if (slab_state < FULL)
- return ret;
-
- if ((ret < 0) || !is_root_cache(cachep))
- return ret;
-
- lockdep_assert_held(&slab_mutex);
- c = memcg_cache(cachep);
- if (c) {
- /* return value determined by the root cache only */
- __do_tune_cpucache(c, limit, batchcount, shared, gfp);
- }
-
- return ret;
-}
-
/* Called with slab_mutex held always */
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
{
@@ -3874,13 +3846,6 @@ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
if (err)
goto end;

- if (!is_root_cache(cachep)) {
- struct kmem_cache *root = memcg_root_cache(cachep);
- limit = root->limit;
- shared = root->shared;
- batchcount = root->batchcount;
- }
-
if (limit && shared && batchcount)
goto skip_setup;
/*
diff --git a/mm/slab.h b/mm/slab.h
index 46ac1de9a0b7..893d39739ed6 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -30,28 +30,6 @@ struct kmem_cache {
struct list_head list; /* List of all slab caches on the system */
};

-#else /* !CONFIG_SLOB */
-
-/*
- * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child cache will have it. Some fields are used
- * in both cases, other are specific to root caches.
- *
- * @root_cache: Common to root and child caches. NULL for root, pointer to
- * the root cache for children.
- *
- * The following fields are specific to root caches.
- *
- * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
- * cgroups.
- * @work: work struct used to create the non-root cache.
- */
-struct memcg_cache_params {
- struct kmem_cache *root_cache;
-
- struct kmem_cache *memcg_cache;
- struct work_struct work;
-};
#endif /* CONFIG_SLOB */

#ifdef CONFIG_SLAB
@@ -195,7 +173,6 @@ int __kmem_cache_shutdown(struct kmem_cache *);
void __kmem_cache_release(struct kmem_cache *);
int __kmem_cache_shrink(struct kmem_cache *);
void slab_kmem_cache_release(struct kmem_cache *);
-void kmem_cache_shrink_all(struct kmem_cache *s);

struct seq_file;
struct file;
@@ -262,43 +239,6 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
}

#ifdef CONFIG_MEMCG_KMEM
-static inline bool is_root_cache(struct kmem_cache *s)
-{
- return !s->memcg_params.root_cache;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
- struct kmem_cache *p)
-{
- return p == s || p == s->memcg_params.root_cache;
-}
-
-/*
- * We use suffixes to the name in memcg because we can't have caches
- * created in the system with the same name. But when we print them
- * locally, better refer to them with the base name
- */
-static inline const char *cache_name(struct kmem_cache *s)
-{
- if (!is_root_cache(s))
- s = s->memcg_params.root_cache;
- return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
- if (is_root_cache(s))
- return s;
- return s->memcg_params.root_cache;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
- if (is_root_cache(s))
- return s->memcg_params.memcg_cache;
- return NULL;
-}
-
static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
{
/*
@@ -327,8 +267,12 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
if (!vec)
return -ENOMEM;

- kmemleak_not_leak(vec);
- page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+ if (cmpxchg(&page->obj_cgroups, NULL,
+ (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
+ kfree(vec);
+ else
+ kmemleak_not_leak(vec);
+
return 0;
}

@@ -347,38 +291,25 @@ static inline size_t obj_full_size(struct kmem_cache *s)
return s->size + sizeof(struct obj_cgroup *);
}

-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
- struct obj_cgroup **objcgp,
- size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+ size_t objects,
+ gfp_t flags)
{
- struct kmem_cache *cachep;
struct obj_cgroup *objcg;

if (memcg_kmem_bypass())
- return s;
-
- cachep = READ_ONCE(s->memcg_params.memcg_cache);
- if (unlikely(!cachep)) {
- /*
- * If memcg cache does not exist yet, we schedule it's
- * asynchronous creation and let the current allocation
- * go through with the root cache.
- */
- queue_work(system_wq, &s->memcg_params.work);
- return s;
- }
+ return NULL;

objcg = get_obj_cgroup_from_current();
if (!objcg)
- return s;
+ return NULL;

if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
obj_cgroup_put(objcg);
- cachep = NULL;
+ return NULL;
}

- *objcgp = objcg;
- return cachep;
+ return objcg;
}

static inline void mod_objcg_state(struct obj_cgroup *objcg,
@@ -397,15 +328,27 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,

static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
struct obj_cgroup *objcg,
- size_t size, void **p)
+ gfp_t flags, size_t size,
+ void **p)
{
struct page *page;
unsigned long off;
size_t i;

+ if (!objcg)
+ return;
+
+ flags &= ~__GFP_ACCOUNT;
for (i = 0; i < size; i++) {
if (likely(p[i])) {
page = virt_to_head_page(p[i]);
+
+ if (!page_has_obj_cgroups(page) &&
+ memcg_alloc_page_obj_cgroups(page, s, flags)) {
+ obj_cgroup_uncharge(objcg, obj_full_size(s));
+ continue;
+ }
+
off = obj_to_index(s, page, p[i]);
obj_cgroup_get(objcg);
page_obj_cgroups(page)[off] = objcg;
@@ -424,13 +367,19 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
struct obj_cgroup *objcg;
unsigned int off;

- if (!memcg_kmem_enabled() || is_root_cache(s))
+ if (!memcg_kmem_enabled())
+ return;
+
+ if (!page_has_obj_cgroups(page))
return;

off = obj_to_index(s, page, p);
objcg = page_obj_cgroups(page)[off];
page_obj_cgroups(page)[off] = NULL;

+ if (!objcg)
+ return;
+
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -438,35 +387,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
obj_cgroup_put(objcg);
}

-extern void slab_init_memcg_params(struct kmem_cache *);
-
#else /* CONFIG_MEMCG_KMEM */
-static inline bool is_root_cache(struct kmem_cache *s)
-{
- return true;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
- struct kmem_cache *p)
-{
- return s == p;
-}
-
-static inline const char *cache_name(struct kmem_cache *s)
-{
- return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
- return s;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
- return NULL;
-}
-
static inline bool page_has_obj_cgroups(struct page *page)
{
return false;
@@ -487,16 +408,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
{
}

-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
- struct obj_cgroup **objcgp,
- size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+ size_t objects,
+ gfp_t flags)
{
return NULL;
}

static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
struct obj_cgroup *objcg,
- size_t size, void **p)
+ gfp_t flags, size_t size,
+ void **p)
{
}

@@ -504,11 +426,6 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
void *p)
{
}
-
-static inline void slab_init_memcg_params(struct kmem_cache *s)
-{
-}
-
#endif /* CONFIG_MEMCG_KMEM */

static inline struct kmem_cache *virt_to_cache(const void *obj)
@@ -522,27 +439,18 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
return page->slab_cache;
}

-static __always_inline int charge_slab_page(struct page *page,
- gfp_t gfp, int order,
- struct kmem_cache *s)
+static __always_inline void charge_slab_page(struct page *page,
+ gfp_t gfp, int order,
+ struct kmem_cache *s)
{
- if (memcg_kmem_enabled() && !is_root_cache(s)) {
- int ret;
-
- ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
- if (ret)
- return ret;
- }
-
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
PAGE_SIZE << order);
- return 0;
}

static __always_inline void uncharge_slab_page(struct page *page, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_enabled() && !is_root_cache(s))
+ if (memcg_kmem_enabled())
memcg_free_page_obj_cgroups(page);

mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
@@ -554,12 +462,11 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
struct kmem_cache *cachep;

if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
- !memcg_kmem_enabled() &&
!kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
return s;

cachep = virt_to_cache(x);
- if (WARN(cachep && !slab_equal_or_root(cachep, s),
+ if (WARN(cachep && cachep != s,
"%s: Wrong slab cache. %s but object is from %s\n",
__func__, s->name, cachep->name))
print_tracking(cachep, x);
@@ -612,7 +519,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,

if (memcg_kmem_enabled() &&
((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
- return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
+ *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);

return s;
}
@@ -631,8 +538,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
s->flags, flags);
}

- if (memcg_kmem_enabled() && !is_root_cache(s))
- memcg_slab_post_alloc_hook(s, objcg, size, p);
+ if (memcg_kmem_enabled())
+ memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
}

#ifndef CONFIG_SLOB
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e546778030a2..a143a8c8f874 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -128,36 +128,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
return i;
}

-#ifdef CONFIG_MEMCG_KMEM
-static void memcg_kmem_cache_create_func(struct work_struct *work)
-{
- struct kmem_cache *cachep = container_of(work, struct kmem_cache,
- memcg_params.work);
- memcg_create_kmem_cache(cachep);
-}
-
-void slab_init_memcg_params(struct kmem_cache *s)
-{
- s->memcg_params.root_cache = NULL;
- s->memcg_params.memcg_cache = NULL;
- INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
-}
-
-static void init_memcg_params(struct kmem_cache *s,
- struct kmem_cache *root_cache)
-{
- if (root_cache)
- s->memcg_params.root_cache = root_cache;
- else
- slab_init_memcg_params(s);
-}
-#else
-static inline void init_memcg_params(struct kmem_cache *s,
- struct kmem_cache *root_cache)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
/*
* Figure out what the alignment of the objects will be given a set of
* flags, a user specified alignment and the size of the objects.
@@ -195,9 +165,6 @@ int slab_unmergeable(struct kmem_cache *s)
if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
return 1;

- if (!is_root_cache(s))
- return 1;
-
if (s->ctor)
return 1;

@@ -284,7 +251,6 @@ static struct kmem_cache *create_cache(const char *name,
s->useroffset = useroffset;
s->usersize = usersize;

- init_memcg_params(s, root_cache);
err = __kmem_cache_create(s, flags);
if (err)
goto out_free_cache;
@@ -342,7 +308,6 @@ kmem_cache_create_usercopy(const char *name,

get_online_cpus();
get_online_mems();
- memcg_get_cache_ids();

mutex_lock(&slab_mutex);

@@ -392,7 +357,6 @@ kmem_cache_create_usercopy(const char *name,
out_unlock:
mutex_unlock(&slab_mutex);

- memcg_put_cache_ids();
put_online_mems();
put_online_cpus();

@@ -505,87 +469,6 @@ static int shutdown_cache(struct kmem_cache *s)
return 0;
}

-#ifdef CONFIG_MEMCG_KMEM
-/*
- * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
- * @root_cache: The parent of the new cache.
- *
- * This function attempts to create a kmem cache that will serve allocation
- * requests going all non-root memory cgroups to @root_cache. The new cache
- * inherits properties from its parent.
- */
-void memcg_create_kmem_cache(struct kmem_cache *root_cache)
-{
- struct kmem_cache *s = NULL;
- char *cache_name;
-
- get_online_cpus();
- get_online_mems();
-
- mutex_lock(&slab_mutex);
-
- if (root_cache->memcg_params.memcg_cache)
- goto out_unlock;
-
- cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
- if (!cache_name)
- goto out_unlock;
-
- s = create_cache(cache_name, root_cache->object_size,
- root_cache->align,
- root_cache->flags & CACHE_CREATE_MASK,
- root_cache->useroffset, root_cache->usersize,
- root_cache->ctor, root_cache);
- /*
- * If we could not create a memcg cache, do not complain, because
- * that's not critical at all as we can always proceed with the root
- * cache.
- */
- if (IS_ERR(s)) {
- kfree(cache_name);
- goto out_unlock;
- }
-
- /*
- * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
- * barrier here to ensure nobody will see the kmem_cache partially
- * initialized.
- */
- smp_wmb();
- root_cache->memcg_params.memcg_cache = s;
-
-out_unlock:
- mutex_unlock(&slab_mutex);
-
- put_online_mems();
- put_online_cpus();
-}
-
-static int shutdown_memcg_caches(struct kmem_cache *s)
-{
- BUG_ON(!is_root_cache(s));
-
- if (s->memcg_params.memcg_cache)
- WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
-
- return 0;
-}
-
-static void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
- cancel_work_sync(&s->memcg_params.work);
-}
-#else
-static inline int shutdown_memcg_caches(struct kmem_cache *s)
-{
- return 0;
-}
-
-static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
void slab_kmem_cache_release(struct kmem_cache *s)
{
__kmem_cache_release(s);
@@ -600,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;

- cancel_memcg_cache_creation(s);
-
get_online_cpus();
get_online_mems();

@@ -611,10 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (s->refcount)
goto out_unlock;

- err = shutdown_memcg_caches(s);
- if (!err)
- err = shutdown_cache(s);
-
+ err = shutdown_cache(s);
if (err) {
pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
s->name);
@@ -651,33 +529,6 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
}
EXPORT_SYMBOL(kmem_cache_shrink);

-/**
- * kmem_cache_shrink_all - shrink root and memcg caches
- * @s: The cache pointer
- */
-void kmem_cache_shrink_all(struct kmem_cache *s)
-{
- struct kmem_cache *c;
-
- if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
- kmem_cache_shrink(s);
- return;
- }
-
- get_online_cpus();
- get_online_mems();
- kasan_cache_shrink(s);
- __kmem_cache_shrink(s);
-
- c = memcg_cache(s);
- if (c) {
- kasan_cache_shrink(c);
- __kmem_cache_shrink(c);
- }
- put_online_mems();
- put_online_cpus();
-}
-
bool slab_is_available(void)
{
return slab_state >= UP;
@@ -706,8 +557,6 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
s->useroffset = useroffset;
s->usersize = usersize;

- slab_init_memcg_params(s);
-
err = __kmem_cache_create(s, flags);

if (err)
@@ -1081,25 +930,6 @@ void slab_stop(struct seq_file *m, void *p)
mutex_unlock(&slab_mutex);
}

-static void
-memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
-{
- struct kmem_cache *c;
- struct slabinfo sinfo;
-
- c = memcg_cache(s);
- if (c) {
- memset(&sinfo, 0, sizeof(sinfo));
- get_slabinfo(c, &sinfo);
-
- info->active_slabs += sinfo.active_slabs;
- info->num_slabs += sinfo.num_slabs;
- info->shared_avail += sinfo.shared_avail;
- info->active_objs += sinfo.active_objs;
- info->num_objs += sinfo.num_objs;
- }
-}
-
static void cache_show(struct kmem_cache *s, struct seq_file *m)
{
struct slabinfo sinfo;
@@ -1107,10 +937,8 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
memset(&sinfo, 0, sizeof(sinfo));
get_slabinfo(s, &sinfo);

- memcg_accumulate_slabinfo(s, &sinfo);
-
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
- cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
+ s->name, sinfo.active_objs, sinfo.num_objs, s->size,
sinfo.objects_per_slab, (1 << sinfo.cache_order));

seq_printf(m, " : tunables %4u %4u %4u",
@@ -1127,8 +955,7 @@ static int slab_show(struct seq_file *m, void *p)

if (p == slab_caches.next)
print_slabinfo_header(m);
- if (is_root_cache(s))
- cache_show(s, m);
+ cache_show(s, m);
return 0;
}

@@ -1153,13 +980,13 @@ void dump_unreclaimable_slab(void)
pr_info("Name Used Total\n");

list_for_each_entry_safe(s, s2, &slab_caches, list) {
- if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+ if (s->flags & SLAB_RECLAIM_ACCOUNT)
continue;

get_slabinfo(s, &sinfo);

if (sinfo.num_objs > 0)
- pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
+ pr_info("%-17s %10luKB %10luKB\n", s->name,
(sinfo.active_objs * s->size) / 1024,
(sinfo.num_objs * s->size) / 1024);
}
@@ -1218,53 +1045,6 @@ static int __init slab_proc_init(void)
}
module_init(slab_proc_init);

-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
-/*
- * Display information about kmem caches that have memcg cache.
- */
-static int memcg_slabinfo_show(struct seq_file *m, void *unused)
-{
- struct kmem_cache *s, *c;
- struct slabinfo sinfo;
-
- mutex_lock(&slab_mutex);
- seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
- seq_puts(m, " <active_slabs> <num_slabs>\n");
- list_for_each_entry(s, &slab_caches, list) {
- /*
- * Skip kmem caches that don't have the memcg cache.
- */
- if (!s->memcg_params.memcg_cache)
- continue;
-
- memset(&sinfo, 0, sizeof(sinfo));
- get_slabinfo(s, &sinfo);
- seq_printf(m, "%-17s root %6lu %6lu %6lu %6lu\n",
- cache_name(s), sinfo.active_objs, sinfo.num_objs,
- sinfo.active_slabs, sinfo.num_slabs);
-
- c = s->memcg_params.memcg_cache;
- memset(&sinfo, 0, sizeof(sinfo));
- get_slabinfo(c, &sinfo);
- seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
- cache_name(c), root_mem_cgroup->css.id,
- sinfo.active_objs, sinfo.num_objs,
- sinfo.active_slabs, sinfo.num_slabs);
- }
- mutex_unlock(&slab_mutex);
- return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo);
-
-static int __init memcg_slabinfo_init(void)
-{
- debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO,
- NULL, NULL, &memcg_slabinfo_fops);
- return 0;
-}
-
-late_initcall(memcg_slabinfo_init);
-#endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */
#endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */

static __always_inline void *__do_krealloc(const void *p, size_t new_size,
diff --git a/mm/slub.c b/mm/slub.c
index 7a5d6861b088..0a2a37fdcc52 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -218,14 +218,10 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
#ifdef CONFIG_SYSFS
static int sysfs_slab_add(struct kmem_cache *);
static int sysfs_slab_alias(struct kmem_cache *, const char *);
-static void memcg_propagate_slab_attrs(struct kmem_cache *s);
-static void sysfs_slab_remove(struct kmem_cache *s);
#else
static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
-static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
-static inline void sysfs_slab_remove(struct kmem_cache *s) { }
#endif

static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -1623,10 +1619,8 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
else
page = __alloc_pages_node(node, flags, order);

- if (page && charge_slab_page(page, flags, order, s)) {
- __free_pages(page, order);
- page = NULL;
- }
+ if (page)
+ charge_slab_page(page, flags, order, s);

return page;
}
@@ -3924,7 +3918,6 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
if (n->nr_partial || slabs_node(s, node))
return 1;
}
- sysfs_slab_remove(s);
return 0;
}

@@ -4362,7 +4355,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
p->slab_cache = s;
#endif
}
- slab_init_memcg_params(s);
list_add(&s->list, &slab_caches);
return s;
}
@@ -4418,7 +4410,7 @@ struct kmem_cache *
__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
slab_flags_t flags, void (*ctor)(void *))
{
- struct kmem_cache *s, *c;
+ struct kmem_cache *s;

s = find_mergeable(size, align, flags, name, ctor);
if (s) {
@@ -4431,12 +4423,6 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
s->object_size = max(s->object_size, size);
s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));

- c = memcg_cache(s);
- if (c) {
- c->object_size = s->object_size;
- c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
- }
-
if (sysfs_slab_alias(s, name)) {
s->refcount--;
s = NULL;
@@ -4458,7 +4444,6 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
if (slab_state <= UP)
return 0;

- memcg_propagate_slab_attrs(s);
err = sysfs_slab_add(s);
if (err)
__kmem_cache_release(s);
@@ -5316,7 +5301,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
const char *buf, size_t length)
{
if (buf[0] == '1')
- kmem_cache_shrink_all(s);
+ kmem_cache_shrink(s);
else
return -EINVAL;
return length;
@@ -5540,99 +5525,9 @@ static ssize_t slab_attr_store(struct kobject *kobj,
return -EIO;

err = attribute->store(s, buf, len);
-#ifdef CONFIG_MEMCG
- if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
- struct kmem_cache *c;
-
- mutex_lock(&slab_mutex);
- if (s->max_attr_size < len)
- s->max_attr_size = len;
-
- /*
- * This is a best effort propagation, so this function's return
- * value will be determined by the parent cache only. This is
- * basically because not all attributes will have a well
- * defined semantics for rollbacks - most of the actions will
- * have permanent effects.
- *
- * Returning the error value of any of the children that fail
- * is not 100 % defined, in the sense that users seeing the
- * error code won't be able to know anything about the state of
- * the cache.
- *
- * Only returning the error code for the parent cache at least
- * has well defined semantics. The cache being written to
- * directly either failed or succeeded, in which case we loop
- * through the descendants with best-effort propagation.
- */
- c = memcg_cache(s);
- if (c)
- attribute->store(c, buf, len);
- mutex_unlock(&slab_mutex);
- }
-#endif
return err;
}

-static void memcg_propagate_slab_attrs(struct kmem_cache *s)
-{
-#ifdef CONFIG_MEMCG
- int i;
- char *buffer = NULL;
- struct kmem_cache *root_cache;
-
- if (is_root_cache(s))
- return;
-
- root_cache = s->memcg_params.root_cache;
-
- /*
- * This mean this cache had no attribute written. Therefore, no point
- * in copying default values around
- */
- if (!root_cache->max_attr_size)
- return;
-
- for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
- char mbuf[64];
- char *buf;
- struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
- ssize_t len;
-
- if (!attr || !attr->store || !attr->show)
- continue;
-
- /*
- * It is really bad that we have to allocate here, so we will
- * do it only as a fallback. If we actually allocate, though,
- * we can just use the allocated buffer until the end.
- *
- * Most of the slub attributes will tend to be very small in
- * size, but sysfs allows buffers up to a page, so they can
- * theoretically happen.
- */
- if (buffer)
- buf = buffer;
- else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) &&
- !IS_ENABLED(CONFIG_SLUB_STATS))
- buf = mbuf;
- else {
- buffer = (char *) get_zeroed_page(GFP_KERNEL);
- if (WARN_ON(!buffer))
- continue;
- buf = buffer;
- }
-
- len = attr->show(root_cache, buf);
- if (len > 0)
- attr->store(s, buf, len);
- }
-
- if (buffer)
- free_page((unsigned long)buffer);
-#endif /* CONFIG_MEMCG */
-}
-
static void kmem_cache_release(struct kobject *k)
{
slab_kmem_cache_release(to_slab(k));
@@ -5652,10 +5547,6 @@ static struct kset *slab_kset;

static inline struct kset *cache_kset(struct kmem_cache *s)
{
-#ifdef CONFIG_MEMCG
- if (!is_root_cache(s))
- return s->memcg_params.root_cache->memcg_kset;
-#endif
return slab_kset;
}

@@ -5698,27 +5589,6 @@ static char *create_unique_id(struct kmem_cache *s)
return name;
}

-static void sysfs_slab_remove_workfn(struct work_struct *work)
-{
- struct kmem_cache *s =
- container_of(work, struct kmem_cache, kobj_remove_work);
-
- if (!s->kobj.state_in_sysfs)
- /*
- * For a memcg cache, this may be called during
- * deactivation and again on shutdown. Remove only once.
- * A cache is never shut down before deactivation is
- * complete, so no need to worry about synchronization.
- */
- goto out;
-
-#ifdef CONFIG_MEMCG
- kset_unregister(s->memcg_kset);
-#endif
-out:
- kobject_put(&s->kobj);
-}
-
static int sysfs_slab_add(struct kmem_cache *s)
{
int err;
@@ -5726,8 +5596,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
struct kset *kset = cache_kset(s);
int unmergeable = slab_unmergeable(s);

- INIT_WORK(&s->kobj_remove_work, sysfs_slab_remove_workfn);
-
if (!kset) {
kobject_init(&s->kobj, &slab_ktype);
return 0;
@@ -5764,16 +5632,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
if (err)
goto out_del_kobj;

-#ifdef CONFIG_MEMCG
- if (is_root_cache(s) && memcg_sysfs_enabled) {
- s->memcg_kset = kset_create_and_add("cgroup", NULL, &s->kobj);
- if (!s->memcg_kset) {
- err = -ENOMEM;
- goto out_del_kobj;
- }
- }
-#endif
-
if (!unmergeable) {
/* Setup first alias */
sysfs_slab_alias(s, s->name);
@@ -5787,19 +5645,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
goto out;
}

-static void sysfs_slab_remove(struct kmem_cache *s)
-{
- if (slab_state < FULL)
- /*
- * Sysfs has not been setup yet so no need to remove the
- * cache from sysfs.
- */
- return;
-
- kobject_get(&s->kobj);
- schedule_work(&s->kobj_remove_work);
-}
-
void sysfs_slab_unlink(struct kmem_cache *s)
{
if (slab_state >= FULL)
--
2.26.2

2020-06-23 17:45:17

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 15/19] mm: memcg/slab: deprecate slab_root_caches

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list if
CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches was
proportional to the number of memory cgroups and could reach really big
values. Now, when it cannot exceed the number of root kmem_caches, there
is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths, so it's
perfectly fine to iterate over slab_caches and filter out non-root
kmem_caches.

It allows to remove a lot of config-dependent code and two pointers from
the kmem_cache structure.

Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/slab.c | 1 -
mm/slab.h | 17 -----------------
mm/slab_common.c | 37 ++++++++-----------------------------
mm/slub.c | 1 -
4 files changed, 8 insertions(+), 48 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 84e218fd0bcf..23f0376f66ba 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,6 @@ void __init kmem_cache_init(void)
nr_node_ids * sizeof(struct kmem_cache_node *),
SLAB_HWCACHE_ALIGN, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
- memcg_link_cache(kmem_cache);
slab_state = PARTIAL;

/*
diff --git a/mm/slab.h b/mm/slab.h
index d47430e97ff1..46ac1de9a0b7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -44,14 +44,12 @@ struct kmem_cache {
*
* @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
* cgroups.
- * @root_caches_node: list node for slab_root_caches list.
* @work: work struct used to create the non-root cache.
*/
struct memcg_cache_params {
struct kmem_cache *root_cache;

struct kmem_cache *memcg_cache;
- struct list_head __root_caches_node;
struct work_struct work;
};
#endif /* CONFIG_SLOB */
@@ -264,11 +262,6 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
}

#ifdef CONFIG_MEMCG_KMEM
-
-/* List of all root caches. */
-extern struct list_head slab_root_caches;
-#define root_caches_node memcg_params.__root_caches_node
-
static inline bool is_root_cache(struct kmem_cache *s)
{
return !s->memcg_params.root_cache;
@@ -446,14 +439,8 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
}

extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);

#else /* CONFIG_MEMCG_KMEM */
-
-/* If !memcg, all caches are root. */
-#define slab_root_caches slab_caches
-#define root_caches_node list
-
static inline bool is_root_cache(struct kmem_cache *s)
{
return true;
@@ -522,10 +509,6 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
{
}

-static inline void memcg_link_cache(struct kmem_cache *s)
-{
-}
-
#endif /* CONFIG_MEMCG_KMEM */

static inline struct kmem_cache *virt_to_cache(const void *obj)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e0a2a904fdd9..0618d3595c08 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -129,9 +129,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
}

#ifdef CONFIG_MEMCG_KMEM
-
-LIST_HEAD(slab_root_caches);
-
static void memcg_kmem_cache_create_func(struct work_struct *work)
{
struct kmem_cache *cachep = container_of(work, struct kmem_cache,
@@ -154,27 +151,11 @@ static void init_memcg_params(struct kmem_cache *s,
else
slab_init_memcg_params(s);
}
-
-void memcg_link_cache(struct kmem_cache *s)
-{
- if (is_root_cache(s))
- list_add(&s->root_caches_node, &slab_root_caches);
-}
-
-static void memcg_unlink_cache(struct kmem_cache *s)
-{
- if (is_root_cache(s))
- list_del(&s->root_caches_node);
-}
#else
static inline void init_memcg_params(struct kmem_cache *s,
struct kmem_cache *root_cache)
{
}
-
-static inline void memcg_unlink_cache(struct kmem_cache *s)
-{
-}
#endif /* CONFIG_MEMCG_KMEM */

/*
@@ -251,7 +232,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
if (flags & SLAB_NEVER_MERGE)
return NULL;

- list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
+ list_for_each_entry_reverse(s, &slab_caches, list) {
if (slab_unmergeable(s))
continue;

@@ -310,7 +291,6 @@ static struct kmem_cache *create_cache(const char *name,

s->refcount = 1;
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
out:
if (err)
return ERR_PTR(err);
@@ -505,7 +485,6 @@ static int shutdown_cache(struct kmem_cache *s)
if (__kmem_cache_shutdown(s) != 0)
return -EBUSY;

- memcg_unlink_cache(s);
list_del(&s->list);

if (s->flags & SLAB_TYPESAFE_BY_RCU) {
@@ -749,7 +728,6 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,

create_boot_cache(s, name, size, flags, useroffset, usersize);
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
s->refcount = 1;
return s;
}
@@ -1090,12 +1068,12 @@ static void print_slabinfo_header(struct seq_file *m)
void *slab_start(struct seq_file *m, loff_t *pos)
{
mutex_lock(&slab_mutex);
- return seq_list_start(&slab_root_caches, *pos);
+ return seq_list_start(&slab_caches, *pos);
}

void *slab_next(struct seq_file *m, void *p, loff_t *pos)
{
- return seq_list_next(p, &slab_root_caches, pos);
+ return seq_list_next(p, &slab_caches, pos);
}

void slab_stop(struct seq_file *m, void *p)
@@ -1148,11 +1126,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)

static int slab_show(struct seq_file *m, void *p)
{
- struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
+ struct kmem_cache *s = list_entry(p, struct kmem_cache, list);

- if (p == slab_root_caches.next)
+ if (p == slab_caches.next)
print_slabinfo_header(m);
- cache_show(s, m);
+ if (is_root_cache(s))
+ cache_show(s, m);
return 0;
}

@@ -1254,7 +1233,7 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
mutex_lock(&slab_mutex);
seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
seq_puts(m, " <active_slabs> <num_slabs>\n");
- list_for_each_entry(s, &slab_root_caches, root_caches_node) {
+ list_for_each_entry(s, &slab_caches, list) {
/*
* Skip kmem caches that don't have the memcg cache.
*/
diff --git a/mm/slub.c b/mm/slub.c
index 91a1dce932e9..7a5d6861b088 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4364,7 +4364,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
}
slab_init_memcg_params(s);
list_add(&s->list, &slab_caches);
- memcg_link_cache(s);
return s;
}

--
2.26.2

2020-06-23 17:45:34

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v7 14/19] mm: memcg/slab: remove memcg_kmem_get_cache()

The memcg_kmem_get_cache() function became really trivial, so let's just
inline it into the single call point: memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler to
generate a better code.

Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
include/linux/memcontrol.h | 2 --
mm/memcontrol.c | 25 +------------------------
mm/slab.h | 11 +++++++++--
mm/slab_common.c | 2 +-
4 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2ac84dcfc9e5..5a8b62d075e6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1403,8 +1403,6 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
}
#endif

-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
-
#ifdef CONFIG_MEMCG_KMEM
int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
unsigned int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d23c2bdeea66..c910fe326ca6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -393,7 +393,7 @@ void memcg_put_cache_ids(void)

/*
* A lot of the calls to the cache allocation functions are expected to be
- * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are
* conditional to this static branch, we'll have to allow modules that does
* kmem_cache_alloc and the such to see this symbol as well
*/
@@ -2900,29 +2900,6 @@ static void memcg_free_cache_id(int id)
ida_simple_remove(&memcg_cache_ida, id);
}

-/**
- * memcg_kmem_get_cache: select memcg or root cache for allocation
- * @cachep: the original global kmem cache
- *
- * Return the kmem_cache we're supposed to use for a slab allocation.
- *
- * If the cache does not exist yet, if we are the first user of it, we
- * create it asynchronously in a workqueue and let the current allocation
- * go through with the original cache.
- */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
-{
- struct kmem_cache *memcg_cachep;
-
- memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
- if (unlikely(!memcg_cachep)) {
- queue_work(system_wq, &cachep->memcg_params.work);
- return cachep;
- }
-
- return memcg_cachep;
-}
-
/**
* __memcg_kmem_charge: charge a number of kernel pages to a memcg
* @memcg: memory cgroup to charge
diff --git a/mm/slab.h b/mm/slab.h
index 66482f8467e7..d47430e97ff1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -364,9 +364,16 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
if (memcg_kmem_bypass())
return s;

- cachep = memcg_kmem_get_cache(s);
- if (is_root_cache(cachep))
+ cachep = READ_ONCE(s->memcg_params.memcg_cache);
+ if (unlikely(!cachep)) {
+ /*
+ * If memcg cache does not exist yet, we schedule it's
+ * asynchronous creation and let the current allocation
+ * go through with the root cache.
+ */
+ queue_work(system_wq, &s->memcg_params.work);
return s;
+ }

objcg = get_obj_cgroup_from_current();
if (!objcg)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index fe4f5be0e51f..e0a2a904fdd9 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -568,7 +568,7 @@ void memcg_create_kmem_cache(struct kmem_cache *root_cache)
}

/*
- * Since readers won't lock (see memcg_kmem_get_cache()), we need a
+ * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
* barrier here to ensure nobody will see the kmem_cache partially
* initialized.
*/
--
2.26.2

2020-07-18 17:40:15

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v7 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

On Tue, Jun 23, 2020 at 10:40:35AM -0700, Roman Gushchin wrote:
> Instead of having two sets of kmem_caches: one for system-wide and
> non-accounted allocations and the second one shared by all accounted
> allocations, we can use just one.
>
> The idea is simple: space for obj_cgroup metadata can be allocated on
> demand and filled only for accounted allocations.
>
> It allows to remove a bunch of code which is required to handle kmem_cache
> clones for accounted allocations. There is no more need to create them,
> accumulate statistics, propagate attributes, etc. It's a quite
> significant simplification.
>
> Also, because the total number of slab_caches is reduced almost twice (not
> all kmem_caches have a memcg clone), some additional memory savings are
> expected. On my devvm it additionally saves about 3.5% of slab memory.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Roman Gushchin <[email protected]>
> Reviewed-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>

This patch results in:

{standard input}: Assembler messages:
{standard input}:140: Warning: macro instruction expanded into multiple instructions
mm/slub.c: In function 'slab_alloc.constprop':
mm/slub.c:2897:30: error: inlining failed in call to always_inline 'slab_alloc.constprop': recursive inlining
static __always_inline void *slab_alloc(struct kmem_cache *s,

and many similar messages when trying to build mips:64r6el_defconfig
or mips:defconfig. Bisect log attached.

Guenter

---

# bad: [aab7ee9f8ff0110bfcd594b33dc33748dc1baf46] Add linux-next specific files for 20200717
# good: [11ba468877bb23f28956a35e896356252d63c983] Linux 5.8-rc5
git bisect start 'HEAD' 'v5.8-rc5'
# good: [4d55a7a1298d197755c1a0f4512f56917e938a83] Merge remote-tracking branch 'crypto/master'
git bisect good 4d55a7a1298d197755c1a0f4512f56917e938a83
# good: [e63bf5dcce255302e355cb2277a3a39c83752c92] Merge remote-tracking branch 'devicetree/for-next'
git bisect good e63bf5dcce255302e355cb2277a3a39c83752c92
# good: [94d932ec3afb923efd8c736974f8316413175a5b] Merge remote-tracking branch 'thunderbolt/next'
git bisect good 94d932ec3afb923efd8c736974f8316413175a5b
# good: [5ddd2e0dbe8fceb80b0b36bd38a32217be7a04a5] Merge remote-tracking branch 'livepatching/for-next'
git bisect good 5ddd2e0dbe8fceb80b0b36bd38a32217be7a04a5
# bad: [40346f79983caf46fb92f779b0353422d43580a9] ipc/shm.c: Remove the superfluous break
git bisect bad 40346f79983caf46fb92f779b0353422d43580a9
# bad: [0b917599517f71ddef5f7274a8199a33cecd49b2] kasan: update required compiler versions in documentation
git bisect bad 0b917599517f71ddef5f7274a8199a33cecd49b2
# good: [7822c5f77725d5bf4ea48f155b0aa3827db19423] tmpfs: per-superblock i_ino support
git bisect good 7822c5f77725d5bf4ea48f155b0aa3827db19423
# bad: [c5b15b89803e3ed2810be285def5f4836e5ee629] mm, memcg: reclaim more aggressively before high allocator throttling
git bisect bad c5b15b89803e3ed2810be285def5f4836e5ee629
# good: [2b6d98a0b0cb5ff828228c6a094813c4919727da] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
git bisect good 2b6d98a0b0cb5ff828228c6a094813c4919727da
# bad: [d32b702628530c68b4147d410b4cdf21610e9f93] mm: memcg/percpu: per-memcg percpu memory statistics
git bisect bad d32b702628530c68b4147d410b4cdf21610e9f93
# bad: [b109396be9be1b8fd91fa4c70bd73a0e93722274] percpu: return number of released bytes from pcpu_free_area()
git bisect bad b109396be9be1b8fd91fa4c70bd73a0e93722274
# bad: [6cee58aca5d334ee8195a711e4eb61a05e5f7eb5] kselftests: cgroup: add kernel memory accounting tests
git bisect bad 6cee58aca5d334ee8195a711e4eb61a05e5f7eb5
# bad: [2528f5d4f3c139035dc3adcbfb6c63ca14c840f0] mm: memcg/slab: use a single set of kmem_caches for all allocations
git bisect bad 2528f5d4f3c139035dc3adcbfb6c63ca14c840f0
# first bad commit: [2528f5d4f3c139035dc3adcbfb6c63ca14c840f0] mm: memcg/slab: use a single set of kmem_caches for all allocations

2020-07-18 23:06:15

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v7 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

On Sat, Jul 18, 2020 at 10:24:49AM -0700, Guenter Roeck wrote:
> On Tue, Jun 23, 2020 at 10:40:35AM -0700, Roman Gushchin wrote:
> > Instead of having two sets of kmem_caches: one for system-wide and
> > non-accounted allocations and the second one shared by all accounted
> > allocations, we can use just one.
> >
> > The idea is simple: space for obj_cgroup metadata can be allocated on
> > demand and filled only for accounted allocations.
> >
> > It allows to remove a bunch of code which is required to handle kmem_cache
> > clones for accounted allocations. There is no more need to create them,
> > accumulate statistics, propagate attributes, etc. It's a quite
> > significant simplification.
> >
> > Also, because the total number of slab_caches is reduced almost twice (not
> > all kmem_caches have a memcg clone), some additional memory savings are
> > expected. On my devvm it additionally saves about 3.5% of slab memory.
> >
> > Suggested-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Reviewed-by: Vlastimil Babka <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
>
> This patch results in:
>
> {standard input}: Assembler messages:
> {standard input}:140: Warning: macro instruction expanded into multiple instructions
> mm/slub.c: In function 'slab_alloc.constprop':
> mm/slub.c:2897:30: error: inlining failed in call to always_inline 'slab_alloc.constprop': recursive inlining
> static __always_inline void *slab_alloc(struct kmem_cache *s,
>
> and many similar messages when trying to build mips:64r6el_defconfig
> or mips:defconfig. Bisect log attached.

Hello, Guenter!

Thank you for reporting the problem!
Actually, I've already fixed it, and Andrew pulled the fix yesterday.
So in few days the fix should appear in the next tree.

Sorry for the inconvenience and please let me know if the problem will persist
after a couple of days.

Roman

2020-08-03 09:01:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] mm: memcontrol: decouple reference counting from page accounting

I am sorry for coming late here.

On Tue 23-06-20 10:40:23, Roman Gushchin wrote:
> From: Johannes Weiner <[email protected]>
>
> The reference counting of a memcg is currently coupled directly to how
> many 4k pages are charged to it. This doesn't work well with Roman's new
> slab controller, which maintains pools of objects and doesn't want to keep
> an extra balance sheet for the pages backing those objects.
>
> This unusual refcounting design (reference counts usually track pointers
> to an object) is only for historical reasons: memcg used to not take any
> css references and simply stalled offlining until all charges had been
> reparented and the page counters had dropped to zero. When we got rid of
> the reparenting requirement, the simple mechanical translation was to take
> a reference for every charge.
>
> More historical context can be found in commit e8ea14cc6ead ("mm:
> memcontrol: take a css reference for each charged page"), commit
> 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
> commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
> groups").
>
> The new slab controller exposes the limitations in this scheme, so let's
> switch it to a more idiomatic reference counting model based on actual
> kernel pointers to the memcg:
>
> - The per-cpu stock holds a reference to the memcg its caching
>
> - User pages hold a reference for their page->mem_cgroup. Transparent
> huge pages will no longer acquire tail references in advance, we'll
> get them if needed during the split.
>
> - Kernel pages hold a reference for their page->mem_cgroup
>
> - Pages allocated in the root cgroup will acquire and release css
> references for simplicity. css_get() and css_put() optimize that.
>
> - The current memcg_charge_slab() already hacked around the per-charge
> references; this change gets rid of that as well.

just for completeness
- tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}

As all those paths are handling the reference count differently it is
probably good to remind that in a comment:
/* Caller is responsible to hold reference for the existence of the charged object *

for try_charge function.

We will need to be more careful (e.g. http://lkml.kernel.org/r/[email protected])
but considering that the old model doesn't fit with the new slab
accounting as mentioned above this is not really something terrible to
live with.

[...]
> @@ -5456,7 +5460,10 @@ static int mem_cgroup_move_account(struct page *page,
> */
> smp_mb();
>
> - page->mem_cgroup = to; /* caller should have done css_get */
> + css_get(&to->css);
> + css_put(&from->css);
> +
> + page->mem_cgroup = to;
>
> __unlock_page_memcg(from);

What prevents from memcg to be released here?

--
Michal Hocko
SUSE Labs

2020-08-03 15:06:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] mm: memcontrol: decouple reference counting from page accounting

On Mon, Aug 03, 2020 at 11:00:33AM +0200, Michal Hocko wrote:
> On Tue 23-06-20 10:40:23, Roman Gushchin wrote:
> > @@ -5456,7 +5460,10 @@ static int mem_cgroup_move_account(struct page *page,
> > */
> > smp_mb();
> >
> > - page->mem_cgroup = to; /* caller should have done css_get */
> > + css_get(&to->css);
> > + css_put(&from->css);
> > +
> > + page->mem_cgroup = to;
> >
> > __unlock_page_memcg(from);
>
> What prevents from memcg to be released here?

->attach_task() and kill_css() are exclusive through the cgroup_mutex,
so the base ref cannot disappear from under us during this operation.

2020-08-03 15:10:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] mm: memcontrol: decouple reference counting from page accounting

On Mon 03-08-20 11:03:49, Johannes Weiner wrote:
> On Mon, Aug 03, 2020 at 11:00:33AM +0200, Michal Hocko wrote:
> > On Tue 23-06-20 10:40:23, Roman Gushchin wrote:
> > > @@ -5456,7 +5460,10 @@ static int mem_cgroup_move_account(struct page *page,
> > > */
> > > smp_mb();
> > >
> > > - page->mem_cgroup = to; /* caller should have done css_get */
> > > + css_get(&to->css);
> > > + css_put(&from->css);
> > > +
> > > + page->mem_cgroup = to;
> > >
> > > __unlock_page_memcg(from);
> >
> > What prevents from memcg to be released here?
>
> ->attach_task() and kill_css() are exclusive through the cgroup_mutex,
> so the base ref cannot disappear from under us during this operation.

OK, is this worth a comment? Reference counting before other operation
on the object always makes me worried and those details are hidden
elsewhere.

Btw. with the follow up fix from Hugh
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs