2019-09-06 08:47:11

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 00/14] The new slab memory controller

The existing slab memory controller is based on the idea of replicating
slab allocator internals for each memory cgroup. This approach promises
a low memory overhead (one pointer per page), and isn't adding too much
code on hot allocation and release paths. But is has a very serious flaw:
it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and changing
memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are destroyed
with root counterparts. It allows to remove a good amount of code and make
things generally simpler.

The patchset contains a couple of semi-independent parts, which can find their
usage outside of the slab memory controller too:
1) subpage charging API, which can be used in the future for accounting of
other non-page-sized objects, e.g. percpu allocations.
2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
for the efficient reparenting of other objects, e.g. pagecache.

The patchset has been tested on a number of different workloads in our
production. In all cases, it saved hefty amounts of memory:
1) web frontend, 650-700 Mb, ~42% of slab memory
2) database cache, 750-800 Mb, ~35% of slab memory
3) dns server, 700 Mb, ~36% of slab memory

So far I haven't found any regression on all tested workloads, but
potential CPU regression caused by more precise accounting is a concern.

Obviously the amount of saved memory depend on the number of memory cgroups,
uptime and specific workloads, but overall it feels like the new controller
saves 30-40% of slab memory, sometimes more. Additionally, it should lead
to a lower memory fragmentation, just because of a smaller number of
non-movable pages and also because there is no more need to move all
slab objects to a new set of pages when a workload is restarted in a new
memory cgroup.

* https://github.com/osandov/drgn


Roman Gushchin (14):
mm: memcg: subpage charging API
mm: memcg: introduce mem_cgroup_ptr
mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
mm: vmstat: convert slab vmstat counter to bytes
mm: memcg/slab: allocate space for memcg ownership data for non-root
slabs
mm: slub: implement SLUB version of obj_to_index()
mm: memcg/slab: save memcg ownership data for non-root slab objects
mm: memcg: move memcg_kmem_bypass() to memcontrol.h
mm: memcg: introduce __mod_lruvec_memcg_state()
mm: memcg/slab: charge individual slab objects instead of pages
mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
mm: memcg/slab: replace memcg_from_slab_page() with
memcg_from_slab_obj()
mm: memcg/slab: use one set of kmem_caches for all memory cgroups
mm: slab: remove redundant check in memcg_accumulate_slabinfo()

drivers/base/node.c | 11 +-
fs/proc/meminfo.c | 4 +-
include/linux/memcontrol.h | 102 ++++++++-
include/linux/mm_types.h | 5 +-
include/linux/mmzone.h | 12 +-
include/linux/slab.h | 3 +-
include/linux/slub_def.h | 9 +
include/linux/vmstat.h | 8 +
kernel/power/snapshot.c | 2 +-
mm/list_lru.c | 12 +-
mm/memcontrol.c | 431 +++++++++++++++++++++--------------
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 8 +-
mm/slab.c | 37 ++-
mm/slab.h | 300 +++++++++++++------------
mm/slab_common.c | 449 ++++---------------------------------
mm/slob.c | 12 +-
mm/slub.c | 63 ++----
mm/vmscan.c | 3 +-
mm/vmstat.c | 38 +++-
mm/workingset.c | 6 +-
21 files changed, 683 insertions(+), 834 deletions(-)

--
2.21.0


2019-09-06 08:59:16

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes

In order to prepare for per-object slab memory accounting,
convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
items to bytes.

To make sure that these vmstats are in bytes, rename them
to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
NR_KERNEL_STACK_KB).

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <[email protected]>
---
drivers/base/node.c | 11 ++++++++---
fs/proc/meminfo.c | 4 ++--
include/linux/mmzone.h | 10 ++++++++--
include/linux/vmstat.h | 8 ++++++++
kernel/power/snapshot.c | 2 +-
mm/memcontrol.c | 29 ++++++++++++++++-------------
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 8 ++++----
mm/slab.h | 15 ++++++++-------
mm/slab_common.c | 4 ++--
mm/slob.c | 12 ++++++------
mm/slub.c | 8 ++++----
mm/vmscan.c | 3 ++-
mm/vmstat.c | 22 +++++++++++++++++++---
mm/workingset.c | 6 ++++--
15 files changed, 93 insertions(+), 51 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 296546ffed6c..56664222f3fd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
unsigned long sreclaimable, sunreclaimable;

si_meminfo_node(&i, nid);
- sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
- sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+ sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
n = sprintf(buf,
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
@@ -495,9 +495,14 @@ static ssize_t node_read_vmstat(struct device *dev,
int i;
int n = 0;

- for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+ unsigned long x = sum_zone_node_page_state(nid, i);
+
+ if (vmstat_item_in_bytes(i))
+ x >>= PAGE_SHIFT;
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
sum_zone_node_page_state(nid, i));
+ }

#ifdef CONFIG_NUMA
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index ac9247371871..87afa8683c1b 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -53,8 +53,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
pages[lru] = global_node_page_state(NR_LRU_BASE + lru);

available = si_mem_available();
- sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
- sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+ sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+ sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);

show_val_kb(m, "MemTotal: ", i.totalram);
show_val_kb(m, "MemFree: ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e8233d52971..2dbc2d042ef6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,8 +215,8 @@ enum node_stat_item {
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
- NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */
- NR_SLAB_UNRECLAIMABLE, /* and this one without looking at
+ NR_SLAB_RECLAIMABLE_B, /* Please do not reorder this item */
+ NR_SLAB_UNRECLAIMABLE_B,/* and this one without looking at
* memcg_flush_percpu_vmstats() first. */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
@@ -247,6 +247,12 @@ enum node_stat_item {
NR_VM_NODE_STAT_ITEMS
};

+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+ return (item == NR_SLAB_RECLAIMABLE_B ||
+ item == NR_SLAB_UNRECLAIMABLE_B);
+}
+
/*
* We do arithmetic on the LRU lists in various places in the code,
* so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bdeda4b079fe..e5460e597e0c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -194,6 +194,12 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
return x;
}

+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
+{
+ return global_node_page_state(item) >> PAGE_SHIFT;
+}
+
static inline unsigned long zone_page_state(struct zone *zone,
enum zone_stat_item item)
{
@@ -234,6 +240,8 @@ extern unsigned long sum_zone_node_page_state(int node,
extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
extern unsigned long node_page_state(struct pglist_data *pgdat,
enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+ enum node_stat_item item);
#else
#define sum_zone_node_page_state(node, item) global_zone_page_state(item)
#define node_page_state(node, item) global_node_page_state(item)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 83105874f255..ce9e5686e745 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1659,7 +1659,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
{
unsigned long size;

- size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+ size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
+ global_node_page_state(NR_ACTIVE_ANON)
+ global_node_page_state(NR_INACTIVE_ANON)
+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb9adb31360e..e4af9810b59e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -759,13 +759,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
*/
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
{
- long x;
+ long x, threshold = MEMCG_CHARGE_BATCH;

if (mem_cgroup_disabled())
return;

+ if (vmstat_item_in_bytes(idx))
+ threshold <<= PAGE_SHIFT;
+
x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ if (unlikely(abs(x) > threshold)) {
struct mem_cgroup *mi;

/*
@@ -807,7 +810,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
pg_data_t *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
- long x;
+ long x, threshold = MEMCG_CHARGE_BATCH;

/* Update node */
__mod_node_page_state(pgdat, idx, val);
@@ -824,8 +827,11 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */
__this_cpu_add(pn->lruvec_stat_local->count[idx], val);

+ if (vmstat_item_in_bytes(idx))
+ threshold <<= PAGE_SHIFT;
+
x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
- if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ if (unlikely(abs(x) > threshold)) {
struct mem_cgroup_per_node *pi;

for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -1478,9 +1484,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
(u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
1024);
seq_buf_printf(&s, "slab %llu\n",
- (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
- memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
- PAGE_SIZE);
+ (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+ memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
seq_buf_printf(&s, "sock %llu\n",
(u64)memcg_page_state(memcg, MEMCG_SOCK) *
PAGE_SIZE);
@@ -1514,11 +1519,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
PAGE_SIZE);

seq_buf_printf(&s, "slab_reclaimable %llu\n",
- (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
- PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
seq_buf_printf(&s, "slab_unreclaimable %llu\n",
- (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
- PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));

/* Accumulated memory events */

@@ -3564,8 +3567,8 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
int min_idx, max_idx;

if (slab_only) {
- min_idx = NR_SLAB_RECLAIMABLE;
- max_idx = NR_SLAB_UNRECLAIMABLE;
+ min_idx = NR_SLAB_RECLAIMABLE_B;
+ max_idx = NR_SLAB_UNRECLAIMABLE_B;
} else {
min_idx = 0;
max_idx = MEMCG_NR_STAT;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314ce1a3cf25..61476bf0f147 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -183,7 +183,7 @@ static bool is_dump_unreclaim_slabs(void)
global_node_page_state(NR_ISOLATED_FILE) +
global_node_page_state(NR_UNEVICTABLE);

- return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+ return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
}

/**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5d62f1c2851..9da8ee92c226 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5100,8 +5100,8 @@ long si_mem_available(void)
* items that are in use, and cannot be freed. Cap this estimate at the
* low watermark.
*/
- reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
- global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+ reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+ global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
available += reclaimable - min(reclaimable / 2, wmark_low);

if (available < 0)
@@ -5245,8 +5245,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
global_node_page_state(NR_FILE_DIRTY),
global_node_page_state(NR_WRITEBACK),
global_node_page_state(NR_UNSTABLE_NFS),
- global_node_page_state(NR_SLAB_RECLAIMABLE),
- global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+ global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+ global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
global_node_page_state(NR_FILE_MAPPED),
global_node_page_state(NR_SHMEM),
global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index 68e455f2b698..7c5577c2b9ea 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
static inline int cache_vmstat_idx(struct kmem_cache *s)
{
return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
- NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+ NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
}

#ifdef CONFIG_MEMCG_KMEM
@@ -360,7 +360,7 @@ static __always_inline int memcg_charge_slab(struct page *page,

if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
- (1 << order));
+ (PAGE_SIZE << order));
percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
return 0;
}
@@ -370,7 +370,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
goto out;

lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
- mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+ mod_lruvec_state(lruvec, cache_vmstat_idx(s), PAGE_SIZE << order);

/* transer try_charge() page references to kmem_cache */
percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
@@ -394,11 +394,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
memcg = READ_ONCE(s->memcg_params.memcg);
if (likely(!mem_cgroup_is_root(memcg))) {
lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
- mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+ mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+ -(PAGE_SIZE << order));
memcg_kmem_uncharge_memcg(page, order, memcg);
} else {
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
- -(1 << order));
+ -(PAGE_SIZE << order));
}
rcu_read_unlock();

@@ -482,7 +483,7 @@ static __always_inline int charge_slab_page(struct page *page,
{
if (is_root_cache(s)) {
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
- 1 << order);
+ PAGE_SIZE << order);
return 0;
}

@@ -494,7 +495,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
{
if (is_root_cache(s)) {
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
- -(1 << order));
+ -(PAGE_SIZE << order));
return;
}

diff --git a/mm/slab_common.c b/mm/slab_common.c
index c29f03adca91..79695d9c34f3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1303,8 +1303,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
page = alloc_pages(flags, order);
if (likely(page)) {
ret = page_address(page);
- mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
- 1 << order);
+ mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+ PAGE_SIZE << order);
}
ret = kasan_kmalloc_large(ret, size, flags);
/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index fa53e9f73893..8b7b56235438 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
if (!page)
return NULL;

- mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
- 1 << order);
+ mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+ PAGE_SIZE << order);
return page_address(page);
}

@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;

- mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
- -(1 << order));
+ mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+ -(PAGE_SIZE << order));
__free_pages(sp, order);
}

@@ -550,8 +550,8 @@ void kfree(const void *block)
slob_free(m, *m + align);
} else {
unsigned int order = compound_order(sp);
- mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
- -(1 << order));
+ mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+ -(PAGE_SIZE << order));
__free_pages(sp, order);

}
diff --git a/mm/slub.c b/mm/slub.c
index c9856a9807f1..0873b77727bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3825,8 +3825,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
page = alloc_pages_node(node, flags, order);
if (page) {
ptr = page_address(page);
- mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
- 1 << order);
+ mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+ PAGE_SIZE << order);
}

return kmalloc_large_node_hook(ptr, size, flags);
@@ -3957,8 +3957,8 @@ void kfree(const void *x)

BUG_ON(!PageCompound(page));
kfree_hook(object);
- mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
- -(1 << order));
+ mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+ -(PAGE_SIZE << order));
__free_pages(page, order);
return;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee47bbcb99b5..283da9a39de2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4272,7 +4272,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
* unmapped file backed pages.
*/
if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
- node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+ node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+ pgdat->min_slab_pages)
return NODE_RECLAIM_FULL;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 98f43725d910..d04f53997fd2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -344,6 +344,8 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
x = delta + __this_cpu_read(*p);

t = __this_cpu_read(pcp->stat_threshold);
+ if (vmstat_item_in_bytes(item))
+ t <<= PAGE_SHIFT;

if (unlikely(x > t || x < -t)) {
node_page_state_add(x, pgdat, item);
@@ -555,6 +557,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
* for all cpus in a node.
*/
t = this_cpu_read(pcp->stat_threshold);
+ if (vmstat_item_in_bytes(item))
+ t <<= PAGE_SHIFT;

o = this_cpu_read(*p);
n = delta + o;
@@ -999,6 +1003,12 @@ unsigned long node_page_state(struct pglist_data *pgdat,
#endif
return x;
}
+
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+ enum node_stat_item item)
+{
+ return node_page_state(pgdat, item) >> PAGE_SHIFT;
+}
#endif

#ifdef CONFIG_COMPACTION
@@ -1547,10 +1557,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
if (is_zone_first_populated(pgdat, zone)) {
seq_printf(m, "\n per-node stats");
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ unsigned long x = node_page_state(pgdat, i);
+
+ if (vmstat_item_in_bytes(i))
+ x >>= PAGE_SHIFT;
seq_printf(m, "\n %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS],
- node_page_state(pgdat, i));
+ NR_VM_NUMA_STAT_ITEMS], x);
}
}
seq_printf(m,
@@ -1679,8 +1692,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v += NR_VM_NUMA_STAT_ITEMS;
#endif

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
v[i] = global_node_page_state(i);
+ if (vmstat_item_in_bytes(i))
+ v[i] >>= PAGE_SHIFT;
+ }
v += NR_VM_NODE_STAT_ITEMS;

global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
diff --git a/mm/workingset.c b/mm/workingset.c
index c963831d354f..675792387e58 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -430,8 +430,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
pages += lruvec_page_state_local(lruvec,
NR_LRU_BASE + i);
- pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
- pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+ pages += lruvec_page_state_local(
+ lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+ pages += lruvec_page_state_local(
+ lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
} else
#endif
pages = node_present_pages(sc->nid);
--
2.21.0

2019-09-06 08:59:31

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs

Allocate and release memory for storing the memcg ownership data.
For each slab page allocate space sufficient for number_of_objects
pointers to struct mem_cgroup_vec.

The mem_cgroup field of the struct page isn't used for slab pages,
so let's use the space for storing the pointer for the allocated
space.

This commit makes sure that the space is ready for use, but nobody
is actually using it yet. Following commits in the series will fix it.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/mm_types.h | 5 ++++-
mm/slab.c | 3 ++-
mm/slab.h | 37 ++++++++++++++++++++++++++++++++++++-
mm/slub.c | 2 +-
4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 25395481d2ae..510cb170c4b8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -199,7 +199,10 @@ struct page {
atomic_t _refcount;

#ifdef CONFIG_MEMCG
- struct mem_cgroup *mem_cgroup;
+ union {
+ struct mem_cgroup *mem_cgroup;
+ struct mem_cgroup_ptr **mem_cgroup_vec;
+ };
#endif

/*
diff --git a/mm/slab.c b/mm/slab.c
index 9df370558e5d..f0833f287dcf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1369,7 +1369,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
return NULL;
}

- if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+ if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+ cachep->num)) {
__free_pages(page, cachep->gfporder);
return NULL;
}
diff --git a/mm/slab.h b/mm/slab.h
index 7c5577c2b9ea..16d7ea30a2d3 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -406,6 +406,23 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
}

+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+ unsigned int objects)
+{
+ page->mem_cgroup_vec = kmalloc(sizeof(struct mem_cgroup_ptr *) *
+ objects, gfp | __GFP_ZERO);
+ if (!page->mem_cgroup_vec)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+ kfree(page->mem_cgroup_vec);
+ page->mem_cgroup_vec = NULL;
+}
+
extern void slab_init_memcg_params(struct kmem_cache *);
extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);

@@ -455,6 +472,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
{
}

+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+ unsigned int objects)
+{
+ return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+}
+
static inline void slab_init_memcg_params(struct kmem_cache *s)
{
}
@@ -479,14 +506,21 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)

static __always_inline int charge_slab_page(struct page *page,
gfp_t gfp, int order,
- struct kmem_cache *s)
+ struct kmem_cache *s,
+ unsigned int objects)
{
+ int ret;
+
if (is_root_cache(s)) {
mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
PAGE_SIZE << order);
return 0;
}

+ ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
+ if (ret)
+ return ret;
+
return memcg_charge_slab(page, gfp, order, s);
}

@@ -499,6 +533,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
return;
}

+ memcg_free_page_memcg_vec(page);
memcg_uncharge_slab(page, order, s);
}

diff --git a/mm/slub.c b/mm/slub.c
index 0873b77727bf..3014158c100d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1517,7 +1517,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
else
page = __alloc_pages_node(node, flags, order);

- if (page && charge_slab_page(page, flags, order, s)) {
+ if (page && charge_slab_page(page, flags, order, s, oo_objects(oo))) {
__free_pages(page, order);
page = NULL;
}
--
2.21.0

2019-09-06 08:59:47

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 7 -------
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b9643d758fc9..8f1d7161579f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1430,6 +1430,13 @@ static inline bool memcg_kmem_enabled(void)
return static_branch_unlikely(&memcg_kmem_enabled_key);
}

+static inline bool memcg_kmem_bypass(void)
+{
+ if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+ return true;
+ return false;
+}
+
static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
{
if (memcg_kmem_enabled())
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 761b646eb968..d57f95177aec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2992,13 +2992,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
queue_work(memcg_kmem_cache_wq, &cw->work);
}

-static inline bool memcg_kmem_bypass(void)
-{
- if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
- return true;
- return false;
-}
-
/**
* memcg_kmem_get_cache: select the correct per-memcg cache for allocation
* @cachep: the original global kmem cache
--
2.21.0

2019-09-16 12:41:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes

On Thu, Sep 05, 2019 at 02:45:48PM -0700, Roman Gushchin wrote:
> In order to prepare for per-object slab memory accounting,
> convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
> items to bytes.
>
> To make sure that these vmstats are in bytes, rename them
> to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
> NR_KERNEL_STACK_KB).
>
> The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> so it will fit into atomic_long_t we use for vmstats.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Maybe a crazy idea, but instead of mixing bytes and pages, would it be
difficult to account all vmstat items in bytes internally? And provide
two general apis, byte and page based, to update and query the counts,
instead of tying the unit it to individual items?

The vmstat_item_in_bytes() conditional shifting is pretty awkward in
code that has a recent history littered with subtle breakages.

The translation helper node_page_state_pages() will yield garbage if
used with the page-based counters, which is another easy to misuse
interface.

We already have many places that multiply with PAGE_SIZE to get the
stats in bytes or kb units.

And _B/_KB suffixes are kinda clunky.

The stats use atomic_long_t, so switching to atomic64_t doesn't make a
difference on 64-bit and is backward compatible with 32-bit.

The per-cpu batch size you have to raise from s8 either way.

It seems to me that would make the code and API a lot simpler and
easier to use / harder to misuse.

2019-09-17 08:31:27

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes

On Mon, Sep 16, 2019 at 02:38:40PM +0200, Johannes Weiner wrote:
> On Thu, Sep 05, 2019 at 02:45:48PM -0700, Roman Gushchin wrote:
> > In order to prepare for per-object slab memory accounting,
> > convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
> > items to bytes.
> >
> > To make sure that these vmstats are in bytes, rename them
> > to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
> > NR_KERNEL_STACK_KB).
> >
> > The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> > so it will fit into atomic_long_t we use for vmstats.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>

Hello Johannes!

Thank you for looking into the patchset!

>
> Maybe a crazy idea, but instead of mixing bytes and pages, would it be
> difficult to account all vmstat items in bytes internally? And provide
> two general apis, byte and page based, to update and query the counts,
> instead of tying the unit it to individual items?
>
> The vmstat_item_in_bytes() conditional shifting is pretty awkward in
> code that has a recent history littered with subtle breakages.
>
> The translation helper node_page_state_pages() will yield garbage if
> used with the page-based counters, which is another easy to misuse
> interface.
>
> We already have many places that multiply with PAGE_SIZE to get the
> stats in bytes or kb units.
>
> And _B/_KB suffixes are kinda clunky.
>
> The stats use atomic_long_t, so switching to atomic64_t doesn't make a
> difference on 64-bit and is backward compatible with 32-bit.

I fully agree here, that having different stats in different units
adds a lot of mess to the code. But I always thought that 64-bit
atomics are slow on a 32-bit machine, so it might be a noticeable
performance regression. Don't you think so?

I'm happy to prepare such a patch(set), only I'd prefer to keep it
separately from this one. It can precede or follow the slab controller
rework, either way will work. Slab controller rework is already not so
small, so adding more code (and potential issues) here will only make
the review more complex.

>
> The per-cpu batch size you have to raise from s8 either way.

Yeah, tbh I don't know why those are just not unsigned long by default.
Space savings are miserable here, and I don't see any other reasons.
It could be even slightly faster to use a larger type.

I kinda tried to keep the patchset as small as possible (at least for
the RFC version), so tried to avoid any non-necessary changes.
But overall using s8 or s16 here doesn't make much sense to me.

>
> It seems to me that would make the code and API a lot simpler and
> easier to use / harder to misuse.

2019-09-17 20:18:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On 9/5/19 5:45 PM, Roman Gushchin wrote:
> The existing slab memory controller is based on the idea of replicating
> slab allocator internals for each memory cgroup. This approach promises
> a low memory overhead (one pointer per page), and isn't adding too much
> code on hot allocation and release paths. But is has a very serious flaw:
> it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
>
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
>
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
>
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
>
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and changing
> memcg pointer to the parent.
>
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.
>
> The patchset contains a couple of semi-independent parts, which can find their
> usage outside of the slab memory controller too:
> 1) subpage charging API, which can be used in the future for accounting of
> other non-page-sized objects, e.g. percpu allocations.
> 2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
> for the efficient reparenting of other objects, e.g. pagecache.
>
> The patchset has been tested on a number of different workloads in our
> production. In all cases, it saved hefty amounts of memory:
> 1) web frontend, 650-700 Mb, ~42% of slab memory
> 2) database cache, 750-800 Mb, ~35% of slab memory
> 3) dns server, 700 Mb, ~36% of slab memory
>
> So far I haven't found any regression on all tested workloads, but
> potential CPU regression caused by more precise accounting is a concern.
>
> Obviously the amount of saved memory depend on the number of memory cgroups,
> uptime and specific workloads, but overall it feels like the new controller
> saves 30-40% of slab memory, sometimes more. Additionally, it should lead
> to a lower memory fragmentation, just because of a smaller number of
> non-movable pages and also because there is no more need to move all
> slab objects to a new set of pages when a workload is restarted in a new
> memory cgroup.
>
> * https://github.com/osandov/drgn
>
>
> Roman Gushchin (14):
> mm: memcg: subpage charging API
> mm: memcg: introduce mem_cgroup_ptr
> mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
> mm: vmstat: convert slab vmstat counter to bytes
> mm: memcg/slab: allocate space for memcg ownership data for non-root
> slabs
> mm: slub: implement SLUB version of obj_to_index()
> mm: memcg/slab: save memcg ownership data for non-root slab objects
> mm: memcg: move memcg_kmem_bypass() to memcontrol.h
> mm: memcg: introduce __mod_lruvec_memcg_state()
> mm: memcg/slab: charge individual slab objects instead of pages
> mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
> mm: memcg/slab: replace memcg_from_slab_page() with
> memcg_from_slab_obj()
> mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> mm: slab: remove redundant check in memcg_accumulate_slabinfo()
>
> drivers/base/node.c | 11 +-
> fs/proc/meminfo.c | 4 +-
> include/linux/memcontrol.h | 102 ++++++++-
> include/linux/mm_types.h | 5 +-
> include/linux/mmzone.h | 12 +-
> include/linux/slab.h | 3 +-
> include/linux/slub_def.h | 9 +
> include/linux/vmstat.h | 8 +
> kernel/power/snapshot.c | 2 +-
> mm/list_lru.c | 12 +-
> mm/memcontrol.c | 431 +++++++++++++++++++++--------------
> mm/oom_kill.c | 2 +-
> mm/page_alloc.c | 8 +-
> mm/slab.c | 37 ++-
> mm/slab.h | 300 +++++++++++++------------
> mm/slab_common.c | 449 ++++---------------------------------
> mm/slob.c | 12 +-
> mm/slub.c | 63 ++----
> mm/vmscan.c | 3 +-
> mm/vmstat.c | 38 +++-
> mm/workingset.c | 6 +-
> 21 files changed, 683 insertions(+), 834 deletions(-)
>
I can only see the first 9 patches. Patches 10-14 are not there.

Cheers,
Longman

2019-09-18 00:36:30

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Tue, Sep 17, 2019 at 03:48:57PM -0400, Waiman Long wrote:
> On 9/5/19 5:45 PM, Roman Gushchin wrote:
> > The existing slab memory controller is based on the idea of replicating
> > slab allocator internals for each memory cgroup. This approach promises
> > a low memory overhead (one pointer per page), and isn't adding too much
> > code on hot allocation and release paths. But is has a very serious flaw:
> > it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> >
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> >
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> >
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> >
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and changing
> > memcg pointer to the parent.
> >
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are destroyed
> > with root counterparts. It allows to remove a good amount of code and make
> > things generally simpler.
> >
> > The patchset contains a couple of semi-independent parts, which can find their
> > usage outside of the slab memory controller too:
> > 1) subpage charging API, which can be used in the future for accounting of
> > other non-page-sized objects, e.g. percpu allocations.
> > 2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
> > for the efficient reparenting of other objects, e.g. pagecache.
> >
> > The patchset has been tested on a number of different workloads in our
> > production. In all cases, it saved hefty amounts of memory:
> > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > 2) database cache, 750-800 Mb, ~35% of slab memory
> > 3) dns server, 700 Mb, ~36% of slab memory
> >
> > So far I haven't found any regression on all tested workloads, but
> > potential CPU regression caused by more precise accounting is a concern.
> >
> > Obviously the amount of saved memory depend on the number of memory cgroups,
> > uptime and specific workloads, but overall it feels like the new controller
> > saves 30-40% of slab memory, sometimes more. Additionally, it should lead
> > to a lower memory fragmentation, just because of a smaller number of
> > non-movable pages and also because there is no more need to move all
> > slab objects to a new set of pages when a workload is restarted in a new
> > memory cgroup.
> >
> > * https://github.com/osandov/drgn
> >
> >
> > Roman Gushchin (14):
> > mm: memcg: subpage charging API
> > mm: memcg: introduce mem_cgroup_ptr
> > mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
> > mm: vmstat: convert slab vmstat counter to bytes
> > mm: memcg/slab: allocate space for memcg ownership data for non-root
> > slabs
> > mm: slub: implement SLUB version of obj_to_index()
> > mm: memcg/slab: save memcg ownership data for non-root slab objects
> > mm: memcg: move memcg_kmem_bypass() to memcontrol.h
> > mm: memcg: introduce __mod_lruvec_memcg_state()
> > mm: memcg/slab: charge individual slab objects instead of pages
> > mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
> > mm: memcg/slab: replace memcg_from_slab_page() with
> > memcg_from_slab_obj()
> > mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> > mm: slab: remove redundant check in memcg_accumulate_slabinfo()
> >
> > drivers/base/node.c | 11 +-
> > fs/proc/meminfo.c | 4 +-
> > include/linux/memcontrol.h | 102 ++++++++-
> > include/linux/mm_types.h | 5 +-
> > include/linux/mmzone.h | 12 +-
> > include/linux/slab.h | 3 +-
> > include/linux/slub_def.h | 9 +
> > include/linux/vmstat.h | 8 +
> > kernel/power/snapshot.c | 2 +-
> > mm/list_lru.c | 12 +-
> > mm/memcontrol.c | 431 +++++++++++++++++++++--------------
> > mm/oom_kill.c | 2 +-
> > mm/page_alloc.c | 8 +-
> > mm/slab.c | 37 ++-
> > mm/slab.h | 300 +++++++++++++------------
> > mm/slab_common.c | 449 ++++---------------------------------
> > mm/slob.c | 12 +-
> > mm/slub.c | 63 ++----
> > mm/vmscan.c | 3 +-
> > mm/vmstat.c | 38 +++-
> > mm/workingset.c | 6 +-
> > 21 files changed, 683 insertions(+), 834 deletions(-)
> >
> I can only see the first 9 patches. Patches 10-14 are not there.

Hm, strange. I'll rebase the patchset on top of the current mm tree and resend.

In the meantime you can find the original patchset here:
https://github.com/rgushchin/linux/tree/new_slab.rfc
or on top of the 5.3 release, which might be better for testing here:
https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3

Thanks!

2019-09-19 13:57:16

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <[email protected]> wrote:
> The patchset has been tested on a number of different workloads in our
> production. In all cases, it saved hefty amounts of memory:
> 1) web frontend, 650-700 Mb, ~42% of slab memory
> 2) database cache, 750-800 Mb, ~35% of slab memory
> 3) dns server, 700 Mb, ~36% of slab memory

Do these workloads cycle through a lot of different memcgs?

For workloads that don't, wouldn't this approach potentially use more
memory? For example, a workload where everything is in one or two
memcgs, and those memcgs last forever.

-- Suleiman

2019-09-19 22:10:00

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <[email protected]> wrote:
> > The patchset has been tested on a number of different workloads in our
> > production. In all cases, it saved hefty amounts of memory:
> > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > 2) database cache, 750-800 Mb, ~35% of slab memory
> > 3) dns server, 700 Mb, ~36% of slab memory
>
> Do these workloads cycle through a lot of different memcgs?

Not really, those are just plain services managed by systemd.
They aren't restarted too often, maybe several times per day at most.

Also, there is nothing fb-specific. You can take any new modern
distributive (I've tried Fedora 30), boot it up and look at the
amount of slab memory. Numbers are roughly the same.

>
> For workloads that don't, wouldn't this approach potentially use more
> memory? For example, a workload where everything is in one or two
> memcgs, and those memcgs last forever.
>

Yes, it's true, if you have a very small and fixed number of memory cgroups,
in theory the new approach can take ~10% more memory.

I don't think it's such a big problem though: it seems that the majority
of cgroup users have a lot of them, and they are dynamically created and
destroyed by systemd/kubernetes/whatever else.

And if somebody has a very special setup with only 1-2 cgroups, arguably
kernel memory accounting isn't such a big thing for them, so it can be simple
disabled. Am I wrong and do you have a real-life example?

Thanks!

Roman

2019-09-20 06:57:21

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Fri, Sep 20, 2019 at 1:22 AM Roman Gushchin <[email protected]> wrote:
>
> On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> > On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <[email protected]> wrote:
> > > The patchset has been tested on a number of different workloads in our
> > > production. In all cases, it saved hefty amounts of memory:
> > > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > > 2) database cache, 750-800 Mb, ~35% of slab memory
> > > 3) dns server, 700 Mb, ~36% of slab memory
> >
> > Do these workloads cycle through a lot of different memcgs?
>
> Not really, those are just plain services managed by systemd.
> They aren't restarted too often, maybe several times per day at most.
>
> Also, there is nothing fb-specific. You can take any new modern
> distributive (I've tried Fedora 30), boot it up and look at the
> amount of slab memory. Numbers are roughly the same.

Ah, ok.
These numbers are kind of surprising to me.
Do you know if the savings are similar if you use CONFIG_SLAB instead
of CONFIG_SLUB?

> > For workloads that don't, wouldn't this approach potentially use more
> > memory? For example, a workload where everything is in one or two
> > memcgs, and those memcgs last forever.
> >
>
> Yes, it's true, if you have a very small and fixed number of memory cgroups,
> in theory the new approach can take ~10% more memory.
>
> I don't think it's such a big problem though: it seems that the majority
> of cgroup users have a lot of them, and they are dynamically created and
> destroyed by systemd/kubernetes/whatever else.
>
> And if somebody has a very special setup with only 1-2 cgroups, arguably
> kernel memory accounting isn't such a big thing for them, so it can be simple
> disabled. Am I wrong and do you have a real-life example?

No, I don't have any specific examples.

-- Suleiman

2019-09-20 07:43:08

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Fri, Sep 20, 2019 at 06:10:11AM +0900, Suleiman Souhlal wrote:
> On Fri, Sep 20, 2019 at 1:22 AM Roman Gushchin <[email protected]> wrote:
> >
> > On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> > > On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <[email protected]> wrote:
> > > > The patchset has been tested on a number of different workloads in our
> > > > production. In all cases, it saved hefty amounts of memory:
> > > > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > > > 2) database cache, 750-800 Mb, ~35% of slab memory
> > > > 3) dns server, 700 Mb, ~36% of slab memory
> > >
> > > Do these workloads cycle through a lot of different memcgs?
> >
> > Not really, those are just plain services managed by systemd.
> > They aren't restarted too often, maybe several times per day at most.
> >
> > Also, there is nothing fb-specific. You can take any new modern
> > distributive (I've tried Fedora 30), boot it up and look at the
> > amount of slab memory. Numbers are roughly the same.
>
> Ah, ok.
> These numbers are kind of surprising to me.
> Do you know if the savings are similar if you use CONFIG_SLAB instead
> of CONFIG_SLUB?

I did only a brief testing of the SLAB version: savings were there, numbers were
slightly less impressive, but still in a double digit number of percents.

>
> > > For workloads that don't, wouldn't this approach potentially use more
> > > memory? For example, a workload where everything is in one or two
> > > memcgs, and those memcgs last forever.
> > >
> >
> > Yes, it's true, if you have a very small and fixed number of memory cgroups,
> > in theory the new approach can take ~10% more memory.
> >
> > I don't think it's such a big problem though: it seems that the majority
> > of cgroup users have a lot of them, and they are dynamically created and
> > destroyed by systemd/kubernetes/whatever else.
> >
> > And if somebody has a very special setup with only 1-2 cgroups, arguably
> > kernel memory accounting isn't such a big thing for them, so it can be simple
> > disabled. Am I wrong and do you have a real-life example?
>
> No, I don't have any specific examples.
>
> -- Suleiman

2019-10-01 15:13:02

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <[email protected]> wrote:
> Roman Gushchin (14):
> [...]
> mm: memcg/slab: use one set of kmem_caches for all memory cgroups
From that commit's message:

> 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
> no per-memcg kmem_caches anymore (empty output is printed)

The empty file means no allocations took place in the particular cgroup.
I find this quite a surprising change for consumers of these stats.

I understand obtaining the same data efficiently from the proposed
structures is difficult, however, such a change should be avoided. (In
my understanding, obsoleted file ~ not available in v2, however, it
should not disappear from v1.)

Michal


Attachments:
(No filename) (745.00 B)
signature.asc (849.00 B)
Digital signature
Download all attachments

2019-10-02 06:30:01

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Tue, Oct 01, 2019 at 05:12:02PM +0200, Michal Koutn? wrote:
> On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <[email protected]> wrote:
> > Roman Gushchin (14):
> > [...]
> > mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> From that commit's message:
>
> > 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
> > no per-memcg kmem_caches anymore (empty output is printed)
>
> The empty file means no allocations took place in the particular cgroup.
> I find this quite a surprising change for consumers of these stats.
>
> I understand obtaining the same data efficiently from the proposed
> structures is difficult, however, such a change should be avoided. (In
> my understanding, obsoleted file ~ not available in v2, however, it
> should not disappear from v1.)

Well, my assumption is that nobody is using this file for anything except
debugging purposes (I might be wrong, if somebody has an automation based
on it, please, let me know). A number of allocations of each type per memory
cgroup is definitely a useful debug information, but currently it barely works
(displayed numbers show mostly the number of allocated pages, not the number
of active objects). We can support it, but it comes with the price, and
most users don't really need it. So I don't think it worth it to make all
allocations slower just to keep some debug interface working for some
cgroup v1 users. Do you have examples when it's really useful and worth
extra cpu cost?

Unfortunately, we can't enable it conditionally, as a user can switch
between cgroup v1 and cgroup v2 memory controllers dynamically.

Thanks!

2019-10-02 13:31:42

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Wed, Oct 2, 2019 at 11:09 AM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Oct 01, 2019 at 05:12:02PM +0200, Michal Koutný wrote:
> > On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <[email protected]> wrote:
> > > Roman Gushchin (14):
> > > [...]
> > > mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> > From that commit's message:
> >
> > > 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
> > > no per-memcg kmem_caches anymore (empty output is printed)
> >
> > The empty file means no allocations took place in the particular cgroup.
> > I find this quite a surprising change for consumers of these stats.
> >
> > I understand obtaining the same data efficiently from the proposed
> > structures is difficult, however, such a change should be avoided. (In
> > my understanding, obsoleted file ~ not available in v2, however, it
> > should not disappear from v1.)
>
> Well, my assumption is that nobody is using this file for anything except
> debugging purposes (I might be wrong, if somebody has an automation based
> on it, please, let me know). A number of allocations of each type per memory
> cgroup is definitely a useful debug information, but currently it barely works
> (displayed numbers show mostly the number of allocated pages, not the number
> of active objects). We can support it, but it comes with the price, and
> most users don't really need it. So I don't think it worth it to make all
> allocations slower just to keep some debug interface working for some
> cgroup v1 users. Do you have examples when it's really useful and worth
> extra cpu cost?
>
> Unfortunately, we can't enable it conditionally, as a user can switch
> between cgroup v1 and cgroup v2 memory controllers dynamically.

kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
I am however not aware of any automation based on it.

Maybe it might be worth adding it to cgroup v2 and have a CONFIG
option to enable it?

-- Suleiman

2019-10-03 10:49:37

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal <[email protected]> wrote:
> kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
> I am however not aware of any automation based on it.
My experience is the same. However, the point is that this has been
exposed since ages, so the safe assumption is that there may be users.

> Maybe it might be worth adding it to cgroup v2 and have a CONFIG
> option to enable it?
I don't think v2 file is necessary given the cost of obtaining the
information. But I concur the idea of making the per-object tracking
switchable at boot time or at least CONFIGurable.

Michal


Attachments:
(No filename) (661.00 B)
signature.asc (849.00 B)
Digital signature
Download all attachments

2019-10-03 16:00:21

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/14] The new slab memory controller

On Thu, Oct 03, 2019 at 12:47:41PM +0200, Michal Koutn? wrote:
> On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal <[email protected]> wrote:
> > kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
> > I am however not aware of any automation based on it.
> My experience is the same. However, the point is that this has been
> exposed since ages, so the safe assumption is that there may be users.

Yes, but kernel memory accounting was an opt-in feature for years,
and also it can be disabled on boot time, so displaying an empty
memory.slabinfo file doesn't break the interface.

>
> > Maybe it might be worth adding it to cgroup v2 and have a CONFIG
> > option to enable it?
> I don't think v2 file is necessary given the cost of obtaining the
> information. But I concur the idea of making the per-object tracking
> switchable at boot time or at least CONFIGurable.

As I said, the cost is the same and should be paid in any case,
no matter if cgroup v1 or v2 is used. A user can dynamically switch
between v1 and v2, and there is no way to obtain this information
afterwards, so we need to collect it from scratch.

Another concern I have is that it will require adding a non-trivial amount
of new code (as we don't have dynamically creating and destroying kmem_caches
anymore). It's perfectly doable, but I'm not sure we need it so much
to postpone the merging of the main thing. But I'm happy to hear
any arguments why it's not true.

Thanks!

2019-12-09 09:29:53

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

Hi,

I see the below crash during early boot when I try this patchset on
PowerPC host. I am on new_slab.rfc.v5.3 branch.

BUG: Unable to handle kernel data access at 0x81030236d1814578
Faulting instruction address: 0xc0000000002cc314
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
NIP: c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
REGS: c000001e40f378b0 TRAP: 0380 Not tainted (5.3.0-g9bd85fd72a0c)
MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022224 XER: 00000000
CFAR: c0000000002c6ad4 IRQMASK: 1
GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000
GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400
GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030
GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8
GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000
GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400
GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000
GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010
NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
Call Trace:
[c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
[c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
[c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
[c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
[c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
Instruction dump:
4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020
e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78

Looks like page->mem_cgroup_vec is allocated but not yet initialized
with memcg pointers when we try to access them.

I did get past the crash by initializing the pointers like this
in account_kernel_stack(), but I am pretty sure that this is not the
place to do this:

diff --git a/kernel/fork.c b/kernel/fork.c
index 541fd805fb88..be21419feae2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -380,13 +380,26 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
* All stack pages are in the same zone and belong to the
* same memcg.
*/
- struct page *first_page = virt_to_page(stack);
+ struct page *first_page = virt_to_head_page((stack));
+ unsigned long off;
+ struct mem_cgroup_ptr *memcg_ptr;
+ struct mem_cgroup *memcg;

mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
THREAD_SIZE / 1024 * account);

- mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB,
+ if (!first_page->mem_cgroup_vec)
+ return;
+ off = obj_to_index(task_struct_cachep, first_page, stack);
+ memcg_ptr = first_page->mem_cgroup_vec[off];
+ if (!memcg_ptr)
+ return;
+ rcu_read_lock();
+ memcg = memcg_ptr->memcg;
+ if (memcg)
+ __mod_memcg_state(memcg, MEMCG_KERNEL_STACK_KB,
account * (THREAD_SIZE / 1024));
+ rcu_read_unlock();
}
}

Regards,
Bharata.

2019-12-09 11:57:42

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> Hi,
>
> I see the below crash during early boot when I try this patchset on
> PowerPC host. I am on new_slab.rfc.v5.3 branch.
>
> BUG: Unable to handle kernel data access at 0x81030236d1814578
> Faulting instruction address: 0xc0000000002cc314
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
> CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
> NIP: c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
> REGS: c000001e40f378b0 TRAP: 0380 Not tainted (5.3.0-g9bd85fd72a0c)
> MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022224 XER: 00000000
> CFAR: c0000000002c6ad4 IRQMASK: 1
> GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000
> GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400
> GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030
> GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8
> GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000
> GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400
> GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000
> GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010
> NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
> LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
> Call Trace:
> [c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
> [c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
> [c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
> [c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
> [c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
> Instruction dump:
> 4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020
> e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78
>
> Looks like page->mem_cgroup_vec is allocated but not yet initialized
> with memcg pointers when we try to access them.
>
> I did get past the crash by initializing the pointers like this
> in account_kernel_stack(),

The above is not an accurate description of the hack I showed below.
Essentially I am making sure that I get to the memcg corresponding
to task_struct_cachep object in the page.

But that still doesn't explain why we don't hit this problem on x86.

> but I am pretty sure that this is not the
> place to do this:
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 541fd805fb88..be21419feae2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -380,13 +380,26 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> * All stack pages are in the same zone and belong to the
> * same memcg.
> */
> - struct page *first_page = virt_to_page(stack);
> + struct page *first_page = virt_to_head_page((stack));
> + unsigned long off;
> + struct mem_cgroup_ptr *memcg_ptr;
> + struct mem_cgroup *memcg;
>
> mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
> THREAD_SIZE / 1024 * account);
>
> - mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB,
> + if (!first_page->mem_cgroup_vec)
> + return;
> + off = obj_to_index(task_struct_cachep, first_page, stack);
> + memcg_ptr = first_page->mem_cgroup_vec[off];
> + if (!memcg_ptr)
> + return;
> + rcu_read_lock();
> + memcg = memcg_ptr->memcg;
> + if (memcg)
> + __mod_memcg_state(memcg, MEMCG_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + rcu_read_unlock();
> }
> }
>
> Regards,
> Bharata.

2019-12-09 18:05:56

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> > Hi,
> >
> > I see the below crash during early boot when I try this patchset on
> > PowerPC host. I am on new_slab.rfc.v5.3 branch.
> >
> > BUG: Unable to handle kernel data access at 0x81030236d1814578
> > Faulting instruction address: 0xc0000000002cc314
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> > Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
> > CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
> > NIP: c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
> > REGS: c000001e40f378b0 TRAP: 0380 Not tainted (5.3.0-g9bd85fd72a0c)
> > MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022224 XER: 00000000
> > CFAR: c0000000002c6ad4 IRQMASK: 1
> > GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000
> > GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400
> > GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030
> > GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8
> > GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000
> > GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400
> > GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000
> > GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010
> > NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
> > LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
> > Call Trace:
> > [c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
> > [c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
> > [c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
> > [c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
> > [c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
> > Instruction dump:
> > 4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020
> > e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78
> >
> > Looks like page->mem_cgroup_vec is allocated but not yet initialized
> > with memcg pointers when we try to access them.
> >
> > I did get past the crash by initializing the pointers like this
> > in account_kernel_stack(),
>
> The above is not an accurate description of the hack I showed below.
> Essentially I am making sure that I get to the memcg corresponding
> to task_struct_cachep object in the page.

Hello, Bharata!

Thank you very much for the report and the patch, it's a good catch,
and the code looks good to me. I'll include the fix into the next
version of the patchset (I can't keep it as a separate fix due to massive
renamings/rewrites).

>
> But that still doesn't explain why we don't hit this problem on x86.

On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
allocated directly by the page allocator, bypassing the slab allocator.
It depends on CONFIG_VMAP_STACK.

Btw, thank you for looking into the patchset and trying it on powerpc.
Would you mind to share some results?

Thank you!

Roman

2019-12-10 06:24:54

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Mon, Dec 09, 2019 at 06:04:22PM +0000, Roman Gushchin wrote:
> On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> > On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> Hello, Bharata!
>
> Thank you very much for the report and the patch, it's a good catch,
> and the code looks good to me. I'll include the fix into the next
> version of the patchset (I can't keep it as a separate fix due to massive
> renamings/rewrites).

Sure, but please note that I did post only the core change without
the associated header includes etc, where I took some short cuts.

>
> >
> > But that still doesn't explain why we don't hit this problem on x86.
>
> On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
> allocated directly by the page allocator, bypassing the slab allocator.
> It depends on CONFIG_VMAP_STACK.

I turned off CONFIG_VMAP_STACK on x86, but still don't hit any
problems.

$ grep VMAP .config
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
# CONFIG_VMAP_STACK is not set

May be something else prevents this particular crash on x86?

>
> Btw, thank you for looking into the patchset and trying it on powerpc.
> Would you mind to share some results?

Sure, I will get back with more results, but initial numbers when running
a small alpine docker image look promising.

With slab patches
# docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
24bc99d94d91 sleek 0.00% 1MiB / 25MiB 4.00% 1.81kB / 0B 0B / 0B 0

Without slab patches
# docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
52382f8aaa13 sleek 0.00% 8.688MiB / 25MiB 34.75% 1.53kB / 0B 0B / 0B 0

So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
docker container isn't doing anything useful and hence the numbers
aren't representative of any workload.

Regards,
Bharata.

2019-12-10 18:20:52

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Tue, Dec 10, 2019 at 11:53:08AM +0530, Bharata B Rao wrote:
> On Mon, Dec 09, 2019 at 06:04:22PM +0000, Roman Gushchin wrote:
> > On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> > > On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> > Hello, Bharata!
> >
> > Thank you very much for the report and the patch, it's a good catch,
> > and the code looks good to me. I'll include the fix into the next
> > version of the patchset (I can't keep it as a separate fix due to massive
> > renamings/rewrites).
>
> Sure, but please note that I did post only the core change without
> the associated header includes etc, where I took some short cuts.

Sure, I'll adopt the code to the next version.

>
> >
> > >
> > > But that still doesn't explain why we don't hit this problem on x86.
> >
> > On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
> > allocated directly by the page allocator, bypassing the slab allocator.
> > It depends on CONFIG_VMAP_STACK.
>
> I turned off CONFIG_VMAP_STACK on x86, but still don't hit any
> problems.

If you'll look at kernel/fork.c (~ :184), there are two ORed conditions
to bypass the slab allocator:
1) THREAD_SIZE >= PAGE_SIZE
2) CONFIG_VMAP_STACK

I guess the first one is what saves x86 in your case, while on ppc you might
have 64k pages (hard to say without looking at your config).

>
> $ grep VMAP .config
> CONFIG_HAVE_ARCH_HUGE_VMAP=y
> CONFIG_HAVE_ARCH_VMAP_STACK=y
> # CONFIG_VMAP_STACK is not set
>
> May be something else prevents this particular crash on x86?

I'm pretty sure it will crash, have stack been allocated using
the slab allocator. I bet everybody are using vmap-based stacks.

>
> >
> > Btw, thank you for looking into the patchset and trying it on powerpc.
> > Would you mind to share some results?
>
> Sure, I will get back with more results, but initial numbers when running
> a small alpine docker image look promising.
>
> With slab patches
> # docker stats --no-stream
> CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> 24bc99d94d91 sleek 0.00% 1MiB / 25MiB 4.00% 1.81kB / 0B 0B / 0B 0
>
> Without slab patches
> # docker stats --no-stream
> CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> 52382f8aaa13 sleek 0.00% 8.688MiB / 25MiB 34.75% 1.53kB / 0B 0B / 0B 0
>
> So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> docker container isn't doing anything useful and hence the numbers
> aren't representative of any workload.

Cool, that's great!

Small containers is where the relative win is the biggest. Of course, it will
decrease with the size of containers, but it's expected.

If you'll get any additional numbers, please, share them. It's really
interesting, especially if you have larger-than-4k pages.

Thank you!

Roman

2020-01-13 08:48:37

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Tue, Dec 10, 2019 at 06:05:20PM +0000, Roman Gushchin wrote:
> >
> > With slab patches
> > # docker stats --no-stream
> > CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> > 24bc99d94d91 sleek 0.00% 1MiB / 25MiB 4.00% 1.81kB / 0B 0B / 0B 0
> >
> > Without slab patches
> > # docker stats --no-stream
> > CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> > 52382f8aaa13 sleek 0.00% 8.688MiB / 25MiB 34.75% 1.53kB / 0B 0B / 0B 0
> >
> > So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> > docker container isn't doing anything useful and hence the numbers
> > aren't representative of any workload.
>
> Cool, that's great!
>
> Small containers is where the relative win is the biggest. Of course, it will
> decrease with the size of containers, but it's expected.
>
> If you'll get any additional numbers, please, share them. It's really
> interesting, especially if you have larger-than-4k pages.

I run a couple of workloads contained within a memory cgroup and measured
memory.kmem.usage_in_bytes and memory.usage_in_bytes with and without
this patchset on PowerPC host. I see significant reduction in
memory.kmem.usage_in_bytes and some reduction in memory.usage_in_bytes.
Before posting the numbers, would like to get the following clarified:

In the original case, the memory cgroup is charged (including kmem charging)
when a new slab page is allocated. In your patch, the subpage charging is
done in slab_pre_alloc_hook routine. However in this case, I couldn't find
where exactly kmem counters are being charged/updated. Hence wanted to
make sure that the reduction in memory.kmem.usage_in_bytes that I am
seeing is indeed real and not because kmem accounting was missed out for
slab usage?

Also, I see all non-root allocations are coming from a single set of
kmem_caches. Guess <kmemcache_name>-memcg caches don't yet show up in
/proc/slabinfo and nor their stats is accumulated into /proc/slabinfo?

Regards,
Bharata.

2020-01-13 15:32:52

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 00/16] The new slab memory controller

On Mon, Jan 13, 2020 at 02:17:10PM +0530, Bharata B Rao wrote:
> On Tue, Dec 10, 2019 at 06:05:20PM +0000, Roman Gushchin wrote:
> > >
> > > With slab patches
> > > # docker stats --no-stream
> > > CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> > > 24bc99d94d91 sleek 0.00% 1MiB / 25MiB 4.00% 1.81kB / 0B 0B / 0B 0
> > >
> > > Without slab patches
> > > # docker stats --no-stream
> > > CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
> > > 52382f8aaa13 sleek 0.00% 8.688MiB / 25MiB 34.75% 1.53kB / 0B 0B / 0B 0
> > >
> > > So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> > > docker container isn't doing anything useful and hence the numbers
> > > aren't representative of any workload.
> >
> > Cool, that's great!
> >
> > Small containers is where the relative win is the biggest. Of course, it will
> > decrease with the size of containers, but it's expected.
> >
> > If you'll get any additional numbers, please, share them. It's really
> > interesting, especially if you have larger-than-4k pages.
>
> I run a couple of workloads contained within a memory cgroup and measured
> memory.kmem.usage_in_bytes and memory.usage_in_bytes with and without
> this patchset on PowerPC host. I see significant reduction in
> memory.kmem.usage_in_bytes and some reduction in memory.usage_in_bytes.
> Before posting the numbers, would like to get the following clarified:
>
> In the original case, the memory cgroup is charged (including kmem charging)
> when a new slab page is allocated. In your patch, the subpage charging is
> done in slab_pre_alloc_hook routine. However in this case, I couldn't find
> where exactly kmem counters are being charged/updated. Hence wanted to
> make sure that the reduction in memory.kmem.usage_in_bytes that I am
> seeing is indeed real and not because kmem accounting was missed out for
> slab usage?
>
> Also, I see all non-root allocations are coming from a single set of
> kmem_caches. Guess <kmemcache_name>-memcg caches don't yet show up in
> /proc/slabinfo and nor their stats is accumulated into /proc/slabinfo?

Hello Bharata!

First I'd look at global slab counters in /proc/meminfo (or vmstat).
These are reflecting the total system-wide amount of pages used by all slab
memory and they are accurate.

What about cgroup-level counters, they are not perfect in the version which
I posted. In general on cgroup v1 kernel memory is accounted twice: as a part
of total memory (memory.usage_in_bytes) and as a separate value
(memory.kmem.usage_in_bytes). The version of the slab controller which you test
doesn't support the second one. Also, it doesn't include the space used by the
accounting meta-data (1 pointer per object) into the accounting.
But after all the difference in memory.usage_in_bytes values beside some margin
(~10% of the difference) is meaningful.

The next version which I'm working on right now (and hope to post in a week or so)
will address these issues.

Thanks!