LinuxLists.cc - SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-12 10:12:18

Subject: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

[This is not something I intend to work on myself. But since I
stumbled over this issue, I figured I should at least document/report
it, in case anyone is willing to pick it up.]

Hi!

I was poking around in SLUB internals and noticed that the estimate of
how many free objects exist on a percpu partial list (tracked in
page->pobjects of the first page on the list and exposed to userspace
via /sys/kernel/slab/*/slabs_cpu_partial) is highly inaccurate.

From a naive first look, it seems like SLUB tries to roughly keep up
to cache->cpu_partial free objects per slab and CPU around.
cache->cpu_partial depends on the object size; the maximum is 30 (for
objects <256 bytes).

However, the accounting of free objects into page->pobjects only
happens when a page is added to the percpu partial list;
page->pobjects is not updated when objects are freed to a page that is
already on the percpu partial list. Pages can be added to the percpu
partial list in two cases:

1. When an object is freed from a page which previously had zero free
objects (via __slab_free()), meaning the page was not previously on
any list.
2. When the percpu partial list was empty and get_partial_node(),
after grabbing a partial page from the node, moves more partial pages
onto the percpu partial list to make the percpu partial list contain
around cache->cpu_partial/2 free objects.

In case 1, pages will almost always be counted as having one free
object, unless a race with a concurrent __slab_free() on the same page
happens, because the code path specifically handles the case where the
number of free objects just became 1.
In case 2, pages will probably often be counted as having many free
objects; however, this case doesn't appear to happen often in
practice, likely partly because pages outside of percpu partial lists
get freed immediately when they become empty.

This means that in practice, SLUB actually ends up keeping as many
**pages** on the percpu partial lists as it intends to keep **free
objects** there. To see this, you can append the snippet at the end of
this mail to mm/slub.c; that will add a debugfs entry that lets you
accurately dump the percpu partial lists of all running CPUs (at the
cost of an IPI storm when you read it).

Especially after running some forkbombs multiple times on a VM with a
bunch of CPUs, that will show you that some of the percpu lists look
like this (just showing a single slab's percpu lists on three CPU
cores here) - note the "inuse=0" everywhere:

task_delay_info on 10:
page=fffff9e4444b9b00 base=ffff988bd2e6c000 order=0 partial_pages=24
partial_objects=24 objects=51 inuse=0
page=fffff9e44469f800 base=ffff988bda7e0000 order=0 partial_pages=23
partial_objects=23 objects=51 inuse=0
page=fffff9e444e01380 base=ffff988bf804e000 order=0 partial_pages=22
partial_objects=22 objects=51 inuse=0
page=fffff9e444bdda40 base=ffff988bef769000 order=0 partial_pages=21
partial_objects=21 objects=51 inuse=0
page=fffff9e444c59700 base=ffff988bf165c000 order=0 partial_pages=20
partial_objects=20 objects=51 inuse=0
page=fffff9e44494d280 base=ffff988be534a000 order=0 partial_pages=19
partial_objects=19 objects=51 inuse=0
page=fffff9e444fd2e80 base=ffff988bff4ba000 order=0 partial_pages=18
partial_objects=18 objects=51 inuse=0
page=fffff9e444c47c80 base=ffff988bf11f2000 order=0 partial_pages=17
partial_objects=17 objects=51 inuse=0
page=fffff9e4448ff780 base=ffff988be3fde000 order=0 partial_pages=16
partial_objects=16 objects=51 inuse=0
page=fffff9e4443883c0 base=ffff988bce20f000 order=0 partial_pages=15
partial_objects=15 objects=51 inuse=0
page=fffff9e444eede40 base=ffff988bfbb79000 order=0 partial_pages=14
partial_objects=14 objects=51 inuse=0
page=fffff9e4442febc0 base=ffff988bcbfaf000 order=0 partial_pages=13
partial_objects=13 objects=51 inuse=0
page=fffff9e444e44940 base=ffff988bf9125000 order=0 partial_pages=12
partial_objects=12 objects=51 inuse=0
page=fffff9e4446f72c0 base=ffff988bdbdcb000 order=0 partial_pages=11
partial_objects=11 objects=51 inuse=0
page=fffff9e444dba080 base=ffff988bf6e82000 order=0 partial_pages=10
partial_objects=10 objects=51 inuse=0
page=fffff9e444a23c40 base=ffff988be88f1000 order=0 partial_pages=9
partial_objects=9 objects=51 inuse=0
page=fffff9e444786cc0 base=ffff988bde1b3000 order=0 partial_pages=8
partial_objects=8 objects=51 inuse=0
page=fffff9e444b2cf80 base=ffff988becb3e000 order=0 partial_pages=7
partial_objects=7 objects=51 inuse=0
page=fffff9e444f19cc0 base=ffff988bfc673000 order=0 partial_pages=6
partial_objects=6 objects=51 inuse=0
page=fffff9e444f08fc0 base=ffff988bfc23f000 order=0 partial_pages=5
partial_objects=5 objects=51 inuse=0
page=fffff9e444a0e540 base=ffff988be8395000 order=0 partial_pages=4
partial_objects=4 objects=51 inuse=0
page=fffff9e445127a00 base=ffff988c049e8000 order=0 partial_pages=3
partial_objects=3 objects=51 inuse=0
page=fffff9e44468ae40 base=ffff988bda2b9000 order=0 partial_pages=2
partial_objects=2 objects=51 inuse=0
page=fffff9e44452c9c0 base=ffff988bd4b27000 order=0 partial_pages=1
partial_objects=1 objects=51 inuse=0
task_delay_info on 11:
page=fffff9e444a90c40 base=ffff988bea431000 order=0 partial_pages=22
partial_objects=22 objects=51 inuse=0
page=fffff9e444447040 base=ffff988bd11c1000 order=0 partial_pages=21
partial_objects=21 objects=51 inuse=0
page=fffff9e446505b40 base=ffff988c5416d000 order=0 partial_pages=20
partial_objects=20 objects=51 inuse=0
page=fffff9e444c02500 base=ffff988bf0094000 order=0 partial_pages=19
partial_objects=19 objects=51 inuse=0
page=fffff9e4447ceec0 base=ffff988bdf3bb000 order=0 partial_pages=18
partial_objects=18 objects=51 inuse=0
page=fffff9e444524a40 base=ffff988bd4929000 order=0 partial_pages=17
partial_objects=17 objects=51 inuse=0
page=fffff9e444698700 base=ffff988bda61c000 order=0 partial_pages=16
partial_objects=16 objects=51 inuse=0
page=fffff9e444d23240 base=ffff988bf48c9000 order=0 partial_pages=15
partial_objects=15 objects=51 inuse=0
page=fffff9e4442ccac0 base=ffff988bcb32b000 order=0 partial_pages=14
partial_objects=14 objects=51 inuse=0
page=fffff9e444a03500 base=ffff988be80d4000 order=0 partial_pages=13
partial_objects=13 objects=51 inuse=0
page=fffff9e444582ec0 base=ffff988bd60bb000 order=0 partial_pages=12
partial_objects=12 objects=51 inuse=0
page=fffff9e444e5a340 base=ffff988bf968d000 order=0 partial_pages=11
partial_objects=11 objects=51 inuse=0
page=fffff9e4444bb680 base=ffff988bd2eda000 order=0 partial_pages=10
partial_objects=10 objects=51 inuse=0
page=fffff9e444232100 base=ffff988bc8c84000 order=0 partial_pages=9
partial_objects=9 objects=51 inuse=0
page=fffff9e444acb2c0 base=ffff988beb2cb000 order=0 partial_pages=8
partial_objects=8 objects=51 inuse=0
page=fffff9e44512cdc0 base=ffff988c04b37000 order=0 partial_pages=7
partial_objects=7 objects=51 inuse=0
page=fffff9e44474f040 base=ffff988bdd3c1000 order=0 partial_pages=6
partial_objects=6 objects=51 inuse=0
page=fffff9e4446dee80 base=ffff988bdb7ba000 order=0 partial_pages=5
partial_objects=5 objects=51 inuse=0
page=fffff9e444c19fc0 base=ffff988bf067f000 order=0 partial_pages=4
partial_objects=4 objects=51 inuse=0
page=fffff9e444a07a80 base=ffff988be81ea000 order=0 partial_pages=3
partial_objects=3 objects=51 inuse=0
page=fffff9e444e6ac80 base=ffff988bf9ab2000 order=0 partial_pages=2
partial_objects=2 objects=51 inuse=0
page=fffff9e4442e5ec0 base=ffff988bcb97b000 order=0 partial_pages=1
partial_objects=1 objects=51 inuse=0
task_delay_info on 12:
page=fffff9e4446d0880 base=ffff988bdb422000 order=0 partial_pages=27
partial_objects=27 objects=51 inuse=0
page=fffff9e44485f5c0 base=ffff988be17d7000 order=0 partial_pages=26
partial_objects=26 objects=51 inuse=0
page=fffff9e444ff8180 base=ffff988bffe06000 order=0 partial_pages=25
partial_objects=25 objects=51 inuse=0
page=fffff9e444727d00 base=ffff988bdc9f4000 order=0 partial_pages=24
partial_objects=24 objects=51 inuse=0
page=fffff9e444dc7a80 base=ffff988bf71ea000 order=0 partial_pages=23
partial_objects=23 objects=51 inuse=0
page=fffff9e4443c7600 base=ffff988bcf1d8000 order=0 partial_pages=22
partial_objects=22 objects=51 inuse=0
page=fffff9e444d76580 base=ffff988bf5d96000 order=0 partial_pages=21
partial_objects=21 objects=51 inuse=0
page=fffff9e4446d20c0 base=ffff988bdb483000 order=0 partial_pages=20
partial_objects=20 objects=51 inuse=0
page=fffff9e4448b9c00 base=ffff988be2e70000 order=0 partial_pages=19
partial_objects=19 objects=51 inuse=0
page=fffff9e444781900 base=ffff988bde064000 order=0 partial_pages=18
partial_objects=18 objects=51 inuse=0
page=fffff9e4465dd4c0 base=ffff988c57753000 order=0 partial_pages=17
partial_objects=17 objects=51 inuse=0
page=fffff9e4446f2340 base=ffff988bdbc8d000 order=0 partial_pages=16
partial_objects=16 objects=51 inuse=0
page=fffff9e4449c4f40 base=ffff988be713d000 order=0 partial_pages=15
partial_objects=15 objects=51 inuse=0
page=fffff9e445106b80 base=ffff988c041ae000 order=0 partial_pages=14
partial_objects=14 objects=51 inuse=0
page=fffff9e444b7b9c0 base=ffff988bedee7000 order=0 partial_pages=13
partial_objects=13 objects=51 inuse=0
page=fffff9e44422c400 base=ffff988bc8b10000 order=0 partial_pages=12
partial_objects=12 objects=51 inuse=0
page=fffff9e444eb2240 base=ffff988bfac89000 order=0 partial_pages=11
partial_objects=11 objects=51 inuse=0
page=fffff9e44455ce40 base=ffff988bd5739000 order=0 partial_pages=10
partial_objects=10 objects=51 inuse=0
page=fffff9e44490f440 base=ffff988be43d1000 order=0 partial_pages=9
partial_objects=9 objects=51 inuse=0
page=fffff9e444b640c0 base=ffff988bed903000 order=0 partial_pages=8
partial_objects=8 objects=51 inuse=0
page=fffff9e444877c40 base=ffff988be1df1000 order=0 partial_pages=7
partial_objects=7 objects=51 inuse=0
page=fffff9e444ef72c0 base=ffff988bfbdcb000 order=0 partial_pages=6
partial_objects=6 objects=51 inuse=0
page=fffff9e444a6d040 base=ffff988be9b41000 order=0 partial_pages=5
partial_objects=5 objects=51 inuse=0
page=fffff9e444503340 base=ffff988bd40cd000 order=0 partial_pages=4
partial_objects=4 objects=51 inuse=0
page=fffff9e444e408c0 base=ffff988bf9023000 order=0 partial_pages=3
partial_objects=3 objects=51 inuse=0
page=fffff9e445a55a80 base=ffff988c2956a000 order=0 partial_pages=2
partial_objects=2 objects=51 inuse=0
page=fffff9e444486480 base=ffff988bd2192000 order=0 partial_pages=1
partial_objects=1 objects=51 inuse=0

Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
see something like 1.5MiB of pages with zero inuse objects stuck in
percpu lists.

I suspect that this may have also contributed to the memory wastage
problem with memory cgroups that was fixed in v5.9
(https://lore.kernel.org/linux-mm/[email protected]/);
meaning that servers with lots of CPU cores running pre-5.9 kernels
with memcg and systemd (which tends to stick every service into its
own memcg) might be even worse off.

It also seems unsurprising to me that flushing ~30 pages out of the
percpu partial caches at once with IRQs disabled would cause tail
latency spikes (as noted by Joonsoo Kim and Christoph Lameter in
commit 345c905d13a4e "slub: Make cpu partial slab support
configurable").

At first I thought that this wasn't a significant issue because SLUB
has a reclaim path that can trim the percpu partial lists; but as it
turns out, that reclaim path is not actually wired up to the page
allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
(very rare) subsystem-specific calls into SLUB for specific slabs and
by sysfs entries. So in userland processes will OOM even if SLUB still
has megabytes of entirely unused pages lying around.

It might be a good idea to figure out whether it is possible to
efficiently keep track of a more accurate count of the free objects on
percpu partial lists; and if not, maybe change the accounting to
explicitly track the number of partial pages, and use limits that are
more appropriate for that? And perhaps the page allocator reclaim path
should also occasionally rip unused pages out of the percpu partial
lists?

If you want to manually try reaping memory from the percpu partial
lists one time, you may want to try writing 1 into all
/sys/kernel/slab/*/shrink and see what effect that has on memory
usage, especially on the "Slab" line in /proc/meminfo. On my laptop
(running some kernel older than 5.9) that caused Slab memory usage to
drop from 652096 kB to 580780 kB, a relative 11% reduction in page
usage by the slab allocator, and an absolute ~70 MiB reduction. On my
(mostly idle) 56-core workstation (running the same kernel), it
reduced slab memory usage by ~13% / ~1.35 GiB:

# for i in {0..5}; do grep Slab /proc/meminfo; sleep 5; done; for file
in /sys/kernel/slab/*/shrink; do echo 1 > $file; done; for i in
{0..5}; do grep Slab /proc/meminfo; sleep 5; done
Slab: 10902060 kB
Slab: 10902528 kB
Slab: 10902528 kB
Slab: 10902728 kB
Slab: 10902696 kB
Slab: 10902712 kB
Slab: 9485364 kB
Slab: 9424864 kB
Slab: 9429188 kB
Slab: 9430656 kB
Slab: 9431020 kB
Slab: 9431764 kB

=============================
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct partial_dump_state {
int cache_idx;
struct kmem_cache *cache;
int cpu;
};

static void *partial_dump_start(struct seq_file *m, loff_t *pos)
{
struct partial_dump_state *ds = m->private;

mutex_lock(&slab_mutex);
ds->cache = list_first_entry(&slab_caches, struct kmem_cache, list);
if (*pos == 0) {
ds->cache_idx = 0;
ds->cpu = cpumask_next(-1, cpu_possible_mask);
} else {
int i;

for (i = 0; i < ds->cache_idx; i++) {
if (list_is_last(&ds->cache->list, &slab_caches))
return NULL;
ds->cache = list_next_entry(ds->cache, list);
}

}
return (void*)1UL;
}

static void *partial_dump_next(struct seq_file *m, void *v, loff_t *pos)
{
struct partial_dump_state *ds = m->private;

(*pos)++; /* meaningless but seq_file complains if we don't */
ds->cpu = cpumask_next(ds->cpu, cpu_possible_mask);
if (ds->cpu >= num_possible_cpus()) {
ds->cpu = cpumask_next(-1, cpu_possible_mask);

ds->cache_idx++;
if (list_is_last(&ds->cache->list, &slab_caches))
return NULL;
ds->cache = list_next_entry(ds->cache, list);
}
return (void*)1UL;
}

static void empty_stop(struct seq_file *m, void *p) {
struct partial_dump_state *ds = m->private;

ds->cache = NULL;
mutex_unlock(&slab_mutex);
}

static void partial_dump_on_cpu(void *info)
{
struct seq_file *m = info;
struct partial_dump_state *ds = m->private;
struct kmem_cache_cpu *c = per_cpu_ptr(ds->cache->cpu_slab, ds->cpu);
struct page *page;

if (WARN_ON(smp_processor_id() != ds->cpu))
return;

seq_printf(m, "%s on %d:\n", ds->cache->name, ds->cpu);
page = c->partial;
for (page = c->partial; page != NULL; page = page->next) {
seq_printf(m, " page=%px base=%px order=%d
partial_pages=%d partial_objects=%d objects=%u inuse=%u\n",
page,
page_address(page),
compound_order(page),
page->pages,
page->pobjects,
page->objects,
page->inuse);
}
}

static int partial_dump_show(struct seq_file *m, void *p)
{
struct partial_dump_state *ds = m->private;

if (smp_call_function_single(ds->cpu, partial_dump_on_cpu, m, 1))
seq_printf(m, "%s on %d: not online\n",
ds->cache->name, ds->cpu);

return 0;
}

static const struct seq_operations partial_dump_seq_ops = {
.start = partial_dump_start,
.next = partial_dump_next,
.stop = empty_stop,
.show = partial_dump_show
};

static int partial_dump_open(struct inode *inode, struct file *file)
{
return seq_open_private(file, &partial_dump_seq_ops,
sizeof(struct partial_dump_state));
}

static const struct file_operations partial_dump_ops = {
.open = partial_dump_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release
};
#endif

#include <linux/debugfs.h>
static int __init slub_debugfs_init(void)
{
struct dentry *dir = debugfs_create_dir("slub", NULL);

#ifdef CONFIG_SLUB_CPU_PARTIAL
debugfs_create_file("partial_dump", 0400, dir, NULL, &partial_dump_ops);
#endif

return 0;
}
late_initcall(slub_debugfs_init);
=============================

2021-01-12 10:37:04

by Roman Gushchin

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Tue, Jan 12, 2021 at 12:12:47AM +0100, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]

Hi, Jann!

Thank you very much for this report!

It should be less of a problem since 5.9, but still on a machine with
many CPUs a non-trivial amount of memory is wasted.

I'll definitely take a look.

Thanks!

2021-01-12 16:39:39

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Tue, 12 Jan 2021, Jann Horn wrote:

> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]

Well yeah all true. There is however a slabinfo tool that has an -s option
to shrink all slabs.

slabinfo -s

So you could put that somewhere that executes if the system is
idle or put it into cron or so.

This is a heavy handed operation through. You could switch off partial cpu
slabs completely to avoid the issue. Nothing came to mind in the past on
how to solve this without sacrificing significant performance or cause
some system processing at random times while the shrinking runs. No one
wants any of that.

Being able to do it from userspace cleanly shifts the burden to userspace ;-)
You can even do random heavy system processing from user space if you put
a sleep there for random seconds between the shrinking runs.

2021-01-13 19:16:56

by Vlastimil Babka

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On 1/12/21 12:12 AM, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]
>
> Hi!

Hi, thanks for saving me a lot of typing!

...

> This means that in practice, SLUB actually ends up keeping as many
> **pages** on the percpu partial lists as it intends to keep **free
> objects** there.

Yes, I concluded the same thing.

...

> I suspect that this may have also contributed to the memory wastage
> problem with memory cgroups that was fixed in v5.9
> (https://lore.kernel.org/linux-mm/[email protected]/);
> meaning that servers with lots of CPU cores running pre-5.9 kernels
> with memcg and systemd (which tends to stick every service into its
> own memcg) might be even worse off.

Very much yes. Investigating an increase of kmemcg usage of a workload between
an older kernel with SLAB and 5.3-based kernel with SLUB led us to find the same
issue as you did. It doesn't help that slabinfo (global or per-memcg) is also
inaccurate as it cannot count free objects on per-cpu partial slabs and thus
reports them as active. I was aware that some empty slab pages might linger on
per-cpu lists, but only after seeing how many were freed after "echo 1 >
.../shrink" made me realize the extent of this.

> It also seems unsurprising to me that flushing ~30 pages out of the
> percpu partial caches at once with IRQs disabled would cause tail
> latency spikes (as noted by Joonsoo Kim and Christoph Lameter in
> commit 345c905d13a4e "slub: Make cpu partial slab support
> configurable").
>
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.

Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor
solution. I'm considering introducing a proper shrinker that would be registered
and work like other shrinkers for reclaimable caches. Then we would make it
memcg-aware in our backport - upstream after v5.9 doesn't need that obviously.

> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on

As long as there are some inuse objects, it shouldn't matter much if the slab is
sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
It becomes a real problem only after the slab become fully free. If we detected
that in __slab_free() also for already-frozen slabs, we would need to know which
CPU this slab belongs to (currently that's not tracked afaik), and send it an
IPI to do some light version of unfreeze_partials() that would only remove empty
slabs. The trick would be not to cause too many IPI's by this, obviously :/

Actually I'm somewhat wrong above. If a CPU and per-node partial list runs out
of free objects, it's wasteful to allocate new slabs if almost-empty slabs sit
on another CPU's per-node partial list.

> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are

That would be probably the simplest solution. Maybe sufficient upstream where
the wastage only depends on number of caches and not memcgs. For pre-5.9 I also
considered limiting the number of pages only for the per-memcg clones :/
Currently writing to the /sys/.../<cache>/cpu_partial file is propagated to all
the clones and root cache.

> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

That would be best done by the a shrinker?

BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers (which
works, but is not ideal) They also can't grow that large like this.

2021-01-14 02:05:53

by Jann Horn

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka <[email protected]> wrote:
> On 1/12/21 12:12 AM, Jann Horn wrote:
> It doesn't help that slabinfo (global or per-memcg) is also
> inaccurate as it cannot count free objects on per-cpu partial slabs and thus
> reports them as active.

Maybe SLUB could be taught to track how many objects are in the percpu
machinery, and then print that number separately so that you can at
least know how much data you're missing without having to collect data
with IPIs...

> > It might be a good idea to figure out whether it is possible to
> > efficiently keep track of a more accurate count of the free objects on
>
> As long as there are some inuse objects, it shouldn't matter much if the slab is
> sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
> It becomes a real problem only after the slab become fully free. If we detected
> that in __slab_free() also for already-frozen slabs, we would need to know which
> CPU this slab belongs to (currently that's not tracked afaik),

Yeah, but at least on 64-bit systems we still have 32 completely
unused bits in the counter field that's updated via cmpxchg_double on
struct page. (On 32-bit systems the bitfields are also wider than they
strictly need to be, I think, at least if the system has 4K page
size.) So at least on 64-bit systems, we could squeeze a CPU number in
there, and then you'd know to which CPU the page belonged at the time
the object was freed.

> and send it an
> IPI to do some light version of unfreeze_partials() that would only remove empty
> slabs. The trick would be not to cause too many IPI's by this, obviously :/

Some brainstorming:

Maybe you could have an atomic counter in kmem_cache_cpu that tracks
the number of empty frozen pages that are associated with a specific
CPU? So the freeing slowpath would do its cmpxchg_double, and if the
new state after a successful cmpxchg_double is "inuse==0 && frozen ==
1" with a valid CPU number, you afterwards do
"atomic_long_inc(&per_cpu_ptr(cache->cpu_slab,
cpu)->empty_partial_pages)". I think it should be possible to
implement that such that the empty_partial_pages count, while not
immediately completely accurate, would be eventually consistent; and
readers on the CPU owning the kmem_cache_cpu should never see a number
that is too large, only one that is too small.

You could additionally have a plain percpu counter, not tied to the
kmem_cache, and increment it by 1<<page_order - then that would track
the amount of memory you could reclaim by sending an IPI to a given
CPU core. Then that threshold could help decide whether it's worth
sending IPIs from SLUB and/or the shrinker?

2021-01-14 09:07:20

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Wed, 13 Jan 2021, Jann Horn wrote:

> Some brainstorming:
>
> Maybe you could have an atomic counter in kmem_cache_cpu that tracks
> the number of empty frozen pages that are associated with a specific
> CPU? So the freeing slowpath would do its cmpxchg_double, and if the

The latencies of these functions are so low that any additional counter
will have significant performance impacts. An atomic counter would be waay
out there.

> You could additionally have a plain percpu counter, not tied to the

The performance critical counters are already all per cpu. I enhanced the
percpu subsystem specifically to support latency critical operations in
the fast path of the slab allocators.

2021-01-14 09:30:07

by Vlastimil Babka

[permalink] [raw]

Subject: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On 1/12/21 5:35 PM, Christoph Lameter wrote:
> On Tue, 12 Jan 2021, Jann Horn wrote:
>
>> [This is not something I intend to work on myself. But since I
>> stumbled over this issue, I figured I should at least document/report
>> it, in case anyone is willing to pick it up.]
>
> Well yeah all true. There is however a slabinfo tool that has an -s option
> to shrink all slabs.
>
> slabinfo -s
>
> So you could put that somewhere that executes if the system is
> idle or put it into cron or so.

Hm this would be similar to recommending a periodical echo > drop_caches
operation. We actually discourage from that (and yeah, some tools do that, and
we now report those in dmesg). I believe the kernel should respond to memory
pressure and not OOM prematurely by itself, including SLUB.

2021-01-18 11:09:48

by Michal Hocko

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Thu 14-01-21 10:27:40, Vlastimil Babka wrote:
> On 1/12/21 5:35 PM, Christoph Lameter wrote:
> > On Tue, 12 Jan 2021, Jann Horn wrote:
> >
> >> [This is not something I intend to work on myself. But since I
> >> stumbled over this issue, I figured I should at least document/report
> >> it, in case anyone is willing to pick it up.]
> >
> > Well yeah all true. There is however a slabinfo tool that has an -s option
> > to shrink all slabs.
> >
> > slabinfo -s
> >
> > So you could put that somewhere that executes if the system is
> > idle or put it into cron or so.
>
> Hm this would be similar to recommending a periodical echo > drop_caches
> operation. We actually discourage from that (and yeah, some tools do that, and
> we now report those in dmesg). I believe the kernel should respond to memory
> pressure and not OOM prematurely by itself, including SLUB.

Absolutely agreed! Partial caches are a very deep internal
implementation detail of the allocator and admin has no bussiness into
fiddling with that. This would only lead to more harm than good.
Comparision to drop_caches is really exact!

--
Michal Hocko
SUSE Labs

2021-01-18 15:51:28

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Mon, 18 Jan 2021, Michal Hocko wrote:

> > Hm this would be similar to recommending a periodical echo > drop_caches
> > operation. We actually discourage from that (and yeah, some tools do that, and
> > we now report those in dmesg). I believe the kernel should respond to memory
> > pressure and not OOM prematurely by itself, including SLUB.
>
> Absolutely agreed! Partial caches are a very deep internal
> implementation detail of the allocator and admin has no bussiness into
> fiddling with that. This would only lead to more harm than good.
> Comparision to drop_caches is really exact!

Really? The maximum allocation here has a upper boundary that depends on
the number of possible partial per cpu slabs. There is a worst case
scenario that is not nice and wastes some memory but it is not an OOM
situation and the system easily recovers from it.

The slab shrinking is not needed but if you are concerned about reclaiming
more memory right now then I guess you may want to run the slab shrink
operation.

Dropping the page cache is bad? Well sometimes you want more free memory
due to a certain operation that needs to be started and where you do not
want the overhead of page cache processing.

You can go crazy and expect magical things from either operation. True.

2021-01-19 04:43:44

by Michal Hocko

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On Mon 18-01-21 15:46:43, Cristopher Lameter wrote:
> On Mon, 18 Jan 2021, Michal Hocko wrote:
>
> > > Hm this would be similar to recommending a periodical echo > drop_caches
> > > operation. We actually discourage from that (and yeah, some tools do that, and
> > > we now report those in dmesg). I believe the kernel should respond to memory
> > > pressure and not OOM prematurely by itself, including SLUB.
> >
> > Absolutely agreed! Partial caches are a very deep internal
> > implementation detail of the allocator and admin has no bussiness into
> > fiddling with that. This would only lead to more harm than good.
> > Comparision to drop_caches is really exact!
>
> Really? The maximum allocation here has a upper boundary that depends on
> the number of possible partial per cpu slabs.

And number of cpus and caches...

> There is a worst case
> scenario that is not nice and wastes some memory but it is not an OOM
> situation and the system easily recovers from it.

There is no pro-active shrinking of those when we are close to the OOM
so we still can go and kill a task while there is quite some memory
sitting in a freeable slub caches unless I am missing something.

We have learned about this in a memcg environment on our distribution
kernels where the problem is amplified by the use in memcgs with a small
limit. This is an older kernel and I would expect the current upstream
will behave better with Roman's accounting rework. But still it would be
great if the allocator could manage its caches depending on the memory
demand.

> The slab shrinking is not needed but if you are concerned about reclaiming
> more memory right now then I guess you may want to run the slab shrink
> operation.

Yes, you can do that. In a same way you can shrink the page cache.
Moreover it is really hard to do that somehow inteligently because you
would need to watch the system very closely in order to shrink when it
is really needed. That requires a deep understanding of the allocator.

> Dropping the page cache is bad? Well sometimes you want more free memory
> due to a certain operation that needs to be started and where you do not
> want the overhead of page cache processing.

It is not bad if used properly. My experience is that people have
developed instinct to drop caches whenever something is not quite right
because Internet has recommended that.

--
Michal Hocko
SUSE Labs

2021-01-21 17:24:37

by Vlastimil Babka

[permalink] [raw]

Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

On 1/12/21 12:12 AM, Jann Horn wrote:
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.
>
> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on
> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are
> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

I'm gonna send a RFC that adds a proper shrinker and thus connects this
shrinking to page reclaim, as a reply to this e-mail.

2021-01-21 17:25:59

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC 1/2] mm, vmscan: add priority field to struct shrink_control

Slab reclaim works with reclaim priority, which influences how much to reclaim,
but is not directly passed to individual shrinkers. The next patch introduces a
slab shrinker that uses the priority, so add it to shrink_control and
initialize appropriately. We can then also remove the parameter from
shrink_slab() and trace_mm_shrink_slab_start().

Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/shrinker.h | 3 +++
include/trace/events/vmscan.h | 8 +++-----
mm/vmscan.c | 14 ++++++++------
3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1066f052be4f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -29,6 +29,9 @@ struct shrink_control {
*/
unsigned long nr_scanned;

+ /* current reclaim priority */
+ int priority;
+
/* current memcg being shrunk (for memcg aware shrinkers) */
struct mem_cgroup *memcg;
};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 2070df64958e..d42e480977c6 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -185,11 +185,9 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
TRACE_EVENT(mm_shrink_slab_start,
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
long nr_objects_to_shrink, unsigned long cache_items,
- unsigned long long delta, unsigned long total_scan,
- int priority),
+ unsigned long long delta, unsigned long total_scan),

- TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
- priority),
+ TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan),

TP_STRUCT__entry(
__field(struct shrinker *, shr)
@@ -212,7 +210,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->cache_items = cache_items;
__entry->delta = delta;
__entry->total_scan = total_scan;
- __entry->priority = priority;
+ __entry->priority = sc->priority;
),

TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 469016222cdb..bc5157625cec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -410,7 +410,7 @@ EXPORT_SYMBOL(unregister_shrinker);
#define SHRINK_BATCH 128

static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
- struct shrinker *shrinker, int priority)
+ struct shrinker *shrinker)
{
unsigned long freed = 0;
unsigned long long delta;
@@ -439,7 +439,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,

total_scan = nr;
if (shrinker->seeks) {
- delta = freeable >> priority;
+ delta = freeable >> shrinkctl->priority;
delta *= 4;
do_div(delta, shrinker->seeks);
} else {
@@ -484,7 +484,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
total_scan = freeable * 2;

trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- freeable, delta, total_scan, priority);
+ freeable, delta, total_scan);

/*
* Normally, we should not scan less than batch_size objects in one
@@ -562,6 +562,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
+ .priority = priority,
.memcg = memcg,
};
struct shrinker *shrinker;
@@ -578,7 +579,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
!(shrinker->flags & SHRINKER_NONSLAB))
continue;

- ret = do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker);
if (ret == SHRINK_EMPTY) {
clear_bit(i, map->map);
/*
@@ -597,7 +598,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
* set_bit() do_shrink_slab()
*/
smp_mb__after_atomic();
- ret = do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker);
if (ret == SHRINK_EMPTY)
ret = 0;
else
@@ -666,10 +667,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
+ .priority = priority,
.memcg = memcg,
};

- ret = do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
--
2.30.0

2021-01-21 17:26:43

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs

For performance reasons, SLUB doesn't keep all slabs on shared lists and
doesn't always free slabs immediately after all objects are freed. Namely:

- for each cache and cpu, there might be a "CPU slab" page, partially or fully
free
- with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu
partial slabs" for each cache and cpu, also partially or fully free
- for each cache and numa node, there are caches on per-node partial list, up
to 10 of those may be empty

As Jann reports [1], the number of percpu partial slabs should be limited by
number of free objects (up to 30), but due to imprecise accounting, this can
deterioriate so that there are up to 30 free slabs. He notes:

> Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
> see something like 1.5MiB of pages with zero inuse objects stuck in
> percpu lists.

My observations match Jann's, and we've seen e.g. cases with 10 free slabs per
cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in
v5.9), this issue is amplified as there are separate sets of kmem caches with
cpu caches, per-cpu partial and per-node partial lists for each memcg and cache
that deals with kmemcg-accounted objects.

The cached free slabs can therefore become a memory waste, making memory
pressure higher, causing more reclaim of actually used LRU pages, and even
cause OOM (global, or memcg on older kernels).

SLUB provides __kmem_cache_shrink() that can flush all the abovementioned
slabs, but is currently called only in rare situations, or from a sysfs
handler. The standard way to cooperate with reclaim is to provide a shrinker,
and so this patch adds such shrinker to call __kmem_cache_shrink()
systematically.

The shrinker design is however atypical. The usual design assumes that a
shrinker can easily count how many objects can be reclaimed, and then reclaim
given number of objects. For SLUB, determining the number of the various cached
slabs would be a lot of work, and controlling how many to shrink precisely
would be impractical. Instead, the shrinker is based on reclaim priority, and
on lowest priority shrinks a single kmem cache, while on highest it shrinks all
of them. To do that effectively, there's a new list caches_to_shrink where
caches are taken from its head and then moved to tail. Existing slab_caches
list is unaffected so that e.g. /proc/slabinfo order is not disrupted.

This approach should not cause excessive shrinking and IPI storms:

- If there are multiple reclaimers in parallel, only one can proceed, thanks to
mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked
are at the tail of the list.
- in flush_all(), we actually check if there's anything to flush by a CPU
(has_cpu_slab()) before sending an IPI
- CPU slab deactivation became more efficient with "mm, slub: splice cpu and
page freelists in deactivate_slab()

The result is that SLUB's per-cpu and per-node caches are trimmed of free
pages, and partially used pages have higher chance of being either reused of
freed. The trimming effort is controlled by reclaim activity and thus memory
pressure. Before an OOM, a reclaim attempt at highest priority ensures
shrinking all caches. Also being a proper slab shrinker, the shrinking is
now also called as part of the drop_caches sysctl operation.

[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/

Reported-by: Jann Horn <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slub_def.h | 1 +
mm/slub.c | 76 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..6c4eeb30764d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -107,6 +107,7 @@ struct kmem_cache {
unsigned int red_left_pad; /* Left redzone padding size */
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
+ struct list_head shrink_list; /* List ordered for shrinking */
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
#endif
diff --git a/mm/slub.c b/mm/slub.c
index c3141aa962be..bba05bd9287a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -123,6 +123,8 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
#endif
#endif

+static LIST_HEAD(caches_to_shrink);
+
static inline bool kmem_cache_debug(struct kmem_cache *s)
{
return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
@@ -3933,6 +3935,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
int node;
struct kmem_cache_node *n;

+ list_del(&s->shrink_list);
+
flush_all(s);
/* Attempt to free all objects */
for_each_kmem_cache_node(s, node, n) {
@@ -3985,6 +3989,69 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
}
#endif

+static unsigned long count_shrinkable_caches(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ /*
+ * Determining how much there is to shrink would be so complex, it's
+ * better to just pretend there always is and scale the actual effort
+ * based on sc->priority.
+ */
+ return shrink->batch;
+}
+
+static unsigned long shrink_caches(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct kmem_cache *s;
+ int nr_to_shrink;
+ int ret = sc->nr_to_scan / 2;
+
+ nr_to_shrink = DEF_PRIORITY - sc->priority;
+ if (nr_to_shrink < 0)
+ nr_to_shrink = 0;
+
+ nr_to_shrink = 1 << nr_to_shrink;
+ if (sc->priority == 0) {
+ nr_to_shrink = INT_MAX;
+ ret = 0;
+ }
+
+ if (!mutex_trylock(&slab_mutex))
+ return SHRINK_STOP;
+
+ list_for_each_entry(s, &caches_to_shrink, shrink_list) {
+ __kmem_cache_shrink(s);
+ if (--nr_to_shrink == 0) {
+ list_bulk_move_tail(&caches_to_shrink,
+ caches_to_shrink.next,
+ &s->shrink_list);
+ break;
+ }
+ }
+
+ mutex_unlock(&slab_mutex);
+
+ /*
+ * As long as we are not at the highest priority, pretend we freed
+ * something as we might have not have processed all caches. This
+ * should signal that it's worth retrying. Once we are at the highest
+ * priority and shrink the whole list, pretend we didn't free anything,
+ * because there's no point in trying again.
+ *
+ * Note the value is currently ultimately ignored in "normal" reclaim,
+ * but drop_slab_node() which handles drop_caches sysctl works like this.
+ */
+ return ret;
+}
+
+static struct shrinker slub_cache_shrinker = {
+ .count_objects = count_shrinkable_caches,
+ .scan_objects = shrink_caches,
+ .batch = 128,
+ .seeks = 0,
+};
+
/********************************************************************
* Kmalloc subsystem
*******************************************************************/
@@ -4424,6 +4491,8 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
#endif
}
list_add(&s->list, &slab_caches);
+ list_del(&static_cache->shrink_list);
+ list_add(&s->shrink_list, &caches_to_shrink);
return s;
}

@@ -4480,6 +4549,8 @@ void __init kmem_cache_init(void)

void __init kmem_cache_init_late(void)
{
+ if (!register_shrinker(&slub_cache_shrinker))
+ pr_err("SLUB: failed to register shrinker\n");
}

struct kmem_cache *
@@ -4518,11 +4589,14 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)

/* Mutex is not taken during early boot */
if (slab_state <= UP)
- return 0;
+ goto out;

err = sysfs_slab_add(s);
if (err)
__kmem_cache_release(s);
+out:
+ if (!err)
+ list_add(&s->shrink_list, &caches_to_shrink);

return err;
}
--
2.30.0

2021-01-22 01:09:10

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs

On Thu, Jan 21, 2021 at 06:21:54PM +0100, Vlastimil Babka wrote:
> For performance reasons, SLUB doesn't keep all slabs on shared lists and
> doesn't always free slabs immediately after all objects are freed. Namely:
>
> - for each cache and cpu, there might be a "CPU slab" page, partially or fully
> free
> - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu
> partial slabs" for each cache and cpu, also partially or fully free
> - for each cache and numa node, there are caches on per-node partial list, up
> to 10 of those may be empty
>
> As Jann reports [1], the number of percpu partial slabs should be limited by
> number of free objects (up to 30), but due to imprecise accounting, this can
> deterioriate so that there are up to 30 free slabs. He notes:
>
> > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I
> > see something like 1.5MiB of pages with zero inuse objects stuck in
> > percpu lists.
>
> My observations match Jann's, and we've seen e.g. cases with 10 free slabs per
> cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in
> v5.9), this issue is amplified as there are separate sets of kmem caches with
> cpu caches, per-cpu partial and per-node partial lists for each memcg and cache
> that deals with kmemcg-accounted objects.
>
> The cached free slabs can therefore become a memory waste, making memory
> pressure higher, causing more reclaim of actually used LRU pages, and even
> cause OOM (global, or memcg on older kernels).
>
> SLUB provides __kmem_cache_shrink() that can flush all the abovementioned
> slabs, but is currently called only in rare situations, or from a sysfs
> handler. The standard way to cooperate with reclaim is to provide a shrinker,
> and so this patch adds such shrinker to call __kmem_cache_shrink()
> systematically.
>
> The shrinker design is however atypical. The usual design assumes that a
> shrinker can easily count how many objects can be reclaimed, and then reclaim
> given number of objects. For SLUB, determining the number of the various cached
> slabs would be a lot of work, and controlling how many to shrink precisely
> would be impractical. Instead, the shrinker is based on reclaim priority, and
> on lowest priority shrinks a single kmem cache, while on highest it shrinks all
> of them. To do that effectively, there's a new list caches_to_shrink where
> caches are taken from its head and then moved to tail. Existing slab_caches
> list is unaffected so that e.g. /proc/slabinfo order is not disrupted.
>
> This approach should not cause excessive shrinking and IPI storms:
>
> - If there are multiple reclaimers in parallel, only one can proceed, thanks to
> mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked
> are at the tail of the list.
> - in flush_all(), we actually check if there's anything to flush by a CPU
> (has_cpu_slab()) before sending an IPI
> - CPU slab deactivation became more efficient with "mm, slub: splice cpu and
> page freelists in deactivate_slab()
>
> The result is that SLUB's per-cpu and per-node caches are trimmed of free
> pages, and partially used pages have higher chance of being either reused of
> freed. The trimming effort is controlled by reclaim activity and thus memory
> pressure. Before an OOM, a reclaim attempt at highest priority ensures
> shrinking all caches. Also being a proper slab shrinker, the shrinking is
> now also called as part of the drop_caches sysctl operation.

Hi Vlastimil!

This makes a lot of sense, however it looks a bit as an overkill to me (on 5.9+).
Isn't limiting a number of pages (instead of number of objects) sufficient on 5.9+?

If not, maybe we can limit the shrinking to the pre-OOM condition?
Do we really need to trip it constantly?

Thanks!

2021-01-26 12:10:43

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs

On 1/22/21 1:48 AM, Roman Gushchin wrote:
> On Thu, Jan 21, 2021 at 06:21:54PM +0100, Vlastimil Babka wrote:
>
> Hi Vlastimil!
>
> This makes a lot of sense, however it looks a bit as an overkill to me (on 5.9+).
> Isn't limiting a number of pages (instead of number of objects) sufficient on 5.9+?

It would help, but fundamentally there can still be a lot of memory locked up
with e.g. many CPU's. We should have a way to flush this automatically, like for
other cached things.

> If not, maybe we can limit the shrinking to the pre-OOM condition?
> Do we really need to trip it constantly?

The priority could be reduced, pre-OOM might be too extreme. Why reclaim e.g.
actually used LRU pages instead of unused slab pages?
IMHO a frequently reclaiming system probably doesn't benefit from SLUB's peak
performance at that point anyway...

> Thanks!
>