2022-04-04 23:36:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote:
> > On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote:
> > > Hello Hillf!
> > >
> > Hello Roman,
> >
> > > Thank you for sharing it, really interesting! I=E2=80=99m actually working o=
> > > n the same problem.=20
> >
> > Good to know you have some interest in it.
> > Feel free to let me know you would like to take it over to avoid
> > repeated works on both sides.
> >
> > >
> > > No code to share yet, but here are some of my thoughts:
> > > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl=
> > > ab scanning is needed.
> >
> > Agree - the periodic shrinker can be canceled once kswapd wakes up.
>
> I think we should be waking up per-node kswapd to do the periodic
> shrinking, not adding yet another way of executing (thousands of)
> shrinkers (across hundreds of nodes) from a single threaded context.

I'm on the same page here. Ideally we just need to set somewhere an
"extra" slab scanning target and wake kswapd.

>
> Indeed, the problem of "no reclaim when there is no memory
> pressure" also affects the page cache, not just shrinkable caches
> and we really don't want periodic reclaim to have a compeltely
> different behaviour to general memory reclaim.

I agree. Slabs are a worse because they fragment the memory with unmovable
pages and are generally slower/more expensive to reclaim. But in general
it's the same problem.

>
> i.e. the amount of work that shrinkers need to do in a periodic scan
> is largerly determined by the rate of shrinkable cache memory usage
> growth rather than memory reclaim priority as it is now. Hence there
> needs to be different high level "shrinker needs to do X amount of
> work" calculation for periodic reclaim than there is now.
>
> e.g. we calculate a rolling average of the size of the cache and a
> rate of change over a series of polling operations (i.e. calling
> ->scan_count) and then when sustained growth is detected we start
> trying to shrink the cache to limit the rate of growth of the cache.
>
> If the cache keeps growing, then it's objects are being repeatedly
> referenced and it *should* keep growing. If it's one-off objects
> that are causing the growth of the cache and so objects are being
> reclaimed by the shrinker, then matching the periodic shrink scan to
> the growth rate will substantially reduce the rate of growth of that
> cache.

A clever idea!

It seems like we need to add some stats to the list_lru API or maybe to
the shrinker API (and let list_lru to use it).
E.g. total/scanned/reclaimed, maybe with a time decay

I'm also thinking about:
1) adding a sysfs/debugfs interface to expose shrinkers current size and
statistics, with an ability to call into the reclaim manually.
2) formalizing a reference bit/counter API on the shrinkers level, so that
shrinker users can explicitly mark objects (re)-"activation".
3) _maybe_ we need to change the shrinkers API from a number of objects to
bytes, so that releasing a small number of large objects can compete with
a releasing on many small objects. But I'm not sure.
4) also some shrinkers (e.g. shadow nodes) lying about the total size of
objects and I have an uneasy feeling about this approach.

Dave, what are your thoughts on these ideas/issues?

>
> And if it's integrated into the existing do_shrink_slab
> calculations, the moment actual memory reclaim calls the shrinker
> the periodic scan calculations can be reset back to zero and it
> starts again...
>
> > > 2) =46rom a power perspective it=E2=80=99s better to scan more at once, but l=
> > > ess often.
> >
> > The shrinker proposed is a catapult on the vmscan side without knowing
> > where the cold slab objects are piling up in Dave's backyard but he is
> > free to take different actions than the regular shrinker - IOW this
> > shrinker alone does not make much sense wrt shooting six birds without
> > the stone on the slab owner side.
> >
> > It is currently scanning *every* slab cache at an arbitrary frequency,
> > once 30 seconds - I am open to a minute or whatever.
>
> Sorry, I don't understand what "Dave's backyard" is or why it would
> ever need to be special cased?

Me neither.

Also 30s seconds sound like a perfect way to drain the laptop's battery,
I'd say 1h is a more reasonable scale, but of course it's a pure speculation.

>
> > > 3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are a=
> > > lmost full there is more sense to do a proactive scanning and free up some m=
> > > emory, otherwise we=E2=80=99ll end up allocating more slabs. But it=E2=80=99=
> > > s tricky.
> >
> > There are 31 bits available in the periodic flag added to shrink control.
> >
> > > 4) If the scanning is not resulting in any memory reclaim, maybe we should (=
> > > temporarily) exclude the corresponding shrinker from the scanning.
> >
> > Given the periodic flag, Dave is free to ignore the scan request and the
> > scan result is currently dropped on the vmscan side because what is
> > considered is the cold slab objects that for instance have been inactive
> > for more than 30 seconds in every slab cache, rather than kswapd's cake.
>
> I don't understand how passing a "periodic" flag to individual
> shrinkers is really useful here. How does the shrinker
> implementation use this to determine how much work it needs to do?
>
> i.e. The amount of work a shrinker needs to perform is calculated by
> the high level slab scanning code based on relative cache size and
> reclaim priority. If there's a periodic scanner, it should be
> calculating a specific amount of work for the shrinkers to do way up
> in do_shrink_slab() and then asking the shrinker to perform that
> work in exactly the way it does now - the shrinker itself doesn't
> need to know anything about whether it's a periodic memory reclaim
> scan or whether there's actual memory pressure - it just needs to
> scan the oldest objects in it's cache to try to reclaim them.

Again I'm on the same page here. We have a mechanism for a deferred work,
it or something similar can be reused to schedule an extra work.

Thanks!


2022-04-05 07:15:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > i.e. the amount of work that shrinkers need to do in a periodic scan
> > is largerly determined by the rate of shrinkable cache memory usage
> > growth rather than memory reclaim priority as it is now. Hence there
> > needs to be different high level "shrinker needs to do X amount of
> > work" calculation for periodic reclaim than there is now.
> >
> > e.g. we calculate a rolling average of the size of the cache and a
> > rate of change over a series of polling operations (i.e. calling
> > ->scan_count) and then when sustained growth is detected we start
> > trying to shrink the cache to limit the rate of growth of the cache.
> >
> > If the cache keeps growing, then it's objects are being repeatedly
> > referenced and it *should* keep growing. If it's one-off objects
> > that are causing the growth of the cache and so objects are being
> > reclaimed by the shrinker, then matching the periodic shrink scan to
> > the growth rate will substantially reduce the rate of growth of that
> > cache.
>
> A clever idea!
>
> It seems like we need to add some stats to the list_lru API or maybe to
> the shrinker API (and let list_lru to use it).
> E.g. total/scanned/reclaimed, maybe with a time decay
>
> I'm also thinking about:
> 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> statistics, with an ability to call into the reclaim manually.

I've thought about it, too, and can see where it could be useful.
However, when I consider the list_lru memcg integration, I suspect
it becomes a "can't see the forest for the trees" problem. We're
going to end up with millions of sysfs objects with no obvious way
to navigate, iterate or search them if we just take the naive "sysfs
object + stats per list_lru instance" approach.

Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
array of per-node per-memcg lru lists") that went into 5.18-rc1,
you'll get an idea of the amount of memory overhead just tracking
the list_lru x memcg infrastructure consumes at scale:

I had done a easy test to show the optimization. I create 10k memory
cgroups and mount 10k filesystems in the systems. We use free command to
show how many memory does the systems comsumes after this operation (There
are 2 numa nodes in the system).

+-----------------------+------------------------+
| condition | memory consumption |
+-----------------------+------------------------+
| without this patchset | 24464 MB |
+-----------------------+------------------------+
| after patch 1 | 21957 MB | <--------+
+-----------------------+------------------------+ |
| after patch 10 | 6895 MB | |
+-----------------------+------------------------+ |
| after patch 12 | 4367 MB | |
+-----------------------+------------------------+ |
|
The more the number of nodes, the more obvious the effect---+

If we now add sysfs objects and stats arrays to each of the
list_lrus that we initiate even now on 5.18-rc1, we're going to
massively blow out the memory footprint again.

So, nice idea, but I'm not sure we can make it useful and not
consume huge amounts of memory....

What might be more useful is a way of getting the kernel to tell us
what the, say, 20 biggest slab caches are in the system and then
provide a way to selectively shrink them. We generally don't really
care about tiny caches, just the ones consuming all the memory. This
isn't my idea - Kent has been looking at this because of how useless
OOM kill output is for debugging slab cache based OOM triggers. See
this branch for an example:

https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text

Even a sysfs entry that you echo a number into and it returns
the "top N" largest LRU lists. echo -1 into it and it returns every
single one if you want all the information.

The whole "one sysfs file, one value" architecture falls completely
apart when we might have to indexing *millions* of internal
structures with many parameters per structure...

> 2) formalizing a reference bit/counter API on the shrinkers level, so that
> shrinker users can explicitly mark objects (re)-"activation".

Not 100% certain what you are refering to here - something to do
with active object rotation? Or an active/inactive list split with period
demotion like we have for the page LRUs? Or workingset refault
detection? Can you explain in more detail?

> 3) _maybe_ we need to change the shrinkers API from a number of objects to
> bytes, so that releasing a small number of large objects can compete with
> a releasing on many small objects. But I'm not sure.

I think I suggested something similar a long time ago. We have
shrinkers that track things other than slab objects. e.g. IIRC the
ttm graphics allocator shrinker tracks sets of pages and frees
pages, not slab objects. The XFS buffer cache tracks variable sized
objects, from 512 bytes to 64KB in length, so the amount of memory
it frees is variable even if the number of handles it scans and
reclaims is fixed and consistent. Other subsystems have different
"non object" shrinker needs as well.

The long and short of it is that two shrinkers might have the same
object count, but one might free 10x the amount of memory than the
other for the same amount of shrinking work. Being able to focus
reclaim work on caches that can free a lot of memory much more
quickly would be a great idea.

It also means that a shrinker that scans a fragmented slab can keep
going until a set number of slab pages have been freed, rather than
a set number of slab objects. We can push reclaim of fragmented slab
caches much harder when necessary if we are reclaiming by freed byte
counts...

So, yeah, byte-count based reclaim definitely has merit compared to
what we currently do. It's more generic and more flexible...

> 4) also some shrinkers (e.g. shadow nodes) lying about the total size of
> objects and I have an uneasy feeling about this approach.

Yeah, never been a fan about use cases like that, but IIRC it was really
the only option at the time to only trigger reclaim when the kernel
was absolutely desperate because we really need the working set
shadow nodes to do their job well when memory is very low...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-06 11:42:56

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > is largerly determined by the rate of shrinkable cache memory usage
> > > growth rather than memory reclaim priority as it is now. Hence there
> > > needs to be different high level "shrinker needs to do X amount of
> > > work" calculation for periodic reclaim than there is now.
> > >
> > > e.g. we calculate a rolling average of the size of the cache and a
> > > rate of change over a series of polling operations (i.e. calling
> > > ->scan_count) and then when sustained growth is detected we start
> > > trying to shrink the cache to limit the rate of growth of the cache.
> > >
> > > If the cache keeps growing, then it's objects are being repeatedly
> > > referenced and it *should* keep growing. If it's one-off objects
> > > that are causing the growth of the cache and so objects are being
> > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > the growth rate will substantially reduce the rate of growth of that
> > > cache.
> >
> > A clever idea!
> >
> > It seems like we need to add some stats to the list_lru API or maybe to
> > the shrinker API (and let list_lru to use it).
> > E.g. total/scanned/reclaimed, maybe with a time decay
> >
> > I'm also thinking about:
> > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> > statistics, with an ability to call into the reclaim manually.
>
> I've thought about it, too, and can see where it could be useful.
> However, when I consider the list_lru memcg integration, I suspect
> it becomes a "can't see the forest for the trees" problem. We're
> going to end up with millions of sysfs objects with no obvious way
> to navigate, iterate or search them if we just take the naive "sysfs
> object + stats per list_lru instance" approach.

Ok, I'll try to master something and share patches. I assume it would be useful
to have statistics and be able to trigger a reclaim of a particular shrinker
without creating a real memory pressure. I anticipate many interesting
findings in the implementation of individual shrinkers...

>
> Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> array of per-node per-memcg lru lists") that went into 5.18-rc1,
> you'll get an idea of the amount of memory overhead just tracking
> the list_lru x memcg infrastructure consumes at scale:
>
> I had done a easy test to show the optimization. I create 10k memory
> cgroups and mount 10k filesystems in the systems. We use free command to
> show how many memory does the systems comsumes after this operation (There
> are 2 numa nodes in the system).
>
> +-----------------------+------------------------+
> | condition | memory consumption |
> +-----------------------+------------------------+
> | without this patchset | 24464 MB |
> +-----------------------+------------------------+
> | after patch 1 | 21957 MB | <--------+
> +-----------------------+------------------------+ |
> | after patch 10 | 6895 MB | |
> +-----------------------+------------------------+ |
> | after patch 12 | 4367 MB | |
> +-----------------------+------------------------+ |
> |
> The more the number of nodes, the more obvious the effect---+

Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
mounts look a bit artificial. It's hard to believe that there are 100M
LRUs containing hot objects. If most of them are cold, it's actually
a question how to efficiently reclaim them and free the wasted memory.

>
> If we now add sysfs objects and stats arrays to each of the
> list_lrus that we initiate even now on 5.18-rc1, we're going to
> massively blow out the memory footprint again.

sysfs can be a config option, it doesn't have to be enabled in prod.
But I agree, because of memory constraints we're very limited in the size
of statistics which can be used for reclaim decisions.

>
> So, nice idea, but I'm not sure we can make it useful and not
> consume huge amounts of memory....
>
> What might be more useful is a way of getting the kernel to tell us
> what the, say, 20 biggest slab caches are in the system and then
> provide a way to selectively shrink them. We generally don't really
> care about tiny caches, just the ones consuming all the memory. This
> isn't my idea - Kent has been looking at this because of how useless
> OOM kill output is for debugging slab cache based OOM triggers. See
> this branch for an example:
>
> https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
>
> Even a sysfs entry that you echo a number into and it returns
> the "top N" largest LRU lists. echo -1 into it and it returns every
> single one if you want all the information.
>
> The whole "one sysfs file, one value" architecture falls completely
> apart when we might have to indexing *millions* of internal
> structures with many parameters per structure...

It's a good point, I need to think what can we do here.

Actually, if it's so hard and inefficient for a sysfs interface,
it also means it's inefficient for the reclaim path. So if we have
an ability to quickly find beefy LRU lists worth shrinking, it might
improve the reclaim efficiency too.

Actually, thinking of reducing the memory footprint, it would be great
to combine LRU's belonging to different superblocks (of the same type).
Pagecache analogy: we have an LRU for all pagecache pages, it's not
per-file or per-sb. Not sure how easy/viable it is.

>
> > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> > shrinker users can explicitly mark objects (re)-"activation".
>
> Not 100% certain what you are refering to here - something to do
> with active object rotation? Or an active/inactive list split with period
> demotion like we have for the page LRUs? Or workingset refault
> detection? Can you explain in more detail?

It's a vague thinking at this point, but you got me right, I was thinking about
all the listed ideas above. In general I'm trying to understand whether
the existing model is sufficient for a more or less effective management
of hot/cold objects and if we can improve it borrowing some ideas from
the page reclaim code.
Shadow entries would be great but likely not possible because of the memory
footprint.

>
> > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> > bytes, so that releasing a small number of large objects can compete with
> > a releasing on many small objects. But I'm not sure.
>
> I think I suggested something similar a long time ago. We have
> shrinkers that track things other than slab objects. e.g. IIRC the
> ttm graphics allocator shrinker tracks sets of pages and frees
> pages, not slab objects. The XFS buffer cache tracks variable sized
> objects, from 512 bytes to 64KB in length, so the amount of memory
> it frees is variable even if the number of handles it scans and
> reclaims is fixed and consistent. Other subsystems have different
> "non object" shrinker needs as well.

It's true for many objects, because it's not unusual for kernel objects
to have attached kmallocs or pin other objects in the memory. Unlikely
we can be 100% accurate, but we might improve the "smoothness" of the
reclaim process.

>
> The long and short of it is that two shrinkers might have the same
> object count, but one might free 10x the amount of memory than the
> other for the same amount of shrinking work. Being able to focus
> reclaim work on caches that can free a lot of memory much more
> quickly would be a great idea.

Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
expensive small objects if there are big and cheap.

>
> It also means that a shrinker that scans a fragmented slab can keep
> going until a set number of slab pages have been freed, rather than
> a set number of slab objects. We can push reclaim of fragmented slab
> caches much harder when necessary if we are reclaiming by freed byte
> counts...

I thought about it, but it's tricky. If all slabs are almost full (fragmentation
is low), we want to shrink aggressively, otherwise we end up allocating
more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
is very high), we also want to shrink more aggressively with a hope to release
physical memory, as you said. But setting a target in pages is dangerous: slab caches
can be merged or slab pages can be pinned by objects belonging to a different
memory cgroup, so it might be not possible to reclaim pages no matter how hard
we're trying.

>
> So, yeah, byte-count based reclaim definitely has merit compared to
> what we currently do. It's more generic and more flexible...

Great, I'm glad we're on the same page here.

Thank you for the answers, I think it's a really useful discussion!

Roman

2022-04-06 11:55:34

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > > is largerly determined by the rate of shrinkable cache memory usage
> > > > growth rather than memory reclaim priority as it is now. Hence there
> > > > needs to be different high level "shrinker needs to do X amount of
> > > > work" calculation for periodic reclaim than there is now.
> > > >
> > > > e.g. we calculate a rolling average of the size of the cache and a
> > > > rate of change over a series of polling operations (i.e. calling
> > > > ->scan_count) and then when sustained growth is detected we start
> > > > trying to shrink the cache to limit the rate of growth of the cache.
> > > >
> > > > If the cache keeps growing, then it's objects are being repeatedly
> > > > referenced and it *should* keep growing. If it's one-off objects
> > > > that are causing the growth of the cache and so objects are being
> > > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > > the growth rate will substantially reduce the rate of growth of that
> > > > cache.
> > >
> > > A clever idea!
> > >
> > > It seems like we need to add some stats to the list_lru API or maybe to
> > > the shrinker API (and let list_lru to use it).
> > > E.g. total/scanned/reclaimed, maybe with a time decay
> > >
> > > I'm also thinking about:
> > > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> > > statistics, with an ability to call into the reclaim manually.
> >
> > I've thought about it, too, and can see where it could be useful.
> > However, when I consider the list_lru memcg integration, I suspect
> > it becomes a "can't see the forest for the trees" problem. We're
> > going to end up with millions of sysfs objects with no obvious way
> > to navigate, iterate or search them if we just take the naive "sysfs
> > object + stats per list_lru instance" approach.
>
> Ok, I'll try to master something and share patches. I assume it would be useful
> to have statistics and be able to trigger a reclaim of a particular shrinker
> without creating a real memory pressure. I anticipate many interesting
> findings in the implementation of individual shrinkers...
>
> >
> > Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> > array of per-node per-memcg lru lists") that went into 5.18-rc1,
> > you'll get an idea of the amount of memory overhead just tracking
> > the list_lru x memcg infrastructure consumes at scale:
> >
> > I had done a easy test to show the optimization. I create 10k memory
> > cgroups and mount 10k filesystems in the systems. We use free command to
> > show how many memory does the systems comsumes after this operation (There
> > are 2 numa nodes in the system).
> >
> > +-----------------------+------------------------+
> > | condition | memory consumption |
> > +-----------------------+------------------------+
> > | without this patchset | 24464 MB |
> > +-----------------------+------------------------+
> > | after patch 1 | 21957 MB | <--------+
> > +-----------------------+------------------------+ |
> > | after patch 10 | 6895 MB | |
> > +-----------------------+------------------------+ |
> > | after patch 12 | 4367 MB | |
> > +-----------------------+------------------------+ |
> > |
> > The more the number of nodes, the more obvious the effect---+
>
> Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
> mounts look a bit artificial. It's hard to believe that there are 100M
> LRUs containing hot objects. If most of them are cold, it's actually
> a question how to efficiently reclaim them and free the wasted memory.
>
> >
> > If we now add sysfs objects and stats arrays to each of the
> > list_lrus that we initiate even now on 5.18-rc1, we're going to
> > massively blow out the memory footprint again.
>
> sysfs can be a config option, it doesn't have to be enabled in prod.
> But I agree, because of memory constraints we're very limited in the size
> of statistics which can be used for reclaim decisions.
>
> >
> > So, nice idea, but I'm not sure we can make it useful and not
> > consume huge amounts of memory....
> >
> > What might be more useful is a way of getting the kernel to tell us
> > what the, say, 20 biggest slab caches are in the system and then
> > provide a way to selectively shrink them. We generally don't really
> > care about tiny caches, just the ones consuming all the memory. This
> > isn't my idea - Kent has been looking at this because of how useless
> > OOM kill output is for debugging slab cache based OOM triggers. See
> > this branch for an example:
> >
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> >
> > Even a sysfs entry that you echo a number into and it returns
> > the "top N" largest LRU lists. echo -1 into it and it returns every
> > single one if you want all the information.
> >
> > The whole "one sysfs file, one value" architecture falls completely
> > apart when we might have to indexing *millions* of internal
> > structures with many parameters per structure...
>
> It's a good point, I need to think what can we do here.
>
> Actually, if it's so hard and inefficient for a sysfs interface,
> it also means it's inefficient for the reclaim path. So if we have
> an ability to quickly find beefy LRU lists worth shrinking, it might
> improve the reclaim efficiency too.
>
> Actually, thinking of reducing the memory footprint, it would be great
> to combine LRU's belonging to different superblocks (of the same type).
> Pagecache analogy: we have an LRU for all pagecache pages, it's not
> per-file or per-sb. Not sure how easy/viable it is.
>
> >
> > > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> > > shrinker users can explicitly mark objects (re)-"activation".
> >
> > Not 100% certain what you are refering to here - something to do
> > with active object rotation? Or an active/inactive list split with period
> > demotion like we have for the page LRUs? Or workingset refault
> > detection? Can you explain in more detail?
>
> It's a vague thinking at this point, but you got me right, I was thinking about
> all the listed ideas above. In general I'm trying to understand whether
> the existing model is sufficient for a more or less effective management
> of hot/cold objects and if we can improve it borrowing some ideas from
> the page reclaim code.
> Shadow entries would be great but likely not possible because of the memory
> footprint.
>
> >
> > > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> > > bytes, so that releasing a small number of large objects can compete with
> > > a releasing on many small objects. But I'm not sure.
> >
> > I think I suggested something similar a long time ago. We have
> > shrinkers that track things other than slab objects. e.g. IIRC the
> > ttm graphics allocator shrinker tracks sets of pages and frees
> > pages, not slab objects. The XFS buffer cache tracks variable sized
> > objects, from 512 bytes to 64KB in length, so the amount of memory
> > it frees is variable even if the number of handles it scans and
> > reclaims is fixed and consistent. Other subsystems have different
> > "non object" shrinker needs as well.
>
> It's true for many objects, because it's not unusual for kernel objects
> to have attached kmallocs or pin other objects in the memory. Unlikely
> we can be 100% accurate, but we might improve the "smoothness" of the
> reclaim process.
>
> >
> > The long and short of it is that two shrinkers might have the same
> > object count, but one might free 10x the amount of memory than the
> > other for the same amount of shrinking work. Being able to focus
> > reclaim work on caches that can free a lot of memory much more
> > quickly would be a great idea.
>
> Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
> expensive small objects if there are big and cheap.
>
> >
> > It also means that a shrinker that scans a fragmented slab can keep
> > going until a set number of slab pages have been freed, rather than
> > a set number of slab objects. We can push reclaim of fragmented slab
> > caches much harder when necessary if we are reclaiming by freed byte
> > counts...
>
> I thought about it, but it's tricky. If all slabs are almost full (fragmentation
> is low), we want to shrink aggressively, otherwise we end up allocating
> more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
> is very high), we also want to shrink more aggressively with a hope to release
> physical memory, as you said. But setting a target in pages is dangerous: slab caches
> can be merged or slab pages can be pinned by objects belonging to a different
> memory cgroup, so it might be not possible to reclaim pages no matter how hard
> we're trying.

Yeah, I agree it actually doesn't make too much sense to return the
number of reclaimed objects. Other part of vmscan returns the number
of base pages, the sizes of slab objects are varied, it may be much
smaller than a page, for example, dentry may be 192 bytes.

Another problem which doesn't help the feedback loop is even though
bytes is returned instead of the number of objects, it doesn't mean
that much memory is actually freed and available for allocation. IMHO
the number of really freed pages should be returned (I do understand
it is not that easy for now), and returning 0 should be fine. The
current logic (returning the number of objects) may feed up something
over-optimistic. I, at least, experienced once or twice that a
significant amount of slab caches were shrunk, but actually 0 pages
were freed actually. TBH the new slab controller may make it worse
since the page may be pinned by the objects from other memcgs.

>
> >
> > So, yeah, byte-count based reclaim definitely has merit compared to
> > what we currently do. It's more generic and more flexible...
>
> Great, I'm glad we're on the same page here.
>
> Thank you for the answers, I think it's a really useful discussion!
>
> Roman
>

2022-04-06 12:25:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> Yeah, I agree it actually doesn't make too much sense to return the
> number of reclaimed objects. Other part of vmscan returns the number
> of base pages, the sizes of slab objects are varied, it may be much
> smaller than a page, for example, dentry may be 192 bytes.

From the point of view of vmscan, it only cares about the number of pages
freed because it's trying to free pages. But from the point of view of
trying to keep the number of non-useful objects in check, the number of
objects freed is more important, and it doesn't matter whether we ended
up freeing any pages because we made memory available for this slab cache.

2022-04-06 13:28:17

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <[email protected]> wrote:
> >
> > On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > > > is largerly determined by the rate of shrinkable cache memory usage
> > > > > growth rather than memory reclaim priority as it is now. Hence there
> > > > > needs to be different high level "shrinker needs to do X amount of
> > > > > work" calculation for periodic reclaim than there is now.
> > > > >
> > > > > e.g. we calculate a rolling average of the size of the cache and a
> > > > > rate of change over a series of polling operations (i.e. calling
> > > > > ->scan_count) and then when sustained growth is detected we start
> > > > > trying to shrink the cache to limit the rate of growth of the cache.
> > > > >
> > > > > If the cache keeps growing, then it's objects are being repeatedly
> > > > > referenced and it *should* keep growing. If it's one-off objects
> > > > > that are causing the growth of the cache and so objects are being
> > > > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > > > the growth rate will substantially reduce the rate of growth of that
> > > > > cache.
> > > >
> > > > A clever idea!
> > > >
> > > > It seems like we need to add some stats to the list_lru API or maybe to
> > > > the shrinker API (and let list_lru to use it).
> > > > E.g. total/scanned/reclaimed, maybe with a time decay
> > > >
> > > > I'm also thinking about:
> > > > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> > > > statistics, with an ability to call into the reclaim manually.
> > >
> > > I've thought about it, too, and can see where it could be useful.
> > > However, when I consider the list_lru memcg integration, I suspect
> > > it becomes a "can't see the forest for the trees" problem. We're
> > > going to end up with millions of sysfs objects with no obvious way
> > > to navigate, iterate or search them if we just take the naive "sysfs
> > > object + stats per list_lru instance" approach.
> >
> > Ok, I'll try to master something and share patches. I assume it would be useful
> > to have statistics and be able to trigger a reclaim of a particular shrinker
> > without creating a real memory pressure. I anticipate many interesting
> > findings in the implementation of individual shrinkers...
> >
> > >
> > > Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> > > array of per-node per-memcg lru lists") that went into 5.18-rc1,
> > > you'll get an idea of the amount of memory overhead just tracking
> > > the list_lru x memcg infrastructure consumes at scale:
> > >
> > > I had done a easy test to show the optimization. I create 10k memory
> > > cgroups and mount 10k filesystems in the systems. We use free command to
> > > show how many memory does the systems comsumes after this operation (There
> > > are 2 numa nodes in the system).
> > >
> > > +-----------------------+------------------------+
> > > | condition | memory consumption |
> > > +-----------------------+------------------------+
> > > | without this patchset | 24464 MB |
> > > +-----------------------+------------------------+
> > > | after patch 1 | 21957 MB | <--------+
> > > +-----------------------+------------------------+ |
> > > | after patch 10 | 6895 MB | |
> > > +-----------------------+------------------------+ |
> > > | after patch 12 | 4367 MB | |
> > > +-----------------------+------------------------+ |
> > > |
> > > The more the number of nodes, the more obvious the effect---+
> >
> > Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
> > mounts look a bit artificial. It's hard to believe that there are 100M
> > LRUs containing hot objects. If most of them are cold, it's actually
> > a question how to efficiently reclaim them and free the wasted memory.
> >
> > >
> > > If we now add sysfs objects and stats arrays to each of the
> > > list_lrus that we initiate even now on 5.18-rc1, we're going to
> > > massively blow out the memory footprint again.
> >
> > sysfs can be a config option, it doesn't have to be enabled in prod.
> > But I agree, because of memory constraints we're very limited in the size
> > of statistics which can be used for reclaim decisions.
> >
> > >
> > > So, nice idea, but I'm not sure we can make it useful and not
> > > consume huge amounts of memory....
> > >
> > > What might be more useful is a way of getting the kernel to tell us
> > > what the, say, 20 biggest slab caches are in the system and then
> > > provide a way to selectively shrink them. We generally don't really
> > > care about tiny caches, just the ones consuming all the memory. This
> > > isn't my idea - Kent has been looking at this because of how useless
> > > OOM kill output is for debugging slab cache based OOM triggers. See
> > > this branch for an example:
> > >
> > > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> > >
> > > Even a sysfs entry that you echo a number into and it returns
> > > the "top N" largest LRU lists. echo -1 into it and it returns every
> > > single one if you want all the information.
> > >
> > > The whole "one sysfs file, one value" architecture falls completely
> > > apart when we might have to indexing *millions* of internal
> > > structures with many parameters per structure...
> >
> > It's a good point, I need to think what can we do here.
> >
> > Actually, if it's so hard and inefficient for a sysfs interface,
> > it also means it's inefficient for the reclaim path. So if we have
> > an ability to quickly find beefy LRU lists worth shrinking, it might
> > improve the reclaim efficiency too.
> >
> > Actually, thinking of reducing the memory footprint, it would be great
> > to combine LRU's belonging to different superblocks (of the same type).
> > Pagecache analogy: we have an LRU for all pagecache pages, it's not
> > per-file or per-sb. Not sure how easy/viable it is.
> >
> > >
> > > > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> > > > shrinker users can explicitly mark objects (re)-"activation".
> > >
> > > Not 100% certain what you are refering to here - something to do
> > > with active object rotation? Or an active/inactive list split with period
> > > demotion like we have for the page LRUs? Or workingset refault
> > > detection? Can you explain in more detail?
> >
> > It's a vague thinking at this point, but you got me right, I was thinking about
> > all the listed ideas above. In general I'm trying to understand whether
> > the existing model is sufficient for a more or less effective management
> > of hot/cold objects and if we can improve it borrowing some ideas from
> > the page reclaim code.
> > Shadow entries would be great but likely not possible because of the memory
> > footprint.
> >
> > >
> > > > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> > > > bytes, so that releasing a small number of large objects can compete with
> > > > a releasing on many small objects. But I'm not sure.
> > >
> > > I think I suggested something similar a long time ago. We have
> > > shrinkers that track things other than slab objects. e.g. IIRC the
> > > ttm graphics allocator shrinker tracks sets of pages and frees
> > > pages, not slab objects. The XFS buffer cache tracks variable sized
> > > objects, from 512 bytes to 64KB in length, so the amount of memory
> > > it frees is variable even if the number of handles it scans and
> > > reclaims is fixed and consistent. Other subsystems have different
> > > "non object" shrinker needs as well.
> >
> > It's true for many objects, because it's not unusual for kernel objects
> > to have attached kmallocs or pin other objects in the memory. Unlikely
> > we can be 100% accurate, but we might improve the "smoothness" of the
> > reclaim process.
> >
> > >
> > > The long and short of it is that two shrinkers might have the same
> > > object count, but one might free 10x the amount of memory than the
> > > other for the same amount of shrinking work. Being able to focus
> > > reclaim work on caches that can free a lot of memory much more
> > > quickly would be a great idea.
> >
> > Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
> > expensive small objects if there are big and cheap.
> >
> > >
> > > It also means that a shrinker that scans a fragmented slab can keep
> > > going until a set number of slab pages have been freed, rather than
> > > a set number of slab objects. We can push reclaim of fragmented slab
> > > caches much harder when necessary if we are reclaiming by freed byte
> > > counts...
> >
> > I thought about it, but it's tricky. If all slabs are almost full (fragmentation
> > is low), we want to shrink aggressively, otherwise we end up allocating
> > more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
> > is very high), we also want to shrink more aggressively with a hope to release
> > physical memory, as you said. But setting a target in pages is dangerous: slab caches
> > can be merged or slab pages can be pinned by objects belonging to a different
> > memory cgroup, so it might be not possible to reclaim pages no matter how hard
> > we're trying.
>
> Yeah, I agree it actually doesn't make too much sense to return the
> number of reclaimed objects. Other part of vmscan returns the number
> of base pages, the sizes of slab objects are varied, it may be much
> smaller than a page, for example, dentry may be 192 bytes.
>
> Another problem which doesn't help the feedback loop is even though
> bytes is returned instead of the number of objects, it doesn't mean
> that much memory is actually freed and available for allocation.

It might be not available for non-slab allocations, but at least new
slab objects can be allocated without polluting more pages. So it
might not be completely useless even if 0 pages were reclaimed.

> IMHO
> the number of really freed pages should be returned (I do understand
> it is not that easy for now), and returning 0 should be fine.

It's doable, there is already a mechanism in place which hooks into
the slub/slab/slob release path and stops the slab reclaim as a whole
if enough memory was freed.

> The
> current logic (returning the number of objects) may feed up something
> over-optimistic. I, at least, experienced once or twice that a
> significant amount of slab caches were shrunk, but actually 0 pages
> were freed actually. TBH the new slab controller may make it worse
> since the page may be pinned by the objects from other memcgs.

Of course, the more dense the placement of objects is, the harder is to get
the physical pages back. But usually it pays off by having a dramatically
lower total number of slab pages.

If the total size of slab memory is not very high and more or less constant,
we don't have to move large number of pages back and forth between slab- and
non-slab users. So if we can satisfy new allocations using "old" pages, it's
totally fine.

2022-04-06 13:55:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 02:31:02PM -0700, Roman Gushchin wrote:
> On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> > On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <[email protected]> wrote:
> > > On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > > > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > IMHO
> > the number of really freed pages should be returned (I do understand
> > it is not that easy for now), and returning 0 should be fine.
>
> It's doable, there is already a mechanism in place which hooks into
> the slub/slab/slob release path and stops the slab reclaim as a whole
> if enough memory was freed.

The reclaim state that accounts for slab pages freed really
needs to be first class shrinker state that is aggregated at the
do_shrink_slab() level and passed back to the vmscan code. The
shrinker infrastructure itself should be aware of the progress each
shrinker is making - not just objects reclaimed but also pages
reclaimed - so it can make better decisions about how much work
should be done by each shrinker.

e.g. lots of objects in cache, lots of objects reclaimed, no pages
reclaimed is indicative of a fragmented slab cache. If this keeps
happening, we should be trying to apply extra pressure to this
specific cache because the only method we have for correcting a
fragmented cache to return some memory is to reclaim lots more
objects from it.

> > The
> > current logic (returning the number of objects) may feed up something
> > over-optimistic. I, at least, experienced once or twice that a
> > significant amount of slab caches were shrunk, but actually 0 pages
> > were freed actually. TBH the new slab controller may make it worse
> > since the page may be pinned by the objects from other memcgs.
>
> Of course, the more dense the placement of objects is, the harder is to get
> the physical pages back. But usually it pays off by having a dramatically
> lower total number of slab pages.

Unless you have tens of millions of objects in the cache. The dentry
cache is a prime example of this "lots of tiny cached objects" where
we have tens of objects per slab page and so can suffer badly from
internal fragmentation....

Cheers,

Dave.
--
Dave Chinner
[email protected]