LinuxLists.cc - Re: [RFC] mm/vmscan: add periodic slab shrinker

2022-04-05 01:28:48

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote:
> On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote:
> > Hello Hillf!
> >
> Hello Roman,
>
> > Thank you for sharing it, really interesting! I=E2=80=99m actually working o=
> > n the same problem.=20
>
> Good to know you have some interest in it.
> Feel free to let me know you would like to take it over to avoid
> repeated works on both sides.
>
> >
> > No code to share yet, but here are some of my thoughts:
> > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl=
> > ab scanning is needed.
>
> Agree - the periodic shrinker can be canceled once kswapd wakes up.

I think we should be waking up per-node kswapd to do the periodic
shrinking, not adding yet another way of executing (thousands of)
shrinkers (across hundreds of nodes) from a single threaded context.

Indeed, the problem of "no reclaim when there is no memory
pressure" also affects the page cache, not just shrinkable caches
and we really don't want periodic reclaim to have a compeltely
different behaviour to general memory reclaim.

i.e. the amount of work that shrinkers need to do in a periodic scan
is largerly determined by the rate of shrinkable cache memory usage
growth rather than memory reclaim priority as it is now. Hence there
needs to be different high level "shrinker needs to do X amount of
work" calculation for periodic reclaim than there is now.

e.g. we calculate a rolling average of the size of the cache and a
rate of change over a series of polling operations (i.e. calling
->scan_count) and then when sustained growth is detected we start
trying to shrink the cache to limit the rate of growth of the cache.

If the cache keeps growing, then it's objects are being repeatedly
referenced and it *should* keep growing. If it's one-off objects
that are causing the growth of the cache and so objects are being
reclaimed by the shrinker, then matching the periodic shrink scan to
the growth rate will substantially reduce the rate of growth of that
cache.

And if it's integrated into the existing do_shrink_slab
calculations, the moment actual memory reclaim calls the shrinker
the periodic scan calculations can be reset back to zero and it
starts again...

> > 2) =46rom a power perspective it=E2=80=99s better to scan more at once, but l=
> > ess often.
>
> The shrinker proposed is a catapult on the vmscan side without knowing
> where the cold slab objects are piling up in Dave's backyard but he is
> free to take different actions than the regular shrinker - IOW this
> shrinker alone does not make much sense wrt shooting six birds without
> the stone on the slab owner side.
>
> It is currently scanning *every* slab cache at an arbitrary frequency,
> once 30 seconds - I am open to a minute or whatever.

Sorry, I don't understand what "Dave's backyard" is or why it would
ever need to be special cased?

> > 3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are a=
> > lmost full there is more sense to do a proactive scanning and free up some m=
> > emory, otherwise we=E2=80=99ll end up allocating more slabs. But it=E2=80=99=
> > s tricky.
>
> There are 31 bits available in the periodic flag added to shrink control.
>
> > 4) If the scanning is not resulting in any memory reclaim, maybe we should (=
> > temporarily) exclude the corresponding shrinker from the scanning.
>
> Given the periodic flag, Dave is free to ignore the scan request and the
> scan result is currently dropped on the vmscan side because what is
> considered is the cold slab objects that for instance have been inactive
> for more than 30 seconds in every slab cache, rather than kswapd's cake.

I don't understand how passing a "periodic" flag to individual
shrinkers is really useful here. How does the shrinker
implementation use this to determine how much work it needs to do?

i.e. The amount of work a shrinker needs to perform is calculated by
the high level slab scanning code based on relative cache size and
reclaim priority. If there's a periodic scanner, it should be
calculating a specific amount of work for the shrinkers to do way up
in do_shrink_slab() and then asking the shrinker to perform that
work in exactly the way it does now - the shrinker itself doesn't
need to know anything about whether it's a periodic memory reclaim
scan or whether there's actual memory pressure - it just needs to
scan the oldest objects in it's cache to try to reclaim them.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-06 10:06:20

by Stephen Brennan

[permalink] [raw]

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

Dave Chinner <[email protected]> writes:
[snip]
> If the cache keeps growing, then it's objects are being repeatedly
> referenced and it *should* keep growing. If it's one-off objects
> that are causing the growth of the cache and so objects are being
> reclaimed by the shrinker, then matching the periodic shrink scan to
> the growth rate will substantially reduce the rate of growth of that
> cache.

I can't speak for every slab cache, but I've been coming to the same
conclusion myself regarding the dentry cache. I think that the rate of
stepping through the LRU should be tied to the rate of allocations.
Truly in-use objects shouldn't be harmed by this, as they should get
referenced and rotated to the beginning of the LRU. But the one-offs
which are bloating the cache will be found and removed.

My dentry-related patch here [1] does tie the reclaim to the rate of
allocations. In that patch, I looked for sibling negative dentries to
reclaim, which is just silly in hindsight :)

I've implemented a version of this patch which just takes one step
through the LRU on each d_alloc. It's quite interesting to watch it
work. You can create 5 million negative dentries in directory /A via
stat(), and then create 5 million negative dentries in directory /B. The
total dentry slab size reaches 5 million but never goes past it, since
the old negative dentries from /A aren't really in use, and they get
pruned at the same rate as negative dentries from /B get created. On the
other hand, if you *continue* to stat() on the dentries of /A while you
create negative dentries in /B, then the cache grows to 10 million,
since the /A dentries are actually in use.

Maybe a solution could involve some generic list_lru machinery that can
let you do these sorts of incremental scans? Maybe batching them up so
instead of doing one every allocation, you do them every 100 or 1000?
It would still be up to the individual user to put this to good use in
the object allocation path.

Thanks,
Stephen

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/

2022-04-06 12:19:14

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> I can't speak for every slab cache, but I've been coming to the same
> conclusion myself regarding the dentry cache. I think that the rate of
> stepping through the LRU should be tied to the rate of allocations.
> Truly in-use objects shouldn't be harmed by this, as they should get
> referenced and rotated to the beginning of the LRU. But the one-offs
> which are bloating the cache will be found and removed.

I agree with all this.

> I've implemented a version of this patch which just takes one step
> through the LRU on each d_alloc. It's quite interesting to watch it
> work. You can create 5 million negative dentries in directory /A via
> stat(), and then create 5 million negative dentries in directory /B. The
> total dentry slab size reaches 5 million but never goes past it, since
> the old negative dentries from /A aren't really in use, and they get
> pruned at the same rate as negative dentries from /B get created. On the
> other hand, if you *continue* to stat() on the dentries of /A while you
> create negative dentries in /B, then the cache grows to 10 million,
> since the /A dentries are actually in use.
>
> Maybe a solution could involve some generic list_lru machinery that can
> let you do these sorts of incremental scans? Maybe batching them up so
> instead of doing one every allocation, you do them every 100 or 1000?
> It would still be up to the individual user to put this to good use in
> the object allocation path.

I feel like we need to allow the list to both shrink and grow, depending
on how useful the entries in it are. So one counter per LRU, incremented
on every add. When that counter gets to 100, reset it to 0 and scan
110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be.
But the list can shrink over time instead of being a "one in, one out"
scenario.

Clearly 110 is a magic number, but intuitively, attempting to shrink
by 10% feels reasonable. Need to strike a balance between shrinking
quickly enough and giving the cache time to figure out which entries
are actually useful.

2022-04-06 12:51:06

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> > I can't speak for every slab cache, but I've been coming to the same
> > conclusion myself regarding the dentry cache. I think that the rate of
> > stepping through the LRU should be tied to the rate of allocations.
> > Truly in-use objects shouldn't be harmed by this, as they should get
> > referenced and rotated to the beginning of the LRU. But the one-offs
> > which are bloating the cache will be found and removed.
>
> I agree with all this.

Same here.

> > I've implemented a version of this patch which just takes one step
> > through the LRU on each d_alloc. It's quite interesting to watch it
> > work. You can create 5 million negative dentries in directory /A via
> > stat(), and then create 5 million negative dentries in directory /B. The
> > total dentry slab size reaches 5 million but never goes past it, since
> > the old negative dentries from /A aren't really in use, and they get
> > pruned at the same rate as negative dentries from /B get created. On the
> > other hand, if you *continue* to stat() on the dentries of /A while you
> > create negative dentries in /B, then the cache grows to 10 million,
> > since the /A dentries are actually in use.
> >
> > Maybe a solution could involve some generic list_lru machinery that can
> > let you do these sorts of incremental scans? Maybe batching them up so
> > instead of doing one every allocation, you do them every 100 or 1000?
> > It would still be up to the individual user to put this to good use in
> > the object allocation path.
>
> I feel like we need to allow the list to both shrink and grow, depending
> on how useful the entries in it are. So one counter per LRU, incremented
> on every add. When that counter gets to 100, reset it to 0 and scan
> 110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be.
> But the list can shrink over time instead of being a "one in, one out"
> scenario.

Yes, this is pretty much what I've been saying we should be using
the list-lru for since .... Well, let's just say it was one of the
things I wanted to be able to do when I first created the list-lru
infrastructure.

But it is much more complex than this. One of the issues with purely
list-lru add-time accounting is that we cannot make reclaim
decisions from list-add context because the list-add can occur in
reclaim context. e.g. dentry reclaim will drop the last reference
to an inode, which then gets inserted into the the inode list-lru in
reclaim context. The very next thing the superblock shrinker does
is scan the inode cache list-lru and remove a pre-calculated number
of objects from the list. Hence in reclaim context we can be both
increasing and decreasing the size of the list-lru by significant
percentages in a very short time window. This means it will be quite
challenging to make clear decisions based purely on lru list add
operations.

Hence I think we want to connect more than just the unused entries
to periodic reclaim - we really need to know both the number of free
objects on the list-lru as well as the total number of objects
allocated that could be on the list_lru. This would give us some
comparitive measure of free objects vs active referenced objects
and that would allow better decisions to be made.

As it is, we've recently made a necessary connection between
allocation and the list-lru via kmem_cache_alloc_lru(). This was
done as part of the list-lru/memcg rework patchset I referenced
earlier in the thread.

This means that operations that slab objects that are kept
on list_lrus for caching are now supplied with the list_lru at
allocation time. We already use this API for inode caches (via
inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we
already have the infrastructure in place to do per-allocation
list-lru accounting for inodes and dentries, not just "per list-lru
add/remove" accounting.

Extending that to other slab-based list-lru users should be pretty
easy, and in doing so would remove another difference between memcg
and non-memcg aware list-lrus....

> Clearly 110 is a magic number, but intuitively, attempting to shrink
> by 10% feels reasonable. Need to strike a balance between shrinking
> quickly enough and giving the cache time to figure out which entries
> are actually useful.

Testing will teach us where the thresholds need to be pretty
quickly. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-06 13:56:55

by Stephen Brennan

[permalink] [raw]

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

Dave Chinner <[email protected]> writes:
> On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
>> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
>> > I can't speak for every slab cache, but I've been coming to the same
>> > conclusion myself regarding the dentry cache. I think that the rate of
>> > stepping through the LRU should be tied to the rate of allocations.
>> > Truly in-use objects shouldn't be harmed by this, as they should get
>> > referenced and rotated to the beginning of the LRU. But the one-offs
>> > which are bloating the cache will be found and removed.
>>
>> I agree with all this.
>
> Same here.
>
>> > I've implemented a version of this patch which just takes one step
>> > through the LRU on each d_alloc. It's quite interesting to watch it
>> > work. You can create 5 million negative dentries in directory /A via
>> > stat(), and then create 5 million negative dentries in directory /B. The
>> > total dentry slab size reaches 5 million but never goes past it, since
>> > the old negative dentries from /A aren't really in use, and they get
>> > pruned at the same rate as negative dentries from /B get created. On the
>> > other hand, if you *continue* to stat() on the dentries of /A while you
>> > create negative dentries in /B, then the cache grows to 10 million,
>> > since the /A dentries are actually in use.
>> >
>> > Maybe a solution could involve some generic list_lru machinery that can
>> > let you do these sorts of incremental scans? Maybe batching them up so
>> > instead of doing one every allocation, you do them every 100 or 1000?
>> > It would still be up to the individual user to put this to good use in
>> > the object allocation path.
>>
>> I feel like we need to allow the list to both shrink and grow, depending
>> on how useful the entries in it are. So one counter per LRU, incremented
>> on every add. When that counter gets to 100, reset it to 0 and scan
>> 110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be.
>> But the list can shrink over time instead of being a "one in, one out"
>> scenario.
>
> Yes, this is pretty much what I've been saying we should be using
> the list-lru for since .... Well, let's just say it was one of the
> things I wanted to be able to do when I first created the list-lru
> infrastructure.
>
> But it is much more complex than this. One of the issues with purely
> list-lru add-time accounting is that we cannot make reclaim
> decisions from list-add context because the list-add can occur in
> reclaim context. e.g. dentry reclaim will drop the last reference
> to an inode, which then gets inserted into the the inode list-lru in
> reclaim context. The very next thing the superblock shrinker does
> is scan the inode cache list-lru and remove a pre-calculated number
> of objects from the list. Hence in reclaim context we can be both
> increasing and decreasing the size of the list-lru by significant
> percentages in a very short time window. This means it will be quite
> challenging to make clear decisions based purely on lru list add
> operations.

Plus, for the dcache, dentries are added to the LRU the first time their
reference count drops to zero, but they're not removed if they gain a
new reference. So at least for the dentry case, it's not clear that
list_lru_add/del would be good indicators of free/in-use object counts.

> Hence I think we want to connect more than just the unused entries
> to periodic reclaim - we really need to know both the number of free
> objects on the list-lru as well as the total number of objects
> allocated that could be on the list_lru. This would give us some
> comparitive measure of free objects vs active referenced objects
> and that would allow better decisions to be made.

The dentry branch I have works purely based on total allocated objects:
no differentiation between free and active referenced objects. I could
hook into dput() where reference counts drop to zero for the other part
of the equation, because like I said, list_lru_{add,del} aren't reliable
for the dentry cache.

I have opinions about whether or not it would be helpful to add in the
dput() signal, but I'd rather just try it and see. I'm learning that my
opinion and intuition are not all that reliable when it comes to caching
and LRU algorithms; trial and error is key.

> As it is, we've recently made a necessary connection between
> allocation and the list-lru via kmem_cache_alloc_lru(). This was
> done as part of the list-lru/memcg rework patchset I referenced
> earlier in the thread.
>
> This means that operations that slab objects that are kept
> on list_lrus for caching are now supplied with the list_lru at
> allocation time. We already use this API for inode caches (via
> inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we
> already have the infrastructure in place to do per-allocation
> list-lru accounting for inodes and dentries, not just "per list-lru
> add/remove" accounting.
>
> Extending that to other slab-based list-lru users should be pretty
> easy, and in doing so would remove another difference between memcg
> and non-memcg aware list-lrus....
>
>> Clearly 110 is a magic number, but intuitively, attempting to shrink
>> by 10% feels reasonable. Need to strike a balance between shrinking
>> quickly enough and giving the cache time to figure out which entries
>> are actually useful.
>
> Testing will teach us where the thresholds need to be pretty
> quickly. :)

100% (that is, 2 per allocation) is too high in my very cursory trial,
the dentry cache didn't seem to grow much during filesystem workloads.
0% (that is, 1 per allocation) worked well enough but like Matthew says,
won't tend to shrink the cache when it is necessary.

Stephen

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2022-04-06 14:07:51

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC] mm/vmscan: add periodic slab shrinker

On Tue, Apr 05, 2022 at 06:06:38PM -0700, Stephen Brennan wrote:
> Dave Chinner <[email protected]> writes:
> > On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
> >> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> >> > I can't speak for every slab cache, but I've been coming to the same
> >> > conclusion myself regarding the dentry cache. I think that the rate of
> >> > stepping through the LRU should be tied to the rate of allocations.
> >> > Truly in-use objects shouldn't be harmed by this, as they should get
> >> > referenced and rotated to the beginning of the LRU. But the one-offs
> >> > which are bloating the cache will be found and removed.
> >>
> >> I agree with all this.
> >
> > Same here.
> >
> >> > I've implemented a version of this patch which just takes one step
> >> > through the LRU on each d_alloc. It's quite interesting to watch it
> >> > work. You can create 5 million negative dentries in directory /A via
> >> > stat(), and then create 5 million negative dentries in directory /B. The
> >> > total dentry slab size reaches 5 million but never goes past it, since
> >> > the old negative dentries from /A aren't really in use, and they get
> >> > pruned at the same rate as negative dentries from /B get created. On the
> >> > other hand, if you *continue* to stat() on the dentries of /A while you
> >> > create negative dentries in /B, then the cache grows to 10 million,
> >> > since the /A dentries are actually in use.
> >> >
> >> > Maybe a solution could involve some generic list_lru machinery that can
> >> > let you do these sorts of incremental scans? Maybe batching them up so
> >> > instead of doing one every allocation, you do them every 100 or 1000?
> >> > It would still be up to the individual user to put this to good use in
> >> > the object allocation path.
> >>
> >> I feel like we need to allow the list to both shrink and grow, depending
> >> on how useful the entries in it are. So one counter per LRU, incremented
> >> on every add. When that counter gets to 100, reset it to 0 and scan
> >> 110 entries. Maybe 0 of them can be reclaimed; maybe 110 of them can be.
> >> But the list can shrink over time instead of being a "one in, one out"
> >> scenario.
> >
> > Yes, this is pretty much what I've been saying we should be using
> > the list-lru for since .... Well, let's just say it was one of the
> > things I wanted to be able to do when I first created the list-lru
> > infrastructure.
> >
> > But it is much more complex than this. One of the issues with purely
> > list-lru add-time accounting is that we cannot make reclaim
> > decisions from list-add context because the list-add can occur in
> > reclaim context. e.g. dentry reclaim will drop the last reference
> > to an inode, which then gets inserted into the the inode list-lru in
> > reclaim context. The very next thing the superblock shrinker does
> > is scan the inode cache list-lru and remove a pre-calculated number
> > of objects from the list. Hence in reclaim context we can be both
> > increasing and decreasing the size of the list-lru by significant
> > percentages in a very short time window. This means it will be quite
> > challenging to make clear decisions based purely on lru list add
> > operations.
>
> Plus, for the dcache, dentries are added to the LRU the first time their
> reference count drops to zero, but they're not removed if they gain a
> new reference. So at least for the dentry case, it's not clear that
> list_lru_add/del would be good indicators of free/in-use object counts.

Same for several other list-lru caches - the lazy removal trick is
used by (at least) the dentry, inode, XFS buffer and XFS dquot
caches. It's simply a way of reducing lock contention on the LRU
list - generally only memory reclaim or object freeing needs list
removal. So why remove it on every new reference to an object only
to have to put it back on the list a short time later?

> > Hence I think we want to connect more than just the unused entries
> > to periodic reclaim - we really need to know both the number of free
> > objects on the list-lru as well as the total number of objects
> > allocated that could be on the list_lru. This would give us some
> > comparitive measure of free objects vs active referenced objects
> > and that would allow better decisions to be made.
>
> The dentry branch I have works purely based on total allocated objects:
> no differentiation between free and active referenced objects. I could
> hook into dput() where reference counts drop to zero for the other part
> of the equation, because like I said, list_lru_{add,del} aren't reliable
> for the dentry cache.

Right, but shrinkers work from the number of objects that are
potentially freeable in a cache, not from the number that are
currently allocated. i.e. it uses the number of objects on the LRU
as a proxy for size of the allocated cache.

My point is that we actually need both sets of stats to make decent
reclaim decisions - scanning large caches is not useful if there is
almost zero potentially reclaimable objects in the cache...

That, however, brings into focus another issue: we conflate "reclaim
scanning" with list rotation (removing referenced bits and giving an
object another pass through the list). This forces reclaim to try to
reclaim a significant portion of any cache regardless of whether
there is nothing we can free immediately. If we don't then we never
strip the referenced bits that will allows the objects to eventually
be reclaimed.that will allows the objects to eventually be
reclaimed...

> I have opinions about whether or not it would be helpful to add in the
> dput() signal, but I'd rather just try it and see. I'm learning that my
> opinion and intuition are not all that reliable when it comes to caching
> and LRU algorithms; trial and error is key.

*nod*

Cheers,

Dave.
--
Dave Chinner
[email protected]