Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
changed the way how target the slab pressure is calculated and
made it priority-based:
delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);
The problem is that on a default priority (which is 12) no pressure
is applied at all, if the number of potentially reclaimable objects
is less than 4096.
It wouldn't be a big deal, if only these objects were not pinning the
corresponding dying memory cgroups. 4096 dentries/inodes/radix tree
nodes/... is a reasonable number, but 4096 dying cgroups is not.
If there are no big spikes in memory pressure, and new memory cgroups
are created and destroyed periodically, this causes the number of
dying cgroups grow steadily, causing a slow-ish and hard-to-detect
memory "leak". It's not a real leak, as the memory can be eventually
reclaimed, but it could not happen in a real life at all. I've seen
hosts with a steadily climbing number of dying cgroups, which doesn't
show any signs of a decline in months, despite the host is loaded
with a production workload.
It is an obvious waste of memory, and to prevent it, let's apply
a minimal pressure even on small shrinker lists. E.g. if there are
freeable objects, let's scan at least min(freeable, scan_batch)
objects.
This fix significantly improves a chance of a dying cgroup to be
reclaimed, and together with some previous patches stops the steady
growth of the dying cgroups number on some of our hosts.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
---
mm/vmscan.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa2c150ab7b9..c910cf6bf606 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);
+
+ if (delta == 0 && freeable > 0)
+ delta = min(freeable, batch_size);
+
total_scan += delta;
if (total_scan < 0) {
pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
--
2.17.1
On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa2c150ab7b9..c910cf6bf606 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> shrink_control *shrinkctl,
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> +
> + if (delta == 0 && freeable > 0)
> + delta = min(freeable, batch_size);
> +
> total_scan += delta;
> if (total_scan < 0) {
> pr_err("shrink_slab: %pF negative objects to delete
> nr=%ld\n",
I agree that we need to shrink slabs with fewer than
4096 objects, but do we want to put more pressure on
a slab the moment it drops below 4096 than we applied
when it had just over 4096 objects on it?
With this patch, a slab with 5000 objects on it will
get 1 item scanned, while a slab with 4000 objects on
it will see shrinker->batch or SHRINK_BATCH objects
scanned every time.
I don't know if this would cause any issues, just
something to ponder.
If nobody things this is a problem, you can give the
patch my:
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index fa2c150ab7b9..c910cf6bf606 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > shrink_control *shrinkctl,
> > delta = freeable >> priority;
> > delta *= 4;
> > do_div(delta, shrinker->seeks);
> > +
> > + if (delta == 0 && freeable > 0)
> > + delta = min(freeable, batch_size);
> > +
> > total_scan += delta;
> > if (total_scan < 0) {
> > pr_err("shrink_slab: %pF negative objects to delete
> > nr=%ld\n",
>
> I agree that we need to shrink slabs with fewer than
> 4096 objects, but do we want to put more pressure on
> a slab the moment it drops below 4096 than we applied
> when it had just over 4096 objects on it?
>
> With this patch, a slab with 5000 objects on it will
> get 1 item scanned, while a slab with 4000 objects on
> it will see shrinker->batch or SHRINK_BATCH objects
> scanned every time.
>
> I don't know if this would cause any issues, just
> something to ponder.
Hm, fair enough. So, basically we can always do
delta = max(delta, min(freeable, batch_size));
Does it look better?
>
> If nobody things this is a problem, you can give the
> patch my:
>
> Acked-by: Rik van Riel <[email protected]>
>
Thanks!
On Fri, 2018-08-31 at 14:31 -0700, Roman Gushchin wrote:
> On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index fa2c150ab7b9..c910cf6bf606 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > shrink_control *shrinkctl,
> > > delta = freeable >> priority;
> > > delta *= 4;
> > > do_div(delta, shrinker->seeks);
> > > +
> > > + if (delta == 0 && freeable > 0)
> > > + delta = min(freeable, batch_size);
> > > +
> > > total_scan += delta;
> > > if (total_scan < 0) {
> > > pr_err("shrink_slab: %pF negative objects to delete
> > > nr=%ld\n",
> >
> > I agree that we need to shrink slabs with fewer than
> > 4096 objects, but do we want to put more pressure on
> > a slab the moment it drops below 4096 than we applied
> > when it had just over 4096 objects on it?
> >
> > With this patch, a slab with 5000 objects on it will
> > get 1 item scanned, while a slab with 4000 objects on
> > it will see shrinker->batch or SHRINK_BATCH objects
> > scanned every time.
> >
> > I don't know if this would cause any issues, just
> > something to ponder.
>
> Hm, fair enough. So, basically we can always do
>
> delta = max(delta, min(freeable, batch_size));
>
> Does it look better?
Yeah, that looks fine to me.
That will read to small cgroups having small caches
reclaimed relatively more quickly than large caches
getting reclaimed, but small caches should also be
faster to refill once they are needed again, so it
is probably fine.
--
All Rights Reversed.
On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index fa2c150ab7b9..c910cf6bf606 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > shrink_control *shrinkctl,
> > > delta = freeable >> priority;
> > > delta *= 4;
> > > do_div(delta, shrinker->seeks);
> > > +
> > > + if (delta == 0 && freeable > 0)
> > > + delta = min(freeable, batch_size);
> > > +
> > > total_scan += delta;
> > > if (total_scan < 0) {
> > > pr_err("shrink_slab: %pF negative objects to delete
> > > nr=%ld\n",
> >
> > I agree that we need to shrink slabs with fewer than
> > 4096 objects, but do we want to put more pressure on
> > a slab the moment it drops below 4096 than we applied
> > when it had just over 4096 objects on it?
> >
> > With this patch, a slab with 5000 objects on it will
> > get 1 item scanned, while a slab with 4000 objects on
> > it will see shrinker->batch or SHRINK_BATCH objects
> > scanned every time.
> >
> > I don't know if this would cause any issues, just
> > something to ponder.
>
> Hm, fair enough. So, basically we can always do
>
> delta = max(delta, min(freeable, batch_size));
>
> Does it look better?
Why don't you use the same heuristic we use for the normal LRU raclaim?
/*
* If the cgroup's already been deleted, make sure to
* scrape out the remaining cache.
*/
if (!scan && !mem_cgroup_online(memcg))
scan = min(size, SWAP_CLUSTER_MAX);
--
Michal Hocko
SUSE Labs
On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > shrink_control *shrinkctl,
> > > > delta = freeable >> priority;
> > > > delta *= 4;
> > > > do_div(delta, shrinker->seeks);
> > > > +
> > > > + if (delta == 0 && freeable > 0)
> > > > + delta = min(freeable, batch_size);
> > > > +
> > > > total_scan += delta;
> > > > if (total_scan < 0) {
> > > > pr_err("shrink_slab: %pF negative objects to delete
> > > > nr=%ld\n",
> > >
> > > I agree that we need to shrink slabs with fewer than
> > > 4096 objects, but do we want to put more pressure on
> > > a slab the moment it drops below 4096 than we applied
> > > when it had just over 4096 objects on it?
> > >
> > > With this patch, a slab with 5000 objects on it will
> > > get 1 item scanned, while a slab with 4000 objects on
> > > it will see shrinker->batch or SHRINK_BATCH objects
> > > scanned every time.
> > >
> > > I don't know if this would cause any issues, just
> > > something to ponder.
> >
> > Hm, fair enough. So, basically we can always do
> >
> > delta = max(delta, min(freeable, batch_size));
> >
> > Does it look better?
>
> Why don't you use the same heuristic we use for the normal LRU raclaim?
Because we do reparent kmem lru lists on offlining.
Take a look at memcg_offline_kmem().
Thanks!
On Mon 03-09-18 13:28:06, Roman Gushchin wrote:
> On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> > On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > > >
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > > shrink_control *shrinkctl,
> > > > > delta = freeable >> priority;
> > > > > delta *= 4;
> > > > > do_div(delta, shrinker->seeks);
> > > > > +
> > > > > + if (delta == 0 && freeable > 0)
> > > > > + delta = min(freeable, batch_size);
> > > > > +
> > > > > total_scan += delta;
> > > > > if (total_scan < 0) {
> > > > > pr_err("shrink_slab: %pF negative objects to delete
> > > > > nr=%ld\n",
> > > >
> > > > I agree that we need to shrink slabs with fewer than
> > > > 4096 objects, but do we want to put more pressure on
> > > > a slab the moment it drops below 4096 than we applied
> > > > when it had just over 4096 objects on it?
> > > >
> > > > With this patch, a slab with 5000 objects on it will
> > > > get 1 item scanned, while a slab with 4000 objects on
> > > > it will see shrinker->batch or SHRINK_BATCH objects
> > > > scanned every time.
> > > >
> > > > I don't know if this would cause any issues, just
> > > > something to ponder.
> > >
> > > Hm, fair enough. So, basically we can always do
> > >
> > > delta = max(delta, min(freeable, batch_size));
> > >
> > > Does it look better?
> >
> > Why don't you use the same heuristic we use for the normal LRU raclaim?
>
> Because we do reparent kmem lru lists on offlining.
> Take a look at memcg_offline_kmem().
Then I must be missing something. Why are we growing the number of dead
cgroups then?
--
Michal Hocko
SUSE Labs
On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote:
> On Mon 03-09-18 13:28:06, Roman Gushchin wrote:
> > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > > > >
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > > > shrink_control *shrinkctl,
> > > > > > delta = freeable >> priority;
> > > > > > delta *= 4;
> > > > > > do_div(delta, shrinker->seeks);
> > > > > > +
> > > > > > + if (delta == 0 && freeable > 0)
> > > > > > + delta = min(freeable, batch_size);
> > > > > > +
> > > > > > total_scan += delta;
> > > > > > if (total_scan < 0) {
> > > > > > pr_err("shrink_slab: %pF negative objects to delete
> > > > > > nr=%ld\n",
> > > > >
> > > > > I agree that we need to shrink slabs with fewer than
> > > > > 4096 objects, but do we want to put more pressure on
> > > > > a slab the moment it drops below 4096 than we applied
> > > > > when it had just over 4096 objects on it?
> > > > >
> > > > > With this patch, a slab with 5000 objects on it will
> > > > > get 1 item scanned, while a slab with 4000 objects on
> > > > > it will see shrinker->batch or SHRINK_BATCH objects
> > > > > scanned every time.
> > > > >
> > > > > I don't know if this would cause any issues, just
> > > > > something to ponder.
> > > >
> > > > Hm, fair enough. So, basically we can always do
> > > >
> > > > delta = max(delta, min(freeable, batch_size));
> > > >
> > > > Does it look better?
> > >
> > > Why don't you use the same heuristic we use for the normal LRU raclaim?
> >
> > Because we do reparent kmem lru lists on offlining.
> > Take a look at memcg_offline_kmem().
>
> Then I must be missing something. Why are we growing the number of dead
> cgroups then?
We do reparent LRU lists, but not objects. Objects (or, more precisely, pages)
are still holding a reference to the memcg.
Let's say we have systemd periodically restarting some service in system.slice.
Then all accounted objects after removal of the service's memcg are placed
into the system.slice's LRU. But under small/moderate memory pressure we
won't scan it at all, unless it get's really big. If there is "only" a couple
of thousands of objects, we don't scan it, and can easily have several hundreds
of pinned dying cgroups.
On Tue 04-09-18 08:34:49, Roman Gushchin wrote:
> On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote:
> > On Mon 03-09-18 13:28:06, Roman Gushchin wrote:
> > > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> > > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > > > > >
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > > > > shrink_control *shrinkctl,
> > > > > > > delta = freeable >> priority;
> > > > > > > delta *= 4;
> > > > > > > do_div(delta, shrinker->seeks);
> > > > > > > +
> > > > > > > + if (delta == 0 && freeable > 0)
> > > > > > > + delta = min(freeable, batch_size);
> > > > > > > +
> > > > > > > total_scan += delta;
> > > > > > > if (total_scan < 0) {
> > > > > > > pr_err("shrink_slab: %pF negative objects to delete
> > > > > > > nr=%ld\n",
> > > > > >
> > > > > > I agree that we need to shrink slabs with fewer than
> > > > > > 4096 objects, but do we want to put more pressure on
> > > > > > a slab the moment it drops below 4096 than we applied
> > > > > > when it had just over 4096 objects on it?
> > > > > >
> > > > > > With this patch, a slab with 5000 objects on it will
> > > > > > get 1 item scanned, while a slab with 4000 objects on
> > > > > > it will see shrinker->batch or SHRINK_BATCH objects
> > > > > > scanned every time.
> > > > > >
> > > > > > I don't know if this would cause any issues, just
> > > > > > something to ponder.
> > > > >
> > > > > Hm, fair enough. So, basically we can always do
> > > > >
> > > > > delta = max(delta, min(freeable, batch_size));
> > > > >
> > > > > Does it look better?
> > > >
> > > > Why don't you use the same heuristic we use for the normal LRU raclaim?
> > >
> > > Because we do reparent kmem lru lists on offlining.
> > > Take a look at memcg_offline_kmem().
> >
> > Then I must be missing something. Why are we growing the number of dead
> > cgroups then?
>
> We do reparent LRU lists, but not objects. Objects (or, more precisely, pages)
> are still holding a reference to the memcg.
OK, this is what I missed. I thought that the reparenting includes all
the pages as well. Is there any strong reason that we cannot do that?
Performance/Locking/etc.?
Or maybe do not reparent at all and rely on the same reclaim heuristic
we do for normal pages?
I am not opposing your patch but I am trying to figure out whether that
is the best approach.
--
Michal Hocko
SUSE Labs
On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote:
> On Tue 04-09-18 08:34:49, Roman Gushchin wrote:
> > On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote:
> > > On Mon 03-09-18 13:28:06, Roman Gushchin wrote:
> > > > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> > > > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > > > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > > > > > >
> > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > > > > > --- a/mm/vmscan.c
> > > > > > > > +++ b/mm/vmscan.c
> > > > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > > > > > shrink_control *shrinkctl,
> > > > > > > > delta = freeable >> priority;
> > > > > > > > delta *= 4;
> > > > > > > > do_div(delta, shrinker->seeks);
> > > > > > > > +
> > > > > > > > + if (delta == 0 && freeable > 0)
> > > > > > > > + delta = min(freeable, batch_size);
> > > > > > > > +
> > > > > > > > total_scan += delta;
> > > > > > > > if (total_scan < 0) {
> > > > > > > > pr_err("shrink_slab: %pF negative objects to delete
> > > > > > > > nr=%ld\n",
> > > > > > >
> > > > > > > I agree that we need to shrink slabs with fewer than
> > > > > > > 4096 objects, but do we want to put more pressure on
> > > > > > > a slab the moment it drops below 4096 than we applied
> > > > > > > when it had just over 4096 objects on it?
> > > > > > >
> > > > > > > With this patch, a slab with 5000 objects on it will
> > > > > > > get 1 item scanned, while a slab with 4000 objects on
> > > > > > > it will see shrinker->batch or SHRINK_BATCH objects
> > > > > > > scanned every time.
> > > > > > >
> > > > > > > I don't know if this would cause any issues, just
> > > > > > > something to ponder.
> > > > > >
> > > > > > Hm, fair enough. So, basically we can always do
> > > > > >
> > > > > > delta = max(delta, min(freeable, batch_size));
> > > > > >
> > > > > > Does it look better?
> > > > >
> > > > > Why don't you use the same heuristic we use for the normal LRU raclaim?
> > > >
> > > > Because we do reparent kmem lru lists on offlining.
> > > > Take a look at memcg_offline_kmem().
> > >
> > > Then I must be missing something. Why are we growing the number of dead
> > > cgroups then?
> >
> > We do reparent LRU lists, but not objects. Objects (or, more precisely, pages)
> > are still holding a reference to the memcg.
>
> OK, this is what I missed. I thought that the reparenting includes all
> the pages as well. Is there any strong reason that we cannot do that?
> Performance/Locking/etc.?
>
> Or maybe do not reparent at all and rely on the same reclaim heuristic
> we do for normal pages?
>
> I am not opposing your patch but I am trying to figure out whether that
> is the best approach.
I don't think the current logic does make sense. Why should cgroups
with less than 4k kernel objects be excluded from being scanned?
Reparenting of all pages is definitely an option to consider,
but it's not free in any case, so if there is no problem,
why should we? Let's keep it as a last measure. In my case,
the proposed patch works perfectly: the number of dying cgroups
jumps around 100, where it grew steadily to 2k and more before.
I believe that reparenting of LRU lists is required to minimize
the number of LRU lists to scan, but I'm not sure.
On Tue 04-09-18 10:52:46, Roman Gushchin wrote:
> On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote:
[...]
> > I am not opposing your patch but I am trying to figure out whether that
> > is the best approach.
>
> I don't think the current logic does make sense. Why should cgroups
> with less than 4k kernel objects be excluded from being scanned?
How is it any different from the the LRU reclaim? Maybe it is just not
that visible because there usually more pages there. But in principle it
is the same issue AFAICS.
> Reparenting of all pages is definitely an option to consider,
> but it's not free in any case, so if there is no problem,
> why should we? Let's keep it as a last measure. In my case,
> the proposed patch works perfectly: the number of dying cgroups
> jumps around 100, where it grew steadily to 2k and more before.
Let me emphasise that I am not opposing the patch. I just think that we
have made some decisions which are not ideal but I would really like to
prevent from building workarounds on top. If we have to reconsider some
of those decisions then let's do it. Maybe the priority scaling is just
too coarse and what seem to work work for normal LRUs doesn't work for
shrinkers.
> I believe that reparenting of LRU lists is required to minimize
> the number of LRU lists to scan, but I'm not sure.
Well, we do have more lists to scan for normal LRUs. It is true that
shrinkers add multiplining factor to that but in principle I guess we
really want to distinguish dead memcgs because we do want to reclaim
those much more than the rest. Those objects are basically of no use
just eating resources. The pagecache has some chance to be reused at
least but I fail to see why we should keep kernel objects around. Sure,
some of them might be harder to reclaim due to different life time and
internal object management but this doesn't change the fact that we
should try hard to reclaim those. So my gut feeling tells me that we
should have a way to distinguish them.
Btw. I do not see Vladimir on the CC list. Added (the thread starts
here http://lkml.kernel.org/r/[email protected])
--
Michal Hocko
SUSE Labs
[now CC Vladimir for real]
On Tue 04-09-18 20:06:31, Michal Hocko wrote:
> On Tue 04-09-18 10:52:46, Roman Gushchin wrote:
> > On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote:
> [...]
> > > I am not opposing your patch but I am trying to figure out whether that
> > > is the best approach.
> >
> > I don't think the current logic does make sense. Why should cgroups
> > with less than 4k kernel objects be excluded from being scanned?
>
> How is it any different from the the LRU reclaim? Maybe it is just not
> that visible because there usually more pages there. But in principle it
> is the same issue AFAICS.
>
> > Reparenting of all pages is definitely an option to consider,
> > but it's not free in any case, so if there is no problem,
> > why should we? Let's keep it as a last measure. In my case,
> > the proposed patch works perfectly: the number of dying cgroups
> > jumps around 100, where it grew steadily to 2k and more before.
>
> Let me emphasise that I am not opposing the patch. I just think that we
> have made some decisions which are not ideal but I would really like to
> prevent from building workarounds on top. If we have to reconsider some
> of those decisions then let's do it. Maybe the priority scaling is just
> too coarse and what seem to work work for normal LRUs doesn't work for
> shrinkers.
>
> > I believe that reparenting of LRU lists is required to minimize
> > the number of LRU lists to scan, but I'm not sure.
>
> Well, we do have more lists to scan for normal LRUs. It is true that
> shrinkers add multiplining factor to that but in principle I guess we
> really want to distinguish dead memcgs because we do want to reclaim
> those much more than the rest. Those objects are basically of no use
> just eating resources. The pagecache has some chance to be reused at
> least but I fail to see why we should keep kernel objects around. Sure,
> some of them might be harder to reclaim due to different life time and
> internal object management but this doesn't change the fact that we
> should try hard to reclaim those. So my gut feeling tells me that we
> should have a way to distinguish them.
>
> Btw. I do not see Vladimir on the CC list. Added (the thread starts
> here http://lkml.kernel.org/r/[email protected])
--
Michal Hocko
SUSE Labs
On Tue, Sep 04, 2018 at 10:52:46AM -0700, Roman Gushchin wrote:
> Reparenting of all pages is definitely an option to consider,
Reparenting pages would be great indeed, but I'm not sure we could do
that, because we'd have to walk over page lists of semi-active kmem
caches and do it consistently while some pages may be freed as we go.
Kmem caches are so optimized for performance that implementing such a
procedure without impacting any hot paths would be nearly impossible
IMHO. And there are two implementations (SLAB/SLUB), both of which we'd
have to support.
> but it's not free in any case, so if there is no problem,
> why should we? Let's keep it as a last measure. In my case,
> the proposed patch works perfectly: the number of dying cgroups
> jumps around 100, where it grew steadily to 2k and more before.
>
> I believe that reparenting of LRU lists is required to minimize
> the number of LRU lists to scan, but I'm not sure.
AFAIR the sole purpose of LRU reparenting is releasing kmemcg_id as soon
as a cgroup directory is deleted. If we didn't do that, dead cgroups
would occupy slots in per memcg arrays (list_lru, kmem_cache) so if we
had say 10K dead cgroups, we'd have to allocate 80 KB arrays to store
per memcg data for each kmem_cache and list_lru. Back when kmem
accounting was introduced, we used kmalloc() for allocating those arrays
so growing the size up to 80 KB would result in getting ENOMEM when
trying to create a cgroup too often. Now, we fall back on vmalloc() so
may be it wouldn't be a problem...
Alternatively, I guess we could "reparent" those dangling LRU objects
not to the parent cgroup's list_lru_memcg, but instead to a special
list_lru_memcg which wouldn't be assigned to any cgroup and which would
be reclaimed ASAP on both global or memcg pressure.