2018-11-02 00:16:47

by Dexuan Cui

[permalink] [raw]
Subject: Will the recent memory leak fixes be backported to longterm kernels?

Hi all,
When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516)
with v4.14.11-coreos, we noticed the same issue may have been fixed recently by
Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349, which lists these
patches (I'm not sure if the 5-patch list is complete):

010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
f77d7a05670d mm: don't miss the last page because of round-off error
d18bf0af683e mm: drain memcg stocks on css offlining
71cd51b2e1ca mm: rework memcg kernel stack accounting
f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects

Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y,
but none of the 5 patches has the "Cc: [email protected]" tag? I'm wondering if
these patches will be backported to the longterm kernels. BTW, the patches are not
in v4.19, but I suppose they will be in v4.19.1-rc1?

Thanks,
-- Dexuan



2018-11-02 00:46:54

by Sasha Levin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
>Hi all,
>When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516)
>with v4.14.11-coreos, we noticed the same issue may have been fixed recently by
>Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of
>https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349, which lists these
>patches (I'm not sure if the 5-patch list is complete):
>
>010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
>f77d7a05670d mm: don't miss the last page because of round-off error
>d18bf0af683e mm: drain memcg stocks on css offlining
>71cd51b2e1ca mm: rework memcg kernel stack accounting
>f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects
>
>Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y,
>but none of the 5 patches has the "Cc: [email protected]" tag? I'm wondering if
>these patches will be backported to the longterm kernels. BTW, the patches are not
>in v4.19, but I suppose they will be in v4.19.1-rc1?

There was an issue with this series:
https://lkml.org/lkml/2018/10/23/586, so it's waiting on a fix to be
properly tested.

--
Thanks,
Sasha

2018-11-02 00:59:41

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
> Hi all,
> When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516)
> with v4.14.11-coreos, we noticed the same issue may have been fixed recently by
> Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1792349&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=mrT9jcrhFvVxDpVBlxihJg6S6U91rlevOJby7y1YynE&s=1eHLVA-oQGqMd2ujRPU8kZMbkShOuIDD5CUgpM1IzGI&e=, which lists these
> patches (I'm not sure if the 5-patch list is complete):
>
> 010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments
> f77d7a05670d mm: don't miss the last page because of round-off error
> d18bf0af683e mm: drain memcg stocks on css offlining
> 71cd51b2e1ca mm: rework memcg kernel stack accounting
> f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects
>
> Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y,
> but none of the 5 patches has the "Cc: [email protected]" tag? I'm wondering if
> these patches will be backported to the longterm kernels. BTW, the patches are not
> in v4.19, but I suppose they will be in v4.19.1-rc1?

Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes
(hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages
5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Re stable backporting, I'd really wait for some time. Memory reclaim is a
quite complex and fragile area, so even if patches are correct by themselves,
they can easily cause a regression by revealing some other issues (as it was
with the inode reclaim case).

Thanks!

2018-11-02 02:48:49

by Dexuan Cui

[permalink] [raw]
Subject: RE: Will the recent memory leak fixes be backported to longterm kernels?

> From: Roman Gushchin <[email protected]>
> Sent: Thursday, November 1, 2018 17:58
>
> On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
> Hello, Dexuan!
>
> A couple of issues has been revealed recently, here are fixes
> (hashes are from the next tree):
>
> 5f4b04528b5f mm: don't reclaim inodes with many attached pages
> 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge()
> properly
>
> These two patches should be added to the serie.

Thanks for the new info!

> Re stable backporting, I'd really wait for some time. Memory reclaim is a
> quite complex and fragile area, so even if patches are correct by themselves,
> they can easily cause a regression by revealing some other issues (as it was
> with the inode reclaim case).

I totally agree. I'm now just wondering if there is any temporary workaround,
even if that means we have to run the kernel with some features disabled or
with a suboptimal performance?

Thanks!
--Dexuan


2018-11-02 03:19:38

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:
> > From: Roman Gushchin <[email protected]>
> > Sent: Thursday, November 1, 2018 17:58
> >
> > On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
> > Hello, Dexuan!
> >
> > A couple of issues has been revealed recently, here are fixes
> > (hashes are from the next tree):
> >
> > 5f4b04528b5f mm: don't reclaim inodes with many attached pages
> > 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge()
> > properly
> >
> > These two patches should be added to the serie.
>
> Thanks for the new info!
>
> > Re stable backporting, I'd really wait for some time. Memory reclaim is a
> > quite complex and fragile area, so even if patches are correct by themselves,
> > they can easily cause a regression by revealing some other issues (as it was
> > with the inode reclaim case).
>
> I totally agree. I'm now just wondering if there is any temporary workaround,
> even if that means we have to run the kernel with some features disabled or
> with a suboptimal performance?

I don't think there is any, except not using memory cgroups at all.
Limiting the amount of cgroups which are created and destroyed helps too:
a faulty service running under systemd can be especially painful.

Thanks!

2018-11-02 08:04:36

by Michal Hocko

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
[...]
> I totally agree. I'm now just wondering if there is any temporary workaround,
> even if that means we have to run the kernel with some features disabled or
> with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
option). That would reduce the memory isolation because quite a lot of
memory will not be accounted for but the primary source of in-flight and
hard to reclaim memory will be gone.

Another workaround could be to use force_empty knob we have in v1 and
use it when removing a cgroup. We do not have it in cgroup v2 though.
The file hasn't been added to v2 because we didn't really have any
proper usecase. Working around a bug doesn't sound like a _proper_
usecase but I can imagine workloads that bring a lot of metadata objects
that are not really interesting for later use so something like a
targeted drop_caches...
--
Michal Hocko
SUSE Labs

2018-11-02 15:50:23

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> [...]
> > I totally agree. I'm now just wondering if there is any temporary workaround,
> > even if that means we have to run the kernel with some features disabled or
> > with a suboptimal performance?
>
> One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> option). That would reduce the memory isolation because quite a lot of
> memory will not be accounted for but the primary source of in-flight and
> hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue
(without patches), but can lower the rate of the leak.

>
> Another workaround could be to use force_empty knob we have in v1 and
> use it when removing a cgroup. We do not have it in cgroup v2 though.
> The file hasn't been added to v2 because we didn't really have any
> proper usecase. Working around a bug doesn't sound like a _proper_
> usecase but I can imagine workloads that bring a lot of metadata objects
> that are not really interesting for later use so something like a
> targeted drop_caches...

This can help a bit too, but even using the system-wide drop_caches knob
unfortunately doesn't return all the memory back.

Thanks!

2018-11-02 16:02:46

by Sasha Levin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:
>> From: Roman Gushchin <[email protected]>
>> Sent: Thursday, November 1, 2018 17:58
>>
>> On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
>> Hello, Dexuan!
>>
>> A couple of issues has been revealed recently, here are fixes
>> (hashes are from the next tree):
>>
>> 5f4b04528b5f mm: don't reclaim inodes with many attached pages
>> 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge()
>> properly
>>
>> These two patches should be added to the serie.
>
>Thanks for the new info!
>
>> Re stable backporting, I'd really wait for some time. Memory reclaim is a
>> quite complex and fragile area, so even if patches are correct by themselves,
>> they can easily cause a regression by revealing some other issues (as it was
>> with the inode reclaim case).
>
>I totally agree. I'm now just wondering if there is any temporary workaround,
>even if that means we have to run the kernel with some features disabled or
>with a suboptimal performance?

I'm not sure what workload you're seeing it on, but if you could merge
these 7 patches and see that it solves the problem you're seeing and
doesn't cause any regressions it'll be a useful test for the rest of us.

--
Thanks,
Sasha

2018-11-02 16:16:15

by Michal Hocko

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri 02-11-18 15:48:57, Roman Gushchin wrote:
> On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> > On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> > [...]
> > > I totally agree. I'm now just wondering if there is any temporary workaround,
> > > even if that means we have to run the kernel with some features disabled or
> > > with a suboptimal performance?
> >
> > One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> > option). That would reduce the memory isolation because quite a lot of
> > memory will not be accounted for but the primary source of in-flight and
> > hard to reclaim memory will be gone.
>
> In my experience disabling the kmem accounting doesn't really solve the issue
> (without patches), but can lower the rate of the leak.

This is unexpected. 90cbc2508827e was introduced to address offline
memcgs to be reclaim even when they are small. But maybe you mean that
we still leak in an absence of the memory pressure. Or what does prevent
memcg from going down?

> > Another workaround could be to use force_empty knob we have in v1 and
> > use it when removing a cgroup. We do not have it in cgroup v2 though.
> > The file hasn't been added to v2 because we didn't really have any
> > proper usecase. Working around a bug doesn't sound like a _proper_
> > usecase but I can imagine workloads that bring a lot of metadata objects
> > that are not really interesting for later use so something like a
> > targeted drop_caches...
>
> This can help a bit too, but even using the system-wide drop_caches knob
> unfortunately doesn't return all the memory back.

Could you be more specific please?

--
Michal Hocko
SUSE Labs

2018-11-02 16:23:51

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:
> On Fri 02-11-18 15:48:57, Roman Gushchin wrote:
> > On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> > > [...]
> > > > I totally agree. I'm now just wondering if there is any temporary workaround,
> > > > even if that means we have to run the kernel with some features disabled or
> > > > with a suboptimal performance?
> > >
> > > One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> > > option). That would reduce the memory isolation because quite a lot of
> > > memory will not be accounted for but the primary source of in-flight and
> > > hard to reclaim memory will be gone.
> >
> > In my experience disabling the kmem accounting doesn't really solve the issue
> > (without patches), but can lower the rate of the leak.
>
> This is unexpected. 90cbc2508827e was introduced to address offline
> memcgs to be reclaim even when they are small. But maybe you mean that
> we still leak in an absence of the memory pressure. Or what does prevent
> memcg from going down?

There are 3 independent issues which are contributing to this leak:
1) Kernel stack accounting weirdness: processes can reuse stack accounted to
different cgroups. So basically any running process can take a reference to any
cgroup.
2) We do forget to scan the last page in the LRU list. So if we ended up with
1-page long LRU, it can stay there basically forever.
3) We don't apply enough pressure on slab objects.

Because one reference is enough to keep the entire memcg structure in place,
we really have to close all three to eliminate the leak. Disabling kmem
accounting mitigates only the last one.

>
> > > Another workaround could be to use force_empty knob we have in v1 and
> > > use it when removing a cgroup. We do not have it in cgroup v2 though.
> > > The file hasn't been added to v2 because we didn't really have any
> > > proper usecase. Working around a bug doesn't sound like a _proper_
> > > usecase but I can imagine workloads that bring a lot of metadata objects
> > > that are not really interesting for later use so something like a
> > > targeted drop_caches...
> >
> > This can help a bit too, but even using the system-wide drop_caches knob
> > unfortunately doesn't return all the memory back.
>
> Could you be more specific please?

Sure, because problems 1) and 2) exist, echo 3 > /proc/sys/vm/drop_caches can't
reclaim all memcg structures in most cases.

Thanks!

2018-11-02 16:52:41

by Michal Hocko

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:
> > On Fri 02-11-18 15:48:57, Roman Gushchin wrote:
> > > On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> > > > On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> > > > [...]
> > > > > I totally agree. I'm now just wondering if there is any temporary workaround,
> > > > > even if that means we have to run the kernel with some features disabled or
> > > > > with a suboptimal performance?
> > > >
> > > > One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> > > > option). That would reduce the memory isolation because quite a lot of
> > > > memory will not be accounted for but the primary source of in-flight and
> > > > hard to reclaim memory will be gone.
> > >
> > > In my experience disabling the kmem accounting doesn't really solve the issue
> > > (without patches), but can lower the rate of the leak.
> >
> > This is unexpected. 90cbc2508827e was introduced to address offline
> > memcgs to be reclaim even when they are small. But maybe you mean that
> > we still leak in an absence of the memory pressure. Or what does prevent
> > memcg from going down?
>
> There are 3 independent issues which are contributing to this leak:
> 1) Kernel stack accounting weirdness: processes can reuse stack accounted to
> different cgroups. So basically any running process can take a reference to any
> cgroup.

yes, but kmem accounting should rule that out, right? If not then this
is a clear bug and easy to backport because that would mean to add a
missing memcg_kmem_enabled check.

> 2) We do forget to scan the last page in the LRU list. So if we ended up with
> 1-page long LRU, it can stay there basically forever.

Why
/*
* If the cgroup's already been deleted, make sure to
* scrape out the remaining cache.
*/
if (!scan && !mem_cgroup_online(memcg))
scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

> 3) We don't apply enough pressure on slab objects.

again kmem accounting disabled should make this moot

> Because one reference is enough to keep the entire memcg structure in place,
> we really have to close all three to eliminate the leak. Disabling kmem
> accounting mitigates only the last one.
--
Michal Hocko
SUSE Labs

2018-11-02 17:27:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> > On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 15:48:57, Roman Gushchin wrote:
> > > > On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:
> > > > > On Fri 02-11-18 02:45:42, Dexuan Cui wrote:
> > > > > [...]
> > > > > > I totally agree. I'm now just wondering if there is any temporary workaround,
> > > > > > even if that means we have to run the kernel with some features disabled or
> > > > > > with a suboptimal performance?
> > > > >
> > > > > One way would be to disable kmem accounting (cgroup.memory=nokmem kernel
> > > > > option). That would reduce the memory isolation because quite a lot of
> > > > > memory will not be accounted for but the primary source of in-flight and
> > > > > hard to reclaim memory will be gone.
> > > >
> > > > In my experience disabling the kmem accounting doesn't really solve the issue
> > > > (without patches), but can lower the rate of the leak.
> > >
> > > This is unexpected. 90cbc2508827e was introduced to address offline
> > > memcgs to be reclaim even when they are small. But maybe you mean that
> > > we still leak in an absence of the memory pressure. Or what does prevent
> > > memcg from going down?
> >
> > There are 3 independent issues which are contributing to this leak:
> > 1) Kernel stack accounting weirdness: processes can reuse stack accounted to
> > different cgroups. So basically any running process can take a reference to any
> > cgroup.
>
> yes, but kmem accounting should rule that out, right? If not then this
> is a clear bug and easy to backport because that would mean to add a
> missing memcg_kmem_enabled check.

Yes, you're right, disabling kmem accounting should mitigate this problem.

>
> > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > 1-page long LRU, it can stay there basically forever.
>
> Why
> /*
> * If the cgroup's already been deleted, make sure to
> * scrape out the remaining cache.
> */
> if (!scan && !mem_cgroup_online(memcg))
> scan = min(size, SWAP_CLUSTER_MAX);
>
> in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
* If the cgroup's already been deleted, make sure to
* scrape out the remaining cache.
*/
if (!scan && !mem_cgroup_online(memcg))
scan = min(size, SWAP_CLUSTER_MAX);

If size == 1, scan == 0 => scan = min(1, 32) == 1.
And after proportional adjustment we'll have 0.

So, disabling kmem accounting mitigates 2 other issues, but not this one.

Anyway, I'd prefer to wait a bit for test results, and backport the whole
series as a whole.

Thanks!

2018-11-02 17:49:03

by Michal Hocko

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
> On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
[...]
> > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > > 1-page long LRU, it can stay there basically forever.
> >
> > Why
> > /*
> > * If the cgroup's already been deleted, make sure to
> > * scrape out the remaining cache.
> > */
> > if (!scan && !mem_cgroup_online(memcg))
> > scan = min(size, SWAP_CLUSTER_MAX);
> >
> > in get_scan_count doesn't work for that case?
>
> No, it doesn't. Let's look at the whole picture:
>
> size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> scan = size >> sc->priority;
> /*
> * If the cgroup's already been deleted, make sure to
> * scrape out the remaining cache.
> */
> if (!scan && !mem_cgroup_online(memcg))
> scan = min(size, SWAP_CLUSTER_MAX);
>
> If size == 1, scan == 0 => scan = min(1, 32) == 1.
> And after proportional adjustment we'll have 0.

My friday brain hurst when looking at this but if it doesn't work as
advertized then it should be fixed. I do not see any of your patches to
touch this logic so how come it would work after them applied?
--
Michal Hocko
SUSE Labs

2018-11-02 19:39:52

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:
> On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
> > On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> [...]
> > > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > > > 1-page long LRU, it can stay there basically forever.
> > >
> > > Why
> > > /*
> > > * If the cgroup's already been deleted, make sure to
> > > * scrape out the remaining cache.
> > > */
> > > if (!scan && !mem_cgroup_online(memcg))
> > > scan = min(size, SWAP_CLUSTER_MAX);
> > >
> > > in get_scan_count doesn't work for that case?
> >
> > No, it doesn't. Let's look at the whole picture:
> >
> > size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> > scan = size >> sc->priority;
> > /*
> > * If the cgroup's already been deleted, make sure to
> > * scrape out the remaining cache.
> > */
> > if (!scan && !mem_cgroup_online(memcg))
> > scan = min(size, SWAP_CLUSTER_MAX);
> >
> > If size == 1, scan == 0 => scan = min(1, 32) == 1.
> > And after proportional adjustment we'll have 0.
>
> My friday brain hurst when looking at this but if it doesn't work as
> advertized then it should be fixed. I do not see any of your patches to
> touch this logic so how come it would work after them applied?

This part works as expected. But the following
scan = div64_u64(scan * fraction[file], denominator);
reliable turns 1 page to scan to 0 pages to scan.

And this is the issue which my patches do address.

Thanks!

2018-11-04 09:16:58

by Mike Rapoport

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Nov 02, 2018 at 12:01:22PM -0400, Sasha Levin wrote:
> On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:
> >>From: Roman Gushchin <[email protected]>
> >>Sent: Thursday, November 1, 2018 17:58
> >>
> >>On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:
> >>Hello, Dexuan!
> >>
> >>A couple of issues has been revealed recently, here are fixes
> >>(hashes are from the next tree):
> >>
> >>5f4b04528b5f mm: don't reclaim inodes with many attached pages
> >>5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge()
> >>properly
> >>
> >>These two patches should be added to the serie.
> >
> >Thanks for the new info!
> >
> >>Re stable backporting, I'd really wait for some time. Memory reclaim is a
> >>quite complex and fragile area, so even if patches are correct by themselves,
> >>they can easily cause a regression by revealing some other issues (as it was
> >>with the inode reclaim case).
> >
> >I totally agree. I'm now just wondering if there is any temporary workaround,
> >even if that means we have to run the kernel with some features disabled or
> >with a suboptimal performance?
>
> I'm not sure what workload you're seeing it on, but if you could merge
> these 7 patches and see that it solves the problem you're seeing and
> doesn't cause any regressions it'll be a useful test for the rest of us.

AFAIK, with Roman's patches backported to Ubuntu version of 4.15, the
problem reported at [1] is solved.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349

> --
> Thanks,
> Sasha
>

--
Sincerely yours,
Mike.


2018-11-05 09:22:00

by Michal Hocko

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri 02-11-18 19:38:35, Roman Gushchin wrote:
> On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:
> > On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
> > > On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> > > > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> > [...]
> > > > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > > > > 1-page long LRU, it can stay there basically forever.
> > > >
> > > > Why
> > > > /*
> > > > * If the cgroup's already been deleted, make sure to
> > > > * scrape out the remaining cache.
> > > > */
> > > > if (!scan && !mem_cgroup_online(memcg))
> > > > scan = min(size, SWAP_CLUSTER_MAX);
> > > >
> > > > in get_scan_count doesn't work for that case?
> > >
> > > No, it doesn't. Let's look at the whole picture:
> > >
> > > size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> > > scan = size >> sc->priority;
> > > /*
> > > * If the cgroup's already been deleted, make sure to
> > > * scrape out the remaining cache.
> > > */
> > > if (!scan && !mem_cgroup_online(memcg))
> > > scan = min(size, SWAP_CLUSTER_MAX);
> > >
> > > If size == 1, scan == 0 => scan = min(1, 32) == 1.
> > > And after proportional adjustment we'll have 0.
> >
> > My friday brain hurst when looking at this but if it doesn't work as
> > advertized then it should be fixed. I do not see any of your patches to
> > touch this logic so how come it would work after them applied?
>
> This part works as expected. But the following
> scan = div64_u64(scan * fraction[file], denominator);
> reliable turns 1 page to scan to 0 pages to scan.

OK, 68600f623d69 ("mm: don't miss the last page because of round-off
error") sounds like a good and safe stable backport material.

--
Michal Hocko
SUSE Labs

2018-12-28 18:56:39

by Greg KH

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:
> On Fri 02-11-18 19:38:35, Roman Gushchin wrote:
> > On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
> > > > On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> > > > > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> > > [...]
> > > > > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > > > > > 1-page long LRU, it can stay there basically forever.
> > > > >
> > > > > Why
> > > > > /*
> > > > > * If the cgroup's already been deleted, make sure to
> > > > > * scrape out the remaining cache.
> > > > > */
> > > > > if (!scan && !mem_cgroup_online(memcg))
> > > > > scan = min(size, SWAP_CLUSTER_MAX);
> > > > >
> > > > > in get_scan_count doesn't work for that case?
> > > >
> > > > No, it doesn't. Let's look at the whole picture:
> > > >
> > > > size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> > > > scan = size >> sc->priority;
> > > > /*
> > > > * If the cgroup's already been deleted, make sure to
> > > > * scrape out the remaining cache.
> > > > */
> > > > if (!scan && !mem_cgroup_online(memcg))
> > > > scan = min(size, SWAP_CLUSTER_MAX);
> > > >
> > > > If size == 1, scan == 0 => scan = min(1, 32) == 1.
> > > > And after proportional adjustment we'll have 0.
> > >
> > > My friday brain hurst when looking at this but if it doesn't work as
> > > advertized then it should be fixed. I do not see any of your patches to
> > > touch this logic so how come it would work after them applied?
> >
> > This part works as expected. But the following
> > scan = div64_u64(scan * fraction[file], denominator);
> > reliable turns 1 page to scan to 0 pages to scan.
>
> OK, 68600f623d69 ("mm: don't miss the last page because of round-off
> error") sounds like a good and safe stable backport material.

Thanks for this, now queued up.

greg k-h

2019-01-30 00:24:18

by Sasha Levin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Fri, Dec 28, 2018 at 11:50:08AM +0100, Greg KH wrote:
>On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:
>> On Fri 02-11-18 19:38:35, Roman Gushchin wrote:
>> > On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:
>> > > On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
>> > > > On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
>> > > > > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
>> > > [...]
>> > > > > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
>> > > > > > 1-page long LRU, it can stay there basically forever.
>> > > > >
>> > > > > Why
>> > > > > /*
>> > > > > * If the cgroup's already been deleted, make sure to
>> > > > > * scrape out the remaining cache.
>> > > > > */
>> > > > > if (!scan && !mem_cgroup_online(memcg))
>> > > > > scan = min(size, SWAP_CLUSTER_MAX);
>> > > > >
>> > > > > in get_scan_count doesn't work for that case?
>> > > >
>> > > > No, it doesn't. Let's look at the whole picture:
>> > > >
>> > > > size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
>> > > > scan = size >> sc->priority;
>> > > > /*
>> > > > * If the cgroup's already been deleted, make sure to
>> > > > * scrape out the remaining cache.
>> > > > */
>> > > > if (!scan && !mem_cgroup_online(memcg))
>> > > > scan = min(size, SWAP_CLUSTER_MAX);
>> > > >
>> > > > If size == 1, scan == 0 => scan = min(1, 32) == 1.
>> > > > And after proportional adjustment we'll have 0.
>> > >
>> > > My friday brain hurst when looking at this but if it doesn't work as
>> > > advertized then it should be fixed. I do not see any of your patches to
>> > > touch this logic so how come it would work after them applied?
>> >
>> > This part works as expected. But the following
>> > scan = div64_u64(scan * fraction[file], denominator);
>> > reliable turns 1 page to scan to 0 pages to scan.
>>
>> OK, 68600f623d69 ("mm: don't miss the last page because of round-off
>> error") sounds like a good and safe stable backport material.
>
>Thanks for this, now queued up.
>
>greg k-h

It seems that 172b06c32b949 ("mm: slowly shrink slabs with a relatively
small number of objects") and a76cf1a474d ("mm: don't reclaim inodes
with many attached pages") cause a regression reported against the 4.19
stable tree: https://bugzilla.kernel.org/show_bug.cgi?id=202441 .

Given the history and complexity of these (and other patches from that
series) it would be nice to understand if this is something that will be
fixed soon or should we look into reverting the series for now?

--
Thanks,
Sasha

2019-01-30 06:00:23

by Roman Gushchin

[permalink] [raw]
Subject: Re: Will the recent memory leak fixes be backported to longterm kernels?

On Tue, Jan 29, 2019 at 07:23:56PM -0500, Sasha Levin wrote:
> On Fri, Dec 28, 2018 at 11:50:08AM +0100, Greg KH wrote:
> > On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:
> > > On Fri 02-11-18 19:38:35, Roman Gushchin wrote:
> > > > On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:
> > > > > On Fri 02-11-18 17:25:58, Roman Gushchin wrote:
> > > > > > On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:
> > > > > > > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:
> > > > > [...]
> > > > > > > > 2) We do forget to scan the last page in the LRU list. So if we ended up with
> > > > > > > > 1-page long LRU, it can stay there basically forever.
> > > > > > >
> > > > > > > Why
> > > > > > > /*
> > > > > > > * If the cgroup's already been deleted, make sure to
> > > > > > > * scrape out the remaining cache.
> > > > > > > */
> > > > > > > if (!scan && !mem_cgroup_online(memcg))
> > > > > > > scan = min(size, SWAP_CLUSTER_MAX);
> > > > > > >
> > > > > > > in get_scan_count doesn't work for that case?
> > > > > >
> > > > > > No, it doesn't. Let's look at the whole picture:
> > > > > >
> > > > > > size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> > > > > > scan = size >> sc->priority;
> > > > > > /*
> > > > > > * If the cgroup's already been deleted, make sure to
> > > > > > * scrape out the remaining cache.
> > > > > > */
> > > > > > if (!scan && !mem_cgroup_online(memcg))
> > > > > > scan = min(size, SWAP_CLUSTER_MAX);
> > > > > >
> > > > > > If size == 1, scan == 0 => scan = min(1, 32) == 1.
> > > > > > And after proportional adjustment we'll have 0.
> > > > >
> > > > > My friday brain hurst when looking at this but if it doesn't work as
> > > > > advertized then it should be fixed. I do not see any of your patches to
> > > > > touch this logic so how come it would work after them applied?
> > > >
> > > > This part works as expected. But the following
> > > > scan = div64_u64(scan * fraction[file], denominator);
> > > > reliable turns 1 page to scan to 0 pages to scan.
> > >
> > > OK, 68600f623d69 ("mm: don't miss the last page because of round-off
> > > error") sounds like a good and safe stable backport material.
> >
> > Thanks for this, now queued up.
> >
> > greg k-h
>
> It seems that 172b06c32b949 ("mm: slowly shrink slabs with a relatively
> small number of objects") and a76cf1a474d ("mm: don't reclaim inodes
> with many attached pages") cause a regression reported against the 4.19
> stable tree: https://bugzilla.kernel.org/show_bug.cgi?id=202441 .
>
> Given the history and complexity of these (and other patches from that
> series) it would be nice to understand if this is something that will be
> fixed soon or should we look into reverting the series for now?

In that thread I've just suggested to give a chance to Rik's patch, which
hopefully will mitigate or easy the regression (
https://lkml.org/lkml/2019/1/28/1865 ).

Of course, we can simple revert those changes, but this will re-introduce
the memory leak, so I'd leave it as a last option.

Thanks!