LinuxLists.cc - [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On 9/20/2023 10:43 AM, Michal Hocko wrote:
> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
>>> 6.1-stable review patch. If anyone has any objections, please let me know.
>>>
>>> ------------------
>>
>> Hi Greg/Michal,
>>
>> This commit breaks userspace which makes it a bad commit for mainline and an
>> even worse commit for stable.
>>
>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
>> fine.
>
> Could you expand some more on why is the file read? It doesn't support
> writing to it for some time so how does reading it helps in any sense?
>
> Anyway, I do agree that the stable backport should be reverted.
>

This file is read together with all the other memcg files. Each prefix:

memory
memory.memsw
memory.kmem
memory.kmem.tcp

is combined with these suffixes

.usage_in_bytes
.max_usage_in_bytes
.failcnt
.limit_in_bytes

and read, the values are then forwarded on to other components for scheduling decisions.
You want to know the limit when checking the usage (is the usage close to the limit or not).

Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the
set missing is an anomaly. So maybe we could keep the dummy file just for the
sake of consistency? Cgroupv1 is legacy after all.

>>> Address this by wiping out the file completely and effectively get back to
>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>>
>> On reads, the runc code checks for MEMCG_KMEM=n by checking
>> kmem.usage_in_bytes. If it is present then runc expects the other cgroup files
>> to be there (including kmem.limit_in_bytes). So this change is not effectively
>> the same.
>>
>> Here's a link to the PR that would be needed to handle this change in userspace
>> (not merged yet and would need to be propagated through the ecosystem):
>>
>> https://github.com/opencontainers/runc/pull/4018.
>
> Thanks. Does that mean the revert is still necessary for the Linus tree
> or do you expect that the fix can be merged and propagated in a
> reasonable time?
>

We can probably get runc and currently supported kubernetes versions patched in time
before 6.6 (or the next LTS kernel) hits LTS distros.

But there's still a bunch of users running cgroupv1 with unsupported kubernetes
versions that are still taking kernel updates as they come, so this might get reported
again next year if it stays in mainline.

2023-09-20 12:25:17

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
> > 6.1-stable review patch. If anyone has any objections, please let me know.
> >
> > ------------------
>
> Hi Greg/Michal,
>
> This commit breaks userspace which makes it a bad commit for mainline and an
> even worse commit for stable.
>
> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> fine.

Could you expand some more on why is the file read? It doesn't support
writing to it for some time so how does reading it helps in any sense?

Anyway, I do agree that the stable backport should be reverted.

> > Address this by wiping out the file completely and effectively get back to
> > pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>
> On reads, the runc code checks for MEMCG_KMEM=n by checking
> kmem.usage_in_bytes. If it is present then runc expects the other cgroup files
> to be there (including kmem.limit_in_bytes). So this change is not effectively
> the same.
>
> Here's a link to the PR that would be needed to handle this change in userspace
> (not merged yet and would need to be propagated through the ecosystem):
>
> https://github.com/opencontainers/runc/pull/4018.

Thanks. Does that mean the revert is still necessary for the Linus tree
or do you expect that the fix can be merged and propagated in a
reasonable time?

--
Michal Hocko
SUSE Labs

2023-09-20 13:34:18

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 01:08:43PM +0200, Michal Hocko wrote:
> On Wed 20-09-23 12:45:08, Greg KH wrote:
> [...]
> > Ok, then we should revert this, I'll go drop it in the stable trees, it
> > should also be reverted in Linus's tree too.
>
> A simple revert would break other users as noted in other response so
> wait with sending reverts to Linus before we agreen on the least painful
> solution.

A revert should cause the systems that stopped working to start working
again, so I'll keep the revert in the stable trees and wait for you to
work out the real solution in Linus's tree and then backport all of them
as needed.

thanks,

greg k-h

2023-09-20 15:07:47

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 10:43:56AM +0200, Michal Hocko wrote:
> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
> > On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
> > > 6.1-stable review patch. If anyone has any objections, please let me know.
> > >
> > > ------------------
> >
> > Hi Greg/Michal,
> >
> > This commit breaks userspace which makes it a bad commit for mainline and an
> > even worse commit for stable.
> >
> > We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> > cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> > into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> > fine.
>
> Could you expand some more on why is the file read? It doesn't support
> writing to it for some time so how does reading it helps in any sense?
>
> Anyway, I do agree that the stable backport should be reverted.

That will just postpone the breakage, we really shouldn't break
userspace.

That being said, having userspace "break" because a file is no longer
present is not good coding style on the userspace side at all. That's
why we have sysfs and single-value-files now, if the file isn't present,
then userspace instantly notices and can handle it. Much easier than
the old-style multi-fields-in-one-file problem.

> > > Address this by wiping out the file completely and effectively get back to
> > > pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.

The fact that this is a valid option (i.e. no file) with that config
option disabled makes me want to keep this as well, as how does
userspace handle this option disabled at all? Or old kernels?

I can drop this from stable kernels, but again, this feels like the runc
developers are just postponing the problem...

thanks,

greg k-h

2023-09-20 15:20:43

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 12:04:48, Jeremi Piotrowski wrote:
> On 9/20/2023 10:43 AM, Michal Hocko wrote:
> > On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
> >> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
> >>> 6.1-stable review patch. If anyone has any objections, please let me know.
> >>>
> >>> ------------------
> >>
> >> Hi Greg/Michal,
> >>
> >> This commit breaks userspace which makes it a bad commit for mainline and an
> >> even worse commit for stable.
> >>
> >> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> >> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> >> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> >> fine.
> >
> > Could you expand some more on why is the file read? It doesn't support
> > writing to it for some time so how does reading it helps in any sense?
> >
> > Anyway, I do agree that the stable backport should be reverted.
> >
>
> This file is read together with all the other memcg files. Each prefix:
>
> memory
> memory.memsw
> memory.kmem
> memory.kmem.tcp
>
> is combined with these suffixes
>
> .usage_in_bytes
> .max_usage_in_bytes
> .failcnt
> .limit_in_bytes
>
> and read, the values are then forwarded on to other components for scheduling decisions.
> You want to know the limit when checking the usage (is the usage close to the limit or not).

You know there is no kmem limit as there is no way to set it for some
time (since 5.16 - i.e. 2 years ago). I can see that users following old
kernels could have missed that though.

> Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the
> set missing is an anomaly. So maybe we could keep the dummy file just for the
> sake of consistency? Cgroupv1 is legacy after all.

What we had was a dummy file. It didn't allow to write any value so it
would have always reported unlimited. The reason I've decided to remove
the file was that there were other users not being able to handle the
write failure while they are just fine not having the file. So we are
effectively between a rock and hard place here. Either way something is
broken. The other SW got fixed as well but similar to your case it takes
some time to absorb the change through all 3rd party users.

> >>> Address this by wiping out the file completely and effectively get back to
> >>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
> >>
> >> On reads, the runc code checks for MEMCG_KMEM=n by checking
> >> kmem.usage_in_bytes.

Just one side note. Config options get renamed and their semantic
changes over time so I would just recomment to never make any
dependencies on any specific one.

> >> If it is present then runc expects the other cgroup files
> >> to be there (including kmem.limit_in_bytes). So this change is not effectively
> >> the same.
> >>
> >> Here's a link to the PR that would be needed to handle this change in userspace
> >> (not merged yet and would need to be propagated through the ecosystem):
> >>
> >> https://github.com/opencontainers/runc/pull/4018.
> >
> > Thanks. Does that mean the revert is still necessary for the Linus tree
> > or do you expect that the fix can be merged and propagated in a
> > reasonable time?
> >
>
> We can probably get runc and currently supported kubernetes versions patched in time
> before 6.6 (or the next LTS kernel) hits LTS distros.
>
> But there's still a bunch of users running cgroupv1 with unsupported kubernetes
> versions that are still taking kernel updates as they come, so this might get reported
> again next year if it stays in mainline.

I can see how 3rd party users are hard to get aligned but having a fix
available should allow them to apply it or is there any actual roadblock
for them to adapt as soon as they hit the issue?

I mean, normally I would be just fine reverting this API change because
it is disruptive but the only way to have the file available and not
break somebody is to revert 58056f77502f ("memcg, kmem: further
deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
there but that sounds rather dubious. Although one could argue this
would mimic nokmem kernel option.

--
Michal Hocko
SUSE Labs

2023-09-20 15:34:54

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 12:21:37PM +0200, Jeremi Piotrowski wrote:
> On 9/20/2023 11:25 AM, Greg Kroah-Hartman wrote:
> > On Wed, Sep 20, 2023 at 10:43:56AM +0200, Michal Hocko wrote:
> >> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
> >>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
> >>>> 6.1-stable review patch. If anyone has any objections, please let me know.
> >>>>
> >>>> ------------------
> >>>
> >>> Hi Greg/Michal,
> >>>
> >>> This commit breaks userspace which makes it a bad commit for mainline and an
> >>> even worse commit for stable.
> >>>
> >>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> >>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> >>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> >>> fine.
> >>
> >> Could you expand some more on why is the file read? It doesn't support
> >> writing to it for some time so how does reading it helps in any sense?
> >>
> >> Anyway, I do agree that the stable backport should be reverted.
> >
> > That will just postpone the breakage, we really shouldn't break
> > userspace.
> >
> > That being said, having userspace "break" because a file is no longer
> > present is not good coding style on the userspace side at all. That's
> > why we have sysfs and single-value-files now, if the file isn't present,
> > then userspace instantly notices and can handle it. Much easier than
> > the old-style multi-fields-in-one-file problem.
> >
>
> The memcg files in this case are single-value, but userspace expects to be able
> to read memcg limits when it can read the usage (indicating MEMCG is enabled).
> If it can't - then something is off, and the node is marked unhealthy.
>
> >>>> Address this by wiping out the file completely and effectively get back to
> >>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
> >
> > The fact that this is a valid option (i.e. no file) with that config
> > option disabled makes me want to keep this as well, as how does
> > userspace handle this option disabled at all? Or old kernels?
> >
>
> Userspace has had to handle the case of MEMCG_KMEM=n, but that had 2 cases so far:
>
> limits/usage/max_usage/failcnt files are all available or none of them are available.
>
> Now it needs to handle 3 of 4 files being available, but only for kmem (and not plain
> memory, memsw or kmem.tcp). That's an inconsistency.
>
> > I can drop this from stable kernels, but again, this feels like the runc
> > developers are just postponing the problem...
> >
>
> Since cgroups v1 is deprecated, I think the runc developers haven't touched this part
> of the code in years and expected it to keep working while they wait for the long tail
> of usage to die out.

Ok, then we should revert this, I'll go drop it in the stable trees, it
should also be reverted in Linus's tree too.

thanks,

greg k-h

2023-09-20 15:50:41

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 15:25:23, Jeremi Piotrowski wrote:
> On 9/20/2023 1:07 PM, Michal Hocko wrote:
[...]
> > I mean, normally I would be just fine reverting this API change because
> > it is disruptive but the only way to have the file available and not
> > break somebody is to revert 58056f77502f ("memcg, kmem: further
> > deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
> > there but that sounds rather dubious. Although one could argue this
> > would mimic nokmem kernel option.
> >
>
> I just want to make sure we don't introduce yet another new behavior in this legacy
> system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but
> does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit"
> (=don't event store the written limit). The latter might have unintended consequences.

Yes it would mean that the limit is never enforced. Bad as it is the
thing is that the hard limit on kernel memory is broken by design and
unfixable. This causes all sorts of unexpected kernel allocation
failures that this is simply unsafe to use.

All that being said I can see the following options
1) keep the current upstream status and not export the file
2) revert both 58056f77502f and 86327e8eb94 and make it clear
that kmem.limit_in_bytes is unsupported so failures or misbehavior
as a result of the limit being hit are likely not going to be
investigated or fixed.
3) reverting like in 2) but never inforce the limit (so basically nokmem
semantic)

Shakeel, Johannes, Roman, Muchun Song what do you think?
--
Michal Hocko
SUSE Labs

2023-09-20 16:03:32

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On 9/20/2023 1:07 PM, Michal Hocko wrote:
> On Wed 20-09-23 12:04:48, Jeremi Piotrowski wrote:
>> On 9/20/2023 10:43 AM, Michal Hocko wrote:
>>> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
>>>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
>>>>> 6.1-stable review patch. If anyone has any objections, please let me know.
>>>>>
>>>>> ------------------
>>>>
>>>> Hi Greg/Michal,
>>>>
>>>> This commit breaks userspace which makes it a bad commit for mainline and an
>>>> even worse commit for stable.
>>>>
>>>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
>>>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
>>>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
>>>> fine.
>>>
>>> Could you expand some more on why is the file read? It doesn't support
>>> writing to it for some time so how does reading it helps in any sense?
>>>
>>> Anyway, I do agree that the stable backport should be reverted.
>>>
>>
>> This file is read together with all the other memcg files. Each prefix:
>>
>> memory
>> memory.memsw
>> memory.kmem
>> memory.kmem.tcp
>>
>> is combined with these suffixes
>>
>> .usage_in_bytes
>> .max_usage_in_bytes
>> .failcnt
>> .limit_in_bytes
>>
>> and read, the values are then forwarded on to other components for scheduling decisions.
>> You want to know the limit when checking the usage (is the usage close to the limit or not).
>
> You know there is no kmem limit as there is no way to set it for some
> time (since 5.16 - i.e. 2 years ago). I can see that users following old
> kernels could have missed that though.

I know what you mean, but I think this generally went unnoticed because the limit file is read
unconditionally, but only written when a kmem limit is explicitly requested for a specific
container, which is rarely (if ever) done.

Regarding following old kernels: a majority of kubernetes users are still on 5.15 and only slowly
started shifting to >=6.1 very recently (this summer). This is mostly driven by distro vendor
policies which tend to follow the pattern of "follow LTS kernels but don't switch to the next
LTS immediately".

I know this is far from ideal for reporting these kinds of issues, would love to report
them as soon as a kernel release happens.

>
>> Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the
>> set missing is an anomaly. So maybe we could keep the dummy file just for the
>> sake of consistency? Cgroupv1 is legacy after all.
>
> What we had was a dummy file. It didn't allow to write any value so it
> would have always reported unlimited. The reason I've decided to remove
> the file was that there were other users not being able to handle the
> write failure while they are just fine not having the file. So we are
> effectively between a rock and hard place here. Either way something is
> broken. The other SW got fixed as well but similar to your case it takes
> some time to absorb the change through all 3rd party users.
>
>>>>> Address this by wiping out the file completely and effectively get back to
>>>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>>>>
>>>> On reads, the runc code checks for MEMCG_KMEM=n by checking
>>>> kmem.usage_in_bytes.
>
> Just one side note. Config options get renamed and their semantic
> changes over time so I would just recomment to never make any
> dependencies on any specific one.
>

Right, what i meant is the logic is this, with checking the "usage"
file to determine whether the controller is available:

value, err := fscommon.GetCgroupParamUint(path, usage)
if err != nil {
if name != "" && os.IsNotExist(err) {
// Ignore ENOENT as swap and kmem controllers
// are optional in the kernel.
return cgroups.MemoryData{}, nil
}
return cgroups.MemoryData{}, err
}

and if it is, then it proceeds to read "limit_in_bytes" and the others.

>>>> If it is present then runc expects the other cgroup files
>>>> to be there (including kmem.limit_in_bytes). So this change is not effectively
>>>> the same.
>>>>
>>>> Here's a link to the PR that would be needed to handle this change in userspace
>>>> (not merged yet and would need to be propagated through the ecosystem):
>>>>
>>>> https://github.com/opencontainers/runc/pull/4018.
>>>
>>> Thanks. Does that mean the revert is still necessary for the Linus tree
>>> or do you expect that the fix can be merged and propagated in a
>>> reasonable time?
>>>
>>
>> We can probably get runc and currently supported kubernetes versions patched in time
>> before 6.6 (or the next LTS kernel) hits LTS distros.
>>
>> But there's still a bunch of users running cgroupv1 with unsupported kubernetes
>> versions that are still taking kernel updates as they come, so this might get reported
>> again next year if it stays in mainline.
>
> I can see how 3rd party users are hard to get aligned but having a fix
> available should allow them to apply it or is there any actual roadblock
> for them to adapt as soon as they hit the issue?
>

The issue with this is that these users are running a frozen set of kubernetes (+runc)
binaries, but still pull kernel updates from the base distro. These kubernetes versions
are out of maintenance so the code will not get fixed and no one will release fixed
binaries.

> I mean, normally I would be just fine reverting this API change because
> it is disruptive but the only way to have the file available and not
> break somebody is to revert 58056f77502f ("memcg, kmem: further
> deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
> there but that sounds rather dubious. Although one could argue this
> would mimic nokmem kernel option.
>

I just want to make sure we don't introduce yet another new behavior in this legacy
system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but
does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit"
(=don't event store the written limit). The latter might have unintended consequences.

2023-09-20 19:15:24

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On 9/20/2023 11:25 AM, Greg Kroah-Hartman wrote:
> On Wed, Sep 20, 2023 at 10:43:56AM +0200, Michal Hocko wrote:
>> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
>>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
>>>> 6.1-stable review patch. If anyone has any objections, please let me know.
>>>>
>>>> ------------------
>>>
>>> Hi Greg/Michal,
>>>
>>> This commit breaks userspace which makes it a bad commit for mainline and an
>>> even worse commit for stable.
>>>
>>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
>>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
>>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
>>> fine.
>>
>> Could you expand some more on why is the file read? It doesn't support
>> writing to it for some time so how does reading it helps in any sense?
>>
>> Anyway, I do agree that the stable backport should be reverted.
>
> That will just postpone the breakage, we really shouldn't break
> userspace.
>
> That being said, having userspace "break" because a file is no longer
> present is not good coding style on the userspace side at all. That's
> why we have sysfs and single-value-files now, if the file isn't present,
> then userspace instantly notices and can handle it. Much easier than
> the old-style multi-fields-in-one-file problem.
>

The memcg files in this case are single-value, but userspace expects to be able
to read memcg limits when it can read the usage (indicating MEMCG is enabled).
If it can't - then something is off, and the node is marked unhealthy.

>>>> Address this by wiping out the file completely and effectively get back to
>>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>
> The fact that this is a valid option (i.e. no file) with that config
> option disabled makes me want to keep this as well, as how does
> userspace handle this option disabled at all? Or old kernels?
>

Userspace has had to handle the case of MEMCG_KMEM=n, but that had 2 cases so far:

limits/usage/max_usage/failcnt files are all available or none of them are available.

Now it needs to handle 3 of 4 files being available, but only for kmem (and not plain
memory, memsw or kmem.tcp). That's an inconsistency.

> I can drop this from stable kernels, but again, this feels like the runc
> developers are just postponing the problem...
>

Since cgroups v1 is deprecated, I think the runc developers haven't touched this part
of the code in years and expected it to keep working while they wait for the long tail
of usage to die out.

> thanks,
>
> greg k-h

2023-09-20 20:11:22

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 9:55 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 20-09-23 08:32:42, Shakeel Butt wrote:
> > Also I don't think reverting 58056f77502f would give any benefit.
>
> Not reverting 58056f77502f would re-introduce the regression in some
> non-patched versions of Docker runtimes which cannot handle ENOTSUPP.
> So I think we need to revert both or none of them. I would prefer the
> later (option 1) as the fix is trivial but I do understand headache
> of chasing all those outdated deployments or vendor code forks.

I think that would be too much conservative an approach but I don't
have a strong opinion against it. Also just to be clear we are not
talking about full revert of 58056f77502f but just the returning of
EOPNOTSUPP, right?

2023-09-20 21:26:29

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 12:45:08, Greg KH wrote:
[...]
> Ok, then we should revert this, I'll go drop it in the stable trees, it
> should also be reverted in Linus's tree too.

A simple revert would break other users as noted in other response so
wait with sending reverts to Linus before we agreen on the least painful
solution.

--
Michal Hocko
SUSE Labs

2023-09-20 23:06:21

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 08:32:42, Shakeel Butt wrote:
> Also I don't think reverting 58056f77502f would give any benefit.

Not reverting 58056f77502f would re-introduce the regression in some
non-patched versions of Docker runtimes which cannot handle ENOTSUPP.
So I think we need to revert both or none of them. I would prefer the
later (option 1) as the fix is trivial but I do understand headache
of chasing all those outdated deployments or vendor code forks.
--
Michal Hocko
SUSE Labs

2023-09-20 23:53:37

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 1:08 PM Michal Hocko <[email protected]> wrote:
>
[...]
> > have a strong opinion against it. Also just to be clear we are not
> > talking about full revert of 58056f77502f but just the returning of
> > EOPNOTSUPP, right?
>
> If we allow the limit to be set without returning a failure then we
> still have options 2 and 3 on how to deal with that. One of them is to
> enforce the limit.
>

Option 3 is a partial revert of 58056f77502f where we keep the no
limit enforcement and remove the EOPNOTSUPP return on write. Let's go
with option 3. In addition, let's add pr_warn_once on the read of
kmem.limit_in_bytes as well.

2023-09-21 00:22:14

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 6:47 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 20-09-23 15:25:23, Jeremi Piotrowski wrote:
> > On 9/20/2023 1:07 PM, Michal Hocko wrote:
> [...]
> > > I mean, normally I would be just fine reverting this API change because
> > > it is disruptive but the only way to have the file available and not
> > > break somebody is to revert 58056f77502f ("memcg, kmem: further
> > > deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
> > > there but that sounds rather dubious. Although one could argue this
> > > would mimic nokmem kernel option.
> > >
> >
> > I just want to make sure we don't introduce yet another new behavior in this legacy
> > system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but
> > does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit"
> > (=don't event store the written limit). The latter might have unintended consequences.
>
> Yes it would mean that the limit is never enforced. Bad as it is the
> thing is that the hard limit on kernel memory is broken by design and
> unfixable. This causes all sorts of unexpected kernel allocation
> failures that this is simply unsafe to use.
>
> All that being said I can see the following options
> 1) keep the current upstream status and not export the file
> 2) revert both 58056f77502f and 86327e8eb94 and make it clear
> that kmem.limit_in_bytes is unsupported so failures or misbehavior
> as a result of the limit being hit are likely not going to be
> investigated or fixed.
> 3) reverting like in 2) but never inforce the limit (so basically nokmem
> semantic)
>
> Shakeel, Johannes, Roman, Muchun Song what do you think?

I think the safe option would be to revert 86327e8eb94 for now and put
pr_warn_once even for the read of kmem.limit_in_bytes? We can retry
86327e8eb94 in a year or so.

However personally I would prefer option 1. Also I don't think
reverting 58056f77502f would give any benefit.

2023-09-21 06:35:32

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 12:46:23, Shakeel Butt wrote:
> On Wed, Sep 20, 2023 at 9:55 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 20-09-23 08:32:42, Shakeel Butt wrote:
> > > Also I don't think reverting 58056f77502f would give any benefit.
> >
> > Not reverting 58056f77502f would re-introduce the regression in some
> > non-patched versions of Docker runtimes which cannot handle ENOTSUPP.
> > So I think we need to revert both or none of them. I would prefer the
> > later (option 1) as the fix is trivial but I do understand headache
> > of chasing all those outdated deployments or vendor code forks.
>
> I think that would be too much conservative an approach but I don't

Well, TBH I do not really see any sifference between one set of broken
userspace or the other. Both are making assumptions on our interfaces
and they do not overlap unfortunately.

> have a strong opinion against it. Also just to be clear we are not
> talking about full revert of 58056f77502f but just the returning of
> EOPNOTSUPP, right?

If we allow the limit to be set without returning a failure then we
still have options 2 and 3 on how to deal with that. One of them is to
enforce the limit.

--
Michal Hocko
SUSE Labs

2023-09-21 18:00:29

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Thu, Sep 21, 2023 at 4:21 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 21-09-23 12:43:05, Jeremi Piotrowski wrote:
> > On 9/21/2023 9:52 AM, Michal Hocko wrote:
> > > On Wed 20-09-23 14:46:52, Shakeel Butt wrote:
> > >> On Wed, Sep 20, 2023 at 1:08 PM Michal Hocko <[email protected]> wrote:
> > >>>
> > >> [...]
> > >>>> have a strong opinion against it. Also just to be clear we are not
> > >>>> talking about full revert of 58056f77502f but just the returning of
> > >>>> EOPNOTSUPP, right?
> > >>>
> > >>> If we allow the limit to be set without returning a failure then we
> > >>> still have options 2 and 3 on how to deal with that. One of them is to
> > >>> enforce the limit.
> > >>>
> > >>
> > >> Option 3 is a partial revert of 58056f77502f where we keep the no
> > >> limit enforcement and remove the EOPNOTSUPP return on write. Let's go
> > >> with option 3. In addition, let's add pr_warn_once on the read of
> > >> kmem.limit_in_bytes as well.
> > >
> > > How about this?
> > > ---
> >
> > I'm OK with this approach. You're missing this in the patch below:
> >
> > // static struct cftype mem_cgroup_legacy_files[] = {
> >
> > + {
> > + .name = "kmem.limit_in_bytes",
> > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > + .write = mem_cgroup_write,
> > + .read_u64 = mem_cgroup_read_u64,
> > + },
>
> Of course. I've lost the hunk while massaging the revert. Thanks for
> spotting. Updated version below. Btw. I've decided to not pr_{warn,info}
> on the read side because realistically I do not think this will help all
> that much. I am worried we will get stuck with this for ever because
> there always be somebody stuck on unpatched userspace.
> ---
> From bb6702b698efd31f3f90f4f1dd36ffe223397bec Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 21 Sep 2023 09:38:29 +0200
> Subject: [PATCH] mm, memcg: reconsider kmem.limit_in_bytes deprecation
>
> This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
> and partially reverts 58056f77502f ("memcg, kmem: further deprecate
> kmem.limit_in_bytes") which have incrementally removed support for the
> kernel memory accounting hard limit. Unfortunately it has turned out
> that there is still userspace depending on the existence of
> memory.kmem.limit_in_bytes [1]. The underlying functionality is not
> really required but the non-existent file just confuses the userspace
> which fails in the result. The patch to fix this on the userspace side
> has been submitted but it is hard to predict how it will propagate
> through the maze of 3rd party consumers of the software.
>
> Now, reverting alone 86327e8eb94c is not an option because there is
> another set of userspace which cannot cope with ENOTSUPP returned when
> writing to the file. Therefore we have to go and revisit 58056f77502f
> as well. There are two ways to go ahead. Either we give up on the
> deprecation and fully revert 58056f77502f as well or we can keep
> kmem.limit_in_bytes but make the write a noop and warn about the fact.
> This should work for both known breaking workloads which depend on the
> existence but do not depend on the hard limit enforcement.
>
> [1] http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
> Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
> Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
> Signed-off-by: Michal Hocko <[email protected]>

With one request below:

Acked-by: Shakeel Butt <[email protected]>

> ---
> Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
> mm/memcontrol.c | 18 ++++++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 5f502bf68fbc..ff456871bf4b 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -92,6 +92,13 @@ Brief summary of control files.
> memory.oom_control set/show oom controls.
> memory.numa_stat show the number of memory usage per numa
> node
> + memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
> + memory hard limit. Kernel hard limit is not
> + supported since 5.16. Writing any value to
> + do file will not have any effect same as if
> + nokmem kernel parameter was specified.
> + Kernel memory is still charged and reported
> + by memory.kmem.usage_in_bytes.
> memory.kmem.usage_in_bytes show current kernel memory allocation
> memory.kmem.failcnt show the number of kernel memory usage
> hits limits
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4d3282493b6..0b161705ef36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> unsigned int nr_pages)
> {
> + struct page_counter *counter;
> struct mem_cgroup *memcg;
> int ret;
>
> @@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> goto out;
>
> memcg_account_kmem(memcg, nr_pages);
> +
> + /* There is no way to set up kmem hard limit so this operation cannot fail */
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));

WARN_ON_ONCE() please.

2023-09-21 21:19:29

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On 9/21/2023 9:52 AM, Michal Hocko wrote:
> On Wed 20-09-23 14:46:52, Shakeel Butt wrote:
>> On Wed, Sep 20, 2023 at 1:08 PM Michal Hocko <[email protected]> wrote:
>>>
>> [...]
>>>> have a strong opinion against it. Also just to be clear we are not
>>>> talking about full revert of 58056f77502f but just the returning of
>>>> EOPNOTSUPP, right?
>>>
>>> If we allow the limit to be set without returning a failure then we
>>> still have options 2 and 3 on how to deal with that. One of them is to
>>> enforce the limit.
>>>
>>
>> Option 3 is a partial revert of 58056f77502f where we keep the no
>> limit enforcement and remove the EOPNOTSUPP return on write. Let's go
>> with option 3. In addition, let's add pr_warn_once on the read of
>> kmem.limit_in_bytes as well.
>
> How about this?
> ---

I'm OK with this approach. You're missing this in the patch below:

// static struct cftype mem_cgroup_legacy_files[] = {

+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read_u64,
+ },

Thanks,
Jeremi

>>From 81ae0797d8da1b9cfbf357b4be4787a5bbf46bb4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 21 Sep 2023 09:38:29 +0200
> Subject: [PATCH] mm, memcg: reconsider kmem.limit_in_bytes deprecation
>
> This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
> and partially reverts 58056f77502f ("memcg, kmem: further deprecate
> kmem.limit_in_bytes") which have incrementally removed support for the
> kernel memory accounting hard limit. Unfortunately it has turned out
> that there is still userspace depending on the existence of
> memory.kmem.limit_in_bytes [1]. The underlying functionality is not
> really required but the non-existent file just confuses the userspace
> which fails in the result. The patch to fix this on the userspace side
> has been submitted but it is hard to predict how it will propagate
> through the maze of 3rd party consumers of the software.
>
> Now, reverting alone 86327e8eb94c is not an option because there is
> another set of userspace which cannot cope with ENOTSUPP returned when
> writing to the file. Therefore we have to go and revisit 58056f77502f
> as well. There are two ways to go ahead. Either we give up on the
> deprecation and fully revert 58056f77502f as well or we can keep
> kmem.limit_in_bytes but make the write a noop and warn about the fact.
> This should work for both known breaking workloads which depend on the
> existence but do not depend on the hard limit enforcement.
>
> [1] http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
> Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
> Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
> mm/memcontrol.c | 12 ++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 5f502bf68fbc..ff456871bf4b 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -92,6 +92,13 @@ Brief summary of control files.
> memory.oom_control set/show oom controls.
> memory.numa_stat show the number of memory usage per numa
> node
> + memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
> + memory hard limit. Kernel hard limit is not
> + supported since 5.16. Writing any value to
> + do file will not have any effect same as if
> + nokmem kernel parameter was specified.
> + Kernel memory is still charged and reported
> + by memory.kmem.usage_in_bytes.
> memory.kmem.usage_in_bytes show current kernel memory allocation
> memory.kmem.failcnt show the number of kernel memory usage
> hits limits
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4d3282493b6..ac7f14b2338d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> unsigned int nr_pages)
> {
> + struct page_counter *counter;
> struct mem_cgroup *memcg;
> int ret;
>
> @@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> goto out;
>
> memcg_account_kmem(memcg, nr_pages);
> +
> + /* There is no way to set up kmem hard limit so this operation cannot fail */
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));
> out:
> css_put(&memcg->css);
>
> @@ -3867,6 +3872,13 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
> case _MEMSWAP:
> ret = mem_cgroup_resize_max(memcg, nr_pages, true);
> break;
> + case _KMEM:
> + pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
> + "Writing any value to this file has no effect. "
> + "Please report your usecase to [email protected] if you "
> + "depend on this functionality.\n");
> + ret = 0;
> + break;
> case _TCP:
> ret = memcg_update_tcp_max(memcg, nr_pages);
> break;

2023-09-22 02:04:31

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed 20-09-23 14:46:52, Shakeel Butt wrote:
> On Wed, Sep 20, 2023 at 1:08 PM Michal Hocko <[email protected]> wrote:
> >
> [...]
> > > have a strong opinion against it. Also just to be clear we are not
> > > talking about full revert of 58056f77502f but just the returning of
> > > EOPNOTSUPP, right?
> >
> > If we allow the limit to be set without returning a failure then we
> > still have options 2 and 3 on how to deal with that. One of them is to
> > enforce the limit.
> >
>
> Option 3 is a partial revert of 58056f77502f where we keep the no
> limit enforcement and remove the EOPNOTSUPP return on write. Let's go
> with option 3. In addition, let's add pr_warn_once on the read of
> kmem.limit_in_bytes as well.

How about this?
---
From 81ae0797d8da1b9cfbf357b4be4787a5bbf46bb4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 21 Sep 2023 09:38:29 +0200
Subject: [PATCH] mm, memcg: reconsider kmem.limit_in_bytes deprecation

This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
and partially reverts 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out
that there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate
through the maze of 3rd party consumers of the software.

Now, reverting alone 86327e8eb94c is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f77502f
as well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f77502f as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.

[1] http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
mm/memcontrol.c | 12 ++++++++++++
2 files changed, 19 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..ff456871bf4b 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,6 +92,13 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
+ memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
+ memory hard limit. Kernel hard limit is not
+ supported since 5.16. Writing any value to
+ do file will not have any effect same as if
+ nokmem kernel parameter was specified.
+ Kernel memory is still charged and reported
+ by memory.kmem.usage_in_bytes.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..ac7f14b2338d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
unsigned int nr_pages)
{
+ struct page_counter *counter;
struct mem_cgroup *memcg;
int ret;

@@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
goto out;

memcg_account_kmem(memcg, nr_pages);
+
+ /* There is no way to set up kmem hard limit so this operation cannot fail */
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));
out:
css_put(&memcg->css);

@@ -3867,6 +3872,13 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
case _MEMSWAP:
ret = mem_cgroup_resize_max(memcg, nr_pages, true);
break;
+ case _KMEM:
+ pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+ "Writing any value to this file has no effect. "
+ "Please report your usecase to [email protected] if you "
+ "depend on this functionality.\n");
+ ret = 0;
+ break;
case _TCP:
ret = memcg_update_tcp_max(memcg, nr_pages);
break;
--
2.30.2

--
Michal Hocko
SUSE Labs

2023-09-22 05:34:07

by Linux regression tracking (Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Thu 21-09-23 10:25:11, Shakeel Butt wrote:
> On Thu, Sep 21, 2023 at 4:21 AM Michal Hocko <[email protected]> wrote:
[...]
> With one request below:
>
> Acked-by: Shakeel Butt <[email protected]>

Thanks.

> > @@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> > goto out;
> >
> > memcg_account_kmem(memcg, nr_pages);
> > +
> > + /* There is no way to set up kmem hard limit so this operation cannot fail */
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));
>
> WARN_ON_ONCE() please.

Sure. This shouldn't really trigger, but it is true that if something
unexpected happens then it is likly to flood the log so _ONCE is safer.

I will wait for others to comment before I send the official patch.
To be completely honest I am not super happy about this way of handling
stuff, but considering the level of brokenness this seems like the
safest option. Especially when nobody really want to use the kernel
memory hard limit AFAIU.
--
Michal Hocko
SUSE Labs

2023-09-22 11:24:39

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 20.09.23 10:11, Jeremi Piotrowski wrote:
> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
>> 6.1-stable review patch. If anyone has any objections, please let me know.
>>
>> ------------------
>
> Hi Greg/Michal,
>
> This commit breaks userspace which makes it a bad commit for mainline and an
> even worse commit for stable.
>
> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> fine.
>
>> Address this by wiping out the file completely and effectively get back to
>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>
> On reads, the runc code checks for MEMCG_KMEM=n by checking
> kmem.usage_in_bytes. If it is present then runc expects the other cgroup files
> to be there (including kmem.limit_in_bytes). So this change is not effectively
> the same.
>
> Here's a link to the PR that would be needed to handle this change in userspace
> (not merged yet and would need to be propagated through the ecosystem):
>
> https://github.com/opencontainers/runc/pull/4018.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 86327e8eb94c52
#regzbot title mm, memcg: runc fails to gather cgroup statistics
#regzbot fix: mm, memcg: reconsider kmem.limit_in_bytes deprecation
#regzbot ignore-activity

FWIW, the porposed fix can be found here:
https://lore.kernel.org/all/[email protected]/

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-09-22 14:53:53

by Johannes Weiner

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Thu, Sep 21, 2023 at 01:21:54PM +0200, Michal Hocko wrote:
> @@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> unsigned int nr_pages)
> {
> + struct page_counter *counter;
> struct mem_cgroup *memcg;
> int ret;
>
> @@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> goto out;
>
> memcg_account_kmem(memcg, nr_pages);
> +
> + /* There is no way to set up kmem hard limit so this operation cannot fail */
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));

This hunk doesn't look quite right.

static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
{
mod_memcg_state(memcg, MEMCG_KMEM, nr_pages);
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
if (nr_pages > 0)
page_counter_charge(&memcg->kmem, nr_pages);
else
page_counter_uncharge(&memcg->kmem, -nr_pages);
}
}

Other than that, please add

Acked-by: Johannes Weiner <[email protected]>

2023-09-22 21:07:55

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Thu 21-09-23 12:43:05, Jeremi Piotrowski wrote:
> On 9/21/2023 9:52 AM, Michal Hocko wrote:
> > On Wed 20-09-23 14:46:52, Shakeel Butt wrote:
> >> On Wed, Sep 20, 2023 at 1:08 PM Michal Hocko <[email protected]> wrote:
> >>>
> >> [...]
> >>>> have a strong opinion against it. Also just to be clear we are not
> >>>> talking about full revert of 58056f77502f but just the returning of
> >>>> EOPNOTSUPP, right?
> >>>
> >>> If we allow the limit to be set without returning a failure then we
> >>> still have options 2 and 3 on how to deal with that. One of them is to
> >>> enforce the limit.
> >>>
> >>
> >> Option 3 is a partial revert of 58056f77502f where we keep the no
> >> limit enforcement and remove the EOPNOTSUPP return on write. Let's go
> >> with option 3. In addition, let's add pr_warn_once on the read of
> >> kmem.limit_in_bytes as well.
> >
> > How about this?
> > ---
>
> I'm OK with this approach. You're missing this in the patch below:
>
> // static struct cftype mem_cgroup_legacy_files[] = {
>
> + {
> + .name = "kmem.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> + .write = mem_cgroup_write,
> + .read_u64 = mem_cgroup_read_u64,
> + },

Of course. I've lost the hunk while massaging the revert. Thanks for
spotting. Updated version below. Btw. I've decided to not pr_{warn,info}
on the read side because realistically I do not think this will help all
that much. I am worried we will get stuck with this for ever because
there always be somebody stuck on unpatched userspace.
---
From bb6702b698efd31f3f90f4f1dd36ffe223397bec Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 21 Sep 2023 09:38:29 +0200
Subject: [PATCH] mm, memcg: reconsider kmem.limit_in_bytes deprecation

This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
and partially reverts 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out
that there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate
through the maze of 3rd party consumers of the software.

Now, reverting alone 86327e8eb94c is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f77502f
as well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f77502f as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.

[1] http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
mm/memcontrol.c | 18 ++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..ff456871bf4b 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,6 +92,13 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
+ memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
+ memory hard limit. Kernel hard limit is not
+ supported since 5.16. Writing any value to
+ do file will not have any effect same as if
+ nokmem kernel parameter was specified.
+ Kernel memory is still charged and reported
+ by memory.kmem.usage_in_bytes.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..0b161705ef36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
unsigned int nr_pages)
{
+ struct page_counter *counter;
struct mem_cgroup *memcg;
int ret;

@@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
goto out;

memcg_account_kmem(memcg, nr_pages);
+
+ /* There is no way to set up kmem hard limit so this operation cannot fail */
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));
out:
css_put(&memcg->css);

@@ -3867,6 +3872,13 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
case _MEMSWAP:
ret = mem_cgroup_resize_max(memcg, nr_pages, true);
break;
+ case _KMEM:
+ pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+ "Writing any value to this file has no effect. "
+ "Please report your usecase to [email protected] if you "
+ "depend on this functionality.\n");
+ ret = 0;
+ break;
case _TCP:
ret = memcg_update_tcp_max(memcg, nr_pages);
break;
@@ -5077,6 +5089,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
.seq_show = memcg_numa_stat_show,
},
#endif
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read_u64,
+ },
{
.name = "kmem.usage_in_bytes",
.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
--
2.30.2

--
Michal Hocko
SUSE Labs

2023-09-23 07:22:35

by Roman Gushchin

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Wed, Sep 20, 2023 at 03:47:37PM +0200, Michal Hocko wrote:
> On Wed 20-09-23 15:25:23, Jeremi Piotrowski wrote:
> > On 9/20/2023 1:07 PM, Michal Hocko wrote:
> [...]
> > > I mean, normally I would be just fine reverting this API change because
> > > it is disruptive but the only way to have the file available and not
> > > break somebody is to revert 58056f77502f ("memcg, kmem: further
> > > deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
> > > there but that sounds rather dubious. Although one could argue this
> > > would mimic nokmem kernel option.
> > >
> >
> > I just want to make sure we don't introduce yet another new behavior in this legacy
> > system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but
> > does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit"
> > (=don't event store the written limit). The latter might have unintended consequences.
>
> Yes it would mean that the limit is never enforced. Bad as it is the
> thing is that the hard limit on kernel memory is broken by design and
> unfixable. This causes all sorts of unexpected kernel allocation
> failures that this is simply unsafe to use.
>
> All that being said I can see the following options
> 1) keep the current upstream status and not export the file
> 2) revert both 58056f77502f and 86327e8eb94 and make it clear
> that kmem.limit_in_bytes is unsupported so failures or misbehavior
> as a result of the limit being hit are likely not going to be
> investigated or fixed.
> 3) reverting like in 2) but never inforce the limit (so basically nokmem
> semantic)

Since it's a part of cgroup v1 interface, which is in a frozen state as a whole,
and there is no significant (performance, code complexity) benefit of
additionally deprecating kmem.limit_in_bytes, I vote for 2).
1) is also an option.

Thanks!

2023-09-25 16:35:09

[permalink] [raw]

Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

On Fri 22-09-23 09:30:17, Johannes Weiner wrote:
> On Thu, Sep 21, 2023 at 01:21:54PM +0200, Michal Hocko wrote:
> > @@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> > static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> > unsigned int nr_pages)
> > {
> > + struct page_counter *counter;
> > struct mem_cgroup *memcg;
> > int ret;
> >
> > @@ -3107,6 +3108,10 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
> > goto out;
> >
> > memcg_account_kmem(memcg, nr_pages);
> > +
> > + /* There is no way to set up kmem hard limit so this operation cannot fail */
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + WARN_ON(!page_counter_try_charge(&memcg->kmem, nr_pages, &counter));
>
> This hunk doesn't look quite right.
>
> static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
> {
> mod_memcg_state(memcg, MEMCG_KMEM, nr_pages);
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> if (nr_pages > 0)
> page_counter_charge(&memcg->kmem, nr_pages);
> else
> page_counter_uncharge(&memcg->kmem, -nr_pages);
> }
> }
>
> Other than that, please add

Good point. I have missed a8c49af3be5f ("memcg: add per-memcg total kernel memory stat")
introduced in 4.18

> Acked-by: Johannes Weiner <[email protected]>

Fixed version below. Andrew, it seems we have a good consensus for this.
Could you queue this up and send it to Linus please?
---
From 8c3cbe68bba0fe5103d8fe73a06b3608ed49bda0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 21 Sep 2023 09:38:29 +0200
Subject: [PATCH] mm, memcg: reconsider kmem.limit_in_bytes deprecation

This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
and partially reverts 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out
that there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate
through the maze of 3rd party consumers of the software.

Now, reverting alone 86327e8eb94c is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f77502f
as well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f77502f as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.

Note to backporters to stable trees. a8c49af3be5f ("memcg: add per-memcg
total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
so the accounting is not done by obj_cgroup_charge_pages directly for v1
anymore. Prior kernels need to add it explicitly (thanks to Johannes for
pointing this out).

[1] http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Cc: stable
Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 7 +++++++
mm/memcontrol.c | 14 ++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..ff456871bf4b 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,6 +92,13 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
+ memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
+ memory hard limit. Kernel hard limit is not
+ supported since 5.16. Writing any value to
+ do file will not have any effect same as if
+ nokmem kernel parameter was specified.
+ Kernel memory is still charged and reported
+ by memory.kmem.usage_in_bytes.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..63bdaab2a906 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3097,6 +3097,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
unsigned int nr_pages)
{
+ struct page_counter *counter;
struct mem_cgroup *memcg;
int ret;

@@ -3867,6 +3868,13 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
case _MEMSWAP:
ret = mem_cgroup_resize_max(memcg, nr_pages, true);
break;
+ case _KMEM:
+ pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+ "Writing any value to this file has no effect. "
+ "Please report your usecase to [email protected] if you "
+ "depend on this functionality.\n");
+ ret = 0;
+ break;
case _TCP:
ret = memcg_update_tcp_max(memcg, nr_pages);
break;
@@ -5077,6 +5085,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
.seq_show = memcg_numa_stat_show,
},
#endif
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read_u64,
+ },
{
.name = "kmem.usage_in_bytes",
.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
--
2.30.2

--
Michal Hocko
SUSE Labs

2023-09-25 19:31:13