2018-05-25 18:56:48

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH] memcg: force charge kmem counter too

Based on several conditions the kernel can decide to force charge an
allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
on kmem counter is set and reached.

Signed-off-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab5673dbfc4e..0a88f824c550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
current->memcg_nr_pages_over_high = 0;
}

+/*
+ * Based on try_charge() force charge conditions.
+ */
+static inline bool should_force_charge(gfp_t gfp_mask)
+{
+ return (unlikely(tsk_is_oom_victim(current) ||
+ fatal_signal_pending(current) ||
+ current->flags & PF_EXITING ||
+ current->flags & PF_MEMALLOC ||
+ gfp_mask & __GFP_NOFAIL));
+}
+
static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
@@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* The allocation either can't fail or will lead to more memory
* being freed very soon. Allow memory usage go over the limit
* temporarily by force charging it.
+ *
+ * NOTE: Please keep the should_force_charge() conditions in sync.
*/
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
@@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,

if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
- cancel_charge(memcg, nr_pages);
- return -ENOMEM;
+ if (!should_force_charge(gfp)) {
+ cancel_charge(memcg, nr_pages);
+ return -ENOMEM;
+ }
+ page_counter_charge(&memcg->kmem, nr_pages);
}

page->mem_cgroup = memcg;
--
2.17.0.921.gf22659ad46-goog



2018-05-26 18:52:42

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> Based on several conditions the kernel can decide to force charge an
> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> on kmem counter is set and reached.

memory.kmem.limit is broken and unlikely to ever be fixed as this knob
was deprecated in cgroup-v2. The fact that hitting the limit doesn't
trigger reclaim can result in unexpected behavior from user's pov, like
getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
allocations isn't going to fix those problem. So I'd suggest to avoid
setting memory.kmem.limit instead of trying to fix it or, even better,
switch to cgroup-v2.

>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 21 +++++++++++++++++++--
> 1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab5673dbfc4e..0a88f824c550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
> current->memcg_nr_pages_over_high = 0;
> }
>
> +/*
> + * Based on try_charge() force charge conditions.
> + */
> +static inline bool should_force_charge(gfp_t gfp_mask)
> +{
> + return (unlikely(tsk_is_oom_victim(current) ||
> + fatal_signal_pending(current) ||
> + current->flags & PF_EXITING ||
> + current->flags & PF_MEMALLOC ||
> + gfp_mask & __GFP_NOFAIL));
> +}
> +
> static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages)
> {
> @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * The allocation either can't fail or will lead to more memory
> * being freed very soon. Allow memory usage go over the limit
> * temporarily by force charging it.
> + *
> + * NOTE: Please keep the should_force_charge() conditions in sync.
> */
> page_counter_charge(&memcg->memory, nr_pages);
> if (do_memsw_account())
> @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> - cancel_charge(memcg, nr_pages);
> - return -ENOMEM;
> + if (!should_force_charge(gfp)) {
> + cancel_charge(memcg, nr_pages);
> + return -ENOMEM;
> + }
> + page_counter_charge(&memcg->kmem, nr_pages);
> }
>
> page->mem_cgroup = memcg;

2018-05-26 22:38:05

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
<[email protected]> wrote:
> On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
>> Based on several conditions the kernel can decide to force charge an
>> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
>> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
>> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
>> on kmem counter is set and reached.
>
> memory.kmem.limit is broken and unlikely to ever be fixed as this knob
> was deprecated in cgroup-v2. The fact that hitting the limit doesn't
> trigger reclaim can result in unexpected behavior from user's pov, like
> getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
> allocations isn't going to fix those problem.

I understand that fixing NOFAIL will not fix all other issues but it
still is better than current situation. IMHO we should keep fixing
kmem bit by bit.

One crazy idea is to just break it completely by force charging all the time.

2018-05-28 15:57:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Sat 26-05-18 15:37:05, Shakeel Butt wrote:
> On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
> <[email protected]> wrote:
> > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> >> Based on several conditions the kernel can decide to force charge an
> >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> >> on kmem counter is set and reached.
> >
> > memory.kmem.limit is broken and unlikely to ever be fixed as this knob
> > was deprecated in cgroup-v2. The fact that hitting the limit doesn't
> > trigger reclaim can result in unexpected behavior from user's pov, like
> > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
> > allocations isn't going to fix those problem.
>
> I understand that fixing NOFAIL will not fix all other issues but it
> still is better than current situation. IMHO we should keep fixing
> kmem bit by bit.
>
> One crazy idea is to just break it completely by force charging all the time.

What is the limit good for then? Accounting?

--
Michal Hocko
SUSE Labs

2018-05-29 03:48:09

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> On Sat 26-05-18 15:37:05, Shakeel Butt wrote:
>> On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
>> <[email protected]> wrote:
>> > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
>> >> Based on several conditions the kernel can decide to force charge an
>> >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
>> >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
>> >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
>> >> on kmem counter is set and reached.
>> >
>> > memory.kmem.limit is broken and unlikely to ever be fixed as this knob
>> > was deprecated in cgroup-v2. The fact that hitting the limit doesn't
>> > trigger reclaim can result in unexpected behavior from user's pov, like
>> > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
>> > allocations isn't going to fix those problem.
>>
>> I understand that fixing NOFAIL will not fix all other issues but it
>> still is better than current situation. IMHO we should keep fixing
>> kmem bit by bit.
>>
>> One crazy idea is to just break it completely by force charging all the time.
>
> What is the limit good for then? Accounting?
>

Unlike tcpmem, the kmem accounting is enabled by default. No need to
set the limit to enable accounting.

I think my crazy idea was just wrong and without much thought. Though
is there a precedence where the broken feature is not fixed because an
alternative is available?

2018-05-29 08:32:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> Though is there a precedence where the broken feature is not fixed
> because an alternative is available?

Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
other hand we keep saying that kmem accounting in v1 is hard usable and
strongly discourage people from using it. Sure we can add the code which
handles _this_ particular case but that wouldn't make the whole thing
more usable I strongly suspect. Maybe I am wrong and you can provide
some specific examples. Is GFP_NOFAIL that common to matter?

In any case we should balance between the code maintainability here.
Adding more cruft into the allocator path is not free.

--
Michal Hocko
SUSE Labs

2018-05-30 18:15:20

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <[email protected]> wrote:
> On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
>> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
>> Though is there a precedence where the broken feature is not fixed
>> because an alternative is available?
>
> Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> other hand we keep saying that kmem accounting in v1 is hard usable and
> strongly discourage people from using it. Sure we can add the code which
> handles _this_ particular case but that wouldn't make the whole thing
> more usable I strongly suspect. Maybe I am wrong and you can provide
> some specific examples. Is GFP_NOFAIL that common to matter?
>
> In any case we should balance between the code maintainability here.
> Adding more cruft into the allocator path is not free.
>

We do not use kmem limits internally and this is something I found
through code inspection. If this patch is increasing the cost of code
maintainability I am fine with dropping it but at least there should a
comment saying that kmem limits are broken and no need fix.

Shakeel

2018-05-31 06:02:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <[email protected]> wrote:
> > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> >> Though is there a precedence where the broken feature is not fixed
> >> because an alternative is available?
> >
> > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > other hand we keep saying that kmem accounting in v1 is hard usable and
> > strongly discourage people from using it. Sure we can add the code which
> > handles _this_ particular case but that wouldn't make the whole thing
> > more usable I strongly suspect. Maybe I am wrong and you can provide
> > some specific examples. Is GFP_NOFAIL that common to matter?
> >
> > In any case we should balance between the code maintainability here.
> > Adding more cruft into the allocator path is not free.
> >
>
> We do not use kmem limits internally and this is something I found
> through code inspection. If this patch is increasing the cost of code
> maintainability I am fine with dropping it but at least there should a
> comment saying that kmem limits are broken and no need fix.

I agree.

Even, I didn't know kmem is strongly discouraged until now. Then,
why is it enabled by default on cgroup v1?

Let's turn if off with comment "It's broken so do not use/fix. Instead,
please move to cgroup v2".

2018-05-31 06:57:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <[email protected]> wrote:
> > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> > >> Though is there a precedence where the broken feature is not fixed
> > >> because an alternative is available?
> > >
> > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > strongly discourage people from using it. Sure we can add the code which
> > > handles _this_ particular case but that wouldn't make the whole thing
> > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > some specific examples. Is GFP_NOFAIL that common to matter?
> > >
> > > In any case we should balance between the code maintainability here.
> > > Adding more cruft into the allocator path is not free.
> > >
> >
> > We do not use kmem limits internally and this is something I found
> > through code inspection. If this patch is increasing the cost of code
> > maintainability I am fine with dropping it but at least there should a
> > comment saying that kmem limits are broken and no need fix.
>
> I agree.
>
> Even, I didn't know kmem is strongly discouraged until now. Then,
> why is it enabled by default on cgroup v1?

You have to set a non-zero limit to make it active IIRC. The code is
compiled in because v2 enables it by default.

> Let's turn if off with comment "It's broken so do not use/fix. Instead,
> please move to cgroup v2".

--
Michal Hocko
SUSE Labs

2018-05-31 08:24:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote:
> On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <[email protected]> wrote:
> > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> > > >> Though is there a precedence where the broken feature is not fixed
> > > >> because an alternative is available?
> > > >
> > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > > strongly discourage people from using it. Sure we can add the code which
> > > > handles _this_ particular case but that wouldn't make the whole thing
> > > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > > some specific examples. Is GFP_NOFAIL that common to matter?
> > > >
> > > > In any case we should balance between the code maintainability here.
> > > > Adding more cruft into the allocator path is not free.
> > > >
> > >
> > > We do not use kmem limits internally and this is something I found
> > > through code inspection. If this patch is increasing the cost of code
> > > maintainability I am fine with dropping it but at least there should a
> > > comment saying that kmem limits are broken and no need fix.
> >
> > I agree.
> >
> > Even, I didn't know kmem is strongly discouraged until now. Then,
> > why is it enabled by default on cgroup v1?
>
> You have to set a non-zero limit to make it active IIRC. The code is

Maybe, no. I didn't set anything. IOW, it was a default without any setting
and I hit this as you can remember.
http://lkml.kernel.org/r/<[email protected]>
We don't need to allocate memory for stuff even maintainers discourage.

> compiled in because v2 enables it by default.
>
> > Let's turn if off with comment "It's broken so do not use/fix. Instead,
> > please move to cgroup v2".
>
> --
> Michal Hocko
> SUSE Labs

2018-05-31 08:52:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: force charge kmem counter too

On Thu 31-05-18 17:23:17, Minchan Kim wrote:
> On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote:
> > On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> > > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <[email protected]> wrote:
> > > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <[email protected]> wrote:
> > > > >> Though is there a precedence where the broken feature is not fixed
> > > > >> because an alternative is available?
> > > > >
> > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > > > strongly discourage people from using it. Sure we can add the code which
> > > > > handles _this_ particular case but that wouldn't make the whole thing
> > > > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > > > some specific examples. Is GFP_NOFAIL that common to matter?
> > > > >
> > > > > In any case we should balance between the code maintainability here.
> > > > > Adding more cruft into the allocator path is not free.
> > > > >
> > > >
> > > > We do not use kmem limits internally and this is something I found
> > > > through code inspection. If this patch is increasing the cost of code
> > > > maintainability I am fine with dropping it but at least there should a
> > > > comment saying that kmem limits are broken and no need fix.
> > >
> > > I agree.
> > >
> > > Even, I didn't know kmem is strongly discouraged until now. Then,
> > > why is it enabled by default on cgroup v1?
> >
> > You have to set a non-zero limit to make it active IIRC. The code is
>
> Maybe, no. I didn't set anything. IOW, it was a default without any setting
> and I hit this as you can remember.
> http://lkml.kernel.org/r/<[email protected]>
> We don't need to allocate memory for stuff even maintainers discourage.

Well, you can disable the whole kmem accounting by cgroup.memory=nokmem
kernel parameter. I wasn't very happy when we removed the config option
which would have put all the tracking aside even when no limit is set
but the argument was that the ifdefery was growing into an
unmaintainable mess and I have to agree with that. So here we are...
--
Michal Hocko
SUSE Labs