LinuxLists.cc - Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show

2022-04-22 18:15:34

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 02:58:19PM +0200, Michal Hocko wrote:
> On Thu 21-04-22 19:48:37, Kent Overstreet wrote:
> > This patch:
> > - Changes show_mem() to always report on slab usage
> > - Instead of reporting on all slabs, we only report on top 10 slabs,
> > and in sorted order
>
> As I've already pointed out in the email thread for the previous
> version, this would be better in its own patch explaining why we want to
> make this unconditional and why to limit the number caches to print.
> Why the trashold shouldn't be absolute size based?
>
> > - Also reports on shrinkers, with the new shrinkers_to_text().
> > Shrinkers need to be included in OOM/allocation failure reporting
> > because they're responsible for memory reclaim - if a shrinker isn't
> > giving up its memory, we need to know which one and why.
>
> Again, I do agree that information about shrinkers can be useful but
> there are two main things to consider. Do we want to dump that
> information unconditionaly? E.g. does it make sense to print for all
> allocation requests (even high order, GFP_NOWAIT...)? Should there be
> any explicit trigger when to dump this data (like too many shrinkers
> failing etc)?

To add a concern: largest shrinkers are usually memcg-aware. Scanning
over the whole cgroup tree (with potentially hundreds or thousands of cgroups)
and over all shrinkers from the oom context sounds like a bad idea to me.

IMO it's more appropriate to do from userspace by oomd or a similar daemon,
well before the in-kernel OOM kicks in.

>
> Last but not least let me echo the concern from the other reply. Memory
> allocations are not really reasonable to be done from the oom context so
> the pr_buf doesn't sound like a good tool here.

+1

2022-04-22 23:54:01

by Kent Overstreet

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 08:09:48AM -0700, Roman Gushchin wrote:
> To add a concern: largest shrinkers are usually memcg-aware. Scanning
> over the whole cgroup tree (with potentially hundreds or thousands of cgroups)
> and over all shrinkers from the oom context sounds like a bad idea to me.

Why would we be scanning over the whole cgroup tree? We don't do that in the
vmscan code, nor the new report. If the OOM is for a specific cgroup, we should
probably only be reporting on memory usage for that cgroup (show_mem() is not
currently cgroup aware, but perhaps it should be).

> IMO it's more appropriate to do from userspace by oomd or a similar daemon,
> well before the in-kernel OOM kicks in.

The reason I've been introducing printbufs and the .to_text() method was
specifically to make this code general enough to be available from
sysfs/debugfs - so I see no reasons why a userspace oomd couldn't make use of it
as well.

> > Last but not least let me echo the concern from the other reply. Memory
> > allocations are not really reasonable to be done from the oom context so
> > the pr_buf doesn't sound like a good tool here.
>
> +1

In my experience, it's rare to be _so_ out of memory that small kmalloc
allocations are failing - we'll be triggering the show_mem() report before that
happens.

However, if this turns out not to be the case in practice, or if there's a
consensus now that we really want to guard against this, I have some thoughts.
We could either:

- mempool-ify printbufs as a whole

- reserve some memory just for the show_mem() report, which would mean either
adding support to printbuf for external buffers (subsuming what seq_buf
does), or shrinker .to_text() methods would have to output to seq_buf instead
of printbuf (ew, API fragmentation).

2022-04-23 00:44:16

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 07:48:20PM -0400, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 08:09:48AM -0700, Roman Gushchin wrote:
> > To add a concern: largest shrinkers are usually memcg-aware. Scanning
> > over the whole cgroup tree (with potentially hundreds or thousands of cgroups)
> > and over all shrinkers from the oom context sounds like a bad idea to me.
>
> Why would we be scanning over the whole cgroup tree? We don't do that in the
> vmscan code, nor the new report. If the OOM is for a specific cgroup, we should
> probably only be reporting on memory usage for that cgroup (show_mem() is not
> currently cgroup aware, but perhaps it should be).

You're scanning over a small portion of all shrinker lists (on a machine with
cgroups), so the top-10 list has little value.
Global ->count_objects() return the number of objects on the system/root_mem_cgroup
level, not the shrinker's total.

>
> > IMO it's more appropriate to do from userspace by oomd or a similar daemon,
> > well before the in-kernel OOM kicks in.
>
> The reason I've been introducing printbufs and the .to_text() method was
> specifically to make this code general enough to be available from
> sysfs/debugfs - so I see no reasons why a userspace oomd couldn't make use of it
> as well.

Of course, I've nothing against adding .to_text().

>
> > > Last but not least let me echo the concern from the other reply. Memory
> > > allocations are not really reasonable to be done from the oom context so
> > > the pr_buf doesn't sound like a good tool here.
> >
> > +1
>
> In my experience, it's rare to be _so_ out of memory that small kmalloc
> allocations are failing - we'll be triggering the show_mem() report before that
> happens.

I agree. However the OOM killer _has_ to make the progress even in such rare
circumstances.

2022-04-23 00:57:12

by Kent Overstreet

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 05:27:41PM -0700, Roman Gushchin wrote:
> You're scanning over a small portion of all shrinker lists (on a machine with
> cgroups), so the top-10 list has little value.
> Global ->count_objects() return the number of objects on the system/root_mem_cgroup
> level, not the shrinker's total.

Not quite following what you're saying here...?

If you're complaining that my current top-10-shrinker report isn't memcg aware,
that's valid - I can fix that.

> > In my experience, it's rare to be _so_ out of memory that small kmalloc
> > allocations are failing - we'll be triggering the show_mem() report before that
> > happens.
>
> I agree. However the OOM killer _has_ to make the progress even in such rare
> circumstances.

Oh, and the concern is allocator recursion? Yeah, that's a good point.

Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
generate the report on slabs, since we're taking the slab mutex there.

2022-04-23 02:04:14

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 08:46:07PM -0400, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 05:27:41PM -0700, Roman Gushchin wrote:
> > You're scanning over a small portion of all shrinker lists (on a machine with
> > cgroups), so the top-10 list has little value.
> > Global ->count_objects() return the number of objects on the system/root_mem_cgroup
> > level, not the shrinker's total.
>
> Not quite following what you're saying here...?
>
> If you're complaining that my current top-10-shrinker report isn't memcg aware,
> that's valid - I can fix that.

For memcg-aware shrinkers each memcg has it's own LRU (per node).
If you want to print top-10 system-wide lists you need to call
->count_objects() for each shrinker for each memcg for each node.
It's quite a lot of work for an oom context.

>
> > > In my experience, it's rare to be _so_ out of memory that small kmalloc
> > > allocations are failing - we'll be triggering the show_mem() report before that
> > > happens.
> >
> > I agree. However the OOM killer _has_ to make the progress even in such rare
> > circumstances.
>
> Oh, and the concern is allocator recursion? Yeah, that's a good point.

Yes, but not the only problem.

>
> Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> generate the report on slabs, since we're taking the slab mutex there.

And this is another problem: grabbing _any_ locks from the oom context is asking
for trouble: you can potentially enter the oom path doing any allocation, so
now you have to check that no allocations are ever made holding this lock.
And I'm not aware of any reasonable way to test it, so most likely it ends up
introducing some very subtle bags, which will be triggered once a year.

Thanks!

2022-04-25 08:42:24

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On 2022/04/23 10:25, Roman Gushchin wrote:
>>> I agree. However the OOM killer _has_ to make the progress even in such rare
>>> circumstances.
>>
>> Oh, and the concern is allocator recursion? Yeah, that's a good point.
>
> Yes, but not the only problem.
>
>>
>> Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
>> or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
>> generate the report on slabs, since we're taking the slab mutex there.
>
> And this is another problem: grabbing _any_ locks from the oom context is asking
> for trouble: you can potentially enter the oom path doing any allocation, so
> now you have to check that no allocations are ever made holding this lock.
> And I'm not aware of any reasonable way to test it, so most likely it ends up
> introducing some very subtle bags, which will be triggered once a year.
>

You can't allocate memory nor hold locks from OOM context. Since oom_lock mutex
serializes OOM reporting, you could use statically pre-allocated buffer for holding
one line of output. Correlating whole report will be done by the userspace program
with the aid of CONFIG_PRINTK_CALLER=y.

2022-04-25 09:52:49

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Fri 22-04-22 20:46:07, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 05:27:41PM -0700, Roman Gushchin wrote:
[...]
> > > In my experience, it's rare to be _so_ out of memory that small kmalloc
> > > allocations are failing - we'll be triggering the show_mem() report before that
> > > happens.
> >
> > I agree. However the OOM killer _has_ to make the progress even in such rare
> > circumstances.

Absolutely agreed!

> Oh, and the concern is allocator recursion? Yeah, that's a good point.

No, not really. The oom killer is running with PF_MEMALLOC context so no
reclaim recursion is allowed. As I've already pointed out in other
reply the context will have access to memory reserves without any
constrains so it could deplete them completely resulting in other issues
during the recovery.

> Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> generate the report on slabs, since we're taking the slab mutex there.

No it's not. You simply _cannot_ allocate from the oom context.
--
Michal Hocko
SUSE Labs

2022-04-26 07:37:43

by Kent Overstreet

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Mon, Apr 25, 2022 at 11:28:26AM +0200, Michal Hocko wrote:
>
> > Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> > or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> > generate the report on slabs, since we're taking the slab mutex there.
>
> No it's not. You simply _cannot_ allocate from the oom context.

Hmm, no, that can't be right. I've been using the patch set and it definitely
works, at least in my testing. Do you mean to say that we shouldn't? Can you
explain why?

2022-04-26 10:29:55

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 8/8] mm: Centralize & improve oom reporting in show_mem.c

On Mon 25-04-22 11:28:11, Kent Overstreet wrote:
> On Mon, Apr 25, 2022 at 11:28:26AM +0200, Michal Hocko wrote:
> >
> > > Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> > > or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> > > generate the report on slabs, since we're taking the slab mutex there.
> >
> > No it's not. You simply _cannot_ allocate from the oom context.
>
> Hmm, no, that can't be right. I've been using the patch set and it definitely
> works, at least in my testing.

Yes, the world will not fall down and it really depends on the workload
what kind of effect this might have.

> Do you mean to say that we shouldn't? Can you explain why?

I have already touched on that but let me reiterate. Allocation context
called from the oom path will have an unbound access to memory reserves.
Those are a last resort emergency pools of memory that are not available
normally and there are areas which really depend on them to make a
further progress to release the memory pressure.

Swap over NFS would be one such example. If some other code path messes
with those reserves the swap IO path could fail with all sorts of
fallouts.

So to be really exact in my statement. You can allocate from the OOM
context but it is _strongly_ discouraged unless there is no other way
around that.

I would even claim that the memory reclaim in general shouldn't rely on
memory allocations (other than mempools). If an allocation is really
necessary then an extra care has to prevent from complete memory
depletion.
--
Michal Hocko
SUSE Labs