by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH rfc 6/9] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c

On Fri, May 10, 2024 at 03:26:35PM +0200, Michal Hocko wrote:
> On Wed 08-05-24 20:41:35, Roman Gushchin wrote:
> [...]
> > @@ -1747,106 +1623,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >
> > memcg_memory_event(memcg, MEMCG_OOM);
> >
> > - /*
> > - * We are in the middle of the charge context here, so we
> > - * don't want to block when potentially sitting on a callstack
> > - * that holds all kinds of filesystem and mm locks.
> > - *
> > - * cgroup1 allows disabling the OOM killer and waiting for outside
> > - * handling until the charge can succeed; remember the context and put
> > - * the task to sleep at the end of the page fault when all locks are
> > - * released.
> > - *
> > - * On the other hand, in-kernel OOM killer allows for an async victim
> > - * memory reclaim (oom_reaper) and that means that we are not solely
> > - * relying on the oom victim to make a forward progress and we can
> > - * invoke the oom killer here.
> > - *
> > - * Please note that mem_cgroup_out_of_memory might fail to find a
> > - * victim and then we have to bail out from the charge path.
> > - */
> > - if (READ_ONCE(memcg->oom_kill_disable)) {
> > - if (current->in_user_fault) {
> > - css_get(&memcg->css);
> > - current->memcg_in_oom = memcg;
> > - current->memcg_oom_gfp_mask = mask;
> > - current->memcg_oom_order = order;
> > - }
> > + if (!mem_cgroup_v1_oom_prepare(memcg, mask, order, &locked))
> > return false;
> > - }
> > -
> > - mem_cgroup_mark_under_oom(memcg);
> > -
> > - locked = mem_cgroup_oom_trylock(memcg);
>
> This really confused me because this looks like the oom locking is
> removed for v2 but this is not the case because
> mem_cgroup_v1_oom_prepare is not really v1 only code - in other words
> this is not going to be just return false for CONFIG_MEMCG_V1=n.
>
> It makes sense to move the userspace oom handling out to the v1 file. I
> would keep mem_cgroup_mark_under_oom here.

Hm, I don't see any usages of memcg->under_oom outside of v1-specific
context. I probably miss something, can you, please, clarify?

> I am not sure about the oom
> locking thing because I think we can make it v1 only. For v2 I guess we
> can go without this locking as the oom path is already locked and it
> implements overkilling prevention (oom_evaluate_task) as it walks all
> processes in the oom hierarchy.

It's a good point and not obvious if we really need anything of this on v2.
I guess no, but will think a bit more.

Thank you!

2024-05-28 17:20:55

by Kairui Song

[permalink] [raw]

Subject: Re: [PATCH rfc 0/9] mm: memcg: separate legacy cgroup v1 code and put under config option

On Fri, May 24, 2024 at 3:55 AM Roman Gushchin <[email protected]> wrote:
>
> On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote:
> > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linuxdev> wrote:
> > >
> > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > > for the cgroup v1 support anyway:
> > > > 1) the kernel binary contains useless cgroup v1 code,
> > > > 2) some common structures like task_struct and mem_cgroup have never used
> > > > cgroup v1-specific members,
> > > > 3) some code paths have additional checks which are not needed.
> > > >
> > > > Cgroup v1's memory controller has a number of features that are not supported
> > > > by cgroup v2 and their implementation is pretty much self contained.
> > > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > > complicated event notification system, charge migration.
> > > >
> > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > > developers and maintainers.
> > > >
> > > > This patchset aims to solve these problems by:
> > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1c file,
> > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > > > mm/internal.h header
> > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > > 5) putting unused struct memory_cgroup and task_struct members under
> > > > CONFIG_MEMCG_V1 as well.
> > > >
> > > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > > to discuss and agree on the overall approach.
> > > >
> > > > Some open questions, opinions are appreciated:
> > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > > > mem_cgroup_v1_ prefix. Is this a good idea?
> > > > 2) Do we want to extend it beyond the memory controller? Should
> > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > > > mm/internal.h? Or mm/memcontrol-v1.h.
> > > >
> > >
> > > Hi Roman,
> > >
> > > A very timely and important topic and we should definitely talk about it
> > > during LSFMM as well. I have been thinking about this problem for quite
> > > sometime and I am getting more and more convinced that we should aim to
> > > completely deprecate memcg-v1.
> > >
> > > More specifically:
> > >
> > > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > > structual restrictions)
> > >
> > > 2. What are unused memcg-v1 features which we should start deprecating?
> > >
> > > IMO we should systematically start deprecating memcg-v1 features and
> > > start unblocking the users stuck on memcg-v1.
> > >
> > > Now regarding the proposal in this series, I think it can be a first
> > > step but should not give an impression that we are done. The only
> > > concern I have is the potential of "out of sight, out of mind" situation
> > > with this change but if we keep the momentum of deprecation of memcg-v1
> > > it should be fine.
> > >
> > > I have CCed Greg and David from Google to get their opinion on what
> > > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > > have concern in deprecation of memcg-v1 features.
> > >
> > > Anyone else still on memcg-v1, please do provide your input.
> > >
> >
> > Hi,
> >
> > Sorry for joining the discussion late, but I'd like to add some info
> > here: We are using the "memsw" feature a lot. It's a very useful knob
> > for container memory overcommitting: It's a great abstraction of the
> > "expected total memory usage" of a container, so containers can't
> > allocate too much memory using SWAP, but still be able to SWAP out.
> >
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> >
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> >
> > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to
> > account for the total raw usage, not the compressed usage. One example
> > is that if a container uses tons of duplicated pages, then it can
> > allocate much more memory than it is limited, that could cause
> > trouble.
>
> So you don't need separate swap knobs, only combined, right?

Yes, currently we use either combined or separate knobs.

> > I saw Chris also mentioned Google has a workaround internally for it
> > for Cgroup V2. This will be a blocker for us and a similar workaround
> > might be needed. It will be great so see an upstream support for this.
>
> I think that _at least_ we should refactor the code so that it would
> be a minimal patch (e.g. one #define) to switch to the old mode.
>
> I don't think it's reasonable to add a new interface, but having a
> patch/config option or even a mount option which changes the semantics
> of memory.swap.max to the v1-like behavior should be ok.
>
> I'll try to do the first part (refactoring this code), and we can have
> a discussion from there.

Thanks, that sounds like a good start.