Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Tue, 27 Apr 2021 13:26:51 +0100
From:   Chris Down <chris@chrisdown.name>
To:     Alexander Sosna <alexander@sosna.de>
Cc:     linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Prevent OOM casualties by enforcing memcg limits
Message-ID: <YIgDCzmKaesjl8aU@chrisdown.name>
References: <ea6db5cc-f862-7c4b-d872-acb29c2d8193@sosna.de>
 <YIdWMC/iAdanDjLh@chrisdown.name>
 <410a58ba-d746-4ed6-a660-98b5f99258c3@sosna.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
In-Reply-To: <410a58ba-d746-4ed6-a660-98b5f99258c3@sosna.de>
User-Agent: Mutt/2.0.6 (98f8cb83) (2021-03-06)
Precedence: bulk

Alexander Sosna writes:
>> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
>> can still happen for a bunch of reasons, so I really hope PostgreSQL
>> isn't relying on that.
>>
>> Could you please be more clear about the "huge problem" being solved
>> here? I'm not seeing it.
>
>let me explain the problem I encounter and why I fell down the mm rabbit
>hole.  It is not a PostgreSQL specific problem but that's where I run
>into it.  PostgreSQL forks a backend for each client connection.  All
>backends have shared memory as well as local work memory.  When a
>backend needs more dynamic work_mem to execute a query, new memory
>is allocated.  It is normal that such an allocation can fail.  If the
>backend gets an ENOMEM the current query is rolled back an all dynamic
>work_mem is freed. The RDBMS stays operational an no other query is
>disturbed.
>
>When running in a memory cgroup - for example via systemd or on k8s -
>the kernel will not return ENOMEM even if the cgroup's memory limit is
>exceeded.  Instead the OOM killer is awakened and kills processes in the
>violating cgroup.  If any backend is killed with SIGKILL the shared
>memory of the whole cluster is deemed potentially corrupted and
>PostgreSQL needs to do an emergency restart.  This cancels all operation
>on all backends and it entails a potentially lengthy recovery process.
>Therefore the behavior is quite "costly".

My point that memory cgroups are completely overcommit agnostic isn't just a 
question of abstract semantics, but a practical one. Exceeding memory.max is 
not overcommitment, because overages are physical, not virtual, and that has 
vastly different ramifications in terms of what managing that overage means.

For example, if we aggressively ENOMEM at the memory.max bounds, there's no 
provision provided for the natural bounds of memory reclaim to occur. Now maybe 
your application likes that (which I find highly dubious), but from a memory 
balancing perspective it's just nonsensical: we need to ensure that we're 
assisting forward progress of the system at the cgroup level, especially with 
the huge amounts of slack generated.

>I totally understand that vm.overcommit_memory 2 does not mean "no OOM
>killer". IMHO it should mean "no OOM killer if we can avoid it" and I
>would highly appreciate if the kernel would use a less invasive means
>whenever possible.  I guess this might also be the expectation by many
>other users.  In my described case - which is a real pain for me - it is
>quite easy to tweak the kernel behavior in order to handle this and
>other similar situations with less casualties.  This is why I send a
>patch instead of starting a theoretical discussion.

vm.overcommit_memory=2 means "don't overcommit", nothing less, nothing more. 
Adding more semantics is a very good way to make an extremely confusing and 
overloaded API.

This commit reminds me of the comments on cosmetic products that say "no 
parabens". Ok, so there's no parabens -- great, parabens are terrible -- but 
are you now using a much more dangerous preservative instead?

Likewise, this commit claims that it reduces the likelihood of invoking the OOM 
killer -- great, nobody wants their processes to be OOM killed. What do we have 
instead? Code that calls off memory allocations way, way before it's needed to 
do so, and prevents the system from even getting into a state where it can 
efficiently evaluate how it should rebalance memory. That's really not a good 
tradeoff.

>What do you think is necessary to get this to an approvable quality?

The problem is not the code, it's the concept and the way it interacts with the 
rest of the mm subsystem. It asks the mm subsystem to deny memory allocations 
long before it has even had a chance to reliably rebalance (just as one 
example, to punt anon pages to swap) based on the new allocations, which 
doesn't make very much sense. It may not break in some highly trivial setups, 
but it certainly will not work well with stacking or machines with high 
volatility of the anon/file LRUs. You're also likely to see random ENOMEM 
failures from kernelspace when operating under this memcg context long before 
such a response was necessary, which doesn't make much sense.

If you want to know when to back off allocations, use memory.high with PSI 
pressure metrics.

I also would strongly suggest that vm.overcommit_memory=2 is the equivalent of 
using a bucket of ignited thermite to warm one's house.