Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 19 Mar 2020 08:09:11 +0100
From:   Michal Hocko <mhocko@kernel.org>
To:     David Rientjes <rientjes@google.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
        Vlastimil Babka <vbabka@suse.cz>,
        Robert Kolchmeyer <rkolchmeyer@google.com>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch v3] mm, oom: prevent soft lockup on memcg oom for UP
 systems
Message-ID: <20200319070911.GU21362@dhcp22.suse.cz>
References: <8395df04-9b7a-0084-4bb5-e430efe18b97@i-love.sakura.ne.jp>
 <alpine.DEB.2.21.2003161648370.47327@chino.kir.corp.google.com>
 <202003170318.02H3IpSx047471@www262.sakura.ne.jp>
 <alpine.DEB.2.21.2003162107580.97351@chino.kir.corp.google.com>
 <alpine.DEB.2.21.2003171752030.115787@chino.kir.corp.google.com>
 <20200318094219.GE21362@dhcp22.suse.cz>
 <alpine.DEB.2.21.2003181437270.70237@chino.kir.corp.google.com>
 <alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Wed 18-03-20 15:03:52, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled.  Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
> 
> For example, on an UP system with a memcg limited to 100MB, if three 
> processes each charge 40MB of heap with swap disabled, one of the charging 
> processes can loop endlessly trying to charge memory which starves the oom 
> victim.

This only happens if there is no reclaimable memory in the hierarchy.
That is a very specific condition. I do not see any other way than
having a misconfigured system with min protection preventing any
reclaim. Otherwise we have cond_resched both in slab shrinking code
(do_shrink_slab) and LRU shrinking shrink_lruvec. If I am wrong and
those are insufficient then please be explicit about the scenario.

This is a very important information to have in the changelog!

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	ret = should_force_charge() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
> +        /*
> +         * Give a killed process a good chance to exit before trying to
> +         * charge memory again.
> +         */
> +	if (ret)
> +		schedule_timeout_killable(1);

Why are you making this conditional? Say that there is no victim to
kill. The charge path would simply bail out and it would really depend
on the call chain whether there is a scheduling point or not. Isn't it
simply safer to call schedule_timeout_killable unconditioanlly at this
stage?

>  	return ret;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3861,6 +3861,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  out:
>  	mutex_unlock(&oom_lock);
> +	/*
> +	 * Give a killed process a good chance to exit before trying to
> +	 * allocate memory again.
> +	 */
> +	if (*did_some_progress)
> +		schedule_timeout_killable(1);

This doesn't make much sense either. Please remember that the primary
reason you are adding this schedule_timeout_killable in this path is
because you want to somehow reduce the priority inversion problem
mentioned by Tetsuo. Because the page allocator path doesn't lack
regular scheduling points - compaction, reclaim and should_reclaim_retry
etc have them.

>  	return page;
>  }
>  

-- 
Michal Hocko
SUSE Labs