Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp440245imm; Thu, 14 Jun 2018 23:56:25 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJehPMRPjfDbSdDoopq/LM5yLBUmJYtGeiJFeTaoRtA36xka23ZhHHDq6Q+IJdXjGun47os X-Received: by 2002:a17:902:854a:: with SMTP id d10-v6mr581556plo.106.1529045785097; Thu, 14 Jun 2018 23:56:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529045785; cv=none; d=google.com; s=arc-20160816; b=Ku/DqO22SWFcwKcH1oV1O4O+4chSnP3AUu4m7ndQReubhO5DFFGEwdr7FGeLqbnfn4 HrvCoC+7ACo1slmobkcAdwKwI/UA9dzo03z183YtdP/wIH0iM0xylr1Bd5Huz29ANvI8 7OBQ5tBEKnn5a4yYd3OqTS7hs7Wx9w4EX7UVQFy6kyr9IvfngUs5CU3B45exlqmZbqiU SXwHCYapODHkM8NbGMPtxEhN9qRuIOZPg+qP8QZhQuN9GaUV+kpl2boCDkstklJhGiR1 3smKhenKCwsIkgQ13XiU2WMYKlAnwS5C8GEm2Ozi5oAo6bU3lT0/hixsN9kHiI7Dkul7 dQKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=P/3/wfdzeItnUvNbSWDjJ8t8wH/AYMuxByHMOJB+hm8=; b=dv7s8sjamPjPwsfMnq2EJsOWfEXYGAmN1x+16BTRjPVQyPhg71i1CYBqb0kTlOnhNG V7WeBOpv8hVPUFw7YGRW+Arr6xK0VuwVX6Ir6D+VRUN1N/ExnWzFDXFBv5393GSzNG48 GKleoKRC6pjSowTxsIKxTSJFHiI96vM+elofmN1ckfnjB0CqSu7sPhNztor4XAmchyip CHflDwUEJ8FBX2TfnUW+PrGvEV8PCMkbFGto8Sp29/1Ec6uWgzkaJ9SpqRzVFMrRUn8G dMpyA3E1VGeh+oudOTsDjhaHZxGfj5zVSDOwTivYJpOYePZ9Mg+Qk2dyRNulQ4ZfpNM5 3vXA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w13-v6si7134161plp.51.2018.06.14.23.56.10; Thu, 14 Jun 2018 23:56:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755645AbeFOGzp (ORCPT + 99 others); Fri, 15 Jun 2018 02:55:45 -0400 Received: from mx2.suse.de ([195.135.220.15]:33127 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755605AbeFOGzo (ORCPT ); Fri, 15 Jun 2018 02:55:44 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 47A38AD0C; Fri, 15 Jun 2018 06:55:43 +0000 (UTC) Date: Fri, 15 Jun 2018 08:55:41 +0200 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Tetsuo Handa , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch] mm, oom: fix unnecessary killing of additional processes Message-ID: <20180615065541.GA24039@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 14-06-18 13:42:59, David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper > to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. > > This replaces the current timeouts in the oom reaper: (1) when trying to > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) > a HZ sleep if there are blockable mmu notifiers. It extends it with > timeout to allow an oom victim to reach exit_mmap() before choosing > additional processes unnecessarily. > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. > > Signed-off-by: David Rientjes Nacked-by: Michal Hocko as already explained elsewhere in this email thread. > --- > Note: I understand there is an objection based on timeout based delays. > This is currently the only possible way to avoid oom killing important > processes completely unnecessarily. If the oom reaper can someday free > all memory, including mlocked memory and those mm's with blockable mmu > notifiers, and is guaranteed to always be able to grab mm->mmap_sem, > this can be removed. I do not believe any such guarantee is possible > and consider the massive killing of additional processes unnecessarily > to be a regression introduced by the oom reaper and its very quick > setting of MMF_OOM_SKIP to allow additional processes to be oom killed. If you find oom reaper more harmful than useful I would be willing to ack a comman line option to disable it. Especially when you keep claiming that the lockups are not really happening in your environment. Other than that I've already pointed to a more robust solution. If you are reluctant to try it out I will do, but introducing a timeout is just papering over the real problem. Maybe we will not reach the state that _all_ the memory is reapable but we definitely should try to make as much as possible to be reapable and I do not see any fundamental problems in that direction. -- Michal Hocko SUSE Labs