Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4544445imm; Mon, 18 Jun 2018 17:29:01 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLKN1lf+mTTHh2P+ZF9O/HjDEZ4koZvagOmZeJpczQsTTJJncbWQycqd/FAraoc9oNUtTeP X-Received: by 2002:a17:902:7891:: with SMTP id q17-v6mr16619326pll.186.1529368141277; Mon, 18 Jun 2018 17:29:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529368141; cv=none; d=google.com; s=arc-20160816; b=VPnpQvnCTnXMiuwLihpp9D7sHv0xJ6GTwzs9n42YEL2fQF49SRH/h5ykl4GQdKwC1B UNUFL2aBJuNiBArlxD4NjCA8Bq+6KUitiQHJXmEnYbFPYqlLTk87GeO3G7mBYQ8h57Pa RThz4At3TNXq3ubETeFoxPvNY1WYFX1Eyg02YI/fWEh/GpysiGDZCPRx2KkTFBUXEwUI Kv+xiX1c7Qhmdi/oEDxEKkWgO6ZwfdPFQs+vGC/KKUKCVuANWdBO0MPGbHdIu2E8V8Ju QPj6IdUPWCaYBVoxKMkEvP06gQgnUYfy/QbgR3PjtNIPodmLs4dYyG0b2O4Mb/uqSqMl la6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :arc-authentication-results; bh=u6Du66YhPpfO5imjYARZvbW9GcOf+Ym8h8U6YNvMWxc=; b=s1rM3LQEybRERp2x0jhxbHt1JDFpnXHSQG+5HMdBQE1XEEZM1Oogmkj3kDOoS8uQtc 78lUf/tgTFdB1te8OZau8GOB/tK1agKSlTyj/wNHjQck8bTLHPQr2dZ+TCq/0Bgqf+iT S+rgeKxiTWt9evTi24DScQLOdoAmxkXgirfoE+SsgYvoBNdCVkVFfe9gzQXOy5xiqt4c TwythAdf9nTDcq6DA5DGej/VQP0Lss3xlb9kKufqDaoqMX3Iaqwv5pY2FegA2oEhgXVX N6l5VMv59qFQiDB2sMvTbuIg2W0OrIIxXTDPN+AhU8D10yR0aJbzp5KwAg/UbWPpzB3J wv9A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v4-v6si15766168pfk.116.2018.06.18.17.28.32; Mon, 18 Jun 2018 17:29:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S937109AbeFSA1f (ORCPT + 99 others); Mon, 18 Jun 2018 20:27:35 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40924 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S937066AbeFSA1f (ORCPT ); Mon, 18 Jun 2018 20:27:35 -0400 Received: from akpm3.svl.corp.google.com (unknown [104.133.9.92]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 491A2CDE; Tue, 19 Jun 2018 00:27:34 +0000 (UTC) Date: Mon, 18 Jun 2018 17:27:33 -0700 From: Andrew Morton To: David Rientjes Cc: Michal Hocko , Tetsuo Handa , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch] mm, oom: fix unnecessary killing of additional processes Message-Id: <20180618172733.39322643725b196ff4e64703@linux-foundation.org> In-Reply-To: References: X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper > to stall indefinitely, Maybe we should have more than one oom reaper thread? I assume the probability of the oom reaper thread blocking on an mmu notifier is small, so perhaps just dive in and hope for the best. If the oom reaper gets stuck then there's another thread ready to take over. And revisit the decision to use a kernel thread instead of workqueues. > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). So what's actually happening here. A process has a large amount of mlocked memory, it has been oom-killed and it is in the process of releasing its memory and exiting, yes? If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we just patiently waiting for its attempt to release meory? > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. I guess that's my answer. What causes this livelock? Process looping in alloc_pages while holding a lock the oom victim wants? > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. > > This replaces the current timeouts in the oom reaper: (1) when trying to > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) > a HZ sleep if there are blockable mmu notifiers. It extends it with > timeout to allow an oom victim to reach exit_mmap() before choosing > additional processes unnecessarily. > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, That seems sensible, but why set MMF_OOM_SKIP at all? > and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. What happened to the old idea of permitting the task which is blocking the oom victim to access additional reserves? Come to that, what happened to the really really old Andreaidea of not looping in the page allocator anyway? Return NULL instead... I dunno, I'm thrashing around here. We seem to be piling mess on top of mess and then being surprised that the result is a mess. > +#ifdef CONFIG_MMU > + /* When to give up on oom reaping this mm */ > + unsigned long reap_timeout; "timeout" implies "interval". To me, anyway. This is an absolute time, so something like reap_time would be clearer. Along with a comment explaining that the units are in jiffies. > +#endif > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > pgtable_t pmd_huge_pte; /* protected by page_table_lock */ > #endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1163,7 +1163,7 @@ struct task_struct { > #endif > int pagefault_disabled; > #ifdef CONFIG_MMU > - struct task_struct *oom_reaper_list; > + struct list_head oom_reap_list; Can we have a comment explaining its locking. > #endif > #ifdef CONFIG_VMAP_STACK > struct vm_struct *stack_vm_area; > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) > if (unlikely(mm_is_oom_victim(mm))) { > /* > * Manually reap the mm to free as much memory as possible. > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > - * this mm from further consideration. Taking mm->mmap_sem for > - * write after setting MMF_OOM_SKIP will guarantee that the oom > - * reaper will not run on this mm again after mmap_sem is > - * dropped. > + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper. > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will > + * guarantee that the oom reaper will not run on this mm again > + * after mmap_sem is dropped. Comment should explain *why* we don't want the reaper to run on this mm again. > > ... >