Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3839851imm; Fri, 25 May 2018 12:37:06 -0700 (PDT) X-Google-Smtp-Source: AB8JxZroVaF1EMfD0Y5bBt7/fBBEBFwoyUFpLohFVTuUNWiy0hlGa6HavS95h069wj3YtxJNFIAf X-Received: by 2002:a17:902:bf4a:: with SMTP id u10-v6mr3979704pls.322.1527277026549; Fri, 25 May 2018 12:37:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527277026; cv=none; d=google.com; s=arc-20160816; b=RN4X7mdDRTxOpSka5LGpABdDBhO8ICbKqpvw71f2LVJFrwGlI0atan6FP9iwGAl3QX N6B9c4Ab+qJsKC5+m2PjhWPhTwba31haBpFcG9VR8v0MckLAdp+92lvdur3JOWPCTK7r r7aS0wG9woUbQb3u4MxsH3xNJW5narXBvl0MUTVhaHDpnxKCwUoP7f0TuvVbJMKMiSWf 9iOwzSLqkdvSjZNi2VrLRgc3PcxBUDPLf2+xpjawalSckYAW3m8FNzfQ+oSx08xgdGfu w4TgCiy+3MDtE7dBFM4GN5Na1UVaC1docEhFwrqIjOQ1DRHLjOj3ftCjZF2Je/yKt4wT 8WqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=EjjvfXU1NXRgy6rDkAt/UGkwUMbY1GKNBbz0WhTtpig=; b=P5xeoZ26nzFlL9rwhgwI0xhWXS9wz1uVNJVrQzlCns1ujwcMryV3HLvH3npmEvZ6P0 ns4XzUM485NoV/yksDShu6LUDjg+DcBMuXmZfDEGsqH4KzVVv+D0u4VoYHWSxtyqyMJ9 ZunChoRfOTNDM0ElVK0s23mlEwJPJlKnr6FTLn1riYpK5AcifuaEtkp24YoyM4IsSW0Y 8+HYEX6hnasWGeksIwNeuGxwX74+/CFQOTOGWHAihXW4/aMhxIUSIomOmmrsS8I31S7d oj/L/xZj1dDoM+7eVxcXWiDmmGVi2nQqvzLAiknbhPRL7rsMLFDdAFsvyPY/TwmhN/9G fOiw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=UHIXgmZ9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p11-v6si23681148pfh.249.2018.05.25.12.36.21; Fri, 25 May 2018 12:37:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=UHIXgmZ9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S967892AbeEYTgN (ORCPT + 99 others); Fri, 25 May 2018 15:36:13 -0400 Received: from mail-pl0-f68.google.com ([209.85.160.68]:46258 "EHLO mail-pl0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967698AbeEYTgL (ORCPT ); Fri, 25 May 2018 15:36:11 -0400 Received: by mail-pl0-f68.google.com with SMTP id 30-v6so3687336pld.13 for ; Fri, 25 May 2018 12:36:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=EjjvfXU1NXRgy6rDkAt/UGkwUMbY1GKNBbz0WhTtpig=; b=UHIXgmZ9sQjtcSmmTqcGIsxAQSKc9hG1ZM7+jNDZZF9j9400iPFnI/I8OsAWkMzOd+ 6nhof6fLQjOV/TPBmw1+e5jHr9TvEZiIMkzBioy15w8dXado4TxFGPicN4Q0Mp0BTpyZ FTOGXNRz7goT2TcXUoejrhua8wZuIpcGgp2M02oirEkhCV+gK6qGzE57y0hyzZlScSl6 UvZQS534CXLU4JQC+QtGKMQ3wZblDr4pixvc40nRJn1Iu1FBnQyt/6OWfQRCizvxfWMQ HOe+Z1IKh75+vqQ30FY7YMni6T/+MEP8nda7uKBu2XJPlmraEB0b7SkjBCzIZC3hJMhO F8Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=EjjvfXU1NXRgy6rDkAt/UGkwUMbY1GKNBbz0WhTtpig=; b=eMeHkKfhUwHsd5uzoUEss7+Mjab6tASIR1SC7oCCjxzilUeYoWfjvW4dOvHjzY8dF9 sMOpGl3KoK8NQ76dyAomL0Gmw593abIfcMxhi5KkgEV0fwtHktiwEvnwIZ1CJ9J99h2p WEihV9wP2g2o5ac3yv30PD/KHdKkNpaqVBA1sNUN1zvcnH2b6nyxTU6LBaUi7hd7uxBO KOxUcC+yfyaSzbL7GMmnooOTCjvEYMEffT6Fr9dkU/7YFTG9tuSwicYVoy0NQgfXvQ/M hg1d91bzrnJCkf1hSGOtplmBdS3Z8P3tDQ5keMjeUVWINdlnt4lABP3AsAdkwVukSbn6 eOQA== X-Gm-Message-State: ALKqPwfc7zilZAHW0RM6zlByZ/sSxHcbpwp3huPq35tEkzND0G3w8g1I dndtfjICXkvbOlC9d60fKrNXJw== X-Received: by 2002:a17:902:341:: with SMTP id 59-v6mr3929069pld.324.1527276970665; Fri, 25 May 2018 12:36:10 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id g13-v6sm42468794pfm.67.2018.05.25.12.36.09 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 25 May 2018 12:36:09 -0700 (PDT) Date: Fri, 25 May 2018 12:36:08 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Tetsuo Handa , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes In-Reply-To: <20180525072636.GE11881@dhcp22.suse.cz> Message-ID: References: <20180525072636.GE11881@dhcp22.suse.cz> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 May 2018, Michal Hocko wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > to stall indefinitely, > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > > > > When all memory is mlocked, the oom reaper will not be able to free any > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > chosen unnecessarily. This is trivial to reproduce if all eligible > > processes on the system have mlocked their memory: the oom killer calls > > panic() even though forward progress can be made. > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > No timeouts please! The proper way to handle this problem is to simply > teach the oom reaper to handle mlocked areas. That's not sufficient since the oom reaper is also not able to oom reap if the mm has blockable mmu notifiers or all memory is shared filebacked memory, so it immediately sets MMF_OOM_SKIP and additional processes are oom killed. The current implementation that relies on MAX_OOM_REAP_RETRIES is acting as a timeout already for mm->mmap_sem, but it's doing so without attempting to oom reap other victims that may actually allow it to grab mm->mmap_sem if the allocator is waiting on a lock. The solution, as proposed, is to allow the oom reaper to iterate over all victims and try to free memory rather than working on each victim one by one and giving up. But also note that even if oom reaping is possible, in the presence of an antagonist that continues to allocate memory, that it is possible to oom kill additional victims unnecessarily if we aren't able to complete free_pgtables() in exit_mmap() of the original victim. So this patch is solving all three issues: allowing a process to *fully* exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the oom reaper to act on parallel victims that may allow a victim to be reaped, and preventing additional processes from being killed unnecessarily when oom reaping isn't able to free memory (mlock, blockable mmu invalidates, all VM_SHARED file backed, small rss, etc). The vast majority of the time, oom reaping can occur with this change or the process can reach exit_mmap() itself; oom livelock appears to be very rare with this patch even for mem cgroup constrained oom kills and very tight limitation and thus it makes sense to wait for a prolonged period of time before killing additional processes unnecessarily.