Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp1048400imm; Thu, 31 May 2018 14:17:49 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIEKUvB/KYbxcEkB+tP8erJO8++44czQa21l7jW/BTHlcseAb/cIDa/xIg83xqNpMvpjvN1 X-Received: by 2002:a17:902:1a6:: with SMTP id b35-v6mr8425930plb.80.1527801469422; Thu, 31 May 2018 14:17:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527801469; cv=none; d=google.com; s=arc-20160816; b=XsMncxmMZo4l/+U6FfJqMtxy4xUCBv1Ob/vglAN4FOf14QsICkkv0kIGFJOmtdQVT1 UiBr/qPtJ66p6KmkrCqEEtcgflAUlgF56ByPI1eRUN/pvpa/HCig8YKZFhp6cu1WN+Au qB23KxQVRTaGQAlk56hXC0B+bEhdGtYnccpt0F0v02DKvEiWEIOMhFLcIYoMKcVwF9jy ymeABOEs8zaVKBklT0tghAlrPjqaEU4dQy/2EpRfaj8ewsBDUPfRzUbfZep8eV28lieS BatKwLSaLugy8B9R6x4lQGz8Ck5fgWjtbLfGlAsDJlD0hmNwVPLsoHuzvduyw/lmKs7R JPiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=8Mk0qwUoPnRfea2Fl6ydoYr8wUoPw2J1jN88sddloe8=; b=KmWjRexxuWvIH+1+CUmXjxuqfMdTjUmD5+53JPwV644ylQn1/zAL7VSdJ4Y4QTtupM 2pjmlhrJFh5pt84guhCUp5PQSRca7XkZ5Zd/kDIbzPiHhr11XseyGZpIPTyk0En6MOm7 v7eWQyzeZRR/C5oxZrILdSEBMLMgpgWSBL3rHkXRlLei3aqy7Q1tHMbb8FMc8ZOtNJAv fojhtoWBBr7UJRBH1AFL6tt6I4a1mX2TR5fYLtkeiMuJki4DdFfEXbe8rjviod6CEs3B 28llDCzNR5J5917nYBVz+N/fsAoK1W4J64Nzc3lKKeyge11XBAVSGNsv4nNcchlIYPcL HlWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Lt3eS2zc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k28-v6si5085031pfh.50.2018.05.31.14.17.34; Thu, 31 May 2018 14:17:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Lt3eS2zc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751247AbeEaVQj (ORCPT + 99 others); Thu, 31 May 2018 17:16:39 -0400 Received: from mail-pf0-f193.google.com ([209.85.192.193]:40358 "EHLO mail-pf0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751085AbeEaVQg (ORCPT ); Thu, 31 May 2018 17:16:36 -0400 Received: by mail-pf0-f193.google.com with SMTP id f189-v6so11394161pfa.7 for ; Thu, 31 May 2018 14:16:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=8Mk0qwUoPnRfea2Fl6ydoYr8wUoPw2J1jN88sddloe8=; b=Lt3eS2zc8bMqLRAlayUh40OnmWoDkSG12Nlnsfe6fWwj+S9vE00RYmz2HMCqwoMf7s vC5H7xFuU96na2hOZ29ht/NfXuoGWjBfUVywQB09h5MsYWxKXcNSn/lFfN5FNJY7dYog mSP/+gJ2yMci8IspausGS1NJY7Q0Llaz9KYaNIH7U9Ph6O1uh5i+J7Aw9N1Xro9j6SFB V9XaM3qZ4y/hCMUNHWRHspDqrx7jvgUgdoZBp+iqqat9LVyh3/W9XW7LQnJIzUsUXU2c KB2cvx8AfHTvHIlTKmAla8w8DqQrDSXJIbZ4EKzV7VEQs80E424m2y0DP/czNRauNwpW 3HmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=8Mk0qwUoPnRfea2Fl6ydoYr8wUoPw2J1jN88sddloe8=; b=CBK8k8bDX4G+4DD9gvtAxHwfbhWHoUWOVUfJe/+m0pesJHOhVCYjTt4ftS5kGGhYk2 hWyEk5dmrmF0sCe8Upq/ZhyBqQLSTAoJyGDmc2rcjwAXsgd7iYdP/PE8mikEn0hCDLVS 8HfZsL0tVZFBpKfKKW4LNKXikybq3Xn7IjmFVwrqku0Kq2WISmXPZp9lEkzO9fHS7Wxi 6tMMch2lFTxGhg9vYyI+BUdgHs9gVwQq9VXTnkJlHf7lrglcIK7eZwhWryIaBZAgTLKj We30+rAxXStfm6EyZAOMixE4elmGOqKoUO5KP+4Q9yHaKVESD245IL9HLkXAy9z72zxD cOMw== X-Gm-Message-State: ALKqPwdntf5nF0ATeEk7E39+4pIBVjbfMDnRpgW+JykxrPJe4ISsX55K 30ykdznCOYB37E7kXyGDRJmE2Q== X-Received: by 2002:a65:460f:: with SMTP id v15-v6mr6723761pgq.31.1527801395509; Thu, 31 May 2018 14:16:35 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id y8-v6sm24786322pgq.75.2018.05.31.14.16.34 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 31 May 2018 14:16:34 -0700 (PDT) Date: Thu, 31 May 2018 14:16:34 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Tetsuo Handa , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes In-Reply-To: <20180531063212.GF15278@dhcp22.suse.cz> Message-ID: References: <20180525072636.GE11881@dhcp22.suse.cz> <20180528081345.GD1517@dhcp22.suse.cz> <20180531063212.GF15278@dhcp22.suse.cz> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 31 May 2018, Michal Hocko wrote: > > It's not a random timeout, it's sufficiently long such that we don't oom > > kill several processes needlessly in the very rare case where oom livelock > > would actually prevent the original victim from exiting. The oom reaper > > processing an mm, finding everything to be mlocked, and immediately > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > > large memory hogging process that mlocks all of its memory; we > > consistently see spurious and unnecessary oom kills simply because the oom > > reaper has set MMF_OOM_SKIP very early. > > It takes quite some additional steps for admin to allow a large amount > of mlocked memory and such an application should be really careful to > not consume too much memory. So how come this is something you see that > consistently? Is this some sort of bug or an unfortunate workload side > effect? I am asking this because I really want to see how relevant this > really is. > The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after the victim has been chosen for oom kill and we get follow-up oom kills, not that the process is able to mlock a large amount of memory. Mlock here is only being discussed as a single example. Tetsuo has brought up the example of all shared file-backed memory. We've discussed the mm having a single blockable mmu notifier. Regardless of how we arrive at the point where the oom reaper can't free memory, which could be any of those three cases, if (1) the original victim is sufficiently large that follow-up oom kills would become unnecessary and (2) other threads allocate/charge before the oom victim reaches exit_mmap(), this occurs. We have examples of cases where oom reaping was successful, but the rss numbers in the kernel log are very similar to when it was oom killed and the process is known not to mlock, the reason is because the oom reaper could free very little memory due to blockable mmu notifiers. > But the waiting periods just turn out to be a really poor design. There > will be no good timeout to fit for everybody. We can do better and as > long as this is the case the timeout based solution should be really > rejected. It is a shortcut that doesn't really solve the underlying > problem. > The current implementation is a timeout based solution for mmap_sem, it just has the oom reaper spinning trying to grab the sem and eventually gives up. This patch allows it to currently work on other mm's and detects the timeout in a different way, with jiffies instead of an iterator. I'd love a solution where we can reliably detect an oom livelock and oom kill additional processes but only after the original victim has had a chance to do exit_mmap() without a timeout, but I don't see one being offered. Given Tetsuo has seen issues with this in the past and suggested a similar proposal means we are not the only ones feeling pain from this. > > I'm open to hearing any other suggestions that you have other than waiting > > some time period before MMF_OOM_SKIP gets set to solve this problem. > > I've already offered one. Make mlocked pages reapable. Making mlocked pages reapable would only solve the most trivial reproducer of this. Unless the oom reaper can guarantee that it will never block and can free all memory that exit_mmap() can free, we need to ensure that a victim has a chance to reach the exit path on its own before killing every other process on the system. I'll fix the issue I identified with doing list_add_tail() rather than list_add(), fix up the commit message per Tetsuo to identify the other possible ways this can occur other than mlock, remove the rfc tag, and repost.