Received: by 2002:a25:b794:0:0:0:0:0 with SMTP id n20csp3853393ybh; Tue, 6 Aug 2019 02:30:25 -0700 (PDT) X-Google-Smtp-Source: APXvYqziJEqR2QQ2ydJw1JbwqsNQnqdi39tyBQ2xdmKho3uYiv/4GOCOmng9JMCTPCTmh5d+dXtd X-Received: by 2002:a17:90b:949:: with SMTP id dw9mr2227278pjb.49.1565083825542; Tue, 06 Aug 2019 02:30:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565083825; cv=none; d=google.com; s=arc-20160816; b=BcjG15PnqCbRRVDa/oQ++09UAvGIZ2XbLtYvh9KD12cKuKbtwSyr2Q4YEyvOiEI1mZ 9FbiXU2AvAGjqIM4F+J642Qg60VG+qHJAmqV9FsQvuUFQmnEF0O50VkBVtlEtB29ZS2J TaPnZRM9IcBs4ZOrqo8dFRhbYbDYuQ+koiKn1cSeRbjSIyrNsS5k5zaxK+93/B0lqTQw LezQJOGDEcyKNKiE9buYcLtKJ4iPcogtX6theiKJuyAdw4JbF5wBMKtCoJ2eBA0jO/+0 8Jtj8DdoO3aVBulQWlTGitarn+ACDw7ElCa/uVlBmP9vnxz8NlQ6UW3mTF3sWn242duG ytvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=y85BrkNx8zbU/+lSQRKl1qX/fn0oXFSdUBFyhmD0Bzw=; b=ZbsNW2qNJd7jcV36iZfyvSUNiMgOrWk2Y6WcWupnR6OMNP7mB2SGb9TLMn/8qGzW/Q ek0c2h5KMZTZzifpLgk0sTg2eoUBBcxnQFBQTLetKoQSOMwntdyBu+3nOM6w6d4W1gwo 2a5b7SVfHsY3o0D9blUbixJmGL37aEUyAFlQ8f7z5Q2OdyU0imQccqsU1JefoxJ7IxRS Jm00YK3Wau1KD2icDoloIGYblqrupGBoY1TjW1WxFwyOXOXren7H5yIuCWTwXP6ZnFKh Swu99bLGEzUgLn3u58ufchmxJ4XlwqscOnU1R4WZkeA8JInR8enIC/TL2K9ESCsMUSgl +7fg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id cj14si44586708plb.141.2019.08.06.02.30.09; Tue, 06 Aug 2019 02:30:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732290AbfHFJ3e (ORCPT + 99 others); Tue, 6 Aug 2019 05:29:34 -0400 Received: from mx2.suse.de ([195.135.220.15]:46652 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726713AbfHFJ3d (ORCPT ); Tue, 6 Aug 2019 05:29:33 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id AF312AE6F; Tue, 6 Aug 2019 09:29:31 +0000 (UTC) Date: Tue, 6 Aug 2019 11:29:30 +0200 From: Michal Hocko To: Johannes Weiner Cc: Vlastimil Babka , "Artem S. Tashkinov" , linux-kernel@vger.kernel.org, linux-mm , Suren Baghdasaryan Subject: Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Message-ID: <20190806092930.GO11812@dhcp22.suse.cz> References: <20190805133119.GO7597@dhcp22.suse.cz> <20190805185542.GA4128@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190805185542.GA4128@cmpxchg.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 05-08-19 14:55:42, Johannes Weiner wrote: > On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote: > > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > > Hello, > > > > > > > > There's this bug which has been bugging many people for many years > > > > already and which is reproducible in less than a few minutes under the > > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > > defaults. > > > > > > > > Steps to reproduce: > > > > > > > > 1) Boot with mem=4G > > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > > is currently available, the system will stall hard. You will barely be > > > > able to move the mouse pointer. Your disk LED will be flashing > > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > > applications or close currently running ones. > > > > > > > This little crisis may continue for minutes or even longer. I think > > > > that's not how the system should behave in this situation. I believe > > > > something must be done about that to avoid this stall. > > > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > > to keep refaulting the last remaining file pages fast enough, so there > > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > > > At this point, the likely solution will be probably based on pressure > > > stall monitoring (PSI). I don't know how far we are from a built-in > > > monitor with reasonable defaults for a desktop workload, so CCing > > > relevant folks. > > > > Another potential approach would be to consider the refault information > > we have already for file backed pages. Once we start reclaiming only > > workingset pages then we should be trashing, right? It cannot be as > > precise as the cost model which can be defined around PSI but it might > > give us at least a fallback measure. > > NAK, this does *not* work. Not even as fallback. > > There is no amount of refaults for which you can say whether they are > a problem or not. It depends on the disk speed (obvious) but also on > the workload's memory access patterns (somewhat less obvious). > > For example, we have workloads whose cache set doesn't quite fit into > memory, but everything else is pretty much statically allocated and it > rarely touches any new or one-off filesystem data. So there is always > a steady rate of mostly uninterrupted refaults, however, most data > accesses are hitting the cache! And we have fast SSDs that compensate > for the refaults that do occur. The workload runs *completely fine*. OK, thanks for this example. I can see how a constant working set refault can work properly if the rate is slower than the overal IO plus the allocation demand for other purpose. Thanks! -- Michal Hocko SUSE Labs