Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp36012ybl; Thu, 15 Aug 2019 12:10:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqwqTLRaid+2CLNPbztDPVXYE9++XhynkZgPuiMfCbgVO2iIijanBiwNG76IYiM4jRnqe8qn X-Received: by 2002:a63:1f03:: with SMTP id f3mr4567909pgf.249.1565896236611; Thu, 15 Aug 2019 12:10:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565896236; cv=none; d=google.com; s=arc-20160816; b=TN4TkhsOGn8/EWsKMdhqQcVeS3IuaTDwyqfMW2IENl4SZPl+aif3vm8wh5oNDw2TFK Dh5Fkolvd2xGcU8loXMC6Gc/uGnpn7p8IcahpYjY8Z1V4WWIhUeVi9keS4w9/6vd9tny sn2iK3vqm9r2N3+JK7laawghuoSEIc8NRJBiQmkicjYedWhbknSB4J2PnLka4bCIEkCu nL0e3wc5r8euXhXlG1FqZjnr4gQIehSoTL+ScvKe2JXlPXz7BCaXLnhXnoAXWDV/0TC/ 7diMagPSkcmOmhBtoorgKMF8GkvXjMeg9c6DumLWva8+9NT4Qmj3U43KVsk8pjb3XEcZ V3cw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=6UIOzuzcEtwaZBzLbk3UN6zq1KCg5HnbmMdSOL0SX+U=; b=BDLL0ngLAW7jFk60iaP7WOu6k15fTGv43VC4VEstp8eXQii7WSv4/ZhaBHl68lNVvs ofTsf3YrkJdKLF3UH2nlCi++vZna/26J7yMo6R7U8f2hTqSRGAAAPgbVQck9Bw59TwMI nWyjYJVXgBC4ecXuUVmsQd43dnW2F4yQqkYdz2rSmFBc5My0NGWVyH2WDAxkVjzynjpQ TUrIeAoVlvWwg0nXGAhU/1viwVRbsL4sjGDipwXJTTcr9bceYxjFv2KhNGUoXAxJRT7c GJiBDFiUWZ/u751V9KPRgkBSeuxFCVwrIbFg14J5aeKiWAa8wPTNslK1idXuBoRVf2wP fnIA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 19si2641077pfu.109.2019.08.15.12.10.20; Thu, 15 Aug 2019 12:10:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730988AbfHORCt (ORCPT + 99 others); Thu, 15 Aug 2019 13:02:49 -0400 Received: from mx2.suse.de ([195.135.220.15]:40352 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729157AbfHORCt (ORCPT ); Thu, 15 Aug 2019 13:02:49 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 95638AFF3; Thu, 15 Aug 2019 17:02:47 +0000 (UTC) Date: Thu, 15 Aug 2019 19:02:15 +0200 From: Michal Hocko To: Khalid Aziz Cc: akpm@linux-foundation.org, vbabka@suse.cz, mgorman@techsingularity.net, dan.j.williams@intel.com, osalvador@suse.de, richard.weiyang@gmail.com, hannes@cmpxchg.org, arunks@codeaurora.org, rppt@linux.vnet.ibm.com, jgg@ziepe.ca, amir73il@gmail.com, alexander.h.duyck@linux.intel.com, linux-mm@kvack.org, linux-kernel-mentees@lists.linuxfoundation.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction Message-ID: <20190815170215.GQ9477@dhcp22.suse.cz> References: <20190813014012.30232-1-khalid.aziz@oracle.com> <20190813140553.GK17933@dhcp22.suse.cz> <3cb0af00-f091-2f3e-d6cc-73a5171e6eda@oracle.com> <20190814085831.GS17933@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 15-08-19 10:27:26, Khalid Aziz wrote: > On 8/14/19 2:58 AM, Michal Hocko wrote: > > On Tue 13-08-19 09:20:51, Khalid Aziz wrote: > >> On 8/13/19 8:05 AM, Michal Hocko wrote: > >>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote: > >>> [...] > >>>> Patch 1 adds code to maintain a sliding lookback window of (time, number > >>>> of free pages) points which can be updated continuously and adds code to > >>>> compute best fit line across these points. It also adds code to use the > >>>> best fit lines to determine if kernel must start reclamation or > >>>> compaction. > >>>> > >>>> Patch 2 adds code to collect data points on free pages of various orders > >>>> at different points in time, uses code in patch 1 to update sliding > >>>> lookback window with these points and kicks off reclamation or > >>>> compaction based upon the results it gets. > >>> > >>> An important piece of information missing in your description is why > >>> do we need to keep that logic in the kernel. In other words, we have > >>> the background reclaim that acts on a wmark range and those are tunable > >>> from the userspace. The primary point of this background reclaim is to > >>> keep balance and prevent from direct reclaim. Why cannot you implement > >>> this or any other dynamic trend watching watchdog and tune watermarks > >>> accordingly? Something similar applies to kcompactd although we might be > >>> lacking a good interface. > >>> > >> > >> Hi Michal, > >> > >> That is a very good question. As a matter of fact the initial prototype > >> to assess the feasibility of this approach was written in userspace for > >> a very limited application. We wrote the initial prototype to monitor > >> fragmentation and used /sys/devices/system/node/node*/compact to trigger > >> compaction. The prototype demonstrated this approach has merits. > >> > >> The primary reason to implement this logic in the kernel is to make the > >> kernel self-tuning. > > > > What makes this particular self-tuning an universal win? In other words > > there are many ways to analyze the memory pressure and feedback it back > > that I can think of. It is quite likely that very specific workloads > > would have very specific demands there. I have seen cases where are > > trivial increase of min_free_kbytes to normally insane value worked > > really great for a DB workload because the wasted memory didn't matter > > for example. > > Hi Michal, > > The problem is not so much as do we have enough knobs available, rather > how do we tweak them dynamically to avoid allocation stalls. Knobs like > watermarks and min_free_kbytes are set once typically and left alone. Does anything prevent from tuning these knobs more dynamically based on already exported metrics? > Allocation stalls show up even on much smaller scale than large DB or > cloud platforms. I have seen it on a desktop class machine running a few > services in the background. Desktop is running gnome3, I would lock the > screen and come back to unlock it a day or two later. In that time most > of memory has been consumed by buffer/page cache. Just unlocking the > screen can take 30+ seconds while system reclaims pages to be able swap > back in all the processes that were inactive so far. This sounds like a bug to me. > It is true different workloads will have different requirements and that > is what I am attempting to address here. Instead of tweaking the knobs > statically based upon one workload requirements, I am looking at the > trend of memory consumption instead. A best fit line showing recent > trend can be quite indicative of what the workload is doing in terms of > memory. Is there anything preventing from following that trend from the userspace and trigger background reclaim earlier to not even get to the direct reclaim though? > For instance, a cloud server might be running a certain number > of instances for a few days and it can end up using any memory not used > up by tasks, for buffer/page cache. Now the sys admin gets a request to > launch another instance and when they try to to do that, system starts > to allocate pages and soon runs out of free pages. We are now in direct > reclaim path and it can take significant amount of time to find all free > pages the new task needs. If the kernel were watching the memory > consumption trend instead, it could see that the trend line shows a > complete exhaustion of free pages or 100% fragmentation in near future, > irrespective of what the workload is. I am confused now. How can an unpredictable action (like sys admin starting a new workload) be handled by watching a memory consumption history trend? From the above description I would expect that the system would be in a balanced state for few days when a new instance is launched. The only reasonable thing to do then is to trigger the reclaim before the workload is spawned but then what is the actual difference between direct reclaim and an early reclaim? [...] > > I agree on this point. Is the current set of tunning sufficient? What > > would be missing if not? > > > > We have knob available to force compaction immediately. That is helpful > and in some case, sys admins have resorted to forcing compaction on all > zones before launching a new cloud instance or loading a new database. > Some admins have resorted to using /proc/sys/vm/drop_caches to force > buffer/page cache pages to be freed up. Either of these solutions causes > system load to go up immediately while kswapd/kcompactd run to free up > and compact pages. This is far from ideal. Other knobs available seem to > be hard to set correctly especially on servers that run mixed workloads > which results in a regular stream of customer complaints coming in about > system stalling at most inopportune times. Then let's talk about what is missing in the existing tuning we already provide. I do agree that compaction needs some love but I am under impression that min_free_kbytes and watermark_*_factor should give a decent abstraction to control the background reclaim. If that is not the case then I am really interested on examples because I might be easily missing something there. Thanks! -- Michal Hocko SUSE Labs