Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp247630imu; Thu, 6 Dec 2018 23:51:00 -0800 (PST) X-Google-Smtp-Source: AFSGD/VbHAp5HdB1dIGQ3/7Y3bqW5+p+tWiEDAOw1tlExGIpkqtlEsD7yRwkNfGspJRZlS2ABIRk X-Received: by 2002:a17:902:24a2:: with SMTP id w31mr1162607pla.216.1544169060187; Thu, 06 Dec 2018 23:51:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544169060; cv=none; d=google.com; s=arc-20160816; b=uLNBJ7V7YcpHBij8X06wC7EBJIUr1J4iqWCg5Vi8QYsZm0mMrbhtpiPqu/8i8v48TQ Mfdkt9lTnWFez4ZUJrm7ZHWiKaSjNH2/dOdo6MdoBIkdEYm+78BnMlorvONcga0qSlZp 0ygrD4we6PeIhXhmDgAEW3Z7XgoXMuevhq7q2WW+RMKxjHew7PtrBcJ7NaUoQX3BtEO4 uIt9UrotJ6SRV7NwQI98q4UvGIlpQ0Bk3nObk7nZlvj9UOL5wKg6e65l6LYUy14A3W63 XFuMwZ2f1pp1kGJA8ZBWSFbceXCSY9lUI90KB3K3Ux3I8p4bNEVWkmmIyYtpO8f8b2BB srgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=ztL45a06FqeDwxad5TI5BxQg3GGmef4LcCAWlqvIaPw=; b=as4oXqVrA9lP7/voQSB0X1clqICFA/exaMP8FWK68Zn/pw+Yy92p0JEFf/a6MMQtUf KE3gnNshDOgNj5ottKY9zVPp5KadZ1O8DxirOGjmtGkXbzKYzTgIJtGlYf2BAf/WAcCj FJjzUIexCcOh/lOx+dkFASTHs1YAQgp1/NHNp/iHjwlfjGxsbIyTqwJLRAd10EB8pcXC ggJkN5XY8hyF/pokXqofk9WhaHRPIeK50brA0d/Q2g4buc2o5D2443+7areLAEepAzUd DUBtktACBsmatj9Btygp8VmUb8gnWHdY/mrTbMQqzw7IzmpLb5uvT3fM4fXiqY4FIRJh 7vlQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y2si2280906pli.266.2018.12.06.23.50.45; Thu, 06 Dec 2018 23:51:00 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726015AbeLGHt6 (ORCPT + 99 others); Fri, 7 Dec 2018 02:49:58 -0500 Received: from mx2.suse.de ([195.135.220.15]:40500 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725963AbeLGHt6 (ORCPT ); Fri, 7 Dec 2018 02:49:58 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E74F9B007; Fri, 7 Dec 2018 07:49:55 +0000 (UTC) Date: Fri, 7 Dec 2018 08:49:54 +0100 From: Michal Hocko To: Linus Torvalds Cc: Andrea Arcangeli , mgorman@techsingularity.net, Vlastimil Babka , ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, David Rientjes , kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression) Message-ID: <20181207074954.GR1286@dhcp22.suse.cz> References: <64a4aec6-3275-a716-8345-f021f6186d9b@suse.cz> <20181204104558.GV23260@techsingularity.net> <20181205204034.GB11899@redhat.com> <20181205233632.GE11899@redhat.com> <20181206091405.GD1286@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 06-12-18 20:31:46, Linus Torvalds wrote: > [ Oops. different thread for me due to edited subject, so I saw this > after replying to the earlier email by David ] Sorry about that but I really wanted to make the actual discussion about semantic clearly distinguished because the thread just grown too large with back and forth that didn't lead to anywhere. > On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko wrote: > > > > MADV_HUGEPAGE changes the picture because the caller expressed a need > > for THP and is willing to go extra mile to get it. > > Actually, I think MADV_HUGEPAGE should just be > "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma". Yes, that is the case and I didn't want to make the description more complicated than necessary so I've focused only on the current default. But historically we have treated defrag=always and MADV_HUGEPAGE the same. [...] > >I believe that something like the below would be sensible > > 1) THP on a local node with compaction not giving up too early > > 2) THP on a remote node in NOWAIT mode - so no direct > > compaction/reclaim (trigger kswapd/kcompactd only for > > defrag=defer+madvise) > > 3) fallback to the base page allocation > > That doesn't sound insane to me. That said, the numbers David quoted > do fairly strongly imply that local small-pages are actually preferred > to any remote THP pages. As I and others pointed out elsewhere remote penalty is just a part of the picture and on its own might be quite misleading. There are other aspects (TLB pressure, page tables overhead etc) that might amortize the access latency. > But *that* in turn makes for other possible questions: > > - if the reason we couldn't get a local hugepage is that we're simply > out of local memory (huge *or* small), then maybe a remote hugepage is > better. > > Note that this now implies that the choice can be an issue of "did > the hugepage allocation fail due to fragmentation, or due to the node > being low of memory" How exactly do you tell? Many systems are simply low on memory due to caching. A clean pagecache is quite cheap to reclaim but it can be more expensive to fault in. Do we consider it to be a viable target? > > and there is the other question that I asked in the other thread > (before subject edit): > > - how local is the load to begin with? > > Relatively shortlived processes - or processes that are explicitly > bound to a node - might have different preferences than some > long-lived process where the CPU bounces around, and might have > different trade-offs for the local vs remote question too. Agreed > So just based on David's numbers, and some wild handwaving on my part, > a slightly more complex, but still very sensible default might be > something like > > 1) try to do a cheap local node hugepage allocation > > Rationale: everybody agrees this is the best case. > > But if that fails: > > 2) look at compacting and the local node, but not very hard. > > If there's lots of memory on the local node, but synchronous > compaction doesn't do anything easily, just fall back to small pages. Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt? > Rationale: local memory is generally more important than THP. > > If that fails (ie local node is simply low on memory): > > 3) Try to do remote THP allocation > > Rationale: Ok, we simply didn't have a lot of local memory, so > it's not just a question of fragmentation. If it *had* been > fragmentation, lots of small local pages would have been better than a > remote THP page. > > Oops, remote THP allocation failed (possibly after synchronous > remote compaction, but maybe this is where we do kcompactd). > > 4) Just do any small page, and do reclaim etc. THP isn't happening, > and it's not a priority when you're starting to feel memory pressure. If 2) doesn't reclaim heavily (e.g. only try to reclaim clean page cache) or even do NOWAIT (which would be even better) then I _think_ this sounds sane. > In general, I really would want to avoid magic kernel command lines > (or sysfs settings, or whatever) making a huge difference in behavior. > So I really wish people would see the whole > 'transparent_hugepage_flags' thing as a way for kernel developers to > try different settings, not as a way for users to tune their loads. > > Our default should work as sane defaults, we shouldn't have a "ok, > let's have this sysfs tunable and let people make their own > decisions". That's a cop-out. Agreed. I cannot say I am happy with all the ways THP can be tuned. It is quite confusing to say the least. -- Michal Hocko SUSE Labs