Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp440935imm; Thu, 26 Jul 2018 06:21:16 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcHDA4vKlvdKBGa2wk5b0TypRlUstnsMyJXPOFySYLpar5hPs+oiBPb6S2WAns6U08sft4+ X-Received: by 2002:a17:902:d88d:: with SMTP id b13-v6mr1979309plz.314.1532611276772; Thu, 26 Jul 2018 06:21:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532611276; cv=none; d=google.com; s=arc-20160816; b=0kBx+7MNSvH/RlBHwyZwLj9TGIXv1C1ZzBFUqzC9PhzEMRIGqycZkOpPDzKSAskNcT CdigLewCt8HXy84kg0cjhGwfYW03l6iTN0CUg2CPWQsR3ul8Dw2o4+5MZGdE/85u/1lr L30sjB/AY2gU3C6AuV39DdfP70VFghZC3lRK2gP1e+j38Muur2BGirNHawH8WYcZUV0O hGpU9jlP+klknpqiAGlgrW8L29FBeal/j3cgjOab13hoeZ/LYWCJ/HidxIqPJpBNmWmj eCwEFPpUTqrmsusNk0XfLK+DMaTzbhIDCDLaoGUhj0KwbMMCegPSezdLH04CWfJEkpH7 RuVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=+p+GZe0vqLGJgnWudJJNe5mv5tisFoDUoqQ0wURr/KU=; b=LUGF6IWKxV56ej6XvBEDXwRWw0MVMGDBIM4NCZ+3XDnwbaitNw8EG1/YtlH7GbkcO3 cWMNV5itKmw4w31KQ6vAEBdW4RZFgWgWJvn0arL1juaNF8Nr0uN91aNaC2TddU7shjNl f7VJZJTf0HqNUD2tuwSc6SOu/9ReBKB1+zCKpdpIAEGq0g32i0xbdmzLYAA8ONl1aPN/ ddE4dAP+X24+uU5F6rb77zt3pBLrCZrtnPxkltKfIUhF1zG559zw5fHdYmP3px1Xkzbf 9U9sgnRYYRIGw3xeqpkPmFC8nAU5SWK3lVHA7McWa9Euju5KzkiLYx2Z9QvTIzSp4MGX tAHw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t4-v6si1206811plo.235.2018.07.26.06.21.01; Thu, 26 Jul 2018 06:21:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730508AbeGZOgU (ORCPT + 99 others); Thu, 26 Jul 2018 10:36:20 -0400 Received: from mx2.suse.de ([195.135.220.15]:39598 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729506AbeGZOgU (ORCPT ); Thu, 26 Jul 2018 10:36:20 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 9DEA9AD0C; Thu, 26 Jul 2018 13:19:27 +0000 (UTC) Date: Thu, 26 Jul 2018 15:19:27 +0200 From: Michal Hocko To: Cannon Matthews Cc: mike.kravetz@oracle.com, akpm@linux-foundation.org, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andres Lagar-Cavilla , Salman Qazi , Paul Turner , David Matlack , Peter Feiner , Alain Trinh , ying.huang@intel.com Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on x86 Message-ID: <20180726131927.GK28386@dhcp22.suse.cz> References: <20180724210923.GA20168@bombadil.infradead.org> <20180725023728.44630-1-cannonmatthews@google.com> <20180725125741.GL28386@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 25-07-18 10:55:40, Cannon Matthews wrote: > On Wed, Jul 25, 2018 at 5:57 AM Michal Hocko wrote: > > > > [Cc Huang] > > On Tue 24-07-18 19:37:28, Cannon Matthews wrote: > > > Reimplement clear_gigantic_page() to clear gigabytes pages using the > > > non-temporal streaming store instructions that bypass the cache > > > (movnti), since an entire 1GiB region will not fit in the cache anyway. > > > > > > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on > > > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti` > > > and optimizing the control flow over the constituent small pages, this > > > can be improved roughly by a factor of 3-4x, with the 512GiB mlock() > > > taking only 34 seconds on average, or 67ms/GiB. > > > > > > The assembly code for the __clear_page_nt routine is more or less > > > taken directly from the output of gcc with -O3 for this function with > > > some tweaks to support arbitrary sizes and moving memory barriers: > > > > > > void clear_page_nt_64i (void *page) > > > { > > > for (int i = 0; i < GiB /sizeof(long long int); ++i) > > > { > > > _mm_stream_si64 (((long long int*)page) + i, 0); > > > } > > > sfence(); > > > } > > > > > > In general I would love to hear any thoughts and feedback on this > > > approach and any ways it could be improved. > > > > Well, I like it. In fact 2MB pages are in a similar situation even > > though they fit into the cache so the problem is not that pressing. > > Anyway if you are a standard DB wokrload which simply preallocates large > > hugetlb shared files then it would help. Huang has gone a different > > direction c79b57e462b5 ("mm: hugetlb: clear target sub-page last when > > clearing huge page") and I was asking about using the mechanism you are > > proposing back then http://lkml.kernel.org/r/20170821115235.GD25956@dhcp22.suse.cz > > I've got an explanation http://lkml.kernel.org/r/87h8x0whfs.fsf@yhuang-dev.intel.com > > which hasn't really satisfied me but I didn't really want to block the > > obvious optimization. The similar approach has been proposed for GB > > pages IIRC but I do not see it in linux-next so I am not sure what > > happened with it. > > > > Is there any reason to use a different scheme for GB an 2MB pages? Why > > don't we settle with movnti for both? The first access will be a miss > > but I am not really sure it matters all that much. > > > My only hesitation is that while the benefits of doing it faster seem > obvious at a 1GiB granularity, things become more subtle at 2M, and > they are used much more frequently, where negative impacts from this > approach could outweigh. Well, one would expect that even 2M huge pages would be long lived. And that is usually the case for hugetlb pages which are usually preallocated and pre-faulted/initialized during the startup. > Not that that is actually the case, but I am not familiar enough to be > confident proposing that, especially when it gets into the stuff in > that response you liked about synchronous RAM loads and such. > > With the right benchmarking we could certainly motivate it one way or > the other, but I wouldn't know where to begin to do so in a robust > enough way. > > For the first access being a miss, there is the suggestion that Robert > Elliot had above of doing a normal caching store on the sub-page > that contains the faulting address, as an optimization to avoid > that. Perhaps that would be enough. Well, currently we are initializating pages towards the faulting address (from both ends). Extending that to non-temporal mov shouldn't be hard. -- Michal Hocko SUSE Labs