Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7757722imu; Fri, 28 Dec 2018 04:27:44 -0800 (PST) X-Google-Smtp-Source: AFSGD/UFxqmraMroL9OUg74wkP0R74WvY00yIf1zrtTrFGPRzgtyUOI6EJcKKXqOmPaZoy0f7BtT X-Received: by 2002:a62:c683:: with SMTP id x3mr27414090pfk.10.1546000064264; Fri, 28 Dec 2018 04:27:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546000064; cv=none; d=google.com; s=arc-20160816; b=nD0FQ3xgW8Rgh9HLWOo0Tf8A+JKlwsJRuFFnEahm/9z4WL4GmWG6TTTKRpztnKT3NF 9j9+4k7kz5i2Rm9IzZoZr926ACLTGGgAR4qXk4AsNwxI8GPC2VWfpSxy2siKgQTcLLpX /vmjmlMjTf65JwFk3evUpVxMWE0/PaXnpJMew01yKK+w2BvH+o63HmGLaeO6Yfs9DvmM qQKsTmUZtxiOnyWRgiQMp2bDposr0DWRuH5NV8MJ/3OJoan+N24wTrGAyhctYkDJBxqT fQDYIvT7e/o/w4SChEii5QcLdor6hb0tQBRllx7RdHMpSg9KCOPQHFvvyGMZu4JKGWhf CbPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=8R8HCh6tdCMFUukps3u5gwvfrjgUvJz6i3QdcpuOq1o=; b=LPuqn3nYFXDg0Tos+vk7LN7rEDeGhtjElIk79hZmxQe5T2+Fx+dvwZsO8a/WV+zMgS nLGC2vAmbZWzM+rv0FGyBXegBLDLBAyt7FDdeCOc7sPA8JckIDsbD1TkWrwln9diuUeV rRI3l18o2dLgif1vW6aNfAetH88ezBUd8j/5ZsrqfPkY6ZPNeI6EWKKepUiyLHQuouOJ NMrQJPCu/rgoxWrJ20ogXDp+a5fwJd8plyvZTEgj4AEvQS+jXoB5HyW4KjMo99ax+tbh k258rLU2r2WVUQUvr4HsvNGkv4FbIZavFgsy2tqACjL8zlMkkEh6wf+VssKaRPhsFgId WA2Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w24si38442372pgj.582.2018.12.28.04.27.28; Fri, 28 Dec 2018 04:27:44 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731652AbeL1JmN (ORCPT + 99 others); Fri, 28 Dec 2018 04:42:13 -0500 Received: from mga02.intel.com ([134.134.136.20]:9418 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727225AbeL1JmM (ORCPT ); Fri, 28 Dec 2018 04:42:12 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Dec 2018 01:42:12 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,409,1539673200"; d="scan'208";a="121753869" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 28 Dec 2018 01:42:09 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gcoee-0000Ft-S1; Fri, 28 Dec 2018 17:42:08 +0800 Date: Fri, 28 Dec 2018 17:42:08 +0800 From: Fengguang Wu To: Michal Hocko Cc: Andrew Morton , Linux Memory Management List , kvm@vger.kernel.org, LKML , Fan Du , Yao Yuan , Peng Dong , Huang Ying , Liu Jingqi , Dong Eddie , Dave Hansen , Zhang Yi , Dan Williams Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Message-ID: <20181228094208.7lgxhha34zpqu4db@wfg-t540p.sh.intel.com> References: <20181226131446.330864849@intel.com> <20181227203158.GO16738@dhcp22.suse.cz> <20181228050806.ewpxtwo3fpw7h3lq@wfg-t540p.sh.intel.com> <20181228084105.GQ16738@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20181228084105.GQ16738@dhcp22.suse.cz> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 28, 2018 at 09:41:05AM +0100, Michal Hocko wrote: >On Fri 28-12-18 13:08:06, Wu Fengguang wrote: >[...] >> Optimization: do hot/cold page tracking and migration >> ===================================================== >> >> Since PMEM is slower than DRAM, we need to make sure hot pages go to >> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM. >> >> - DRAM=>PMEM cold page migration >> >> It can be done in kernel page reclaim path, near the anonymous page >> swap out point. Instead of swapping out, we now have the option to >> migrate cold pages to PMEM NUMA nodes. > >OK, this makes sense to me except I am not sure this is something that >should be pmem specific. Is there any reason why we shouldn't migrate >pages on memory pressure to other nodes in general? In other words >rather than paging out we whould migrate over to the next node that is >not under memory pressure. Swapout would be the next level when the >memory is (almost_) fully utilized. That wouldn't be pmem specific. In future there could be multi memory levels with different performance/size/cost metric. There are ongoing HMAT works to describe that. When ready, we can switch to the HMAT based general infrastructure. Then the code will no longer be PMEM specific, but do general promotion/demotion migrations between high/low memory levels. Swapout could be from the lowest level memory. Migration between peer nodes is the obvious simple way and a good choice for the initial implementation. But yeah, it's possible to migrate to other nodes. For example, it can be combined with NUMA balancing: if we know the page is mostly accessed by the other socket, then it'd best to migrate hot/cold pages directly to that socket. >> User space may also do it, however cannot act on-demand, when there >> are memory pressure in DRAM nodes. >> >> - PMEM=>DRAM hot page migration >> >> While LRU can be good enough for identifying cold pages, frequency >> based accounting can be more suitable for identifying hot pages. >> >> Our design choice is to create a flexible user space daemon to drive >> the accounting and migration, with necessary kernel supports by this >> patchset. > >We do have numa balancing, why cannot we rely on it? This along with the >above would allow to have pmem numa nodes (cpuless nodes in fact) >without any special casing and a natural part of the MM. It would be >only the matter of the configuration to set the appropriate distance to >allow reasonable allocation fallback strategy. Good question. We actually tried reusing NUMA balancing mechanism to do page-fault triggered migration. move_pages() only calls change_prot_numa(). It turns out the 2 migration types have different purposes (one for hotness, another for home node) and hence implement details. We end up modifying some few NUMA balancing logic -- removing rate limiting, changing target node logics, etc. Those look unnecessary complexities for this post. This v2 patchset mainly fulfills our first milestone goal: a minimal viable solution that's relatively clean to backport. Even when preparing for new upstreamable versions, it may be good to keep it simple for the initial upstream inclusion. >I haven't looked at the implementation yet but if you are proposing a >special cased zone lists then this is something CDM (Coherent Device >Memory) was trying to do two years ago and there was quite some >skepticism in the approach. It looks we are pretty different than CDM. :) We creating new NUMA nodes rather than CDM's new ZONE. The zonelists modification is just to make PMEM nodes more separated. Thanks, Fengguang