Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2936262yba; Tue, 16 Apr 2019 00:49:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqwVpmsbvY5jJccm3ftGtYcH0Q1PKKKQ3Eq674YdEM1mlYyb7YmmgLRUcr0HHvV+V/W60LZZ X-Received: by 2002:a63:ff0f:: with SMTP id k15mr74166879pgi.407.1555400980651; Tue, 16 Apr 2019 00:49:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555400980; cv=none; d=google.com; s=arc-20160816; b=peEe5U/L39HxhC+oStW2fmRD/MMFUQDJ7GUBZ7fQaAwhTEbgGcLap5uPHKvecc8F2b 00ST86dyrpuakt3er1YsYkwKTjHO0JfnMYjkrTi0OPrAcBM8o8jhmuedjsFRLDB3M4w0 31XOopifUSYvwm/bgQta4lJFMNXwSD9ZmuCcbERDEJ4BYq/mxECRr6CcCu2xxVoYiY+v ISxWIrxoiniT5Pb7W4u8JMg9Js/ttMMWp50arbT58mIgSapX9qjpjHb28OV6GGmlJSfD ofFyoO13ltQ6TFQckJzA/8My1C8jzechzGVU0Sp5L4Q+aM5QsFN9ghiDJcnBENxP/v1e 5EZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=6DI/tZqtW3uimUoOXNYYEoSTaNBApBDa658DR1lUb6E=; b=rVSyYlzPtcWap8aycBreSQJcgi+Ngkufpj1WrxfyjUPB8S0o9zAm1s4BQMDHagKE84 5dPWj1p4/J6Slw4eKNHx32lJ2pgg154n6tw5esJdF3y6QFfOCJ5bgwZ2X1R1AJYLnFqM IyhsEF6qhWfS5TfNwutGpSS/TmX831s+aqTGsPNC2OIoHtnpIGzZb9ja+gpEGt4Xz8My ZxIDBlRxIcp0En4yNIDg0EYZ52uE9fYWbXBAM3iUw59h1MbU0u/IhFu7ARMsh+TgQP2C qeUNs+a3oOEk49kBDBtALRT0j9MtmHX9L89ucijJomgrAS1/DRfcvNRJgquX1jYUs5GI EmPQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n22si46433602pgk.102.2019.04.16.00.49.24; Tue, 16 Apr 2019 00:49:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728626AbfDPHrU (ORCPT + 99 others); Tue, 16 Apr 2019 03:47:20 -0400 Received: from mx2.suse.de ([195.135.220.15]:50160 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727845AbfDPHrT (ORCPT ); Tue, 16 Apr 2019 03:47:19 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6F696AEEA; Tue, 16 Apr 2019 07:47:17 +0000 (UTC) Date: Tue, 16 Apr 2019 09:47:14 +0200 From: Michal Hocko To: Yang Shi Cc: mgorman@techsingularity.net, riel@surriel.com, hannes@cmpxchg.org, akpm@linux-foundation.org, dave.hansen@intel.com, keith.busch@intel.com, dan.j.williams@intel.com, fengguang.wu@intel.com, fan.du@intel.com, ying.huang@intel.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node Message-ID: <20190416074714.GD11561@dhcp22.suse.cz> References: <1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com> <20190412084702.GD13373@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 15-04-19 17:09:07, Yang Shi wrote: > > > On 4/12/19 1:47 AM, Michal Hocko wrote: > > On Thu 11-04-19 11:56:50, Yang Shi wrote: > > [...] > > > Design > > > ====== > > > Basically, the approach is aimed to spread data from DRAM (closest to local > > > CPU) down further to PMEM and disk (typically assume the lower tier storage > > > is slower, larger and cheaper than the upper tier) by their hotness. The > > > patchset tries to achieve this goal by doing memory promotion/demotion via > > > NUMA balancing and memory reclaim as what the below diagram shows: > > > > > > DRAM <--> PMEM <--> Disk > > > ^ ^ > > > |-------------------| > > > swap > > > > > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path. > > > Then NUMA balancing will promote pages to DRAM as long as the page is referenced > > > again. The memory pressure on PMEM node would push the inactive pages of PMEM > > > to disk via swap. > > > > > > The promotion/demotion happens only between "primary" nodes (the nodes have > > > both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes > > > and promotion from DRAM to PMEM and demotion from PMEM to DRAM. > > > > > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range > > > that has differentiated performance from the conventional memory pool, or > > > differentiated performance for a specific initiator, per Dan Williams. So, > > > assuming PMEM nodes are cpuless nodes sounds reasonable. > > > > > > However, cpuless nodes might be not PMEM nodes. But, actually, memory > > > promotion/demotion doesn't care what kind of memory will be the target nodes, > > > it could be DRAM, PMEM or something else, as long as they are the second tier > > > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds > > > pointless to do such demotion. > > > > > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in > > > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and > > > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). > > > Typically, memory allocation would happen on such nodes by default unless > > > cpuless nodes are specified explicitly, cpuless nodes would be just fallback > > > nodes, so they are also as known as "primary" nodes in this patchset. With > > > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to > > > demonstrate the promotion/demotion approach for now, and this looks more > > > architecture-independent. But it may be better to construct such node mask > > > by reading hardware information (i.e. HMAT), particularly for more complex > > > memory hierarchy. > > I still believe you are overcomplicating this without a strong reason. > > Why cannot we start simple and build from there? In other words I do not > > think we really need anything like N_CPU_MEM at all. > > In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes. > They would be the preferred demotion target.? Of course, we could rely on > firmware to just demote to the next best node, but it may be a "preferred" > node, if so I don't see too much benefit achieved by demotion. Am I missing > anything? Why cannot we simply demote in the proximity order? Why do you make cpuless nodes so special? If other close nodes are vacant then just use them. > > I would expect that the very first attempt wouldn't do much more than > > migrate to-be-reclaimed pages (without an explicit binding) with a > > Do you mean respect mempolicy or cpuset when doing demotion? I was wondering > this, but I didn't do so in the current implementation since it may need > walk the rmap to retrieve the mempolicy in the reclaim path. Is there any > easier way to do so? You definitely have to follow policy. You cannot demote to a node which is outside of the cpuset/mempolicy because you are breaking contract expected by the userspace. That implies doing a rmap walk. > > I would also not touch the numa balancing logic at this stage and rather > > see how the current implementation behaves. > > I agree we would prefer start from something simpler and see how it works. > > The "twice access" optimization is aimed to reduce the PMEM bandwidth burden > since the bandwidth of PMEM is scarce resource. I did compare "twice access" > to "no twice access", it does save a lot bandwidth for some once-off access > pattern. For example, when running stress test with mmtest's > usemem-stress-numa-compact. The kernel would promote ~600,000 pages with > "twice access" in 4 hours, but it would promote ~80,000,000 pages without > "twice access". I pressume this is a result of a synthetic workload, right? Or do you have any numbers for a real life usecase? -- Michal Hocko SUSE Labs