Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7963720imu; Fri, 28 Dec 2018 08:06:12 -0800 (PST) X-Google-Smtp-Source: ALg8bN5Q+Bzzo0pOw9wE8uJp25Yc9kl6WPNFCP/vtqxS0oX2NR77C99ETmJ2IXroGSsbjw6l9Wlr X-Received: by 2002:a63:3287:: with SMTP id y129mr26991711pgy.337.1546013172178; Fri, 28 Dec 2018 08:06:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546013172; cv=none; d=google.com; s=arc-20160816; b=q3iheSEmswypfJNXFh1Zlq7sjVeT25HHeEv++m/TpnzxW3L40AK4omIvhXQSFZdnVM UHbYtH1ts65jrSyxZEnieMV2e8SEBAJNgiq8H4TwDu0BtU/7EUE1g1GXfj0YqdP9scF6 uiaPfunVGTbztjSA5hTklY1N/Nc66Uchyp5rV1EspnueZdHfabrlG79hNAsgnxGpL/Wq SjLONA6F7Rvy3+qn49PKfp9YICHwIk3KVK1Y5cEc1MngPpzXPJgkgwiT1YHSrvij68H6 QfeHhoJnA+c4ueiphBtnqN8psGmHtdu+y095ZP59ilzcYGPbjGbpXU93uLiUjzq56SYX Dhdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=iRpMGcROL0UB6OzZbYAOMNzN3RH4bKK9W20ayZfBUus=; b=ItB8wLUIGLvzxOah469M6FUvrSHPwSxXI1NZEWex1KdbVd8vqBzUrlzl07Lvs1Biha rsg863nOafx5gn5WPyH9upyQeHXmwmIiCebcL8uQyeljOSba8tIHcsfmgctg4DsJULIa H6UmmJvcQvr+RIsd0GHng+DKwLuP/ukKKUKh589QB4dARXY1z+7BpC0654Ng+V+/KcV0 UuPNugiZR0vEzpo20qbzfe5SsLefuLjnHW8qJ4wqtoZ4n1cHsrsmr5WzTmEb41+EMa5h LSoURXIM0XpS8hJPhIogdHITl2grpGdv6Xdp5pLQZaKO0GQKU4xan6e0XeCOVC/d3yk9 VjWg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i13si38699548pgj.199.2018.12.28.08.05.55; Fri, 28 Dec 2018 08:06:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727542AbeL1FIK (ORCPT + 99 others); Fri, 28 Dec 2018 00:08:10 -0500 Received: from mga14.intel.com ([192.55.52.115]:27420 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726392AbeL1FIK (ORCPT ); Fri, 28 Dec 2018 00:08:10 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 27 Dec 2018 21:08:09 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,408,1539673200"; d="scan'208";a="121710192" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 27 Dec 2018 21:08:07 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gckNS-0005kJ-E0; Fri, 28 Dec 2018 13:08:06 +0800 Date: Fri, 28 Dec 2018 13:08:06 +0800 From: Fengguang Wu To: Michal Hocko Cc: Andrew Morton , Linux Memory Management List , kvm@vger.kernel.org, LKML , Fan Du , Yao Yuan , Peng Dong , Huang Ying , Liu Jingqi , Dong Eddie , Dave Hansen , Zhang Yi , Dan Williams Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Message-ID: <20181228050806.ewpxtwo3fpw7h3lq@wfg-t540p.sh.intel.com> References: <20181226131446.330864849@intel.com> <20181227203158.GO16738@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20181227203158.GO16738@dhcp22.suse.cz> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 27, 2018 at 09:31:58PM +0100, Michal Hocko wrote: >On Wed 26-12-18 21:14:46, Wu Fengguang wrote: >> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's >> transparent to normal applications and virtual machines. >> >> The code is still in active development. It's provided for early design review. > >So can we get a high level description of the design and expected >usecases please? Good question. Use cases ========= The general use case is to use PMEM as slower but cheaper "DRAM". The suitable ones can be - workloads care memory size more than bandwidth/latency - workloads with a set of warm/cold pages that don't change rapidly over time - low cost VM/containers Foundation: create PMEM NUMA nodes ================================== To create PMEM nodes in native kernel, Dave Hansen and Dan Williams have working patches for kernel and ndctl. According to Ying, it'll work like this ndctl destroy-namespace -f namespace0.0 ndctl destroy-namespace -f namespace1.0 ipmctl create -goal MemoryMode=100 reboot To create PMEM nodes in QEMU VMs, current Debian/Fedora etc. distros already support this qemu-system-x86_64 -machine pc,nvdimm -enable-kvm -smp 64 -m 256G # DRAM node 0 -object memory-backend-file,size=128G,share=on,mem-path=/dev/shm/qemu_node0,id=tmpfs-node0 -numa node,cpus=0-31,nodeid=0,memdev=tmpfs-node0 # PMEM node 1 -object memory-backend-file,size=128G,share=on,mem-path=/dev/dax1.0,align=128M,id=dax-node1 -numa node,cpus=32-63,nodeid=1,memdev=dax-node1 Optimization: do hot/cold page tracking and migration ===================================================== Since PMEM is slower than DRAM, we need to make sure hot pages go to DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM. - DRAM=>PMEM cold page migration It can be done in kernel page reclaim path, near the anonymous page swap out point. Instead of swapping out, we now have the option to migrate cold pages to PMEM NUMA nodes. User space may also do it, however cannot act on-demand, when there are memory pressure in DRAM nodes. - PMEM=>DRAM hot page migration While LRU can be good enough for identifying cold pages, frequency based accounting can be more suitable for identifying hot pages. Our design choice is to create a flexible user space daemon to drive the accounting and migration, with necessary kernel supports by this patchset. Linux kernel already offers move_pages(2) for user space to migrate pages to specified NUMA nodes. The major gap lies in hotness accounting. User space driven hotness accounting ==================================== One way to find out hot/cold pages is to scan page table multiple times and collect the "accessed" bits. We created the kvm-ept-idle kernel module to provide the "accessed" bits via interface /proc/PID/idle_pages. User space can open it and read the "accessed" bits for a range of virtual address. Inside kernel module, it implements 2 independent set of page table scan code, seamlessly providing the same interface: - for QEMU, scan HVA range of the VM's EPT(Extended Page Table) - for others, scan VA range of the process page table With /proc/PID/idle_pages and move_pages(2), the user space daemon can work like this One round of scan+migration: loop N=(3-10) times: sleep 0.01-10s (typical values) scan page tables and read/accumulate accessed bits into arrays treat pages with accessed_count == N as hot pages treat pages with accessed_count == 0 as cold pages migrate hot pages to DRAM nodes migrate cold pages to PMEM nodes (optional, may do it once on multi scan rounds, to make sure they are really cold) That just describes the bare minimal working model. A real world daemon should consider lots more to be useful and robust. The notable one is to avoid thrashing. Hotness accounting can be rough and workload can be unstable. We need to avoid promoting a warm page to DRAM and then demoting it soon. The basic scheme is to auto control scan interval and count, so that each round of scan will get hot pages < 1/2 DRAM size. May also do multiple round of scans before migration, to filter out unstable/burst accesses. In long run, most of the accounted hot pages will already be in DRAM. So only need to migrate the new ones to DRAM. When doing so, should consider QoS and rate limiting to reduce impacts to user workloads. When user space drives hot page migration, the DRAM nodes may well be pressured, which will in turn trigger in-kernel cold page migration. The above 1/2 DRAM size hot pages target can help kernel easily find cold pages on LRU scan. To avoid thrashing, it's also important to maintain persistent kernel and user-space view of hot/cold pages. Since they will do migrations in 2 different directions. - the regular page table scans will clear PMD/PTE young - user space compensate that by setting PG_referenced on move_pages(hot pages, MPOL_MF_SW_YOUNG) That guarantees the user space collected view of hot pages will be conveyed to kernel. Regards, Fengguang