Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5592281imu; Wed, 26 Dec 2018 05:40:15 -0800 (PST) X-Google-Smtp-Source: ALg8bN7JyICG0En07Hw4O/3jLWL98Y6Q0dqM7eKVeA+AQUuw4B7l2TGFTcGcza8W9/g5Xpj3SnW2 X-Received: by 2002:a63:f811:: with SMTP id n17mr19444203pgh.23.1545831615083; Wed, 26 Dec 2018 05:40:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545831615; cv=none; d=google.com; s=arc-20160816; b=NoYo6Z2XQ1++nKzVy7TJX5zHubc+xsvfoRNfZ87AhLtoqqVrBhWWNcax6gExC3U8O1 eWUGP0rPuyTANfd90cWbjCQO8FPINxPCWAcAWb327DRabHhPHehM7t0DvFlbHNTW4/lD nH36ya/nswxpFe4nu/z/YVbeJ04UAp+FT3Ula3MTcnqSAGDII/VS0S5Y7NyZKGNS8DPU qzhZhYEdXYHF4VN1tvPA/kz2frqr2sHbOBp+/to/pgD3J2C/fPJlU6CTC+Db5LEhNt15 nwGCGrSJRXUkm7TCMsJVdIa3ZRkYOLetygeEekByh24ncd23obWdmjdOSjA2Z4o7x0bd lhdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc :cc:cc:to:from:date:user-agent:message-id; bh=Mw9KZQ4orLVUVMOyOnjvBYlMWV4ZhCODP6p3UHosjp4=; b=epF8JNk3P4QA4hwgsg6NPdNUFm4J7EuUkpEQLgnBSzPxmYeb5UYloHQ53pQkgkuW3G LnvgbLhvXeZDAp+FVm9YcxU+jsFypALZITkWxeLfqpGYAXmdWVNIQ9M6hSjm9oHv4gaf cnGY3QN6aBBUwMbsaoydLGEnt9TP7tHk00rKJaxaBd2+xi0yhrSKxu89JzxOK94xe1fg 94+WWKPnerd/PfDeSL3TSMI7ANSxgSorQLk3P3Q1xGo4zQaXzLvhd71P4z0ChrdxEWOJ hYrGOG0SzGFDj1JgneqBc9DIBhOZWAX2BDmDouehO6Jq1r8HKMwRHL5GoHNWkXSmroyg yw+w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d13si32285219pgu.40.2018.12.26.05.40.00; Wed, 26 Dec 2018 05:40:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727246AbeLZNi3 (ORCPT + 99 others); Wed, 26 Dec 2018 08:38:29 -0500 Received: from mga06.intel.com ([134.134.136.31]:21292 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726937AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358939" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Ns-5X; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226131446.330864849@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:46 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams cc: Fengguang Wu Subject: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's transparent to normal applications and virtual machines. The code is still in active development. It's provided for early design review. Key functionalities: 1) create and describe PMEM NUMA node for NVDIMM memory 2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting 3) passive kernel cold page migration in page reclaim path 4) improved move_pages() for active user space hot/cold page migration (1) is foundation for transparent usage of NVDIMM for normal apps and virtual machines. (2-4) enable auto placing hot pages in DRAM for better performance. A user space migration daemon is being built based on this kernel patchset to make the full vertical solution. Base kernel is v4.20 . The patches are not suitable for upstreaming in near future -- some are quick hacks, some others need more works. However they are complete enough to demo the necessary kernel changes for the proposed app&VM transparent NVDIMM volatile use model. The interfaces are far from finalized. They kind of illustrate what would be necessary for creating a user space driven solution. The exact forms will ask for more thoughts and inputs. We may adopt HMAT based solution for NUMA node related interface when they are ready. The /proc/PID/idle_pages interface is standalone but non-trivial. Before upstreaming some day, it's expected to take long time to collect various real use cases and feedbacks, so as to refine and stabilize the format. Create PMEM numa node [PATCH 01/21] e820: cheat PMEM as DRAM Mark numa node as DRAM/PMEM [PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table [PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case [PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes [PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM [PATCH 06/21] x86,numa: update numa node type [PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node Point neighbor DRAM/PMEM to each other [PATCH 08/21] mm: introduce and export pgdat peer_node [PATCH 09/21] mm: avoid duplicate peer target node Standalone zonelist for DRAM and PMEM nodes [PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node Keep page table pages in DRAM [PATCH 11/21] kvm: allocate page table pages from DRAM [PATCH 12/21] x86/pgtable: allocate page table pages from DRAM /proc/PID/idle_pages interface for virtual machine and normal tasks [PATCH 13/21] x86/pgtable: dont check PMD accessed bit [PATCH 14/21] kvm: register in mm_struct [PATCH 15/21] ept-idle: EPT walk for virtual machine [PATCH 16/21] mm-idle: mm_walk for normal task [PATCH 17/21] proc: introduce /proc/PID/idle_pages [PATCH 18/21] kvm-ept-idle: enable module Mark hot pages [PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Kernel DRAM=>PMEM migration [PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node [PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM arch/x86/include/asm/numa.h | 2 arch/x86/include/asm/pgalloc.h | 10 arch/x86/include/asm/pgtable.h | 3 arch/x86/kernel/e820.c | 3 arch/x86/kvm/Kconfig | 11 arch/x86/kvm/Makefile | 4 arch/x86/kvm/ept_idle.c | 841 +++++++++++++++++++++++++++++++ arch/x86/kvm/ept_idle.h | 116 ++++ arch/x86/kvm/mmu.c | 12 arch/x86/mm/numa.c | 3 arch/x86/mm/numa_emulation.c | 30 + arch/x86/mm/pgtable.c | 22 drivers/acpi/numa.c | 5 drivers/base/node.c | 21 fs/proc/base.c | 2 fs/proc/internal.h | 1 fs/proc/task_mmu.c | 54 + include/linux/mm_types.h | 11 include/linux/mmzone.h | 38 + mm/mempolicy.c | 14 mm/migrate.c | 13 mm/page_alloc.c | 77 ++ mm/pagewalk.c | 1 mm/vmscan.c | 38 + virt/kvm/kvm_main.c | 3 25 files changed, 1306 insertions(+), 29 deletions(-) V1 patches: https://lkml.org/lkml/2018/9/2/13 Regards, Fengguang