Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2566225imu; Thu, 24 Jan 2019 15:23:24 -0800 (PST) X-Google-Smtp-Source: ALg8bN4LBcslg10fI58HhWrv4QE+d2bMOYTcok5LTiyyhQXX87dNY9YAul63FwAhTalM7+tGhZHH X-Received: by 2002:a63:5518:: with SMTP id j24mr7633742pgb.208.1548372203978; Thu, 24 Jan 2019 15:23:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548372203; cv=none; d=google.com; s=arc-20160816; b=pDv8Jek1KGV7PzQ97d3FjNFO9fHSik1LxEv5Fu/0fo0lFuBzbFh4Q3km/5tDSEzEWP lnjB0b1rt0ly9evo/vqO311fvBkdKiugTBK0pep9l3+sLssO+yqHBkAymjlyH+10fu47 hUL04/hN42oOW2EQS7la5cepHvufX9wooy8eqyXv60A8wl6uGkfkkETJ6lhouKhv0YOM CoC2uMRDJsYRkTYmwnsfPZC9ftugRz+V2Ke6Zex+K++rkY09bwf9hZiFtFuGfDpqrDSK 53hPZYQYFZ5izrBP8ZbP+q4ZtTnZmswovnJwF3M6AnsRQ/vs/RlWubqxVhfTWeNaOVxY Y5cw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:from:cc:to:subject; bh=QvZnb18fNlFv+urXniHrfn2rxJeV1WV6da2ymalnQuU=; b=g4HsgGud/XllimQ8Vfwzr/h/6yWs2FYYDjs/kPeN3jW4W+xltdRyUJKtr7VRUj642N aEkPXRq5RThkHRxhhdECbincUbF26l/33Dv2gUJLVUkcI7GPGS6iwqiTAh51BE91kRST zIC49b3N2Kq6tIyC+vrRHOY310qCulm/GSJ5XoWv0cwjk8YIecRRbEzDFAv4PDAmN4Op SWA3jSPFTzkGDLNW+pnElgne8ysJX2R7V1zBXKXy8uBfghlYNVUGuMWTZCFfYSl2qMP8 qMabmy6624QD/w/WktUSW24twqeGKIV0jjlyq/TM6rOUC+ETBp+cgxFRKZ1OarYV8U2v XWuQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d125si22852999pfc.114.2019.01.24.15.23.09; Thu, 24 Jan 2019 15:23:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726779AbfAXXVw (ORCPT + 99 others); Thu, 24 Jan 2019 18:21:52 -0500 Received: from mga03.intel.com ([134.134.136.65]:22135 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725991AbfAXXVv (ORCPT ); Thu, 24 Jan 2019 18:21:51 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 24 Jan 2019 15:21:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,518,1539673200"; d="scan'208";a="128689433" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga002.jf.intel.com with ESMTP; 24 Jan 2019 15:21:50 -0800 Subject: [PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM To: linux-kernel@vger.kernel.org Cc: Dave Hansen , dan.j.williams@intel.com, dave.jiang@intel.com, zwisler@kernel.org, vishal.l.verma@intel.com, thomas.lendacky@amd.com, akpm@linux-foundation.org, mhocko@suse.com, linux-nvdimm@lists.01.org, linux-mm@kvack.org, ying.huang@intel.com, fengguang.wu@intel.com, bp@suse.de, bhelgaas@google.com, baiyaowei@cmss.chinamobile.com, tiwai@suse.de, jglisse@redhat.com From: Dave Hansen Date: Thu, 24 Jan 2019 15:14:41 -0800 Message-Id: <20190124231441.37A4A305@viggo.jf.intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org v3 spurred a bunch of really good discussion. Thanks to everybody that made comments and suggestions! I would still love some Acks on this from the folks on cc, even if it is on just the patch touching your area. Note: these are based on commit d2f33c19644 in: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending Changes since v3: * Move HMM-related resource warning instead of removing it * Use __request_resource() directly instead of devm. * Create a separate DAX_PMEM Kconfig option, complete with help text * Update patch descriptions and cover letter to give a better overview of use-cases and hardware where this might be useful. Changes since v2: * Updates to dev_dax_kmem_probe() in patch 5: * Reject probes for devices with bad NUMA nodes. Keeps slow memory from being added to node 0. * Use raw request_mem_region() * Add comments about permanent reservation * use dev_*() instead of printk's * Add references to nvdimm documentation in descriptions * Remove unneeded GPL export * Add Kconfig prompt and help text Changes since v1: * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git * Use binding/unbinding from "dax bus" code * Move over to a "dax bus" driver from being an nvdimm driver -- Persistent memory is cool. But, currently, you have to rewrite your applications to use it. Wouldn't it be cool if you could just have it show up in your system like normal RAM and get to it like a slow blob of memory? Well... have I got the patch series for you! == Background / Use Cases == Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves are described in detail in Documentation/nvdimm/nvdimm.txt. However, this documentation focuses on actually using them as storage. This set is focused on using NVDIMMs as DRAM replacement. This is intended for Intel-style NVDIMMs (aka. Intel Optane DC persistent memory) NVDIMMs. These DIMMs are physically persistent, more akin to flash than traditional RAM. They are also expected to be more cost-effective than using RAM, which is why folks want this set in the first place. This set is not intended for RAM-based NVDIMMs. Those are not cost-effective vs. plain RAM, and this using them here would simply be a waste. But, why would you bother with this approach? Intel itself [1] has announced a hardware feature that does something very similar: "Memory Mode" which turns DRAM into a cache in front of persistent memory, which is then as a whole used as normal "RAM"? Here are a few reasons: 1. The capacity of memory mode is the size of your persistent memory that you dedicate. DRAM capacity is "lost" because it is used for cache. With this, you get PMEM+DRAM capacity for memory. 2. DRAM acts as a cache with memory mode, and caches can lead to unpredictable latencies. Since memory mode is all-or-nothing (either all your DRAM is used as cache or none is), your entire memory space is exposed to these unpredictable latencies. This solution lets you guarantee DRAM latencies if you need them. 3. The new "tier" of memory is exposed to software. That means that you can build tiered applications or infrastructure. A cloud provider could sell cheaper VMs that use more PMEM and more expensive ones that use DRAM. That's impossible with memory mode. Don't take this as criticism of memory mode. Memory mode is awesome, and doesn't strictly require *any* software changes (we have software changes proposed for optimizing it though). It has tons of other advantages over *this* approach. Basically, we believe that the approach in these patches is complementary to memory mode and that both can live side-by-side in harmony. == Patch Set Overview == This series adds a new "driver" to which pmem devices can be attached. Once attached, the memory "owned" by the device is hot-added to the kernel and managed like any other memory. On systems with an HMAT (a new ACPI table), each socket (roughly) will have a separate NUMA node for its persistent memory so this newly-added memory can be selected by its unique NUMA node. == Testing Overview == Here's how I set up a system to test this thing: 1. Boot qemu with lots of memory: "-m 4096", for instance 2. Reserve 512MB of physical memory. Reserving a spot a 2GB physical seems to work: memmap=512M!0x0000000080000000 This will end up looking like a pmem device at boot. 3. When booted, convert fsdax device to "device dax": ndctl create-namespace -fe namespace0.0 -m dax 4. See patch 4 for instructions on binding the kmem driver to a device. 5. Now, online the new memory sections. Perhaps: grep ^MemTotal /proc/meminfo for f in `grep -vl online /sys/devices/system/memory/*/state`; do echo $f: `cat $f` echo online_movable > $f grep ^MemTotal /proc/meminfo done 1. https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Borislav Petkov Cc: Bjorn Helgaas Cc: Yaowei Bai Cc: Takashi Iwai Cc: Jerome Glisse