2015-06-25 09:42:12

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 00/17] libnvdimm: ->rw_bytes(), BLK, BTT, PMEM api, and unit tests

->rw_bytes() is a byte-aligned interface for accessing persistent memory
namespaces. The primary consumer of the ->rw_bytes() interface is the
BTT library.

BTT is a library that converts a byte-accessible namespace into a disk
with atomic sector update semantics (prevents sector tearing on crash or
power loss). The sinister aspect of sector tearing is that most
applications do not know they have a atomic sector dependency. At least
today's disk's rarely ever tear sectors and if they do you almost
certainly get a CRC error on access. NVDIMMs will always tear and
always silently.

BLK is a driver for NVDIMMs that provide sliding mmio windows to access
persistent memory.

The PMEM api defines ensures writes have hit persistent media relative
to the completion of an i/o.

Changes since v1 [1]:

1/ The ->rw_bytes() interface has been removed from struct
block_device_operations and is now a common operation of NVDIMM
namespace devices. Accordingly a BTT instance is now a libnvdimm
device-model peer of a namespace rather than a stacked block device
driver. The BTT is no longer a driver in its own right, instead it is a
extension library used by a BLK or PMEM namespace. This clarifies the
device model and reduced the core implementation by a couple hundred
lines of code. (Christoph)

2/ Kill ND_MAX_REGIONS and ND_IOSTAT Kconfig options. (Christoph)

3/ Killed the access out of range check separately in PMEM, BLK, and BTT
(Christoph)

4/ Kill the central helper for blk queue properties (Christoph)

5/ Added Toshi's numa patches with a change to set the numa info at
device create time rather than driver probe time.

6/ Cherry picked the PMEM api. The wider arch cleanups in the full
pmem-api series are too large / invasive to pick up at this late date.
The keeps the pmem.c driver x86-only for one more cycle.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001246.html

Diffstat since v1:

Documentation/nvdimm/btt.txt | 24 ++-
Documentation/nvdimm/nvdimm.txt | 79 ++++----
arch/x86/Kconfig | 1 +
arch/x86/include/asm/cacheflush.h | 72 +++++++
arch/x86/include/asm/io.h | 6 +
drivers/acpi/nfit.c | 17 +-
drivers/acpi/numa.c | 50 ++++-
drivers/nvdimm/Kconfig | 56 ++----
drivers/nvdimm/Makefile | 2 +-
drivers/nvdimm/blk.c | 148 +++++++-------
drivers/nvdimm/btt.c | 166 ++++------------
drivers/nvdimm/btt_devs.c | 403 +++++++++++++++++---------------------
drivers/nvdimm/bus.c | 190 ++++--------------
drivers/nvdimm/core.c | 30 ---
drivers/nvdimm/label.c | 5 +-
drivers/nvdimm/namespace_devs.c | 252 +++++++++++++++++++-----
drivers/nvdimm/nd-core.h | 47 +----
drivers/nvdimm/nd.h | 51 +++--
drivers/nvdimm/pmem.c | 185 ++++++++---------
drivers/nvdimm/region.c | 85 +-------
drivers/nvdimm/region_devs.c | 182 +++++++++++++----
include/linux/acpi.h | 5 +
include/linux/blkdev.h | 44 -----
include/linux/compiler.h | 2 +
include/linux/libnvdimm.h | 2 +
include/linux/nd.h | 63 +++++-
include/linux/pmem.h | 153 +++++++++++++++
include/uapi/linux/ndctl.h | 2 -
lib/Kconfig | 3 +
tools/testing/nvdimm/Kbuild | 2 +-
tools/testing/nvdimm/test/nfit.c | 1 +
31 files changed, 1272 insertions(+), 1056 deletions(-)
create mode 100644 include/linux/pmem.h

---

Dan Williams (8):
libnvdimm: infrastructure for btt devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm: Non-Volatile Devices
libnvdimm, pmem: fix up max_hw_sectors
pmem: make_request cleanups
libnvdimm: enable iostat
pmem: flag pmem block devices as non-rotational
libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only

Ross Zwisler (2):
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
arch, x86: pmem api for ensuring durability of persistent memory updates

Toshi Kani (3):
acpi: Add acpi_map_pxm_to_online_node()
libnvdimm: Set numa_node to NVDIMM devices
libnvdimm: Add sysfs numa_node to NVDIMM devices

Vishal Verma (4):
nd_btt: atomic sector updates
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm, btt: add support for blk integrity
libnvdimm, blk: add support for blk integrity


Documentation/nvdimm/btt.txt | 283 ++++++
Documentation/nvdimm/nvdimm.txt | 808 ++++++++++++++++++
MAINTAINERS | 39 +
arch/x86/Kconfig | 1
arch/x86/include/asm/cacheflush.h | 72 ++
arch/x86/include/asm/io.h | 6
drivers/acpi/nfit.c | 498 +++++++++++
drivers/acpi/nfit.h | 58 +
drivers/acpi/numa.c | 50 +
drivers/nvdimm/Kconfig | 42 +
drivers/nvdimm/Makefile | 7
drivers/nvdimm/blk.c | 384 +++++++++
drivers/nvdimm/btt.c | 1479 +++++++++++++++++++++++++++++++++
drivers/nvdimm/btt.h | 185 ++++
drivers/nvdimm/btt_devs.c | 426 ++++++++++
drivers/nvdimm/bus.c | 60 +
drivers/nvdimm/core.c | 69 ++
drivers/nvdimm/dimm_devs.c | 9
drivers/nvdimm/label.c | 5
drivers/nvdimm/namespace_devs.c | 295 ++++++-
drivers/nvdimm/nd-core.h | 5
drivers/nvdimm/nd.h | 86 ++
drivers/nvdimm/pmem.c | 181 ++--
drivers/nvdimm/region.c | 28 +
drivers/nvdimm/region_devs.c | 238 +++++
fs/block_dev.c | 4
include/linux/acpi.h | 5
include/linux/compiler.h | 2
include/linux/libnvdimm.h | 32 +
include/linux/nd.h | 63 +
include/linux/pmem.h | 153 +++
lib/Kconfig | 3
tools/testing/nvdimm/Kbuild | 40 +
tools/testing/nvdimm/Makefile | 7
tools/testing/nvdimm/config_check.c | 15
tools/testing/nvdimm/test/Kbuild | 8
tools/testing/nvdimm/test/iomap.c | 151 +++
tools/testing/nvdimm/test/nfit.c | 1116 +++++++++++++++++++++++++
tools/testing/nvdimm/test/nfit_test.h | 29 +
39 files changed, 6759 insertions(+), 183 deletions(-)
create mode 100644 Documentation/nvdimm/btt.txt
create mode 100644 Documentation/nvdimm/nvdimm.txt
create mode 100644 drivers/nvdimm/blk.c
create mode 100644 drivers/nvdimm/btt.c
create mode 100644 drivers/nvdimm/btt.h
create mode 100644 drivers/nvdimm/btt_devs.c
create mode 100644 include/linux/pmem.h
create mode 100644 tools/testing/nvdimm/Kbuild
create mode 100644 tools/testing/nvdimm/Makefile
create mode 100644 tools/testing/nvdimm/config_check.c
create mode 100644 tools/testing/nvdimm/test/Kbuild
create mode 100644 tools/testing/nvdimm/test/iomap.c
create mode 100644 tools/testing/nvdimm/test/nfit.c
create mode 100644 tools/testing/nvdimm/test/nfit_test.h


2015-06-25 09:42:24

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 01/17] libnvdimm: infrastructure for btt devices

NVDIMM namespaces, in addition to accepting "struct bio" based requests,
also have the capability to perform byte-aligned accesses. By default
only the bio/block interface is used. However, if another driver can
make effective use of the byte-aligned capability it can claim namespace
interface and use the byte-aligned ->rw_bytes() interface.

The BTT driver is the initial first consumer of this mechanism to allow
adding atomic sector update semantics to a pmem or blk namespace. This
patch is the sysfs infrastructure to allow configuring a BTT instance
for a namespace. Enabling that BTT and performing i/o is in a
subsequent patch.

Cc: Greg KH <[email protected]>
Cc: Neil Brown <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/Kconfig | 3
drivers/nvdimm/Makefile | 1
drivers/nvdimm/btt.h | 45 ++++
drivers/nvdimm/btt_devs.c | 423 +++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/bus.c | 12 +
drivers/nvdimm/label.c | 5
drivers/nvdimm/namespace_devs.c | 204 ++++++++++++++++---
drivers/nvdimm/nd-core.h | 4
drivers/nvdimm/nd.h | 39 ++++
drivers/nvdimm/pmem.c | 132 ++++++++----
drivers/nvdimm/region.c | 8 -
drivers/nvdimm/region_devs.c | 38 +++-
include/linux/nd.h | 63 +++++-
13 files changed, 878 insertions(+), 99 deletions(-)
create mode 100644 drivers/nvdimm/btt.h
create mode 100644 drivers/nvdimm/btt_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 07a29113b870..5680e8e7a7aa 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,4 +33,7 @@ config BLK_DEV_PMEM

Say Y if you want to use an NVDIMM

+config BTT
+ def_bool y
+
endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index abce98f87f16..6085b4bd7312 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -11,3 +11,4 @@ libnvdimm-y += region_devs.o
libnvdimm-y += region.o
libnvdimm-y += namespace_devs.o
libnvdimm-y += label.o
+libnvdimm-$(CONFIG_BTT) += btt_devs.o
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
new file mode 100644
index 000000000000..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/nvdimm/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include <linux/types.h>
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+ u8 signature[BTT_SIG_LEN];
+ u8 uuid[16];
+ u8 parent_uuid[16];
+ __le32 flags;
+ __le16 version_major;
+ __le16 version_minor;
+ __le32 external_lbasize;
+ __le32 external_nlba;
+ __le32 internal_lbasize;
+ __le32 internal_nlba;
+ __le32 nfree;
+ __le32 infosize;
+ __le64 nextoff;
+ __le64 dataoff;
+ __le64 mapoff;
+ __le64 logoff;
+ __le64 info2off;
+ u8 padding[3968];
+ __le64 checksum;
+};
+
+#endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
new file mode 100644
index 000000000000..4636beb8b7ed
--- /dev/null
+++ b/drivers/nvdimm/btt_devs.c
@@ -0,0 +1,423 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/blkdev.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-core.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void __nd_btt_detach_ndns(struct nd_btt *nd_btt)
+{
+ struct nd_namespace_common *ndns = nd_btt->ndns;
+
+ dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
+ || ndns->claim != &nd_btt->dev,
+ "%s: invalid claim\n", __func__);
+ ndns->claim = NULL;
+ nd_btt->ndns = NULL;
+ put_device(&ndns->dev);
+}
+
+static void nd_btt_detach_ndns(struct nd_btt *nd_btt)
+{
+ struct nd_namespace_common *ndns = nd_btt->ndns;
+
+ if (!ndns)
+ return;
+ get_device(&ndns->dev);
+ device_lock(&ndns->dev);
+ __nd_btt_detach_ndns(nd_btt);
+ device_unlock(&ndns->dev);
+ put_device(&ndns->dev);
+}
+
+static bool __nd_btt_attach_ndns(struct nd_btt *nd_btt,
+ struct nd_namespace_common *ndns)
+{
+ if (ndns->claim)
+ return false;
+ dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
+ || nd_btt->ndns,
+ "%s: invalid claim\n", __func__);
+ ndns->claim = &nd_btt->dev;
+ nd_btt->ndns = ndns;
+ get_device(&ndns->dev);
+ return true;
+}
+
+static bool nd_btt_attach_ndns(struct nd_btt *nd_btt,
+ struct nd_namespace_common *ndns)
+{
+ bool claimed;
+
+ device_lock(&ndns->dev);
+ claimed = __nd_btt_attach_ndns(nd_btt, ndns);
+ device_unlock(&ndns->dev);
+ return claimed;
+}
+
+static void nd_btt_release(struct device *dev)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ dev_dbg(dev, "%s\n", __func__);
+ nd_btt_detach_ndns(nd_btt);
+ ida_simple_remove(&btt_ida, nd_btt->id);
+ kfree(nd_btt->uuid);
+ kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+ .name = "nd_btt",
+ .release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+ return dev->type == &nd_btt_device_type;
+}
+EXPORT_SYMBOL(is_nd_btt);
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+ struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+ WARN_ON(!is_nd_btt(dev));
+ return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
+}
+
+static ssize_t sector_size_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+ ssize_t rc;
+
+ device_lock(dev);
+ nvdimm_bus_lock(dev);
+ rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
+ btt_lbasize_supported);
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ nvdimm_bus_unlock(dev);
+ device_unlock(dev);
+
+ return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(sector_size);
+
+static ssize_t uuid_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ if (nd_btt->uuid)
+ return sprintf(buf, "%pUb\n", nd_btt->uuid);
+ return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+ ssize_t rc;
+
+ device_lock(dev);
+ rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ device_unlock(dev);
+
+ return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t namespace_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ rc = sprintf(buf, "%s\n", nd_btt->ndns
+ ? dev_name(&nd_btt->ndns->dev) : "");
+ nvdimm_bus_unlock(dev);
+ return rc;
+}
+
+static int namespace_match(struct device *dev, void *data)
+{
+ char *name = data;
+
+ return strcmp(name, dev_name(dev)) == 0;
+}
+
+static bool is_nd_btt_idle(struct device *dev)
+{
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ if (nd_region->btt_seed == dev || nd_btt->ndns || dev->driver)
+ return false;
+ return true;
+}
+
+static ssize_t __namespace_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+ struct nd_namespace_common *ndns;
+ struct device *found;
+ char *name;
+
+ if (dev->driver) {
+ dev_dbg(dev, "%s: -EBUSY\n", __func__);
+ return -EBUSY;
+ }
+
+ name = kstrndup(buf, len, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+ strim(name);
+
+ if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
+ /* pass */;
+ else {
+ len = -EINVAL;
+ goto out;
+ }
+
+ ndns = nd_btt->ndns;
+ if (strcmp(name, "") == 0) {
+ /* detach the namespace and destroy / reset the btt device */
+ nd_btt_detach_ndns(nd_btt);
+ if (is_nd_btt_idle(dev))
+ nd_device_unregister(dev, ND_ASYNC);
+ else {
+ nd_btt->lbasize = 0;
+ kfree(nd_btt->uuid);
+ nd_btt->uuid = NULL;
+ }
+ goto out;
+ } else if (ndns) {
+ dev_dbg(dev, "namespace already set to: %s\n",
+ dev_name(&ndns->dev));
+ len = -EBUSY;
+ goto out;
+ }
+
+ found = device_find_child(dev->parent, name, namespace_match);
+ if (!found) {
+ dev_dbg(dev, "'%s' not found under %s\n", name,
+ dev_name(dev->parent));
+ len = -ENODEV;
+ goto out;
+ }
+
+ ndns = to_ndns(found);
+ if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
+ dev_dbg(dev, "%s too small to host btt\n", name);
+ len = -ENXIO;
+ goto out_attach;
+ }
+
+ WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
+ if (!nd_btt_attach_ndns(nd_btt, ndns)) {
+ dev_dbg(dev, "%s already claimed\n",
+ dev_name(&ndns->dev));
+ len = -EBUSY;
+ }
+
+ out_attach:
+ put_device(&ndns->dev); /* from device_find_child */
+ out:
+ kfree(name);
+ return len;
+}
+
+static ssize_t namespace_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ device_lock(dev);
+ rc = __namespace_store(dev, attr, buf, len);
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ device_unlock(dev);
+ nvdimm_bus_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RW(namespace);
+
+static struct attribute *nd_btt_attributes[] = {
+ &dev_attr_sector_size.attr,
+ &dev_attr_namespace.attr,
+ &dev_attr_uuid.attr,
+ NULL,
+};
+
+static struct attribute_group nd_btt_attribute_group = {
+ .attrs = nd_btt_attributes,
+};
+
+static const struct attribute_group *nd_btt_attribute_groups[] = {
+ &nd_btt_attribute_group,
+ &nd_device_attribute_group,
+ NULL,
+};
+
+static struct device *__nd_btt_create(struct nd_region *nd_region,
+ unsigned long lbasize, u8 *uuid,
+ struct nd_namespace_common *ndns)
+{
+ struct nd_btt *nd_btt;
+ struct device *dev;
+
+ nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
+ if (!nd_btt)
+ return NULL;
+
+ nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+ if (nd_btt->id < 0) {
+ kfree(nd_btt);
+ return NULL;
+ }
+
+ nd_btt->lbasize = lbasize;
+ if (uuid)
+ uuid = kmemdup(uuid, 16, GFP_KERNEL);
+ nd_btt->uuid = uuid;
+ dev = &nd_btt->dev;
+ dev_set_name(dev, "btt%d", nd_btt->id);
+ dev->parent = &nd_region->dev;
+ dev->type = &nd_btt_device_type;
+ dev->groups = nd_btt_attribute_groups;
+ device_initialize(&nd_btt->dev);
+ if (ndns && !__nd_btt_attach_ndns(nd_btt, ndns)) {
+ dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
+ __func__, dev_name(ndns->claim));
+ put_device(dev);
+ return NULL;
+ }
+ return dev;
+}
+
+struct device *nd_btt_create(struct nd_region *nd_region)
+{
+ struct device *dev = __nd_btt_create(nd_region, 0, NULL, NULL);
+
+ if (dev)
+ __nd_device_register(dev);
+ return dev;
+}
+
+/*
+ * nd_btt_sb_checksum: compute checksum for btt info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
+{
+ u64 sum, sum_save;
+
+ sum_save = btt_sb->checksum;
+ btt_sb->checksum = 0;
+ sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
+ btt_sb->checksum = sum_save;
+ return sum;
+}
+EXPORT_SYMBOL(nd_btt_sb_checksum);
+
+static int __nd_btt_probe(struct nd_btt *nd_btt,
+ struct nd_namespace_common *ndns, struct btt_sb *btt_sb)
+{
+ u64 checksum;
+
+ if (!btt_sb || !ndns || !nd_btt)
+ return -ENODEV;
+
+ if (nvdimm_read_bytes(ndns, SZ_4K, btt_sb, sizeof(*btt_sb)))
+ return -ENXIO;
+
+ if (nvdimm_namespace_capacity(ndns) < SZ_16M)
+ return -ENXIO;
+
+ if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
+ return -ENODEV;
+
+ checksum = le64_to_cpu(btt_sb->checksum);
+ btt_sb->checksum = 0;
+ if (checksum != nd_btt_sb_checksum(btt_sb))
+ return -ENODEV;
+ btt_sb->checksum = cpu_to_le64(checksum);
+
+ nd_btt->lbasize = le32_to_cpu(btt_sb->external_lbasize);
+ nd_btt->uuid = kmemdup(btt_sb->uuid, 16, GFP_KERNEL);
+ if (!nd_btt->uuid)
+ return -ENOMEM;
+
+ __nd_device_register(&nd_btt->dev);
+
+ return 0;
+}
+
+int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+ int rc;
+ struct device *dev;
+ struct btt_sb *btt_sb;
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+
+ if (ndns->force_raw)
+ return -ENODEV;
+
+ nvdimm_bus_lock(&ndns->dev);
+ dev = __nd_btt_create(nd_region, 0, NULL, ndns);
+ nvdimm_bus_unlock(&ndns->dev);
+ if (!dev)
+ return -ENOMEM;
+ dev_set_drvdata(dev, drvdata);
+ btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
+ rc = __nd_btt_probe(to_nd_btt(dev), ndns, btt_sb);
+ kfree(btt_sb);
+ dev_dbg(&ndns->dev, "%s: btt: %s\n", __func__,
+ rc == 0 ? dev_name(dev) : "<none>");
+ if (rc < 0) {
+ __nd_btt_detach_ndns(to_nd_btt(dev));
+ put_device(dev);
+ }
+
+ return rc;
+}
+EXPORT_SYMBOL(nd_btt_probe);
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index ca802702440e..dd12f38397db 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -14,8 +14,10 @@
#include <linux/vmalloc.h>
#include <linux/uaccess.h>
#include <linux/module.h>
+#include <linux/blkdev.h>
#include <linux/fcntl.h>
#include <linux/async.h>
+#include <linux/genhd.h>
#include <linux/ndctl.h>
#include <linux/sched.h>
#include <linux/slab.h>
@@ -103,6 +105,7 @@ static int nvdimm_bus_probe(struct device *dev)

dev_dbg(&nvdimm_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
dev_name(dev), rc);
+
if (rc != 0)
module_put(provider);
return rc;
@@ -163,14 +166,19 @@ static void nd_async_device_unregister(void *d, async_cookie_t cookie)
put_device(dev);
}

-void nd_device_register(struct device *dev)
+void __nd_device_register(struct device *dev)
{
dev->bus = &nvdimm_bus_type;
- device_initialize(dev);
get_device(dev);
async_schedule_domain(nd_async_device_register, dev,
&nd_async_domain);
}
+
+void nd_device_register(struct device *dev)
+{
+ device_initialize(dev);
+ __nd_device_register(dev);
+}
EXPORT_SYMBOL(nd_device_register);

void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 34148003fc73..96526dcfdd37 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -666,7 +666,7 @@ static int __blk_label_update(struct nd_region *nd_region,

/* don't allow updates that consume the last label */
if (nfree - alloc < 0 || nfree - alloc + victims < 1) {
- dev_info(&nsblk->dev, "insufficient label space\n");
+ dev_info(&nsblk->common.dev, "insufficient label space\n");
kfree(victim_map);
return -ENOSPC;
}
@@ -762,7 +762,8 @@ static int __blk_label_update(struct nd_region *nd_region,
continue;
res = to_resource(ndd, nd_label);
res->flags &= ~DPA_RESOURCE_ADJUSTED;
- dev_vdbg(&nsblk->dev, "assign label[%d] slot: %d\n", l, slot);
+ dev_vdbg(&nsblk->common.dev, "assign label[%d] slot: %d\n",
+ l, slot);
nd_mapping->labels[l++] = nd_label;
}
nd_mapping->labels[l] = NULL;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 50b502b1908e..2c50a0719f8d 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -102,7 +102,7 @@ static ssize_t __alt_name_store(struct device *dev, const char *buf,
} else
return -ENXIO;

- if (dev->driver)
+ if (dev->driver || to_ndns(dev)->claim)
return -EBUSY;

input = kmemdup(buf, len + 1, GFP_KERNEL);
@@ -133,7 +133,7 @@ out:

static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
{
- struct nd_region *nd_region = to_nd_region(nsblk->dev.parent);
+ struct nd_region *nd_region = to_nd_region(nsblk->common.dev.parent);
struct nd_mapping *nd_mapping = &nd_region->mapping[0];
struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
struct nd_label_id label_id;
@@ -152,9 +152,9 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
static int nd_namespace_label_update(struct nd_region *nd_region,
struct device *dev)
{
- dev_WARN_ONCE(dev, dev->driver,
+ dev_WARN_ONCE(dev, dev->driver || to_ndns(dev)->claim,
"namespace must be idle during label update\n");
- if (dev->driver)
+ if (dev->driver || to_ndns(dev)->claim)
return 0;

/*
@@ -666,7 +666,7 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
u8 *uuid = NULL;
int rc, i;

- if (dev->driver)
+ if (dev->driver || to_ndns(dev)->claim)
return -EBUSY;

if (is_namespace_pmem(dev)) {
@@ -733,12 +733,16 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
nd_namespace_pmem_set_size(nd_region, nspm,
val * nd_region->ndr_mappings);
} else if (is_namespace_blk(dev)) {
+ struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+
/*
* Try to delete the namespace if we deleted all of its
- * allocation and this is not the seed device for the
- * region.
+ * allocation, this is not the seed device for the
+ * region, and it is not actively claimed by a btt
+ * instance.
*/
- if (val == 0 && nd_region->ns_seed != dev)
+ if (val == 0 && nd_region->ns_seed != dev
+ && !nsblk->common.claim)
nd_device_unregister(dev, ND_ASYNC);
}

@@ -789,26 +793,42 @@ static ssize_t size_store(struct device *dev,
return rc < 0 ? rc : len;
}

-static ssize_t size_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+resource_size_t __nvdimm_namespace_capacity(struct nd_namespace_common *ndns)
{
- unsigned long long size = 0;
+ struct device *dev = &ndns->dev;

- nvdimm_bus_lock(dev);
if (is_namespace_pmem(dev)) {
struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);

- size = resource_size(&nspm->nsio.res);
+ return resource_size(&nspm->nsio.res);
} else if (is_namespace_blk(dev)) {
- size = nd_namespace_blk_size(to_nd_namespace_blk(dev));
+ return nd_namespace_blk_size(to_nd_namespace_blk(dev));
} else if (is_namespace_io(dev)) {
struct nd_namespace_io *nsio = to_nd_namespace_io(dev);

- size = resource_size(&nsio->res);
- }
- nvdimm_bus_unlock(dev);
+ return resource_size(&nsio->res);
+ } else
+ WARN_ONCE(1, "unknown namespace type\n");
+ return 0;
+}
+
+resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns)
+{
+ resource_size_t size;

- return sprintf(buf, "%llu\n", size);
+ nvdimm_bus_lock(&ndns->dev);
+ size = __nvdimm_namespace_capacity(ndns);
+ nvdimm_bus_unlock(&ndns->dev);
+
+ return size;
+}
+EXPORT_SYMBOL(nvdimm_namespace_capacity);
+
+static ssize_t size_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%llu\n", (unsigned long long)
+ nvdimm_namespace_capacity(to_ndns(dev)));
}
static DEVICE_ATTR(size, S_IRUGO, size_show, size_store);

@@ -897,8 +917,8 @@ static ssize_t uuid_store(struct device *dev,
{
struct nd_region *nd_region = to_nd_region(dev->parent);
u8 *uuid = NULL;
+ ssize_t rc = 0;
u8 **ns_uuid;
- ssize_t rc;

if (is_namespace_pmem(dev)) {
struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
@@ -914,7 +934,10 @@ static ssize_t uuid_store(struct device *dev,
device_lock(dev);
nvdimm_bus_lock(dev);
wait_nvdimm_bus_probe_idle(dev);
- rc = nd_uuid_store(dev, &uuid, buf, len);
+ if (to_ndns(dev)->claim)
+ rc = -EBUSY;
+ if (rc >= 0)
+ rc = nd_uuid_store(dev, &uuid, buf, len);
if (rc >= 0)
rc = namespace_update_uuid(nd_region, dev, uuid, ns_uuid);
if (rc >= 0)
@@ -971,15 +994,18 @@ static ssize_t sector_size_store(struct device *dev,
{
struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
- ssize_t rc;
+ ssize_t rc = 0;

if (!is_namespace_blk(dev))
return -ENXIO;

device_lock(dev);
nvdimm_bus_lock(dev);
- rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
- ns_lbasize_supported);
+ if (to_ndns(dev)->claim)
+ rc = -EBUSY;
+ if (rc >= 0)
+ rc = nd_sector_size_store(dev, buf, &nsblk->lbasize,
+ ns_lbasize_supported);
if (rc >= 0)
rc = nd_namespace_label_update(nd_region, dev);
dev_dbg(dev, "%s: result: %zd %s: %s%s", __func__,
@@ -1034,12 +1060,48 @@ static ssize_t dpa_extents_show(struct device *dev,
}
static DEVICE_ATTR_RO(dpa_extents);

+static ssize_t holder_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_namespace_common *ndns = to_ndns(dev);
+ ssize_t rc;
+
+ device_lock(dev);
+ rc = sprintf(buf, "%s\n", ndns->claim ? dev_name(ndns->claim) : "");
+ device_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RO(holder);
+
+static ssize_t force_raw_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ bool force_raw;
+ int rc = strtobool(buf, &force_raw);
+
+ if (rc)
+ return rc;
+
+ to_ndns(dev)->force_raw = force_raw;
+ return len;
+}
+
+static ssize_t force_raw_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", to_ndns(dev)->force_raw);
+}
+static DEVICE_ATTR_RW(force_raw);
+
static struct attribute *nd_namespace_attributes[] = {
&dev_attr_nstype.attr,
&dev_attr_size.attr,
&dev_attr_uuid.attr,
+ &dev_attr_holder.attr,
&dev_attr_resource.attr,
&dev_attr_alt_name.attr,
+ &dev_attr_force_raw.attr,
&dev_attr_sector_size.attr,
&dev_attr_dpa_extents.attr,
NULL,
@@ -1066,7 +1128,9 @@ static umode_t namespace_visible(struct kobject *kobj,
return a->mode;
}

- if (a == &dev_attr_nstype.attr || a == &dev_attr_size.attr)
+ if (a == &dev_attr_nstype.attr || a == &dev_attr_size.attr
+ || a == &dev_attr_holder.attr
+ || a == &dev_attr_force_raw.attr)
return a->mode;

return 0;
@@ -1083,6 +1147,66 @@ static const struct attribute_group *nd_namespace_attribute_groups[] = {
NULL,
};

+struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
+{
+ struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
+ struct nd_namespace_common *ndns;
+ resource_size_t size;
+
+ if (nd_btt) {
+ ndns = nd_btt->ndns;
+ if (!ndns)
+ return ERR_PTR(-ENODEV);
+
+ /*
+ * Flush any in-progess probes / removals in the driver
+ * for the raw personality of this namespace.
+ */
+ device_lock(&ndns->dev);
+ device_unlock(&ndns->dev);
+ if (ndns->dev.driver) {
+ dev_dbg(&ndns->dev, "is active, can't bind %s\n",
+ dev_name(&nd_btt->dev));
+ return ERR_PTR(-EBUSY);
+ }
+ if (dev_WARN_ONCE(&ndns->dev, ndns->claim != &nd_btt->dev,
+ "host (%s) vs claim (%s) mismatch\n",
+ dev_name(&nd_btt->dev),
+ dev_name(ndns->claim)))
+ return ERR_PTR(-ENXIO);
+ } else {
+ ndns = to_ndns(dev);
+ if (ndns->claim) {
+ dev_dbg(dev, "claimed by %s, failing probe\n",
+ dev_name(ndns->claim));
+
+ return ERR_PTR(-ENXIO);
+ }
+ }
+
+ size = nvdimm_namespace_capacity(ndns);
+ if (size < ND_MIN_NAMESPACE_SIZE) {
+ dev_dbg(&ndns->dev, "%pa, too small must be at least %#x\n",
+ &size, ND_MIN_NAMESPACE_SIZE);
+ return ERR_PTR(-ENODEV);
+ }
+
+ if (is_namespace_pmem(&ndns->dev)) {
+ struct nd_namespace_pmem *nspm;
+
+ nspm = to_nd_namespace_pmem(&ndns->dev);
+ if (!nspm->uuid) {
+ dev_dbg(&ndns->dev, "%s: uuid not set\n", __func__);
+ return ERR_PTR(-ENODEV);
+ }
+ } else if (is_namespace_blk(&ndns->dev)) {
+ return ERR_PTR(-ENODEV); /* TODO */
+ }
+
+ return ndns;
+}
+EXPORT_SYMBOL(nvdimm_namespace_common_probe);
+
static struct device **create_namespace_io(struct nd_region *nd_region)
{
struct nd_namespace_io *nsio;
@@ -1099,7 +1223,7 @@ static struct device **create_namespace_io(struct nd_region *nd_region)
return NULL;
}

- dev = &nsio->dev;
+ dev = &nsio->common.dev;
dev->type = &namespace_io_device_type;
dev->parent = &nd_region->dev;
res = &nsio->res;
@@ -1313,7 +1437,7 @@ static struct device **create_namespace_pmem(struct nd_region *nd_region)
if (!nspm)
return NULL;

- dev = &nspm->nsio.dev;
+ dev = &nspm->nsio.common.dev;
dev->type = &namespace_pmem_device_type;
dev->parent = &nd_region->dev;
res = &nspm->nsio.res;
@@ -1346,7 +1470,7 @@ static struct device **create_namespace_pmem(struct nd_region *nd_region)
return devs;

err:
- namespace_pmem_release(&nspm->nsio.dev);
+ namespace_pmem_release(&nspm->nsio.common.dev);
return NULL;
}

@@ -1385,7 +1509,7 @@ static struct device *nd_namespace_blk_create(struct nd_region *nd_region)
if (!nsblk)
return NULL;

- dev = &nsblk->dev;
+ dev = &nsblk->common.dev;
dev->type = &namespace_blk_device_type;
nsblk->id = ida_simple_get(&nd_region->ns_ida, 0, 0, GFP_KERNEL);
if (nsblk->id < 0) {
@@ -1396,7 +1520,7 @@ static struct device *nd_namespace_blk_create(struct nd_region *nd_region)
dev->parent = &nd_region->dev;
dev->groups = nd_namespace_attribute_groups;

- return &nsblk->dev;
+ return &nsblk->common.dev;
}

void nd_region_create_blk_seed(struct nd_region *nd_region)
@@ -1413,6 +1537,18 @@ void nd_region_create_blk_seed(struct nd_region *nd_region)
nd_device_register(nd_region->ns_seed);
}

+void nd_region_create_btt_seed(struct nd_region *nd_region)
+{
+ WARN_ON(!is_nvdimm_bus_locked(&nd_region->dev));
+ nd_region->btt_seed = nd_btt_create(nd_region);
+ /*
+ * Seed creation failures are not fatal, provisioning is simply
+ * disabled until memory becomes available
+ */
+ if (!nd_region->btt_seed)
+ dev_err(&nd_region->dev, "failed to create btt namespace\n");
+}
+
static struct device **create_namespace_blk(struct nd_region *nd_region)
{
struct nd_mapping *nd_mapping = &nd_region->mapping[0];
@@ -1446,7 +1582,7 @@ static struct device **create_namespace_blk(struct nd_region *nd_region)
if (!res)
goto err;
nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
- dev_name(&nsblk->dev));
+ dev_name(&nsblk->common.dev));
break;
}
}
@@ -1462,7 +1598,7 @@ static struct device **create_namespace_blk(struct nd_region *nd_region)
nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
if (!nsblk)
goto err;
- dev = &nsblk->dev;
+ dev = &nsblk->common.dev;
dev->type = &namespace_blk_device_type;
dev->parent = &nd_region->dev;
dev_set_name(dev, "namespace%d.%d", nd_region->id, count);
@@ -1482,7 +1618,7 @@ static struct device **create_namespace_blk(struct nd_region *nd_region)
if (!res)
goto err;
nd_dbg_dpa(nd_region, ndd, res, "%s assign\n",
- dev_name(&nsblk->dev));
+ dev_name(&nsblk->common.dev));
}

dev_dbg(&nd_region->dev, "%s: discovered %d blk namespace%s\n",
@@ -1503,7 +1639,7 @@ static struct device **create_namespace_blk(struct nd_region *nd_region)
nsblk = kzalloc(sizeof(*nsblk), GFP_KERNEL);
if (!nsblk)
goto err;
- dev = &nsblk->dev;
+ dev = &nsblk->common.dev;
dev->type = &namespace_blk_device_type;
dev->parent = &nd_region->dev;
devs[count++] = dev;
@@ -1514,7 +1650,7 @@ static struct device **create_namespace_blk(struct nd_region *nd_region)
err:
for (i = 0; i < count; i++) {
nsblk = to_nd_namespace_blk(devs[i]);
- namespace_blk_release(&nsblk->dev);
+ namespace_blk_release(&nsblk->common.dev);
}
kfree(devs);
return NULL;
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 78d6c51f4bac..5e6413964776 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -45,12 +45,14 @@ struct nvdimm {
bool is_nvdimm(struct device *dev);
bool is_nd_blk(struct device *dev);
bool is_nd_pmem(struct device *dev);
+struct nd_btt;
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
int __init nvdimm_bus_init(void);
void nvdimm_bus_exit(void);
void nd_region_probe_success(struct nvdimm_bus *nvdimm_bus, struct device *dev);
struct nd_region;
void nd_region_create_blk_seed(struct nd_region *nd_region);
+void nd_region_create_btt_seed(struct nd_region *nd_region);
void nd_region_disable(struct nvdimm_bus *nvdimm_bus, struct device *dev);
int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus);
void nvdimm_bus_destroy_ndctl(struct nvdimm_bus *nvdimm_bus);
@@ -58,6 +60,7 @@ void nd_synchronize(void);
int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus);
int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus);
int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus);
+void __nd_device_register(struct device *dev);
int nd_match_dimm(struct device *dev, void *data);
struct nd_label_id;
char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
@@ -77,4 +80,5 @@ struct resource *nsblk_add_resource(struct nd_region *nd_region,
resource_size_t start);
int nvdimm_num_label_slots(struct nvdimm_drvdata *ndd);
void get_ndd(struct nvdimm_drvdata *ndd);
+resource_size_t __nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
#endif /* __ND_CORE_H__ */
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index bfa849617358..329e4d54969f 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -19,6 +19,10 @@
#include <linux/types.h>
#include "label.h"

+enum {
+ SECTOR_SHIFT = 9,
+};
+
struct nvdimm_drvdata {
struct device *dev;
int nsindex_size;
@@ -75,6 +79,7 @@ struct nd_region {
struct device dev;
struct ida ns_ida;
struct device *ns_seed;
+ struct device *btt_seed;
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
@@ -94,6 +99,14 @@ static inline unsigned nd_inc_seq(unsigned seq)
return next[seq & 3];
}

+struct nd_btt {
+ struct device dev;
+ struct nd_namespace_common *ndns;
+ unsigned long lbasize;
+ u8 *uuid;
+ int id;
+};
+
enum nd_async_mode {
ND_SYNC,
ND_ASYNC,
@@ -118,6 +131,30 @@ int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd);
int nvdimm_init_config_data(struct nvdimm_drvdata *ndd);
int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
void *buf, size_t len);
+struct nd_btt *to_nd_btt(struct device *dev);
+struct btt_sb;
+u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
+#if IS_ENABLED(CONFIG_BTT)
+int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata);
+bool is_nd_btt(struct device *dev);
+struct device *nd_btt_create(struct nd_region *nd_region);
+#else
+static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+ return -ENODEV;
+}
+
+static inline bool is_nd_btt(struct device *dev)
+{
+ return false;
+}
+
+static inline struct device *nd_btt_create(struct nd_region *nd_region)
+{
+ return NULL;
+}
+
+#endif
struct nd_region *to_nd_region(struct device *dev);
int nd_region_to_nstype(struct nd_region *nd_region);
int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
@@ -132,4 +169,6 @@ void nvdimm_free_dpa(struct nvdimm_drvdata *ndd, struct resource *res);
struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
struct nd_label_id *label_id, resource_size_t start,
resource_size_t n);
+resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
+struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 90902a142e35..d0c6b4bdba69 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -121,44 +121,61 @@ static struct pmem_device *pmem_alloc(struct device *dev,
struct resource *res, int id)
{
struct pmem_device *pmem;
- struct gendisk *disk;
- int err;

- err = -ENOMEM;
pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
if (!pmem)
- goto out;
+ return ERR_PTR(-ENOMEM);

pmem->phys_addr = res->start;
pmem->size = resource_size(res);

- err = -EINVAL;
- if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
+ if (!request_mem_region(pmem->phys_addr, pmem->size, dev_name(dev))) {
dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n",
&pmem->phys_addr, pmem->size);
- goto out_free_dev;
+ kfree(pmem);
+ return ERR_PTR(-EBUSY);
}

/*
* Map the memory as non-cachable, as we can't write back the contents
* of the CPU caches in case of a crash.
*/
- err = -ENOMEM;
pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
- if (!pmem->virt_addr)
- goto out_release_region;
+ if (!pmem->virt_addr) {
+ release_mem_region(pmem->phys_addr, pmem->size);
+ kfree(pmem);
+ return ERR_PTR(-ENXIO);
+ }
+
+ return pmem;
+}
+
+static void pmem_detach_disk(struct pmem_device *pmem)
+{
+ del_gendisk(pmem->pmem_disk);
+ put_disk(pmem->pmem_disk);
+ blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static int pmem_attach_disk(struct nd_namespace_common *ndns,
+ struct pmem_device *pmem)
+{
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+ struct gendisk *disk;

pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
if (!pmem->pmem_queue)
- goto out_unmap;
+ return -ENOMEM;

blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);

disk = alloc_disk(0);
- if (!disk)
- goto out_free_queue;
+ if (!disk) {
+ blk_cleanup_queue(pmem->pmem_queue);
+ return -ENOMEM;
+ }

disk->major = pmem_major;
disk->first_minor = 0;
@@ -166,32 +183,47 @@ static struct pmem_device *pmem_alloc(struct device *dev,
disk->private_data = pmem;
disk->queue = pmem->pmem_queue;
disk->flags = GENHD_FL_EXT_DEVT;
- sprintf(disk->disk_name, "pmem%d", id);
- disk->driverfs_dev = dev;
+ sprintf(disk->disk_name, "pmem%d", nd_region->id);
+ disk->driverfs_dev = &ndns->dev;
set_capacity(disk, pmem->size >> 9);
pmem->pmem_disk = disk;

add_disk(disk);

- return pmem;
+ return 0;
+}

-out_free_queue:
- blk_cleanup_queue(pmem->pmem_queue);
-out_unmap:
- iounmap(pmem->virt_addr);
-out_release_region:
- release_mem_region(pmem->phys_addr, pmem->size);
-out_free_dev:
- kfree(pmem);
-out:
- return ERR_PTR(err);
+static int pmem_rw_bytes(struct nd_namespace_common *ndns,
+ resource_size_t offset, void *buf, size_t size, int rw)
+{
+ struct pmem_device *pmem = dev_get_drvdata(ndns->claim);
+
+ if (unlikely(offset + size > pmem->size)) {
+ dev_WARN_ONCE(&ndns->dev, 1, "request out of range\n");
+ return -EFAULT;
+ }
+
+ if (rw == READ)
+ memcpy(buf, pmem->virt_addr + offset, size);
+ else
+ memcpy(pmem->virt_addr + offset, buf, size);
+
+ return 0;
+}
+
+static int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
+{
+ /* TODO */
+ return -ENXIO;
+}
+
+static void nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns)
+{
+ /* TODO */
}

static void pmem_free(struct pmem_device *pmem)
{
- del_gendisk(pmem->pmem_disk);
- put_disk(pmem->pmem_disk);
- blk_cleanup_queue(pmem->pmem_queue);
iounmap(pmem->virt_addr);
release_mem_region(pmem->phys_addr, pmem->size);
kfree(pmem);
@@ -200,40 +232,44 @@ static void pmem_free(struct pmem_device *pmem)
static int nd_pmem_probe(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev->parent);
- struct nd_namespace_io *nsio = to_nd_namespace_io(dev);
+ struct nd_namespace_common *ndns;
+ struct nd_namespace_io *nsio;
struct pmem_device *pmem;
+ int rc;

- if (resource_size(&nsio->res) < ND_MIN_NAMESPACE_SIZE) {
- resource_size_t size = resource_size(&nsio->res);
-
- dev_dbg(dev, "%s: size: %pa, too small must be at least %#x\n",
- __func__, &size, ND_MIN_NAMESPACE_SIZE);
- return -ENODEV;
- }
-
- if (nd_region_to_nstype(nd_region) == ND_DEVICE_NAMESPACE_PMEM) {
- struct nd_namespace_pmem *nspm = to_nd_namespace_pmem(dev);
-
- if (!nspm->uuid) {
- dev_dbg(dev, "%s: uuid not set\n", __func__);
- return -ENODEV;
- }
- }
+ ndns = nvdimm_namespace_common_probe(dev);
+ if (IS_ERR(ndns))
+ return PTR_ERR(ndns);

+ nsio = to_nd_namespace_io(&ndns->dev);
pmem = pmem_alloc(dev, &nsio->res, nd_region->id);
if (IS_ERR(pmem))
return PTR_ERR(pmem);

dev_set_drvdata(dev, pmem);
-
- return 0;
+ ndns->rw_bytes = pmem_rw_bytes;
+ if (is_nd_btt(dev))
+ rc = nvdimm_namespace_attach_btt(ndns);
+ else if (nd_btt_probe(ndns, pmem) == 0) {
+ /* we'll come back as btt-pmem */
+ rc = -ENXIO;
+ } else
+ rc = pmem_attach_disk(ndns, pmem);
+ if (rc)
+ pmem_free(pmem);
+ return rc;
}

static int nd_pmem_remove(struct device *dev)
{
struct pmem_device *pmem = dev_get_drvdata(dev);

+ if (is_nd_btt(dev))
+ nvdimm_namespace_detach_btt(to_nd_btt(dev)->ndns);
+ else
+ pmem_detach_disk(pmem);
pmem_free(pmem);
+
return 0;
}

diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 9aba44e483e0..2a5f3f53d79d 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -33,12 +33,13 @@ static int nd_region_probe(struct device *dev)
num_ns->count = rc + err;
dev_set_drvdata(dev, num_ns);

+ if (rc && err && rc == err)
+ return -ENODEV;
+
+ nd_region->btt_seed = nd_btt_create(nd_region);
if (err == 0)
return 0;

- if (rc == err)
- return -ENODEV;
-
/*
* Given multiple namespaces per region, we do not want to
* disable all the successfully registered peer namespaces upon
@@ -66,6 +67,7 @@ static int nd_region_remove(struct device *dev)
/* flush attribute readers and disable */
nvdimm_bus_lock(dev);
nd_region->ns_seed = NULL;
+ nd_region->btt_seed = NULL;
dev_set_drvdata(dev, NULL);
nvdimm_bus_unlock(dev);

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ac21ce419beb..f4cc1d848156 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -296,10 +296,28 @@ static ssize_t namespace_seed_show(struct device *dev,
}
static DEVICE_ATTR_RO(namespace_seed);

+static ssize_t btt_seed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ if (nd_region->btt_seed)
+ rc = sprintf(buf, "%s\n", dev_name(nd_region->btt_seed));
+ else
+ rc = sprintf(buf, "\n");
+ nvdimm_bus_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RO(btt_seed);
+
static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
&dev_attr_mappings.attr,
+ &dev_attr_btt_seed.attr,
&dev_attr_set_cookie.attr,
&dev_attr_available_size.attr,
&dev_attr_namespace_seed.attr,
@@ -345,15 +363,18 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)

/*
* Upon successful probe/remove, take/release a reference on the
- * associated interleave set (if present)
+ * associated interleave set (if present), and plant new btt + namespace
+ * seeds.
*/
static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
struct device *dev, bool probe)
{
+ struct nd_region *nd_region;
+
if (!probe && (is_nd_pmem(dev) || is_nd_blk(dev))) {
- struct nd_region *nd_region = to_nd_region(dev);
int i;

+ nd_region = to_nd_region(dev);
for (i = 0; i < nd_region->ndr_mappings; i++) {
struct nd_mapping *nd_mapping = &nd_region->mapping[i];
struct nvdimm_drvdata *ndd = nd_mapping->ndd;
@@ -365,14 +386,21 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
nd_mapping->ndd = NULL;
atomic_dec(&nvdimm->busy);
}
- } else if (dev->parent && is_nd_blk(dev->parent) && probe) {
- struct nd_region *nd_region = to_nd_region(dev->parent);
-
+ }
+ if (dev->parent && is_nd_blk(dev->parent) && probe) {
+ nd_region = to_nd_region(dev->parent);
nvdimm_bus_lock(dev);
if (nd_region->ns_seed == dev)
nd_region_create_blk_seed(nd_region);
nvdimm_bus_unlock(dev);
}
+ if (is_nd_btt(dev) && probe) {
+ nd_region = to_nd_region(dev->parent);
+ nvdimm_bus_lock(dev);
+ if (nd_region->btt_seed == dev)
+ nd_region_create_btt_seed(nd_region);
+ nvdimm_bus_unlock(dev);
+ }
}

void nd_region_probe_success(struct nvdimm_bus *nvdimm_bus, struct device *dev)
diff --git a/include/linux/nd.h b/include/linux/nd.h
index 23276ea91690..507e47c86737 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -12,6 +12,7 @@
*/
#ifndef __LINUX_ND_H__
#define __LINUX_ND_H__
+#include <linux/fs.h>
#include <linux/ndctl.h>
#include <linux/device.h>

@@ -29,12 +30,32 @@ static inline struct nd_device_driver *to_nd_device_driver(
};

/**
+ * struct nd_namespace_common - core infrastructure of a namespace
+ * @force_raw: ignore other personalities for the namespace (e.g. btt)
+ * @dev: device model node
+ * @claim: when set a another personality has taken ownership of the namespace
+ * @rw_bytes: access the raw namespace capacity with byte-aligned transfers
+ */
+struct nd_namespace_common {
+ int force_raw;
+ struct device dev;
+ struct device *claim;
+ int (*rw_bytes)(struct nd_namespace_common *, resource_size_t offset,
+ void *buf, size_t size, int rw);
+};
+
+static inline struct nd_namespace_common *to_ndns(struct device *dev)
+{
+ return container_of(dev, struct nd_namespace_common, dev);
+}
+
+/**
* struct nd_namespace_io - infrastructure for loading an nd_pmem instance
* @dev: namespace device created by the nd region driver
* @res: struct resource conversion of a NFIT SPA table
*/
struct nd_namespace_io {
- struct device dev;
+ struct nd_namespace_common common;
struct resource res;
};

@@ -52,7 +73,6 @@ struct nd_namespace_pmem {

/**
* struct nd_namespace_blk - namespace for dimm-bounded persistent memory
- * @dev: namespace device creation by the nd region driver
* @alt_name: namespace name supplied in the dimm label
* @uuid: namespace name supplied in the dimm label
* @id: ida allocated id
@@ -61,7 +81,7 @@ struct nd_namespace_pmem {
* @res: discontiguous dpa extents for given dimm
*/
struct nd_namespace_blk {
- struct device dev;
+ struct nd_namespace_common common;
char *alt_name;
u8 *uuid;
int id;
@@ -72,7 +92,7 @@ struct nd_namespace_blk {

static inline struct nd_namespace_io *to_nd_namespace_io(struct device *dev)
{
- return container_of(dev, struct nd_namespace_io, dev);
+ return container_of(dev, struct nd_namespace_io, common.dev);
}

static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)
@@ -84,7 +104,40 @@ static inline struct nd_namespace_pmem *to_nd_namespace_pmem(struct device *dev)

static inline struct nd_namespace_blk *to_nd_namespace_blk(struct device *dev)
{
- return container_of(dev, struct nd_namespace_blk, dev);
+ return container_of(dev, struct nd_namespace_blk, common.dev);
+}
+
+/**
+ * nvdimm_read_bytes() - synchronously read bytes from an nvdimm namespace
+ * @ndns: device to read
+ * @offset: namespace-relative starting offset
+ * @buf: buffer to fill
+ * @size: transfer length
+ *
+ * @buf is up-to-date upon return from this routine.
+ */
+static inline int nvdimm_read_bytes(struct nd_namespace_common *ndns,
+ resource_size_t offset, void *buf, size_t size)
+{
+ return ndns->rw_bytes(ndns, offset, buf, size, READ);
+}
+
+/**
+ * nvdimm_write_bytes() - synchronously write bytes to an nvdimm namespace
+ * @ndns: device to read
+ * @offset: namespace-relative starting offset
+ * @buf: buffer to drain
+ * @size: transfer length
+ *
+ * NVDIMM Namepaces disks do not implement sectors internally. Depending on
+ * the @ndns, the contents of @buf may be in cpu cache, platform buffers,
+ * or on backing memory media upon return from this routine. Flushing
+ * to media is handled internal to the @ndns driver, if at all.
+ */
+static inline int nvdimm_write_bytes(struct nd_namespace_common *ndns,
+ resource_size_t offset, void *buf, size_t size)
+{
+ return ndns->rw_bytes(ndns, offset, buf, size, WRITE);
}

#define MODULE_ALIAS_ND_DEVICE(type) \

2015-06-25 09:42:51

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 02/17] nd_btt: atomic sector updates

From: Vishal Verma <[email protected]>

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Greg KH <[email protected]>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Documentation/nvdimm/btt.txt | 273 ++++++++
drivers/acpi/nfit.c | 1
drivers/nvdimm/Kconfig | 28 +
drivers/nvdimm/Makefile | 3
drivers/nvdimm/btt.c | 1371 +++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/btt.h | 141 ++++
drivers/nvdimm/btt_devs.c | 3
drivers/nvdimm/namespace_devs.c | 24 +
drivers/nvdimm/nd.h | 22 +
drivers/nvdimm/pmem.c | 14
drivers/nvdimm/region.c | 12
drivers/nvdimm/region_devs.c | 82 ++
include/linux/libnvdimm.h | 1
13 files changed, 1950 insertions(+), 25 deletions(-)
create mode 100644 Documentation/nvdimm/btt.txt
create mode 100644 drivers/nvdimm/btt.c

diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt
new file mode 100644
index 000000000000..95134d5ec4a0
--- /dev/null
+++ b/Documentation/nvdimm/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+---------------
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+----------------
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+ Backing Store +-------> Arena
++---------------+ | +------------------+
+| | | | Arena info block |
+| Arena 0 +---+ | 4K |
+| 512G | +------------------+
+| | | |
++---------------+ | |
+| | | |
+| Arena 1 | | Data Blocks |
+| 512G | | |
+| | | |
++---------------+ | |
+| . | | |
+| . | | |
+| . | | |
+| | | |
+| | | |
++---------------+ +------------------+
+ | |
+ | BTT Map |
+ | |
+ | |
+ +------------------+
+ | |
+ | BTT Flog |
+ | |
+ +------------------+
+ | Info block copy |
+ | 4K |
+ +------------------+
+
+
+3. Theory of Operation
+----------------------
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+Bit Description
+31 : TRIM flag - marks if the block was trimmed or discarded
+30 : ERROR flag - marks an error block. Cleared on write.
+29 - 0 : Mappings to internal 'postmap' blocks
+
+
+Some of the terminology that will be subsequently used:
+
+External LBA : LBA as made visible to upper layers.
+ABA : Arena Block Address - Block offset/number within an arena
+Premap ABA : The block offset into an arena, which was decided upon by range
+ checking the External LBA
+Postmap ABA : The block number in the "Data Blocks" area obtained after
+ indirection from the map
+nfree : The number of free blocks that are maintained at any given time.
+ This is the number of concurrent writes that can happen to the
+ arena.
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+lba : The premap ABA that is being written to
+old_map : The old postmap ABA - after 'this' write completes, this will be a
+ free block.
+new_map : The new postmap ABA. The map will up updated to reflect this
+ lba->postmap_aba mapping, but we log it here in case we have to
+ recover.
+seq : Sequence number to mark which of the 2 sections of this flog entry is
+ valid/newest. It cycles between 01->10->11->01 (binary) under normal
+ operation, with 00 indicating an uninitialized state.
+lba' : alternate lba entry
+old_map': alternate old postmap entry
+new_map': alternate new postmap entry
+seq' : alternate sequence number.
+
+Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the new entry such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process.
+ nlanes = min(nfree, num_cpus)
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. It is
+protected by a spinlock.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps:
+
+free[lane] = map[premap_aba]
+map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+ (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1. Convert external LBA to arena number + pre-map ABA
+2. Get a lane (and take lane_lock)
+3. Read map to get the entry for this pre-map ABA
+4. Enter post-map ABA into RTT[lane]
+5. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6. If ERROR flag set in map, end IO with EIO (go to step 8)
+7. Read data from this block
+8. Remove post-map ABA entry from RTT[lane]
+9. Release lane (and lane_lock)
+
+Write:
+
+1. Convert external LBA to Arena number + pre-map ABA
+2. Get a lane (and take lane_lock)
+3. Use lane to index into in-memory free list and obtain a new block, next flog
+ index, next sequence number
+4. Scan the RTT to check if free block is present, and spin/wait if it is.
+5. Write data to this free block
+6. Read map to get the existing post-map ABA entry for this pre-map ABA
+7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8. Write new post-map ABA into map.
+9. Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+ sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+ entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. In-kernel usage
+==================
+
+Any block driver that supports byte granularity IO to the storage may register
+with the BTT. It will have to provide the rw_bytes interface in its
+block_device_operations struct:
+
+ int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
+
+It may register with the BTT after it adds its own gendisk, using btt_init:
+
+ struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
+ u32 lbasize, u8 uuid[], int maxlane);
+
+note that maxlane is the maximum amount of concurrency the driver wishes to
+allow the BTT to use.
+
+The BTT 'disk' appears as a stacked block device that grabs the underlying block
+device in the O_EXCL mode.
+
+When the driver wishes to remove the backing disk, it should similarly call
+btt_fini using the same struct btt* handle that was provided to it by btt_init.
+
+ void btt_fini(struct btt *btt);
+
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 35af6f7f0abd..fc38b49eff7d 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -902,6 +902,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
} else {
nd_mapping->size = nfit_mem->bdw->capacity;
nd_mapping->start = nfit_mem->bdw->start_address;
+ ndr_desc->num_lanes = nfit_mem->bdw->windows;
blk_valid = 1;
}

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5680e8e7a7aa..204ee0796411 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -8,11 +8,11 @@ menuconfig LIBNVDIMM
NFIT, or otherwise can discover NVDIMM resources, a libnvdimm
bus is registered to advertise PMEM (persistent memory)
namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
- namespaces (/dev/ndX). A PMEM namespace refers to a memory
- resource that may span multiple DIMMs and support DAX (see
- CONFIG_DAX). A BLK namespace refers to an NVDIMM control
- region which exposes an mmio register set for windowed
- access mode to non-volatile memory.
+ namespaces (/dev/ndblkX.Y). A PMEM namespace refers to a
+ memory resource that may span multiple DIMMs and support DAX
+ (see CONFIG_DAX). A BLK namespace refers to an NVDIMM control
+ region which exposes an mmio register set for windowed access
+ mode to non-volatile memory.

if LIBNVDIMM

@@ -20,6 +20,7 @@ config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
depends on HAS_IOMEM
+ select ND_BTT if BTT
help
Memory ranges for PMEM are described by either an NFIT
(NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
@@ -33,7 +34,22 @@ config BLK_DEV_PMEM

Say Y if you want to use an NVDIMM

+config ND_BTT
+ tristate
+
config BTT
- def_bool y
+ bool "BTT: Block Translation Table (atomic sector updates)"
+ default y if LIBNVDIMM
+ help
+ The Block Translation Table (BTT) provides atomic sector
+ update semantics for persistent memory devices, so that
+ applications that rely on sector writes not being torn (a
+ guarantee that typical disks provide) can continue to do so.
+ The BTT manifests itself as an alternate personality for an
+ NVDIMM namespace, i.e. a namespace can be in raw mode (pmemX,
+ ndblkX.Y, etc...), or 'sectored' mode, (pmemXs, ndblkX.Ys,
+ etc...).
+
+ Select Y if unsure

endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 6085b4bd7312..d2aab6c58492 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,8 +1,11 @@
obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o

nd_pmem-y := pmem.o

+nd_btt-y := btt.o
+
libnvdimm-y := core.o
libnvdimm-y += bus.o
libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
new file mode 100644
index 000000000000..7ae38aac2c25
--- /dev/null
+++ b/drivers/nvdimm/btt.c
@@ -0,0 +1,1371 @@
+/*
+ * Block Translation Table
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/highmem.h>
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/hdreg.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/ndctl.h>
+#include <linux/fs.h>
+#include <linux/nd.h>
+#include "btt.h"
+#include "nd.h"
+
+enum log_ent_request {
+ LOG_NEW_ENT = 0,
+ LOG_OLD_ENT
+};
+
+static int btt_major;
+
+static int arena_read_bytes(struct arena_info *arena, resource_size_t offset,
+ void *buf, size_t n)
+{
+ struct nd_btt *nd_btt = arena->nd_btt;
+ struct nd_namespace_common *ndns = nd_btt->ndns;
+
+ /* arena offsets are 4K from the base of the device */
+ offset += SZ_4K;
+ return nvdimm_read_bytes(ndns, offset, buf, n);
+}
+
+static int arena_write_bytes(struct arena_info *arena, resource_size_t offset,
+ void *buf, size_t n)
+{
+ struct nd_btt *nd_btt = arena->nd_btt;
+ struct nd_namespace_common *ndns = nd_btt->ndns;
+
+ /* arena offsets are 4K from the base of the device */
+ offset += SZ_4K;
+ return nvdimm_write_bytes(ndns, offset, buf, n);
+}
+
+static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
+{
+ int ret;
+
+ ret = arena_write_bytes(arena, arena->info2off, super,
+ sizeof(struct btt_sb));
+ if (ret)
+ return ret;
+
+ return arena_write_bytes(arena, arena->infooff, super,
+ sizeof(struct btt_sb));
+}
+
+static int btt_info_read(struct arena_info *arena, struct btt_sb *super)
+{
+ WARN_ON(!super);
+ return arena_read_bytes(arena, arena->infooff, super,
+ sizeof(struct btt_sb));
+}
+
+/*
+ * 'raw' version of btt_map write
+ * Assumptions:
+ * mapping is in little-endian
+ * mapping contains 'E' and 'Z' flags as desired
+ */
+static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping)
+{
+ u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+ WARN_ON(lba >= arena->external_nlba);
+ return arena_write_bytes(arena, ns_off, &mapping, MAP_ENT_SIZE);
+}
+
+static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping,
+ u32 z_flag, u32 e_flag)
+{
+ u32 ze;
+ __le32 mapping_le;
+
+ /*
+ * This 'mapping' is supposed to be just the LBA mapping, without
+ * any flags set, so strip the flag bits.
+ */
+ mapping &= MAP_LBA_MASK;
+
+ ze = (z_flag << 1) + e_flag;
+ switch (ze) {
+ case 0:
+ /*
+ * We want to set neither of the Z or E flags, and
+ * in the actual layout, this means setting the bit
+ * positions of both to '1' to indicate a 'normal'
+ * map entry
+ */
+ mapping |= MAP_ENT_NORMAL;
+ break;
+ case 1:
+ mapping |= (1 << MAP_ERR_SHIFT);
+ break;
+ case 2:
+ mapping |= (1 << MAP_TRIM_SHIFT);
+ break;
+ default:
+ /*
+ * The case where Z and E are both sent in as '1' could be
+ * construed as a valid 'normal' case, but we decide not to,
+ * to avoid confusion
+ */
+ WARN_ONCE(1, "Invalid use of Z and E flags\n");
+ return -EIO;
+ }
+
+ mapping_le = cpu_to_le32(mapping);
+ return __btt_map_write(arena, lba, mapping_le);
+}
+
+static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
+ int *trim, int *error)
+{
+ int ret;
+ __le32 in;
+ u32 raw_mapping, postmap, ze, z_flag, e_flag;
+ u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE);
+
+ WARN_ON(lba >= arena->external_nlba);
+
+ ret = arena_read_bytes(arena, ns_off, &in, MAP_ENT_SIZE);
+ if (ret)
+ return ret;
+
+ raw_mapping = le32_to_cpu(in);
+
+ z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT;
+ e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT;
+ ze = (z_flag << 1) + e_flag;
+ postmap = raw_mapping & MAP_LBA_MASK;
+
+ /* Reuse the {z,e}_flag variables for *trim and *error */
+ z_flag = 0;
+ e_flag = 0;
+
+ switch (ze) {
+ case 0:
+ /* Initial state. Return postmap = premap */
+ *mapping = lba;
+ break;
+ case 1:
+ *mapping = postmap;
+ e_flag = 1;
+ break;
+ case 2:
+ *mapping = postmap;
+ z_flag = 1;
+ break;
+ case 3:
+ *mapping = postmap;
+ break;
+ default:
+ return -EIO;
+ }
+
+ if (trim)
+ *trim = z_flag;
+ if (error)
+ *error = e_flag;
+
+ return ret;
+}
+
+static int btt_log_read_pair(struct arena_info *arena, u32 lane,
+ struct log_entry *ent)
+{
+ WARN_ON(!ent);
+ return arena_read_bytes(arena,
+ arena->logoff + (2 * lane * LOG_ENT_SIZE), ent,
+ 2 * LOG_ENT_SIZE);
+}
+
+static struct dentry *debugfs_root;
+
+static void arena_debugfs_init(struct arena_info *a, struct dentry *parent,
+ int idx)
+{
+ char dirname[32];
+ struct dentry *d;
+
+ /* If for some reason, parent bttN was not created, exit */
+ if (!parent)
+ return;
+
+ snprintf(dirname, 32, "arena%d", idx);
+ d = debugfs_create_dir(dirname, parent);
+ if (IS_ERR_OR_NULL(d))
+ return;
+ a->debugfs_dir = d;
+
+ debugfs_create_x64("size", S_IRUGO, d, &a->size);
+ debugfs_create_x64("external_lba_start", S_IRUGO, d,
+ &a->external_lba_start);
+ debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba);
+ debugfs_create_u32("internal_lbasize", S_IRUGO, d,
+ &a->internal_lbasize);
+ debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba);
+ debugfs_create_u32("external_lbasize", S_IRUGO, d,
+ &a->external_lbasize);
+ debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree);
+ debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major);
+ debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor);
+ debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff);
+ debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff);
+ debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff);
+ debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff);
+ debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff);
+ debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off);
+ debugfs_create_x32("flags", S_IRUGO, d, &a->flags);
+}
+
+static void btt_debugfs_init(struct btt *btt)
+{
+ int i = 0;
+ struct arena_info *arena;
+
+ btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev),
+ debugfs_root);
+ if (IS_ERR_OR_NULL(btt->debugfs_dir))
+ return;
+
+ list_for_each_entry(arena, &btt->arena_list, list) {
+ arena_debugfs_init(arena, btt->debugfs_dir, i);
+ i++;
+ }
+}
+
+/*
+ * This function accepts two log entries, and uses the
+ * sequence number to find the 'older' entry.
+ * It also updates the sequence number in this old entry to
+ * make it the 'new' one if the mark_flag is set.
+ * Finally, it returns which of the entries was the older one.
+ *
+ * TODO The logic feels a bit kludge-y. make it better..
+ */
+static int btt_log_get_old(struct log_entry *ent)
+{
+ int old;
+
+ /*
+ * the first ever time this is seen, the entry goes into [0]
+ * the next time, the following logic works out to put this
+ * (next) entry into [1]
+ */
+ if (ent[0].seq == 0) {
+ ent[0].seq = cpu_to_le32(1);
+ return 0;
+ }
+
+ if (ent[0].seq == ent[1].seq)
+ return -EINVAL;
+ if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5)
+ return -EINVAL;
+
+ if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
+ if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+ old = 0;
+ else
+ old = 1;
+ } else {
+ if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+ old = 1;
+ else
+ old = 0;
+ }
+
+ return old;
+}
+
+static struct device *to_dev(struct arena_info *arena)
+{
+ return &arena->nd_btt->dev;
+}
+
+/*
+ * This function copies the desired (old/new) log entry into ent if
+ * it is not NULL. It returns the sub-slot number (0 or 1)
+ * where the desired log entry was found. Negative return values
+ * indicate errors.
+ */
+static int btt_log_read(struct arena_info *arena, u32 lane,
+ struct log_entry *ent, int old_flag)
+{
+ int ret;
+ int old_ent, ret_ent;
+ struct log_entry log[2];
+
+ ret = btt_log_read_pair(arena, lane, log);
+ if (ret)
+ return -EIO;
+
+ old_ent = btt_log_get_old(log);
+ if (old_ent < 0 || old_ent > 1) {
+ dev_info(to_dev(arena),
+ "log corruption (%d): lane %d seq [%d, %d]\n",
+ old_ent, lane, log[0].seq, log[1].seq);
+ /* TODO set error state? */
+ return -EIO;
+ }
+
+ ret_ent = (old_flag ? old_ent : (1 - old_ent));
+
+ if (ent != NULL)
+ memcpy(ent, &log[ret_ent], LOG_ENT_SIZE);
+
+ return ret_ent;
+}
+
+/*
+ * This function commits a log entry to media
+ * It does _not_ prepare the freelist entry for the next write
+ * btt_flog_write is the wrapper for updating the freelist elements
+ */
+static int __btt_log_write(struct arena_info *arena, u32 lane,
+ u32 sub, struct log_entry *ent)
+{
+ int ret;
+ /*
+ * Ignore the padding in log_entry for calculating log_half.
+ * The entry is 'committed' when we write the sequence number,
+ * and we want to ensure that that is the last thing written.
+ * We don't bother writing the padding as that would be extra
+ * media wear and write amplification
+ */
+ unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2;
+ u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE);
+ void *src = ent;
+
+ /* split the 16B write into atomic, durable halves */
+ ret = arena_write_bytes(arena, ns_off, src, log_half);
+ if (ret)
+ return ret;
+
+ ns_off += log_half;
+ src += log_half;
+ return arena_write_bytes(arena, ns_off, src, log_half);
+}
+
+static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub,
+ struct log_entry *ent)
+{
+ int ret;
+
+ ret = __btt_log_write(arena, lane, sub, ent);
+ if (ret)
+ return ret;
+
+ /* prepare the next free entry */
+ arena->freelist[lane].sub = 1 - arena->freelist[lane].sub;
+ if (++(arena->freelist[lane].seq) == 4)
+ arena->freelist[lane].seq = 1;
+ arena->freelist[lane].block = le32_to_cpu(ent->old_map);
+
+ return ret;
+}
+
+/*
+ * This function initializes the BTT map to the initial state, which is
+ * all-zeroes, and indicates an identity mapping
+ */
+static int btt_map_init(struct arena_info *arena)
+{
+ int ret = -EINVAL;
+ void *zerobuf;
+ size_t offset = 0;
+ size_t chunk_size = SZ_2M;
+ size_t mapsize = arena->logoff - arena->mapoff;
+
+ zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+ if (!zerobuf)
+ return -ENOMEM;
+
+ while (mapsize) {
+ size_t size = min(mapsize, chunk_size);
+
+ ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf,
+ size);
+ if (ret)
+ goto free;
+
+ offset += size;
+ mapsize -= size;
+ cond_resched();
+ }
+
+ free:
+ kfree(zerobuf);
+ return ret;
+}
+
+/*
+ * This function initializes the BTT log with 'fake' entries pointing
+ * to the initial reserved set of blocks as being free
+ */
+static int btt_log_init(struct arena_info *arena)
+{
+ int ret;
+ u32 i;
+ struct log_entry log, zerolog;
+
+ memset(&zerolog, 0, sizeof(zerolog));
+
+ for (i = 0; i < arena->nfree; i++) {
+ log.lba = cpu_to_le32(i);
+ log.old_map = cpu_to_le32(arena->external_nlba + i);
+ log.new_map = cpu_to_le32(arena->external_nlba + i);
+ log.seq = cpu_to_le32(LOG_SEQ_INIT);
+ ret = __btt_log_write(arena, i, 0, &log);
+ if (ret)
+ return ret;
+ ret = __btt_log_write(arena, i, 1, &zerolog);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int btt_freelist_init(struct arena_info *arena)
+{
+ int old, new, ret;
+ u32 i, map_entry;
+ struct log_entry log_new, log_old;
+
+ arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry),
+ GFP_KERNEL);
+ if (!arena->freelist)
+ return -ENOMEM;
+
+ for (i = 0; i < arena->nfree; i++) {
+ old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT);
+ if (old < 0)
+ return old;
+
+ new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT);
+ if (new < 0)
+ return new;
+
+ /* sub points to the next one to be overwritten */
+ arena->freelist[i].sub = 1 - new;
+ arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq));
+ arena->freelist[i].block = le32_to_cpu(log_new.old_map);
+
+ /* This implies a newly created or untouched flog entry */
+ if (log_new.old_map == log_new.new_map)
+ continue;
+
+ /* Check if map recovery is needed */
+ ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry,
+ NULL, NULL);
+ if (ret)
+ return ret;
+ if ((le32_to_cpu(log_new.new_map) != map_entry) &&
+ (le32_to_cpu(log_new.old_map) == map_entry)) {
+ /*
+ * Last transaction wrote the flog, but wasn't able
+ * to complete the map write. So fix up the map.
+ */
+ ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
+ le32_to_cpu(log_new.new_map), 0, 0);
+ if (ret)
+ return ret;
+ }
+
+ }
+
+ return 0;
+}
+
+static int btt_rtt_init(struct arena_info *arena)
+{
+ arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL);
+ if (arena->rtt == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int btt_maplocks_init(struct arena_info *arena)
+{
+ u32 i;
+
+ arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
+ GFP_KERNEL);
+ if (!arena->map_locks)
+ return -ENOMEM;
+
+ for (i = 0; i < arena->nfree; i++)
+ spin_lock_init(&arena->map_locks[i].lock);
+
+ return 0;
+}
+
+static struct arena_info *alloc_arena(struct btt *btt, size_t size,
+ size_t start, size_t arena_off)
+{
+ struct arena_info *arena;
+ u64 logsize, mapsize, datasize;
+ u64 available = size;
+
+ arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL);
+ if (!arena)
+ return NULL;
+ arena->nd_btt = btt->nd_btt;
+
+ if (!size)
+ return arena;
+
+ arena->size = size;
+ arena->external_lba_start = start;
+ arena->external_lbasize = btt->lbasize;
+ arena->internal_lbasize = roundup(arena->external_lbasize,
+ INT_LBASIZE_ALIGNMENT);
+ arena->nfree = BTT_DEFAULT_NFREE;
+ arena->version_major = 1;
+ arena->version_minor = 1;
+
+ if (available % BTT_PG_SIZE)
+ available -= (available % BTT_PG_SIZE);
+
+ /* Two pages are reserved for the super block and its copy */
+ available -= 2 * BTT_PG_SIZE;
+
+ /* The log takes a fixed amount of space based on nfree */
+ logsize = roundup(2 * arena->nfree * sizeof(struct log_entry),
+ BTT_PG_SIZE);
+ available -= logsize;
+
+ /* Calculate optimal split between map and data area */
+ arena->internal_nlba = div_u64(available - BTT_PG_SIZE,
+ arena->internal_lbasize + MAP_ENT_SIZE);
+ arena->external_nlba = arena->internal_nlba - arena->nfree;
+
+ mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE);
+ datasize = available - mapsize;
+
+ /* 'Absolute' values, relative to start of storage space */
+ arena->infooff = arena_off;
+ arena->dataoff = arena->infooff + BTT_PG_SIZE;
+ arena->mapoff = arena->dataoff + datasize;
+ arena->logoff = arena->mapoff + mapsize;
+ arena->info2off = arena->logoff + logsize;
+ return arena;
+}
+
+static void free_arenas(struct btt *btt)
+{
+ struct arena_info *arena, *next;
+
+ list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
+ list_del(&arena->list);
+ kfree(arena->rtt);
+ kfree(arena->map_locks);
+ kfree(arena->freelist);
+ debugfs_remove_recursive(arena->debugfs_dir);
+ kfree(arena);
+ }
+}
+
+/*
+ * This function checks if the metadata layout is valid and error free
+ */
+static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
+ u8 *uuid, u32 lbasize)
+{
+ u64 checksum;
+
+ if (memcmp(super->uuid, uuid, 16))
+ return 0;
+
+ checksum = le64_to_cpu(super->checksum);
+ super->checksum = 0;
+ if (checksum != nd_btt_sb_checksum(super))
+ return 0;
+ super->checksum = cpu_to_le64(checksum);
+
+ if (lbasize != le32_to_cpu(super->external_lbasize))
+ return 0;
+
+ /* TODO: figure out action for this */
+ if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0)
+ dev_info(to_dev(arena), "Found arena with an error flag\n");
+
+ return 1;
+}
+
+/*
+ * This function reads an existing valid btt superblock and
+ * populates the corresponding arena_info struct
+ */
+static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super,
+ u64 arena_off)
+{
+ arena->internal_nlba = le32_to_cpu(super->internal_nlba);
+ arena->internal_lbasize = le32_to_cpu(super->internal_lbasize);
+ arena->external_nlba = le32_to_cpu(super->external_nlba);
+ arena->external_lbasize = le32_to_cpu(super->external_lbasize);
+ arena->nfree = le32_to_cpu(super->nfree);
+ arena->version_major = le16_to_cpu(super->version_major);
+ arena->version_minor = le16_to_cpu(super->version_minor);
+
+ arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off +
+ le64_to_cpu(super->nextoff));
+ arena->infooff = arena_off;
+ arena->dataoff = arena_off + le64_to_cpu(super->dataoff);
+ arena->mapoff = arena_off + le64_to_cpu(super->mapoff);
+ arena->logoff = arena_off + le64_to_cpu(super->logoff);
+ arena->info2off = arena_off + le64_to_cpu(super->info2off);
+
+ arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) :
+ (arena->info2off - arena->infooff + BTT_PG_SIZE);
+
+ arena->flags = le32_to_cpu(super->flags);
+}
+
+static int discover_arenas(struct btt *btt)
+{
+ int ret = 0;
+ struct arena_info *arena;
+ struct btt_sb *super;
+ size_t remaining = btt->rawsize;
+ u64 cur_nlba = 0;
+ size_t cur_off = 0;
+ int num_arenas = 0;
+
+ super = kzalloc(sizeof(*super), GFP_KERNEL);
+ if (!super)
+ return -ENOMEM;
+
+ while (remaining) {
+ /* Alloc memory for arena */
+ arena = alloc_arena(btt, 0, 0, 0);
+ if (!arena) {
+ ret = -ENOMEM;
+ goto out_super;
+ }
+
+ arena->infooff = cur_off;
+ ret = btt_info_read(arena, super);
+ if (ret)
+ goto out;
+
+ if (!arena_is_valid(arena, super, btt->nd_btt->uuid,
+ btt->lbasize)) {
+ if (remaining == btt->rawsize) {
+ btt->init_state = INIT_NOTFOUND;
+ dev_info(to_dev(arena), "No existing arenas\n");
+ goto out;
+ } else {
+ dev_info(to_dev(arena),
+ "Found corrupted metadata!\n");
+ ret = -ENODEV;
+ goto out;
+ }
+ }
+
+ arena->external_lba_start = cur_nlba;
+ parse_arena_meta(arena, super, cur_off);
+
+ ret = btt_freelist_init(arena);
+ if (ret)
+ goto out;
+
+ ret = btt_rtt_init(arena);
+ if (ret)
+ goto out;
+
+ ret = btt_maplocks_init(arena);
+ if (ret)
+ goto out;
+
+ list_add_tail(&arena->list, &btt->arena_list);
+
+ remaining -= arena->size;
+ cur_off += arena->size;
+ cur_nlba += arena->external_nlba;
+ num_arenas++;
+
+ if (arena->nextoff == 0)
+ break;
+ }
+ btt->num_arenas = num_arenas;
+ btt->nlba = cur_nlba;
+ btt->init_state = INIT_READY;
+
+ kfree(super);
+ return ret;
+
+ out:
+ kfree(arena);
+ free_arenas(btt);
+ out_super:
+ kfree(super);
+ return ret;
+}
+
+static int create_arenas(struct btt *btt)
+{
+ size_t remaining = btt->rawsize;
+ size_t cur_off = 0;
+
+ while (remaining) {
+ struct arena_info *arena;
+ size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining);
+
+ remaining -= arena_size;
+ if (arena_size < ARENA_MIN_SIZE)
+ break;
+
+ arena = alloc_arena(btt, arena_size, btt->nlba, cur_off);
+ if (!arena) {
+ free_arenas(btt);
+ return -ENOMEM;
+ }
+ btt->nlba += arena->external_nlba;
+ if (remaining >= ARENA_MIN_SIZE)
+ arena->nextoff = arena->size;
+ else
+ arena->nextoff = 0;
+ cur_off += arena_size;
+ list_add_tail(&arena->list, &btt->arena_list);
+ }
+
+ return 0;
+}
+
+/*
+ * This function completes arena initialization by writing
+ * all the metadata.
+ * It is only called for an uninitialized arena when a write
+ * to that arena occurs for the first time.
+ */
+static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
+{
+ int ret;
+ struct btt_sb *super;
+
+ ret = btt_map_init(arena);
+ if (ret)
+ return ret;
+
+ ret = btt_log_init(arena);
+ if (ret)
+ return ret;
+
+ super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
+ if (!super)
+ return -ENOMEM;
+
+ strncpy(super->signature, BTT_SIG, BTT_SIG_LEN);
+ memcpy(super->uuid, uuid, 16);
+ super->flags = cpu_to_le32(arena->flags);
+ super->version_major = cpu_to_le16(arena->version_major);
+ super->version_minor = cpu_to_le16(arena->version_minor);
+ super->external_lbasize = cpu_to_le32(arena->external_lbasize);
+ super->external_nlba = cpu_to_le32(arena->external_nlba);
+ super->internal_lbasize = cpu_to_le32(arena->internal_lbasize);
+ super->internal_nlba = cpu_to_le32(arena->internal_nlba);
+ super->nfree = cpu_to_le32(arena->nfree);
+ super->infosize = cpu_to_le32(sizeof(struct btt_sb));
+ super->nextoff = cpu_to_le64(arena->nextoff);
+ /*
+ * Subtract arena->infooff (arena start) so numbers are relative
+ * to 'this' arena
+ */
+ super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff);
+ super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff);
+ super->logoff = cpu_to_le64(arena->logoff - arena->infooff);
+ super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
+
+ super->flags = 0;
+ super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+
+ ret = btt_info_write(arena, super);
+
+ kfree(super);
+ return ret;
+}
+
+/*
+ * This function completes the initialization for the BTT namespace
+ * such that it is ready to accept IOs
+ */
+static int btt_meta_init(struct btt *btt)
+{
+ int ret = 0;
+ struct arena_info *arena;
+
+ mutex_lock(&btt->init_lock);
+ list_for_each_entry(arena, &btt->arena_list, list) {
+ ret = btt_arena_write_layout(arena, btt->nd_btt->uuid);
+ if (ret)
+ goto unlock;
+
+ ret = btt_freelist_init(arena);
+ if (ret)
+ goto unlock;
+
+ ret = btt_rtt_init(arena);
+ if (ret)
+ goto unlock;
+
+ ret = btt_maplocks_init(arena);
+ if (ret)
+ goto unlock;
+ }
+
+ btt->init_state = INIT_READY;
+
+ unlock:
+ mutex_unlock(&btt->init_lock);
+ return ret;
+}
+
+/*
+ * This function calculates the arena in which the given LBA lies
+ * by doing a linear walk. This is acceptable since we expect only
+ * a few arenas. If we have backing devices that get much larger,
+ * we can construct a balanced binary tree of arenas at init time
+ * so that this range search becomes faster.
+ */
+static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap,
+ struct arena_info **arena)
+{
+ struct arena_info *arena_list;
+ __u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size);
+
+ list_for_each_entry(arena_list, &btt->arena_list, list) {
+ if (lba < arena_list->external_nlba) {
+ *arena = arena_list;
+ *premap = lba;
+ return 0;
+ }
+ lba -= arena_list->external_nlba;
+ }
+
+ return -EIO;
+}
+
+/*
+ * The following (lock_map, unlock_map) are mostly just to improve
+ * readability, since they index into an array of locks
+ */
+static void lock_map(struct arena_info *arena, u32 premap)
+ __acquires(&arena->map_locks[idx].lock)
+{
+ u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+ spin_lock(&arena->map_locks[idx].lock);
+}
+
+static void unlock_map(struct arena_info *arena, u32 premap)
+ __releases(&arena->map_locks[idx].lock)
+{
+ u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree;
+
+ spin_unlock(&arena->map_locks[idx].lock);
+}
+
+static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
+{
+ return arena->dataoff + ((u64)lba * arena->internal_lbasize);
+}
+
+static int btt_data_read(struct arena_info *arena, struct page *page,
+ unsigned int off, u32 lba, u32 len)
+{
+ int ret;
+ u64 nsoff = to_namespace_offset(arena, lba);
+ void *mem = kmap_atomic(page);
+
+ ret = arena_read_bytes(arena, nsoff, mem + off, len);
+ kunmap_atomic(mem);
+
+ return ret;
+}
+
+static int btt_data_write(struct arena_info *arena, u32 lba,
+ struct page *page, unsigned int off, u32 len)
+{
+ int ret;
+ u64 nsoff = to_namespace_offset(arena, lba);
+ void *mem = kmap_atomic(page);
+
+ ret = arena_write_bytes(arena, nsoff, mem + off, len);
+ kunmap_atomic(mem);
+
+ return ret;
+}
+
+static void zero_fill_data(struct page *page, unsigned int off, u32 len)
+{
+ void *mem = kmap_atomic(page);
+
+ memset(mem + off, 0, len);
+ kunmap_atomic(mem);
+}
+
+static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
+ sector_t sector, unsigned int len)
+{
+ int ret = 0;
+ int t_flag, e_flag;
+ struct arena_info *arena = NULL;
+ u32 lane = 0, premap, postmap;
+
+ while (len) {
+ u32 cur_len;
+
+ lane = nd_region_acquire_lane(btt->nd_region);
+
+ ret = lba_to_arena(btt, sector, &premap, &arena);
+ if (ret)
+ goto out_lane;
+
+ cur_len = min(btt->sector_size, len);
+
+ ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag);
+ if (ret)
+ goto out_lane;
+
+ /*
+ * We loop to make sure that the post map LBA didn't change
+ * from under us between writing the RTT and doing the actual
+ * read.
+ */
+ while (1) {
+ u32 new_map;
+
+ if (t_flag) {
+ zero_fill_data(page, off, cur_len);
+ goto out_lane;
+ }
+
+ if (e_flag) {
+ ret = -EIO;
+ goto out_lane;
+ }
+
+ arena->rtt[lane] = RTT_VALID | postmap;
+ /*
+ * Barrier to make sure this write is not reordered
+ * to do the verification map_read before the RTT store
+ */
+ barrier();
+
+ ret = btt_map_read(arena, premap, &new_map, &t_flag,
+ &e_flag);
+ if (ret)
+ goto out_rtt;
+
+ if (postmap == new_map)
+ break;
+
+ postmap = new_map;
+ }
+
+ ret = btt_data_read(arena, page, off, postmap, cur_len);
+ if (ret)
+ goto out_rtt;
+
+ arena->rtt[lane] = RTT_INVALID;
+ nd_region_release_lane(btt->nd_region, lane);
+
+ len -= cur_len;
+ off += cur_len;
+ sector += btt->sector_size >> SECTOR_SHIFT;
+ }
+
+ return 0;
+
+ out_rtt:
+ arena->rtt[lane] = RTT_INVALID;
+ out_lane:
+ nd_region_release_lane(btt->nd_region, lane);
+ return ret;
+}
+
+static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ int ret = 0;
+ struct arena_info *arena = NULL;
+ u32 premap = 0, old_postmap, new_postmap, lane = 0, i;
+ struct log_entry log;
+ int sub;
+
+ while (len) {
+ u32 cur_len;
+
+ lane = nd_region_acquire_lane(btt->nd_region);
+
+ ret = lba_to_arena(btt, sector, &premap, &arena);
+ if (ret)
+ goto out_lane;
+ cur_len = min(btt->sector_size, len);
+
+ if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) {
+ ret = -EIO;
+ goto out_lane;
+ }
+
+ new_postmap = arena->freelist[lane].block;
+
+ /* Wait if the new block is being read from */
+ for (i = 0; i < arena->nfree; i++)
+ while (arena->rtt[i] == (RTT_VALID | new_postmap))
+ cpu_relax();
+
+
+ if (new_postmap >= arena->internal_nlba) {
+ ret = -EIO;
+ goto out_lane;
+ } else
+ ret = btt_data_write(arena, new_postmap, page,
+ off, cur_len);
+ if (ret)
+ goto out_lane;
+
+ lock_map(arena, premap);
+ ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
+ if (ret)
+ goto out_map;
+ if (old_postmap >= arena->internal_nlba) {
+ ret = -EIO;
+ goto out_map;
+ }
+
+ log.lba = cpu_to_le32(premap);
+ log.old_map = cpu_to_le32(old_postmap);
+ log.new_map = cpu_to_le32(new_postmap);
+ log.seq = cpu_to_le32(arena->freelist[lane].seq);
+ sub = arena->freelist[lane].sub;
+ ret = btt_flog_write(arena, lane, sub, &log);
+ if (ret)
+ goto out_map;
+
+ ret = btt_map_write(arena, premap, new_postmap, 0, 0);
+ if (ret)
+ goto out_map;
+
+ unlock_map(arena, premap);
+ nd_region_release_lane(btt->nd_region, lane);
+
+ len -= cur_len;
+ off += cur_len;
+ sector += btt->sector_size >> SECTOR_SHIFT;
+ }
+
+ return 0;
+
+ out_map:
+ unlock_map(arena, premap);
+ out_lane:
+ nd_region_release_lane(btt->nd_region, lane);
+ return ret;
+}
+
+static int btt_do_bvec(struct btt *btt, struct page *page,
+ unsigned int len, unsigned int off, int rw,
+ sector_t sector)
+{
+ int ret;
+
+ if (rw == READ) {
+ ret = btt_read_pg(btt, page, off, sector, len);
+ flush_dcache_page(page);
+ } else {
+ flush_dcache_page(page);
+ ret = btt_write_pg(btt, sector, page, off, len);
+ }
+
+ return ret;
+}
+
+static void btt_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct btt *btt = q->queuedata;
+ struct bvec_iter iter;
+ struct bio_vec bvec;
+ int err = 0, rw;
+
+ rw = bio_data_dir(bio);
+ bio_for_each_segment(bvec, bio, iter) {
+ unsigned int len = bvec.bv_len;
+
+ BUG_ON(len > PAGE_SIZE);
+ /* Make sure len is in multiples of sector size. */
+ /* XXX is this right? */
+ BUG_ON(len < btt->sector_size);
+ BUG_ON(len % btt->sector_size);
+
+ err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+ rw, iter.bi_sector);
+ if (err) {
+ dev_info(&btt->nd_btt->dev,
+ "io error in %s sector %lld, len %d,\n",
+ (rw == READ) ? "READ" : "WRITE",
+ (unsigned long long) iter.bi_sector, len);
+ goto out;
+ }
+ }
+
+out:
+ bio_endio(bio, err);
+}
+
+static int btt_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct btt *btt = bdev->bd_disk->private_data;
+
+ btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ page_endio(page, rw & WRITE, 0);
+ return 0;
+}
+
+
+static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+ /* some standard values */
+ geo->heads = 1 << 6;
+ geo->sectors = 1 << 5;
+ geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+ return 0;
+}
+
+static const struct block_device_operations btt_fops = {
+ .owner = THIS_MODULE,
+ .rw_page = btt_rw_page,
+ .getgeo = btt_getgeo,
+};
+
+static int btt_blk_init(struct btt *btt)
+{
+ struct nd_btt *nd_btt = btt->nd_btt;
+ struct nd_namespace_common *ndns = nd_btt->ndns;
+
+ /* create a new disk and request queue for btt */
+ btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+ if (!btt->btt_queue)
+ return -ENOMEM;
+
+ btt->btt_disk = alloc_disk(0);
+ if (!btt->btt_disk) {
+ blk_cleanup_queue(btt->btt_queue);
+ return -ENOMEM;
+ }
+
+ nvdimm_namespace_disk_name(ndns, btt->btt_disk->disk_name);
+ btt->btt_disk->driverfs_dev = &btt->nd_btt->dev;
+ btt->btt_disk->major = btt_major;
+ btt->btt_disk->first_minor = 0;
+ btt->btt_disk->fops = &btt_fops;
+ btt->btt_disk->private_data = btt;
+ btt->btt_disk->queue = btt->btt_queue;
+ btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+
+ blk_queue_make_request(btt->btt_queue, btt_make_request);
+ blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
+ blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX);
+ blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
+ queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue);
+ btt->btt_queue->queuedata = btt;
+
+ set_capacity(btt->btt_disk,
+ btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+ add_disk(btt->btt_disk);
+
+ return 0;
+}
+
+static void btt_blk_cleanup(struct btt *btt)
+{
+ del_gendisk(btt->btt_disk);
+ put_disk(btt->btt_disk);
+ blk_cleanup_queue(btt->btt_queue);
+}
+
+/**
+ * btt_init - initialize a block translation table for the given device
+ * @nd_btt: device with BTT geometry and backing device info
+ * @rawsize: raw size in bytes of the backing device
+ * @lbasize: lba size of the backing device
+ * @uuid: A uuid for the backing device - this is stored on media
+ * @maxlane: maximum number of parallel requests the device can handle
+ *
+ * Initialize a Block Translation Table on a backing device to provide
+ * single sector power fail atomicity.
+ *
+ * Context:
+ * Might sleep.
+ *
+ * Returns:
+ * Pointer to a new struct btt on success, NULL on failure.
+ */
+static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
+ u32 lbasize, u8 *uuid, struct nd_region *nd_region)
+{
+ int ret;
+ struct btt *btt;
+ struct device *dev = &nd_btt->dev;
+
+ btt = kzalloc(sizeof(struct btt), GFP_KERNEL);
+ if (!btt)
+ return NULL;
+
+ btt->nd_btt = nd_btt;
+ btt->rawsize = rawsize;
+ btt->lbasize = lbasize;
+ btt->sector_size = ((lbasize >= 4096) ? 4096 : 512);
+ INIT_LIST_HEAD(&btt->arena_list);
+ mutex_init(&btt->init_lock);
+ btt->nd_region = nd_region;
+
+ ret = discover_arenas(btt);
+ if (ret) {
+ dev_err(dev, "init: error in arena_discover: %d\n", ret);
+ goto out_free;
+ }
+
+ if (btt->init_state != INIT_READY) {
+ btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
+ ((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
+ dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
+ btt->num_arenas, rawsize);
+
+ ret = create_arenas(btt);
+ if (ret) {
+ dev_info(dev, "init: create_arenas: %d\n", ret);
+ goto out_free;
+ }
+
+ ret = btt_meta_init(btt);
+ if (ret) {
+ dev_err(dev, "init: error in meta_init: %d\n", ret);
+ return NULL;
+ }
+ }
+
+ ret = btt_blk_init(btt);
+ if (ret) {
+ dev_err(dev, "init: error in blk_init: %d\n", ret);
+ goto out_free;
+ }
+
+ btt_debugfs_init(btt);
+
+ return btt;
+
+ out_free:
+ kfree(btt);
+ return NULL;
+}
+
+/**
+ * btt_fini - de-initialize a BTT
+ * @btt: the BTT handle that was generated by btt_init
+ *
+ * De-initialize a Block Translation Table on device removal
+ *
+ * Context:
+ * Might sleep.
+ */
+static void btt_fini(struct btt *btt)
+{
+ if (btt) {
+ btt_blk_cleanup(btt);
+ free_arenas(btt);
+ debugfs_remove_recursive(btt->debugfs_dir);
+ kfree(btt);
+ }
+}
+
+int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
+{
+ struct nd_btt *nd_btt = to_nd_btt(ndns->claim);
+ struct nd_region *nd_region;
+ struct btt *btt;
+ size_t rawsize;
+
+ if (!nd_btt->uuid || !nd_btt->ndns || !nd_btt->lbasize)
+ return -ENODEV;
+
+ rawsize = nvdimm_namespace_capacity(ndns) - SZ_4K;
+ if (rawsize < ARENA_MIN_SIZE) {
+ return -ENXIO;
+ }
+ nd_region = to_nd_region(nd_btt->dev.parent);
+ btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid,
+ nd_region);
+ if (!btt)
+ return -ENOMEM;
+ nd_btt->btt = btt;
+
+ return 0;
+}
+EXPORT_SYMBOL(nvdimm_namespace_attach_btt);
+
+int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns)
+{
+ struct nd_btt *nd_btt = to_nd_btt(ndns->claim);
+ struct btt *btt = nd_btt->btt;
+
+ btt_fini(btt);
+ nd_btt->btt = NULL;
+
+ return 0;
+}
+EXPORT_SYMBOL(nvdimm_namespace_detach_btt);
+
+static int __init nd_btt_init(void)
+{
+ int rc;
+
+ BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+
+ btt_major = register_blkdev(0, "btt");
+ if (btt_major < 0)
+ return btt_major;
+
+ debugfs_root = debugfs_create_dir("btt", NULL);
+ if (IS_ERR_OR_NULL(debugfs_root)) {
+ rc = -ENXIO;
+ goto err_debugfs;
+ }
+
+ return 0;
+
+ err_debugfs:
+ unregister_blkdev(btt_major, "btt");
+
+ return rc;
+}
+
+static void __exit nd_btt_exit(void)
+{
+ debugfs_remove_recursive(debugfs_root);
+ unregister_blkdev(btt_major, "btt");
+}
+
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT);
+MODULE_AUTHOR("Vishal Verma <[email protected]>");
+MODULE_LICENSE("GPL v2");
+module_init(nd_btt_init);
+module_exit(nd_btt_exit);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index e8f6d8e0ddd3..8c95a7792c3e 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -19,6 +19,39 @@

#define BTT_SIG_LEN 16
#define BTT_SIG "BTT_ARENA_INFO\0"
+#define MAP_ENT_SIZE 4
+#define MAP_TRIM_SHIFT 31
+#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
+#define MAP_ERR_SHIFT 30
+#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
+#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
+#define MAP_ENT_NORMAL 0xC0000000
+#define LOG_ENT_SIZE sizeof(struct log_entry)
+#define ARENA_MIN_SIZE (1UL << 24) /* 16 MB */
+#define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */
+#define RTT_VALID (1UL << 31)
+#define RTT_INVALID 0
+#define INT_LBASIZE_ALIGNMENT 256
+#define BTT_PG_SIZE 4096
+#define BTT_DEFAULT_NFREE ND_MAX_LANES
+#define LOG_SEQ_INIT 1
+
+#define IB_FLAG_ERROR 0x00000001
+#define IB_FLAG_ERROR_MASK 0x00000001
+
+enum btt_init_state {
+ INIT_UNCHECKED = 0,
+ INIT_NOTFOUND,
+ INIT_READY
+};
+
+struct log_entry {
+ __le32 lba;
+ __le32 old_map;
+ __le32 new_map;
+ __le32 seq;
+ __le64 padding[2];
+};

struct btt_sb {
u8 signature[BTT_SIG_LEN];
@@ -42,4 +75,112 @@ struct btt_sb {
__le64 checksum;
};

+struct free_entry {
+ u32 block;
+ u8 sub;
+ u8 seq;
+};
+
+struct aligned_lock {
+ union {
+ spinlock_t lock;
+ u8 cacheline_padding[L1_CACHE_BYTES];
+ };
+};
+
+/**
+ * struct arena_info - handle for an arena
+ * @size: Size in bytes this arena occupies on the raw device.
+ * This includes arena metadata.
+ * @external_lba_start: The first external LBA in this arena.
+ * @internal_nlba: Number of internal blocks available in the arena
+ * including nfree reserved blocks
+ * @internal_lbasize: Internal and external lba sizes may be different as
+ * we can round up 'odd' external lbasizes such as 520B
+ * to be aligned.
+ * @external_nlba: Number of blocks contributed by the arena to the number
+ * reported to upper layers. (internal_nlba - nfree)
+ * @external_lbasize: LBA size as exposed to upper layers.
+ * @nfree: A reserve number of 'free' blocks that is used to
+ * handle incoming writes.
+ * @version_major: Metadata layout version major.
+ * @version_minor: Metadata layout version minor.
+ * @nextoff: Offset in bytes to the start of the next arena.
+ * @infooff: Offset in bytes to the info block of this arena.
+ * @dataoff: Offset in bytes to the data area of this arena.
+ * @mapoff: Offset in bytes to the map area of this arena.
+ * @logoff: Offset in bytes to the log area of this arena.
+ * @info2off: Offset in bytes to the backup info block of this arena.
+ * @freelist: Pointer to in-memory list of free blocks
+ * @rtt: Pointer to in-memory "Read Tracking Table"
+ * @map_locks: Spinlocks protecting concurrent map writes
+ * @nd_btt: Pointer to parent nd_btt structure.
+ * @list: List head for list of arenas
+ * @debugfs_dir: Debugfs dentry
+ * @flags: Arena flags - may signify error states.
+ *
+ * arena_info is a per-arena handle. Once an arena is narrowed down for an
+ * IO, this struct is passed around for the duration of the IO.
+ */
+struct arena_info {
+ u64 size; /* Total bytes for this arena */
+ u64 external_lba_start;
+ u32 internal_nlba;
+ u32 internal_lbasize;
+ u32 external_nlba;
+ u32 external_lbasize;
+ u32 nfree;
+ u16 version_major;
+ u16 version_minor;
+ /* Byte offsets to the different on-media structures */
+ u64 nextoff;
+ u64 infooff;
+ u64 dataoff;
+ u64 mapoff;
+ u64 logoff;
+ u64 info2off;
+ /* Pointers to other in-memory structures for this arena */
+ struct free_entry *freelist;
+ u32 *rtt;
+ struct aligned_lock *map_locks;
+ struct nd_btt *nd_btt;
+ struct list_head list;
+ struct dentry *debugfs_dir;
+ /* Arena flags */
+ u32 flags;
+};
+
+/**
+ * struct btt - handle for a BTT instance
+ * @btt_disk: Pointer to the gendisk for BTT device
+ * @btt_queue: Pointer to the request queue for the BTT device
+ * @arena_list: Head of the list of arenas
+ * @debugfs_dir: Debugfs dentry
+ * @nd_btt: Parent nd_btt struct
+ * @nlba: Number of logical blocks exposed to the upper layers
+ * after removing the amount of space needed by metadata
+ * @rawsize: Total size in bytes of the available backing device
+ * @lbasize: LBA size as requested and presented to upper layers.
+ * This is sector_size + size of any metadata.
+ * @sector_size: The Linux sector size - 512 or 4096
+ * @lanes: Per-lane spinlocks
+ * @init_lock: Mutex used for the BTT initialization
+ * @init_state: Flag describing the initialization state for the BTT
+ * @num_arenas: Number of arenas in the BTT instance
+ */
+struct btt {
+ struct gendisk *btt_disk;
+ struct request_queue *btt_queue;
+ struct list_head arena_list;
+ struct dentry *debugfs_dir;
+ struct nd_btt *nd_btt;
+ u64 nlba;
+ unsigned long long rawsize;
+ u32 lbasize;
+ u32 sector_size;
+ struct nd_region *nd_region;
+ struct mutex init_lock;
+ int init_state;
+ int num_arenas;
+};
#endif
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 4636beb8b7ed..cde9f075fcea 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -349,7 +349,8 @@ struct device *nd_btt_create(struct nd_region *nd_region)
*/
u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
{
- u64 sum, sum_save;
+ u64 sum;
+ __le64 sum_save;

sum_save = btt_sb->checksum;
btt_sb->checksum = 0;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 2c50a0719f8d..4aa647c8d644 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -76,6 +76,30 @@ static bool is_namespace_io(struct device *dev)
return dev ? dev->type == &namespace_io_device_type : false;
}

+const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
+ char *name)
+{
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+ const char *suffix = "";
+
+ if (ndns->claim && is_nd_btt(ndns->claim))
+ suffix = "s";
+
+ if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev))
+ sprintf(name, "pmem%d%s", nd_region->id, suffix);
+ else if (is_namespace_blk(&ndns->dev)) {
+ struct nd_namespace_blk *nsblk;
+
+ nsblk = to_nd_namespace_blk(&ndns->dev);
+ sprintf(name, "ndblk%d.%d%s", nd_region->id, nsblk->id, suffix);
+ } else {
+ return NULL;
+ }
+
+ return name;
+}
+EXPORT_SYMBOL(nvdimm_namespace_disk_name);
+
static ssize_t nstype_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 329e4d54969f..78177a76a05d 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -20,6 +20,12 @@
#include "label.h"

enum {
+ /*
+ * Limits the maximum number of block apertures a dimm can
+ * support and is an input to the geometry/on-disk-format of a
+ * BTT instance
+ */
+ ND_MAX_LANES = 256,
SECTOR_SHIFT = 9,
};

@@ -75,6 +81,11 @@ static inline struct nd_namespace_index *to_next_namespace_index(
for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
res; res = next, next = next ? next->sibling : NULL)

+struct nd_percpu_lane {
+ int count;
+ spinlock_t lock;
+};
+
struct nd_region {
struct device dev;
struct ida ns_ida;
@@ -83,9 +94,10 @@ struct nd_region {
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
- int id;
+ int id, num_lanes;
void *provider_data;
struct nd_interleave_set *nd_set;
+ struct nd_percpu_lane __percpu *lane;
struct nd_mapping mapping[0];
};

@@ -99,9 +111,11 @@ static inline unsigned nd_inc_seq(unsigned seq)
return next[seq & 3];
}

+struct btt;
struct nd_btt {
struct device dev;
struct nd_namespace_common *ndns;
+ struct btt *btt;
unsigned long lbasize;
u8 *uuid;
int id;
@@ -156,6 +170,8 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)

#endif
struct nd_region *to_nd_region(struct device *dev);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
int nd_region_to_nstype(struct nd_region *nd_region);
int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -171,4 +187,8 @@ struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
resource_size_t n);
resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
+int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
+int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns);
+const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
+ char *name);
#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d0c6b4bdba69..7346054bccbb 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -160,7 +160,6 @@ static void pmem_detach_disk(struct pmem_device *pmem)
static int pmem_attach_disk(struct nd_namespace_common *ndns,
struct pmem_device *pmem)
{
- struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
struct gendisk *disk;

pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
@@ -183,7 +182,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
disk->private_data = pmem;
disk->queue = pmem->pmem_queue;
disk->flags = GENHD_FL_EXT_DEVT;
- sprintf(disk->disk_name, "pmem%d", nd_region->id);
+ nvdimm_namespace_disk_name(ndns, disk->disk_name);
disk->driverfs_dev = &ndns->dev;
set_capacity(disk, pmem->size >> 9);
pmem->pmem_disk = disk;
@@ -211,17 +210,6 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
return 0;
}

-static int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
-{
- /* TODO */
- return -ENXIO;
-}
-
-static void nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns)
-{
- /* TODO */
-}
-
static void pmem_free(struct pmem_device *pmem)
{
iounmap(pmem->virt_addr);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 2a5f3f53d79d..eb8aebcd4800 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -10,6 +10,7 @@
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
+#include <linux/cpumask.h>
#include <linux/module.h>
#include <linux/device.h>
#include <linux/nd.h>
@@ -18,10 +19,21 @@
static int nd_region_probe(struct device *dev)
{
int err;
+ static unsigned long once;
struct nd_region_namespaces *num_ns;
struct nd_region *nd_region = to_nd_region(dev);
int rc = nd_region_register_namespaces(nd_region, &err);

+ if (nd_region->num_lanes > num_online_cpus()
+ && nd_region->num_lanes < num_possible_cpus()
+ && !test_and_set_bit(0, &once)) {
+ dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n",
+ num_online_cpus(), nd_region->num_lanes,
+ num_possible_cpus());
+ dev_info(dev, "setting nr_cpus=%d may yield better libnvdimm device performance\n",
+ nd_region->num_lanes);
+ }
+
num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
if (!num_ns)
return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index f4cc1d848156..abf344651358 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -32,6 +32,7 @@ static void nd_region_release(struct device *dev)

put_device(&nvdimm->dev);
}
+ free_percpu(nd_region->lane);
ida_simple_remove(&region_ida, nd_region->id);
kfree(nd_region);
}
@@ -531,13 +532,66 @@ void *nd_region_provider_data(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nd_region_provider_data);

+/**
+ * nd_region_acquire_lane - allocate and lock a lane
+ * @nd_region: region id and number of lanes possible
+ *
+ * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
+ * We optimize for the common case where there are 256 lanes, one
+ * per-cpu. For larger systems we need to lock to share lanes. For now
+ * this implementation assumes the cost of maintaining an allocator for
+ * free lanes is on the order of the lock hold time, so it implements a
+ * static lane = cpu % num_lanes mapping.
+ *
+ * In the case of a BTT instance on top of a BLK namespace a lane may be
+ * acquired recursively. We lock on the first instance.
+ *
+ * In the case of a BTT instance on top of PMEM, we only acquire a lane
+ * for the BTT metadata updates.
+ */
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
+{
+ unsigned int cpu, lane;
+
+ cpu = get_cpu();
+ if (nd_region->num_lanes < nr_cpu_ids) {
+ struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+ lane = cpu % nd_region->num_lanes;
+ ndl_count = per_cpu_ptr(nd_region->lane, cpu);
+ ndl_lock = per_cpu_ptr(nd_region->lane, lane);
+ if (ndl_count->count++ == 0)
+ spin_lock(&ndl_lock->lock);
+ } else
+ lane = cpu;
+
+ return lane;
+}
+EXPORT_SYMBOL(nd_region_acquire_lane);
+
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
+{
+ if (nd_region->num_lanes < nr_cpu_ids) {
+ unsigned int cpu = get_cpu();
+ struct nd_percpu_lane *ndl_lock, *ndl_count;
+
+ ndl_count = per_cpu_ptr(nd_region->lane, cpu);
+ ndl_lock = per_cpu_ptr(nd_region->lane, lane);
+ if (--ndl_count->count == 0)
+ spin_unlock(&ndl_lock->lock);
+ put_cpu();
+ }
+ put_cpu();
+}
+EXPORT_SYMBOL(nd_region_release_lane);
+
static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc, struct device_type *dev_type,
const char *caller)
{
struct nd_region *nd_region;
struct device *dev;
- u16 i;
+ unsigned int i;

for (i = 0; i < ndr_desc->num_mappings; i++) {
struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
@@ -557,9 +611,19 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
if (!nd_region)
return NULL;
nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
- if (nd_region->id < 0) {
- kfree(nd_region);
- return NULL;
+ if (nd_region->id < 0)
+ goto err_id;
+
+ nd_region->lane = alloc_percpu(struct nd_percpu_lane);
+ if (!nd_region->lane)
+ goto err_percpu;
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ struct nd_percpu_lane *ndl;
+
+ ndl = per_cpu_ptr(nd_region->lane, i);
+ spin_lock_init(&ndl->lock);
+ ndl->count = 0;
}

memcpy(nd_region->mapping, ndr_desc->nd_mapping,
@@ -573,6 +637,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->ndr_mappings = ndr_desc->num_mappings;
nd_region->provider_data = ndr_desc->provider_data;
nd_region->nd_set = ndr_desc->nd_set;
+ nd_region->num_lanes = ndr_desc->num_lanes;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
@@ -584,11 +649,18 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_device_register(dev);

return nd_region;
+
+ err_percpu:
+ ida_simple_remove(&region_ida, nd_region->id);
+ err_id:
+ kfree(nd_region);
+ return NULL;
}

struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc)
{
+ ndr_desc->num_lanes = ND_MAX_LANES;
return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type,
__func__);
}
@@ -599,6 +671,7 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
{
if (ndr_desc->num_mappings > 1)
return NULL;
+ ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES);
return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type,
__func__);
}
@@ -607,6 +680,7 @@ EXPORT_SYMBOL_GPL(nvdimm_blk_region_create);
struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc)
{
+ ndr_desc->num_lanes = ND_MAX_LANES;
return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type,
__func__);
}
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a59dca17b3aa..531d99dfac68 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -85,6 +85,7 @@ struct nd_region_desc {
const struct attribute_group **attr_groups;
struct nd_interleave_set *nd_set;
void *provider_data;
+ int num_lanes;
};

struct nvdimm_bus;

2015-06-25 09:43:09

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 03/17] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory

From: Ross Zwisler <[email protected]>

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces. After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave. For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Ross Zwisler <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit.c | 449 ++++++++++++++++++++++++++++++++++++++-
drivers/acpi/nfit.h | 49 ++++
drivers/nvdimm/Kconfig | 13 +
drivers/nvdimm/Makefile | 3
drivers/nvdimm/blk.c | 245 +++++++++++++++++++++
drivers/nvdimm/dimm_devs.c | 9 +
drivers/nvdimm/namespace_devs.c | 65 ++++++
drivers/nvdimm/nd-core.h | 3
drivers/nvdimm/nd.h | 13 +
drivers/nvdimm/region.c | 8 +
drivers/nvdimm/region_devs.c | 91 +++++++-
include/linux/libnvdimm.h | 27 ++
12 files changed, 941 insertions(+), 34 deletions(-)
create mode 100644 drivers/nvdimm/blk.c

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index fc38b49eff7d..2aa6aa97b40c 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -13,12 +13,20 @@
#include <linux/list_sort.h>
#include <linux/libnvdimm.h>
#include <linux/module.h>
+#include <linux/mutex.h>
#include <linux/ndctl.h>
#include <linux/list.h>
#include <linux/acpi.h>
#include <linux/sort.h>
+#include <linux/io.h>
#include "nfit.h"

+/*
+ * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is
+ * irrelevant.
+ */
+#include <asm-generic/io-64-nonatomic-hi-lo.h>
+
static bool force_enable_dimms;
module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
@@ -72,7 +80,7 @@ static int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc,

if (!adev)
return -ENOTTY;
- dimm_name = dev_name(&adev->dev);
+ dimm_name = nvdimm_name(nvdimm);
cmd_name = nvdimm_cmd_name(cmd);
dsm_mask = nfit_mem->dsm_mask;
desc = nd_cmd_dimm_desc(cmd);
@@ -279,6 +287,23 @@ static bool add_bdw(struct acpi_nfit_desc *acpi_desc,
return true;
}

+static bool add_idt(struct acpi_nfit_desc *acpi_desc,
+ struct acpi_nfit_interleave *idt)
+{
+ struct device *dev = acpi_desc->dev;
+ struct nfit_idt *nfit_idt = devm_kzalloc(dev, sizeof(*nfit_idt),
+ GFP_KERNEL);
+
+ if (!nfit_idt)
+ return false;
+ INIT_LIST_HEAD(&nfit_idt->list);
+ nfit_idt->idt = idt;
+ list_add_tail(&nfit_idt->list, &acpi_desc->idts);
+ dev_dbg(dev, "%s: idt index: %d num_lines: %d\n", __func__,
+ idt->interleave_index, idt->line_count);
+ return true;
+}
+
static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
const void *end)
{
@@ -307,9 +332,9 @@ static void *add_table(struct acpi_nfit_desc *acpi_desc, void *table,
if (!add_bdw(acpi_desc, table))
return err;
break;
- /* TODO */
case ACPI_NFIT_TYPE_INTERLEAVE:
- dev_dbg(dev, "%s: idt\n", __func__);
+ if (!add_idt(acpi_desc, table))
+ return err;
break;
case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
dev_dbg(dev, "%s: flush\n", __func__);
@@ -362,8 +387,11 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
{
u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
+ struct nfit_memdev *nfit_memdev;
struct nfit_dcr *nfit_dcr;
struct nfit_bdw *nfit_bdw;
+ struct nfit_idt *nfit_idt;
+ u16 idt_idx, range_index;

list_for_each_entry(nfit_dcr, &acpi_desc->dcrs, list) {
if (nfit_dcr->dcr->region_index != dcr)
@@ -396,6 +424,26 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
return 0;

nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
+
+ if (!nfit_mem->spa_bdw)
+ return 0;
+
+ range_index = nfit_mem->spa_bdw->range_index;
+ list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
+ if (nfit_memdev->memdev->range_index != range_index ||
+ nfit_memdev->memdev->region_index != dcr)
+ continue;
+ nfit_mem->memdev_bdw = nfit_memdev->memdev;
+ idt_idx = nfit_memdev->memdev->interleave_index;
+ list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+ if (nfit_idt->idt->interleave_index != idt_idx)
+ continue;
+ nfit_mem->idt_bdw = nfit_idt->idt;
+ break;
+ }
+ break;
+ }
+
return 0;
}

@@ -439,9 +487,19 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
}

if (type == NFIT_SPA_DCR) {
+ struct nfit_idt *nfit_idt;
+ u16 idt_idx;
+
/* multiple dimms may share a SPA when interleaved */
nfit_mem->spa_dcr = spa;
nfit_mem->memdev_dcr = nfit_memdev->memdev;
+ idt_idx = nfit_memdev->memdev->interleave_index;
+ list_for_each_entry(nfit_idt, &acpi_desc->idts, list) {
+ if (nfit_idt->idt->interleave_index != idt_idx)
+ continue;
+ nfit_mem->idt_dcr = nfit_idt->idt;
+ break;
+ }
} else {
/*
* A single dimm may belong to multiple SPA-PM
@@ -871,6 +929,359 @@ static int acpi_nfit_init_interleave_set(struct acpi_nfit_desc *acpi_desc,
return 0;
}

+static u64 to_interleave_offset(u64 offset, struct nfit_blk_mmio *mmio)
+{
+ struct acpi_nfit_interleave *idt = mmio->idt;
+ u32 sub_line_offset, line_index, line_offset;
+ u64 line_no, table_skip_count, table_offset;
+
+ line_no = div_u64_rem(offset, mmio->line_size, &sub_line_offset);
+ table_skip_count = div_u64_rem(line_no, mmio->num_lines, &line_index);
+ line_offset = idt->line_offset[line_index]
+ * mmio->line_size;
+ table_offset = table_skip_count * mmio->table_size;
+
+ return mmio->base_offset + line_offset + table_offset + sub_line_offset;
+}
+
+static u64 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
+{
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+ u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+
+ if (mmio->num_lines)
+ offset = to_interleave_offset(offset, mmio);
+
+ return readq(mmio->base + offset);
+}
+
+static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
+ resource_size_t dpa, unsigned int len, unsigned int write)
+{
+ u64 cmd, offset;
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+
+ enum {
+ BCW_OFFSET_MASK = (1ULL << 48)-1,
+ BCW_LEN_SHIFT = 48,
+ BCW_LEN_MASK = (1ULL << 8) - 1,
+ BCW_CMD_SHIFT = 56,
+ };
+
+ cmd = (dpa >> L1_CACHE_SHIFT) & BCW_OFFSET_MASK;
+ len = len >> L1_CACHE_SHIFT;
+ cmd |= ((u64) len & BCW_LEN_MASK) << BCW_LEN_SHIFT;
+ cmd |= ((u64) write) << BCW_CMD_SHIFT;
+
+ offset = nfit_blk->cmd_offset + mmio->size * bw;
+ if (mmio->num_lines)
+ offset = to_interleave_offset(offset, mmio);
+
+ writeq(cmd, mmio->base + offset);
+ /* FIXME: conditionally perform read-back if mandated by firmware */
+}
+
+static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
+ resource_size_t dpa, void *iobuf, size_t len, int rw,
+ unsigned int lane)
+{
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+ unsigned int copied = 0;
+ u64 base_offset;
+ int rc;
+
+ base_offset = nfit_blk->bdw_offset + dpa % L1_CACHE_BYTES
+ + lane * mmio->size;
+ /* TODO: non-temporal access, flush hints, cache management etc... */
+ write_blk_ctl(nfit_blk, lane, dpa, len, rw);
+ while (len) {
+ unsigned int c;
+ u64 offset;
+
+ if (mmio->num_lines) {
+ u32 line_offset;
+
+ offset = to_interleave_offset(base_offset + copied,
+ mmio);
+ div_u64_rem(offset, mmio->line_size, &line_offset);
+ c = min_t(size_t, len, mmio->line_size - line_offset);
+ } else {
+ offset = base_offset + nfit_blk->bdw_offset;
+ c = len;
+ }
+
+ if (rw)
+ memcpy(mmio->aperture + offset, iobuf + copied, c);
+ else
+ memcpy(iobuf + copied, mmio->aperture + offset, c);
+
+ copied += c;
+ len -= c;
+ }
+ rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
+ return rc;
+}
+
+static int acpi_nfit_blk_region_do_io(struct nd_blk_region *ndbr,
+ resource_size_t dpa, void *iobuf, u64 len, int rw)
+{
+ struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+ struct nd_region *nd_region = nfit_blk->nd_region;
+ unsigned int lane, copied = 0;
+ int rc = 0;
+
+ lane = nd_region_acquire_lane(nd_region);
+ while (len) {
+ u64 c = min(len, mmio->size);
+
+ rc = acpi_nfit_blk_single_io(nfit_blk, dpa + copied,
+ iobuf + copied, c, rw, lane);
+ if (rc)
+ break;
+
+ copied += c;
+ len -= c;
+ }
+ nd_region_release_lane(nd_region, lane);
+
+ return rc;
+}
+
+static void nfit_spa_mapping_release(struct kref *kref)
+{
+ struct nfit_spa_mapping *spa_map = to_spa_map(kref);
+ struct acpi_nfit_system_address *spa = spa_map->spa;
+ struct acpi_nfit_desc *acpi_desc = spa_map->acpi_desc;
+
+ WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+ dev_dbg(acpi_desc->dev, "%s: SPA%d\n", __func__, spa->range_index);
+ iounmap(spa_map->iomem);
+ release_mem_region(spa->address, spa->length);
+ list_del(&spa_map->list);
+ kfree(spa_map);
+}
+
+static struct nfit_spa_mapping *find_spa_mapping(
+ struct acpi_nfit_desc *acpi_desc,
+ struct acpi_nfit_system_address *spa)
+{
+ struct nfit_spa_mapping *spa_map;
+
+ WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+ list_for_each_entry(spa_map, &acpi_desc->spa_maps, list)
+ if (spa_map->spa == spa)
+ return spa_map;
+
+ return NULL;
+}
+
+static void nfit_spa_unmap(struct acpi_nfit_desc *acpi_desc,
+ struct acpi_nfit_system_address *spa)
+{
+ struct nfit_spa_mapping *spa_map;
+
+ mutex_lock(&acpi_desc->spa_map_mutex);
+ spa_map = find_spa_mapping(acpi_desc, spa);
+
+ if (spa_map)
+ kref_put(&spa_map->kref, nfit_spa_mapping_release);
+ mutex_unlock(&acpi_desc->spa_map_mutex);
+}
+
+static void __iomem *__nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+ struct acpi_nfit_system_address *spa)
+{
+ resource_size_t start = spa->address;
+ resource_size_t n = spa->length;
+ struct nfit_spa_mapping *spa_map;
+ struct resource *res;
+
+ WARN_ON(!mutex_is_locked(&acpi_desc->spa_map_mutex));
+
+ spa_map = find_spa_mapping(acpi_desc, spa);
+ if (spa_map) {
+ kref_get(&spa_map->kref);
+ return spa_map->iomem;
+ }
+
+ spa_map = kzalloc(sizeof(*spa_map), GFP_KERNEL);
+ if (!spa_map)
+ return NULL;
+
+ INIT_LIST_HEAD(&spa_map->list);
+ spa_map->spa = spa;
+ kref_init(&spa_map->kref);
+ spa_map->acpi_desc = acpi_desc;
+
+ res = request_mem_region(start, n, dev_name(acpi_desc->dev));
+ if (!res)
+ goto err_mem;
+
+ /* TODO: cacheability based on the spa type */
+ spa_map->iomem = ioremap_nocache(start, n);
+ if (!spa_map->iomem)
+ goto err_map;
+
+ list_add_tail(&spa_map->list, &acpi_desc->spa_maps);
+ return spa_map->iomem;
+
+ err_map:
+ release_mem_region(start, n);
+ err_mem:
+ kfree(spa_map);
+ return NULL;
+}
+
+/**
+ * nfit_spa_map - interleave-aware managed-mappings of acpi_nfit_system_address ranges
+ * @nvdimm_bus: NFIT-bus that provided the spa table entry
+ * @nfit_spa: spa table to map
+ *
+ * In the case where block-data-window apertures and
+ * dimm-control-regions are interleaved they will end up sharing a
+ * single request_mem_region() + ioremap() for the address range. In
+ * the style of devm nfit_spa_map() mappings are automatically dropped
+ * when all region devices referencing the same mapping are disabled /
+ * unbound.
+ */
+static void __iomem *nfit_spa_map(struct acpi_nfit_desc *acpi_desc,
+ struct acpi_nfit_system_address *spa)
+{
+ void __iomem *iomem;
+
+ mutex_lock(&acpi_desc->spa_map_mutex);
+ iomem = __nfit_spa_map(acpi_desc, spa);
+ mutex_unlock(&acpi_desc->spa_map_mutex);
+
+ return iomem;
+}
+
+static int nfit_blk_init_interleave(struct nfit_blk_mmio *mmio,
+ struct acpi_nfit_interleave *idt, u16 interleave_ways)
+{
+ if (idt) {
+ mmio->num_lines = idt->line_count;
+ mmio->line_size = idt->line_size;
+ if (interleave_ways == 0)
+ return -ENXIO;
+ mmio->table_size = mmio->num_lines * interleave_ways
+ * mmio->line_size;
+ }
+
+ return 0;
+}
+
+static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
+ struct device *dev)
+{
+ struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+ struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+ struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+ struct nfit_blk_mmio *mmio;
+ struct nfit_blk *nfit_blk;
+ struct nfit_mem *nfit_mem;
+ struct nvdimm *nvdimm;
+ int rc;
+
+ nvdimm = nd_blk_region_to_dimm(ndbr);
+ nfit_mem = nvdimm_provider_data(nvdimm);
+ if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
+ dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
+ nfit_mem ? "" : " nfit_mem",
+ nfit_mem->dcr ? "" : " dcr",
+ nfit_mem->bdw ? "" : " bdw");
+ return -ENXIO;
+ }
+
+ nfit_blk = devm_kzalloc(dev, sizeof(*nfit_blk), GFP_KERNEL);
+ if (!nfit_blk)
+ return -ENOMEM;
+ nd_blk_region_set_provider_data(ndbr, nfit_blk);
+ nfit_blk->nd_region = to_nd_region(dev);
+
+ /* map block aperture memory */
+ nfit_blk->bdw_offset = nfit_mem->bdw->offset;
+ mmio = &nfit_blk->mmio[BDW];
+ mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_bdw);
+ if (!mmio->base) {
+ dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
+ nvdimm_name(nvdimm));
+ return -ENOMEM;
+ }
+ mmio->size = nfit_mem->bdw->size;
+ mmio->base_offset = nfit_mem->memdev_bdw->region_offset;
+ mmio->idt = nfit_mem->idt_bdw;
+ mmio->spa = nfit_mem->spa_bdw;
+ rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_bdw,
+ nfit_mem->memdev_bdw->interleave_ways);
+ if (rc) {
+ dev_dbg(dev, "%s: %s failed to init bdw interleave\n",
+ __func__, nvdimm_name(nvdimm));
+ return rc;
+ }
+
+ /* map block control memory */
+ nfit_blk->cmd_offset = nfit_mem->dcr->command_offset;
+ nfit_blk->stat_offset = nfit_mem->dcr->status_offset;
+ mmio = &nfit_blk->mmio[DCR];
+ mmio->base = nfit_spa_map(acpi_desc, nfit_mem->spa_dcr);
+ if (!mmio->base) {
+ dev_dbg(dev, "%s: %s failed to map dcr\n", __func__,
+ nvdimm_name(nvdimm));
+ return -ENOMEM;
+ }
+ mmio->size = nfit_mem->dcr->window_size;
+ mmio->base_offset = nfit_mem->memdev_dcr->region_offset;
+ mmio->idt = nfit_mem->idt_dcr;
+ mmio->spa = nfit_mem->spa_dcr;
+ rc = nfit_blk_init_interleave(mmio, nfit_mem->idt_dcr,
+ nfit_mem->memdev_dcr->interleave_ways);
+ if (rc) {
+ dev_dbg(dev, "%s: %s failed to init dcr interleave\n",
+ __func__, nvdimm_name(nvdimm));
+ return rc;
+ }
+
+ if (mmio->line_size == 0)
+ return 0;
+
+ if ((u32) nfit_blk->cmd_offset % mmio->line_size
+ + 8 > mmio->line_size) {
+ dev_dbg(dev, "cmd_offset crosses interleave boundary\n");
+ return -ENXIO;
+ } else if ((u32) nfit_blk->stat_offset % mmio->line_size
+ + 8 > mmio->line_size) {
+ dev_dbg(dev, "stat_offset crosses interleave boundary\n");
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+static void acpi_nfit_blk_region_disable(struct nvdimm_bus *nvdimm_bus,
+ struct device *dev)
+{
+ struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
+ struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+ struct nd_blk_region *ndbr = to_nd_blk_region(dev);
+ struct nfit_blk *nfit_blk = nd_blk_region_provider_data(ndbr);
+ int i;
+
+ if (!nfit_blk)
+ return; /* never enabled */
+
+ /* auto-free BLK spa mappings */
+ for (i = 0; i < 2; i++) {
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[i];
+
+ if (mmio->base)
+ nfit_spa_unmap(acpi_desc, mmio->spa);
+ }
+ nd_blk_region_set_provider_data(ndbr, NULL);
+ /* devm will free nfit_blk */
+}
+
static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
struct nd_mapping *nd_mapping, struct nd_region_desc *ndr_desc,
struct acpi_nfit_memory_map *memdev,
@@ -878,6 +1289,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
{
struct nvdimm *nvdimm = acpi_nfit_dimm_by_handle(acpi_desc,
memdev->device_handle);
+ struct nd_blk_region_desc *ndbr_desc;
struct nfit_mem *nfit_mem;
int blk_valid = 0;

@@ -908,6 +1320,10 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,

ndr_desc->nd_mapping = nd_mapping;
ndr_desc->num_mappings = blk_valid;
+ ndbr_desc = to_blk_region_desc(ndr_desc);
+ ndbr_desc->enable = acpi_nfit_blk_region_enable;
+ ndbr_desc->disable = acpi_nfit_blk_region_disable;
+ ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
return -ENOMEM;
break;
@@ -921,8 +1337,9 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
{
static struct nd_mapping nd_mappings[ND_MAX_MAPPINGS];
struct acpi_nfit_system_address *spa = nfit_spa->spa;
+ struct nd_blk_region_desc ndbr_desc;
+ struct nd_region_desc *ndr_desc;
struct nfit_memdev *nfit_memdev;
- struct nd_region_desc ndr_desc;
struct nvdimm_bus *nvdimm_bus;
struct resource res;
int count = 0, rc;
@@ -935,12 +1352,13 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,

memset(&res, 0, sizeof(res));
memset(&nd_mappings, 0, sizeof(nd_mappings));
- memset(&ndr_desc, 0, sizeof(ndr_desc));
+ memset(&ndbr_desc, 0, sizeof(ndbr_desc));
res.start = spa->address;
res.end = res.start + spa->length - 1;
- ndr_desc.res = &res;
- ndr_desc.provider_data = nfit_spa;
- ndr_desc.attr_groups = acpi_nfit_region_attribute_groups;
+ ndr_desc = &ndbr_desc.ndr_desc;
+ ndr_desc->res = &res;
+ ndr_desc->provider_data = nfit_spa;
+ ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
struct nd_mapping *nd_mapping;
@@ -953,24 +1371,24 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
return -ENXIO;
}
nd_mapping = &nd_mappings[count++];
- rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, &ndr_desc,
+ rc = acpi_nfit_init_mapping(acpi_desc, nd_mapping, ndr_desc,
memdev, spa);
if (rc)
return rc;
}

- ndr_desc.nd_mapping = nd_mappings;
- ndr_desc.num_mappings = count;
- rc = acpi_nfit_init_interleave_set(acpi_desc, &ndr_desc, spa);
+ ndr_desc->nd_mapping = nd_mappings;
+ ndr_desc->num_mappings = count;
+ rc = acpi_nfit_init_interleave_set(acpi_desc, ndr_desc, spa);
if (rc)
return rc;

nvdimm_bus = acpi_desc->nvdimm_bus;
if (nfit_spa_type(spa) == NFIT_SPA_PM) {
- if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+ if (!nvdimm_pmem_region_create(nvdimm_bus, ndr_desc))
return -ENOMEM;
} else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
- if (!nvdimm_volatile_region_create(nvdimm_bus, &ndr_desc))
+ if (!nvdimm_volatile_region_create(nvdimm_bus, ndr_desc))
return -ENOMEM;
}
return 0;
@@ -996,11 +1414,14 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
u8 *data;
int rc;

+ INIT_LIST_HEAD(&acpi_desc->spa_maps);
INIT_LIST_HEAD(&acpi_desc->spas);
INIT_LIST_HEAD(&acpi_desc->dcrs);
INIT_LIST_HEAD(&acpi_desc->bdws);
+ INIT_LIST_HEAD(&acpi_desc->idts);
INIT_LIST_HEAD(&acpi_desc->memdevs);
INIT_LIST_HEAD(&acpi_desc->dimms);
+ mutex_init(&acpi_desc->spa_map_mutex);

data = (u8 *) acpi_desc->nfit;
end = data + sz;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index b76e33629098..7bd38b7baf39 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -52,6 +52,11 @@ struct nfit_bdw {
struct list_head list;
};

+struct nfit_idt {
+ struct acpi_nfit_interleave *idt;
+ struct list_head list;
+};
+
struct nfit_memdev {
struct acpi_nfit_memory_map *memdev;
struct list_head list;
@@ -62,10 +67,13 @@ struct nfit_mem {
struct nvdimm *nvdimm;
struct acpi_nfit_memory_map *memdev_dcr;
struct acpi_nfit_memory_map *memdev_pmem;
+ struct acpi_nfit_memory_map *memdev_bdw;
struct acpi_nfit_control_region *dcr;
struct acpi_nfit_data_region *bdw;
struct acpi_nfit_system_address *spa_dcr;
struct acpi_nfit_system_address *spa_bdw;
+ struct acpi_nfit_interleave *idt_dcr;
+ struct acpi_nfit_interleave *idt_bdw;
struct list_head list;
struct acpi_device *adev;
unsigned long dsm_mask;
@@ -74,16 +82,57 @@ struct nfit_mem {
struct acpi_nfit_desc {
struct nvdimm_bus_descriptor nd_desc;
struct acpi_table_nfit *nfit;
+ struct mutex spa_map_mutex;
+ struct list_head spa_maps;
struct list_head memdevs;
struct list_head dimms;
struct list_head spas;
struct list_head dcrs;
struct list_head bdws;
+ struct list_head idts;
struct nvdimm_bus *nvdimm_bus;
struct device *dev;
unsigned long dimm_dsm_force_en;
};

+enum nd_blk_mmio_selector {
+ BDW,
+ DCR,
+};
+
+struct nfit_blk {
+ struct nfit_blk_mmio {
+ union {
+ void __iomem *base;
+ void *aperture;
+ };
+ u64 size;
+ u64 base_offset;
+ u32 line_size;
+ u32 num_lines;
+ u32 table_size;
+ struct acpi_nfit_interleave *idt;
+ struct acpi_nfit_system_address *spa;
+ } mmio[2];
+ struct nd_region *nd_region;
+ u64 bdw_offset; /* post interleave offset */
+ u64 stat_offset;
+ u64 cmd_offset;
+};
+
+struct nfit_spa_mapping {
+ struct acpi_nfit_desc *acpi_desc;
+ struct acpi_nfit_system_address *spa;
+ struct list_head list;
+ struct kref kref;
+ void __iomem *iomem;
+};
+
+static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
+{
+ return container_of(kref, struct nfit_spa_mapping, kref);
+}
+
static inline struct acpi_nfit_memory_map *__to_nfit_memdev(
struct nfit_mem *nfit_mem)
{
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 204ee0796411..72226acb5c0f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -34,6 +34,19 @@ config BLK_DEV_PMEM

Say Y if you want to use an NVDIMM

+config ND_BLK
+ tristate "BLK: Block data window (aperture) device support"
+ default LIBNVDIMM
+ select ND_BTT if BTT
+ help
+ Support NVDIMMs, or other devices, that implement a BLK-mode
+ access capability. BLK-mode access uses memory-mapped-i/o
+ apertures to access persistent media.
+
+ Say Y if your platform firmware emits an ACPI.NFIT table
+ (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
+ capabilities.
+
config ND_BTT
tristate

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index d2aab6c58492..594bb97c867a 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -1,11 +1,14 @@
obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o

nd_pmem-y := pmem.o

nd_btt-y := btt.o

+nd_blk-y := blk.o
+
libnvdimm-y := core.o
libnvdimm-y += bus.o
libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
new file mode 100644
index 000000000000..9ac0c266c15c
--- /dev/null
+++ b/drivers/nvdimm/blk.c
@@ -0,0 +1,245 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/nd.h>
+#include <linux/sizes.h>
+#include "nd.h"
+
+struct nd_blk_device {
+ struct request_queue *queue;
+ struct gendisk *disk;
+ struct nd_namespace_blk *nsblk;
+ struct nd_blk_region *ndbr;
+ size_t disk_size;
+};
+
+static int nd_blk_major;
+
+static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
+ resource_size_t ns_offset, unsigned int len)
+{
+ int i;
+
+ for (i = 0; i < nsblk->num_resources; i++) {
+ if (ns_offset < resource_size(nsblk->res[i])) {
+ if (ns_offset + len > resource_size(nsblk->res[i])) {
+ dev_WARN_ONCE(&nsblk->common.dev, 1,
+ "illegal request\n");
+ return SIZE_MAX;
+ }
+ return nsblk->res[i]->start + ns_offset;
+ }
+ ns_offset -= resource_size(nsblk->res[i]);
+ }
+
+ dev_WARN_ONCE(&nsblk->common.dev, 1, "request out of range\n");
+ return SIZE_MAX;
+}
+
+static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+ struct gendisk *disk = bdev->bd_disk;
+ struct nd_namespace_blk *nsblk;
+ struct nd_blk_device *blk_dev;
+ struct nd_blk_region *ndbr;
+ struct bvec_iter iter;
+ struct bio_vec bvec;
+ int err = 0, rw;
+
+ blk_dev = disk->private_data;
+ nsblk = blk_dev->nsblk;
+ ndbr = blk_dev->ndbr;
+ rw = bio_data_dir(bio);
+ bio_for_each_segment(bvec, bio, iter) {
+ unsigned int len = bvec.bv_len;
+ resource_size_t dev_offset;
+ void *iobuf;
+
+ BUG_ON(len > PAGE_SIZE);
+
+ dev_offset = to_dev_offset(nsblk,
+ iter.bi_sector << SECTOR_SHIFT, len);
+ if (dev_offset == SIZE_MAX) {
+ err = -EIO;
+ goto out;
+ }
+
+ iobuf = kmap_atomic(bvec.bv_page);
+ err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
+ len, rw);
+ kunmap_atomic(iobuf);
+ if (err)
+ goto out;
+ }
+
+ out:
+ bio_endio(bio, err);
+}
+
+static int nd_blk_rw_bytes(struct nd_namespace_common *ndns,
+ resource_size_t offset, void *iobuf, size_t n, int rw)
+{
+ struct nd_blk_device *blk_dev = dev_get_drvdata(ndns->claim);
+ struct nd_namespace_blk *nsblk = blk_dev->nsblk;
+ struct nd_blk_region *ndbr = blk_dev->ndbr;
+ resource_size_t dev_offset;
+
+ dev_offset = to_dev_offset(nsblk, offset, n);
+
+ if (unlikely(offset + n > blk_dev->disk_size)) {
+ dev_WARN_ONCE(&ndns->dev, 1, "request out of range\n");
+ return -EFAULT;
+ }
+
+ if (dev_offset == SIZE_MAX)
+ return -EIO;
+
+ return ndbr->do_io(ndbr, dev_offset, iobuf, n, rw);
+}
+
+static const struct block_device_operations nd_blk_fops = {
+ .owner = THIS_MODULE,
+};
+
+static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
+ struct nd_blk_device *blk_dev)
+{
+ struct nd_namespace_blk *nsblk = to_nd_namespace_blk(&ndns->dev);
+ struct gendisk *disk;
+
+ blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
+ if (!blk_dev->queue)
+ return -ENOMEM;
+
+ blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
+ blk_queue_max_hw_sectors(blk_dev->queue, UINT_MAX);
+ blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
+ blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+ queue_flag_set_unlocked(QUEUE_FLAG_NONROT, blk_dev->queue);
+
+ disk = blk_dev->disk = alloc_disk(0);
+ if (!disk) {
+ blk_cleanup_queue(blk_dev->queue);
+ return -ENOMEM;
+ }
+
+ disk->driverfs_dev = &ndns->dev;
+ disk->major = nd_blk_major;
+ disk->first_minor = 0;
+ disk->fops = &nd_blk_fops;
+ disk->private_data = blk_dev;
+ disk->queue = blk_dev->queue;
+ disk->flags = GENHD_FL_EXT_DEVT;
+ nvdimm_namespace_disk_name(ndns, disk->disk_name);
+ set_capacity(disk, blk_dev->disk_size >> SECTOR_SHIFT);
+ add_disk(disk);
+
+ return 0;
+}
+
+static int nd_blk_probe(struct device *dev)
+{
+ struct nd_namespace_common *ndns;
+ struct nd_blk_device *blk_dev;
+ int rc;
+
+ ndns = nvdimm_namespace_common_probe(dev);
+ if (IS_ERR(ndns))
+ return PTR_ERR(ndns);
+
+ blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
+ if (!blk_dev)
+ return -ENOMEM;
+
+ blk_dev->disk_size = nvdimm_namespace_capacity(ndns);
+ blk_dev->ndbr = to_nd_blk_region(dev->parent);
+ blk_dev->nsblk = to_nd_namespace_blk(&ndns->dev);
+ dev_set_drvdata(dev, blk_dev);
+
+ ndns->rw_bytes = nd_blk_rw_bytes;
+ if (is_nd_btt(dev))
+ rc = nvdimm_namespace_attach_btt(ndns);
+ else if (nd_btt_probe(ndns, blk_dev) == 0) {
+ /* we'll come back as btt-blk */
+ rc = -ENXIO;
+ } else
+ rc = nd_blk_attach_disk(ndns, blk_dev);
+ if (rc)
+ kfree(blk_dev);
+ return rc;
+}
+
+static void nd_blk_detach_disk(struct nd_blk_device *blk_dev)
+{
+ del_gendisk(blk_dev->disk);
+ put_disk(blk_dev->disk);
+ blk_cleanup_queue(blk_dev->queue);
+}
+
+static int nd_blk_remove(struct device *dev)
+{
+ struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
+
+ if (is_nd_btt(dev))
+ nvdimm_namespace_detach_btt(to_nd_btt(dev)->ndns);
+ else
+ nd_blk_detach_disk(blk_dev);
+ kfree(blk_dev);
+
+ return 0;
+}
+
+static struct nd_device_driver nd_blk_driver = {
+ .probe = nd_blk_probe,
+ .remove = nd_blk_remove,
+ .drv = {
+ .name = "nd_blk",
+ },
+ .type = ND_DRIVER_NAMESPACE_BLK,
+};
+
+static int __init nd_blk_init(void)
+{
+ int rc;
+
+ rc = register_blkdev(0, "nd_blk");
+ if (rc < 0)
+ return rc;
+
+ nd_blk_major = rc;
+ rc = nd_driver_register(&nd_blk_driver);
+
+ if (rc < 0)
+ unregister_blkdev(nd_blk_major, "nd_blk");
+
+ return rc;
+}
+
+static void __exit nd_blk_exit(void)
+{
+ driver_unregister(&nd_blk_driver.drv);
+ unregister_blkdev(nd_blk_major, "nd_blk");
+}
+
+MODULE_AUTHOR("Ross Zwisler <[email protected]>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
+module_init(nd_blk_init);
+module_exit(nd_blk_exit);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 83b179ed6d61..c05eb807d674 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -209,6 +209,15 @@ struct nvdimm *to_nvdimm(struct device *dev)
}
EXPORT_SYMBOL_GPL(to_nvdimm);

+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
+{
+ struct nd_region *nd_region = &ndbr->nd_region;
+ struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+
+ return nd_mapping->nvdimm;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
+
struct nvdimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
{
struct nvdimm *nvdimm = nd_mapping->nvdimm;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 4aa647c8d644..1ce1e70de44a 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -173,6 +173,65 @@ static resource_size_t nd_namespace_blk_size(struct nd_namespace_blk *nsblk)
return size;
}

+static bool __nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+ struct nd_region *nd_region = to_nd_region(nsblk->common.dev.parent);
+ struct nd_mapping *nd_mapping = &nd_region->mapping[0];
+ struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
+ struct nd_label_id label_id;
+ struct resource *res;
+ int count, i;
+
+ if (!nsblk->uuid || !nsblk->lbasize || !ndd)
+ return false;
+
+ count = 0;
+ nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
+ for_each_dpa_resource(ndd, res) {
+ if (strcmp(res->name, label_id.id) != 0)
+ continue;
+ /*
+ * Resources with unacknoweldged adjustments indicate a
+ * failure to update labels
+ */
+ if (res->flags & DPA_RESOURCE_ADJUSTED)
+ return false;
+ count++;
+ }
+
+ /* These values match after a successful label update */
+ if (count != nsblk->num_resources)
+ return false;
+
+ for (i = 0; i < nsblk->num_resources; i++) {
+ struct resource *found = NULL;
+
+ for_each_dpa_resource(ndd, res)
+ if (res == nsblk->res[i]) {
+ found = res;
+ break;
+ }
+ /* stale resource */
+ if (!found)
+ return false;
+ }
+
+ return true;
+}
+
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk)
+{
+ resource_size_t size;
+
+ nvdimm_bus_lock(&nsblk->common.dev);
+ size = __nd_namespace_blk_validate(nsblk);
+ nvdimm_bus_unlock(&nsblk->common.dev);
+
+ return size;
+}
+EXPORT_SYMBOL(nd_namespace_blk_validate);
+
+
static int nd_namespace_label_update(struct nd_region *nd_region,
struct device *dev)
{
@@ -1224,7 +1283,11 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
} else if (is_namespace_blk(&ndns->dev)) {
- return ERR_PTR(-ENODEV); /* TODO */
+ struct nd_namespace_blk *nsblk;
+
+ nsblk = to_nd_namespace_blk(&ndns->dev);
+ if (!nd_namespace_blk_validate(nsblk))
+ return ERR_PTR(-ENODEV);
}

return ndns;
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 5e6413964776..e1970c71ad1c 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -43,9 +43,8 @@ struct nvdimm {
};

bool is_nvdimm(struct device *dev);
-bool is_nd_blk(struct device *dev);
bool is_nd_pmem(struct device *dev);
-struct nd_btt;
+bool is_nd_blk(struct device *dev);
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
int __init nvdimm_bus_init(void);
void nvdimm_bus_exit(void);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 78177a76a05d..d7fc6ef4e4d8 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -101,6 +101,15 @@ struct nd_region {
struct nd_mapping mapping[0];
};

+struct nd_blk_region {
+ int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+ void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+ int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+ void *iobuf, u64 len, int rw);
+ void *blk_provider_data;
+ struct nd_region nd_region;
+};
+
/*
* Lookup next in the repeating sequence of 01, 10, and 11.
*/
@@ -170,8 +179,6 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)

#endif
struct nd_region *to_nd_region(struct device *dev);
-unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
-void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
int nd_region_to_nstype(struct nd_region *nd_region);
int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
@@ -191,4 +198,6 @@ int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns);
const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
char *name);
+int nd_blk_region_init(struct nd_region *nd_region);
+resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index eb8aebcd4800..f28f78ccff19 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -18,11 +18,10 @@

static int nd_region_probe(struct device *dev)
{
- int err;
+ int err, rc;
static unsigned long once;
struct nd_region_namespaces *num_ns;
struct nd_region *nd_region = to_nd_region(dev);
- int rc = nd_region_register_namespaces(nd_region, &err);

if (nd_region->num_lanes > num_online_cpus()
&& nd_region->num_lanes < num_possible_cpus()
@@ -34,6 +33,11 @@ static int nd_region_probe(struct device *dev)
nd_region->num_lanes);
}

+ rc = nd_blk_region_init(nd_region);
+ if (rc)
+ return rc;
+
+ rc = nd_region_register_namespaces(nd_region, &err);
num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL);
if (!num_ns)
return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index abf344651358..fb35d451016f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -11,6 +11,7 @@
* General Public License for more details.
*/
#include <linux/scatterlist.h>
+#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/sort.h>
@@ -34,7 +35,10 @@ static void nd_region_release(struct device *dev)
}
free_percpu(nd_region->lane);
ida_simple_remove(&region_ida, nd_region->id);
- kfree(nd_region);
+ if (is_nd_blk(dev))
+ kfree(to_nd_blk_region(dev));
+ else
+ kfree(nd_region);
}

static struct device_type nd_blk_device_type = {
@@ -71,6 +75,33 @@ struct nd_region *to_nd_region(struct device *dev)
}
EXPORT_SYMBOL_GPL(to_nd_region);

+struct nd_blk_region *to_nd_blk_region(struct device *dev)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+
+ WARN_ON(!is_nd_blk(dev));
+ return container_of(nd_region, struct nd_blk_region, nd_region);
+}
+EXPORT_SYMBOL_GPL(to_nd_blk_region);
+
+void *nd_region_provider_data(struct nd_region *nd_region)
+{
+ return nd_region->provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_region_provider_data);
+
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr)
+{
+ return ndbr->blk_provider_data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_provider_data);
+
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data)
+{
+ ndbr->blk_provider_data = data;
+}
+EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
+
/**
* nd_region_to_nstype() - region to an integer namespace type
* @nd_region: region-device to interrogate
@@ -365,7 +396,8 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region)
/*
* Upon successful probe/remove, take/release a reference on the
* associated interleave set (if present), and plant new btt + namespace
- * seeds.
+ * seeds. Also, on the removal of a BLK region, notify the provider to
+ * disable the region.
*/
static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
struct device *dev, bool probe)
@@ -385,8 +417,14 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
nd_mapping->labels = NULL;
put_ndd(ndd);
nd_mapping->ndd = NULL;
- atomic_dec(&nvdimm->busy);
+ if (ndd)
+ atomic_dec(&nvdimm->busy);
}
+
+ if (is_nd_pmem(dev))
+ return;
+
+ to_nd_blk_region(dev)->disable(nvdimm_bus, dev);
}
if (dev->parent && is_nd_blk(dev->parent) && probe) {
nd_region = to_nd_region(dev->parent);
@@ -526,11 +564,21 @@ struct attribute_group nd_mapping_attribute_group = {
};
EXPORT_SYMBOL_GPL(nd_mapping_attribute_group);

-void *nd_region_provider_data(struct nd_region *nd_region)
+int nd_blk_region_init(struct nd_region *nd_region)
{
- return nd_region->provider_data;
+ struct device *dev = &nd_region->dev;
+ struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+
+ if (!is_nd_blk(dev))
+ return 0;
+
+ if (nd_region->ndr_mappings < 1) {
+ dev_err(dev, "invalid BLK region\n");
+ return -ENXIO;
+ }
+
+ return to_nd_blk_region(dev)->enable(nvdimm_bus, dev);
}
-EXPORT_SYMBOL_GPL(nd_region_provider_data);

/**
* nd_region_acquire_lane - allocate and lock a lane
@@ -591,6 +639,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
{
struct nd_region *nd_region;
struct device *dev;
+ void *region_buf;
unsigned int i;

for (i = 0; i < ndr_desc->num_mappings; i++) {
@@ -605,10 +654,30 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
}
}

- nd_region = kzalloc(sizeof(struct nd_region)
- + sizeof(struct nd_mapping) * ndr_desc->num_mappings,
- GFP_KERNEL);
- if (!nd_region)
+ if (dev_type == &nd_blk_device_type) {
+ struct nd_blk_region_desc *ndbr_desc;
+ struct nd_blk_region *ndbr;
+
+ ndbr_desc = to_blk_region_desc(ndr_desc);
+ ndbr = kzalloc(sizeof(*ndbr) + sizeof(struct nd_mapping)
+ * ndr_desc->num_mappings,
+ GFP_KERNEL);
+ if (ndbr) {
+ nd_region = &ndbr->nd_region;
+ ndbr->enable = ndbr_desc->enable;
+ ndbr->disable = ndbr_desc->disable;
+ ndbr->do_io = ndbr_desc->do_io;
+ }
+ region_buf = ndbr;
+ } else {
+ nd_region = kzalloc(sizeof(struct nd_region)
+ + sizeof(struct nd_mapping)
+ * ndr_desc->num_mappings,
+ GFP_KERNEL);
+ region_buf = nd_region;
+ }
+
+ if (!region_buf)
return NULL;
nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
if (nd_region->id < 0)
@@ -653,7 +722,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
err_percpu:
ida_simple_remove(&region_ida, nd_region->id);
err_id:
- kfree(nd_region);
+ kfree(region_buf);
return NULL;
}

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 531d99dfac68..7fc1b25bdb5d 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -14,6 +14,7 @@
*/
#ifndef __LIBNVDIMM_H__
#define __LIBNVDIMM_H__
+#include <linux/kernel.h>
#include <linux/sizes.h>
#include <linux/types.h>

@@ -89,8 +90,24 @@ struct nd_region_desc {
};

struct nvdimm_bus;
-struct device;
struct module;
+struct device;
+struct nd_blk_region;
+struct nd_blk_region_desc {
+ int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+ void (*disable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
+ int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+ void *iobuf, u64 len, int rw);
+ struct nd_region_desc ndr_desc;
+};
+
+static inline struct nd_blk_region_desc *to_blk_region_desc(
+ struct nd_region_desc *ndr_desc)
+{
+ return container_of(ndr_desc, struct nd_blk_region_desc, ndr_desc);
+
+}
+
struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
struct nvdimm_bus_descriptor *nfit_desc, struct module *module);
#define nvdimm_bus_register(parent, desc) \
@@ -99,10 +116,10 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus);
struct nvdimm_bus *to_nvdimm_bus(struct device *dev);
struct nvdimm *to_nvdimm(struct device *dev);
struct nd_region *to_nd_region(struct device *dev);
+struct nd_blk_region *to_nd_blk_region(struct device *dev);
struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
const char *nvdimm_name(struct nvdimm *nvdimm);
void *nvdimm_provider_data(struct nvdimm *nvdimm);
-void *nd_region_provider_data(struct nd_region *nd_region);
struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
const struct attribute_group **groups, unsigned long flags,
unsigned long *dsm_mask);
@@ -120,5 +137,11 @@ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc);
struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc);
+void *nd_region_provider_data(struct nd_region *nd_region);
+void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
+void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
+struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
+void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
#endif /* __LIBNVDIMM_H__ */

2015-06-25 09:46:51

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 04/17] tools/testing/nvdimm: libnvdimm unit test infrastructure

'libnvdimm' is the first driver sub-system in the kernel to implement
mocking for unit test coverage. The nfit_test module gets built as an
external module and arranges for external module replacements of nfit,
libnvdimm, nd_pmem, and nd_blk. These replacements use the linker
--wrap option to redirect calls to ioremap() + request_mem_region() to
custom defined unit test resources. The end result is a fully
functional nvdimm_bus, as far as userspace is concerned, but with the
capability to perform otherwise destructive tests on emulated resources.

Q: Why not use QEMU for this emulation?
QEMU is not suitable for unit testing. QEMU's role is to faithfully
emulate the platform. A unit test's role is to unfaithfully implement
the platform with the goal of triggering bugs in the corners of the
sub-system implementation. As bugs are discovered in platforms, or the
sub-system itself, the unit tests are extended to backstop a fix with a
reproducer unit test.

Another problem with QEMU is that it would require coordination of 3
software projects instead of 2 (kernel + libndctl [1]) to maintain and
execute the tests. The chances for bit rot and the difficulty of
getting the tests running goes up non-linearly the more components
involved.


Q: Why submit this to the kernel tree instead of external modules in
libndctl?
Simple, to alleviate the same risk that out-of-tree external modules
face. Updates to drivers/nvdimm/ can be immediately evaluated to see if
they have any impact on tools/testing/nvdimm/.


Q: What are the negative implications of merging this?
It is a unique maintenance burden because the purpose of mocking an
interface to enable a unit test is to purposefully short circuit the
semantics of a routine to enable testing. For example
__wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
resource buffer allocated by dma_alloc_coherent(). The future
maintenance burden hits when someone changes the semantics of
ioremap_cache() and wonders what the implications are for the unit test.

[1]: https://github.com/pmem/ndctl

Cc: <[email protected]>
Cc: Lv Zheng <[email protected]>
Cc: Robert Moore <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit.c | 12
drivers/acpi/nfit.h | 6
tools/testing/nvdimm/Kbuild | 40 +
tools/testing/nvdimm/Makefile | 7
tools/testing/nvdimm/config_check.c | 15
tools/testing/nvdimm/test/Kbuild | 8
tools/testing/nvdimm/test/iomap.c | 151 ++++
tools/testing/nvdimm/test/nfit.c | 1113 +++++++++++++++++++++++++++++++++
tools/testing/nvdimm/test/nfit_test.h | 29 +
9 files changed, 1377 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/nvdimm/Kbuild
create mode 100644 tools/testing/nvdimm/Makefile
create mode 100644 tools/testing/nvdimm/config_check.c
create mode 100644 tools/testing/nvdimm/test/Kbuild
create mode 100644 tools/testing/nvdimm/test/iomap.c
create mode 100644 tools/testing/nvdimm/test/nfit.c
create mode 100644 tools/testing/nvdimm/test/nfit_test.h

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 2aa6aa97b40c..07d630e9f4ae 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -33,10 +33,11 @@ MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");

static u8 nfit_uuid[NFIT_UUID_MAX][16];

-static const u8 *to_nfit_uuid(enum nfit_uuids id)
+const u8 *to_nfit_uuid(enum nfit_uuids id)
{
return nfit_uuid[id];
}
+EXPORT_SYMBOL(to_nfit_uuid);

static struct acpi_nfit_desc *to_acpi_nfit_desc(
struct nvdimm_bus_descriptor *nd_desc)
@@ -581,11 +582,12 @@ static struct attribute_group acpi_nfit_attribute_group = {
.attrs = acpi_nfit_attributes,
};

-static const struct attribute_group *acpi_nfit_attribute_groups[] = {
+const struct attribute_group *acpi_nfit_attribute_groups[] = {
&nvdimm_bus_attribute_group,
&acpi_nfit_attribute_group,
NULL,
};
+EXPORT_SYMBOL_GPL(acpi_nfit_attribute_groups);

static struct acpi_nfit_memory_map *to_nfit_memdev(struct device *dev)
{
@@ -1323,7 +1325,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
ndbr_desc = to_blk_region_desc(ndr_desc);
ndbr_desc->enable = acpi_nfit_blk_region_enable;
ndbr_desc->disable = acpi_nfit_blk_region_disable;
- ndbr_desc->do_io = acpi_nfit_blk_region_do_io;
+ ndbr_desc->do_io = acpi_desc->blk_do_io;
if (!nvdimm_blk_region_create(acpi_desc->nvdimm_bus, ndr_desc))
return -ENOMEM;
break;
@@ -1407,7 +1409,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
return 0;
}

-static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
+int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
{
struct device *dev = acpi_desc->dev;
const void *end;
@@ -1446,6 +1448,7 @@ static int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)

return acpi_nfit_register_regions(acpi_desc);
}
+EXPORT_SYMBOL_GPL(acpi_nfit_init);

static int acpi_nfit_add(struct acpi_device *adev)
{
@@ -1470,6 +1473,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
dev_set_drvdata(dev, acpi_desc);
acpi_desc->dev = dev;
acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+ acpi_desc->blk_do_io = acpi_nfit_blk_region_do_io;
nd_desc = &acpi_desc->nd_desc;
nd_desc->provider_name = "ACPI.NFIT";
nd_desc->ndctl = acpi_nfit_ctl;
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 7bd38b7baf39..c62fffea8423 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -93,6 +93,8 @@ struct acpi_nfit_desc {
struct nvdimm_bus *nvdimm_bus;
struct device *dev;
unsigned long dimm_dsm_force_en;
+ int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
+ void *iobuf, u64 len, int rw);
};

enum nd_blk_mmio_selector {
@@ -146,4 +148,8 @@ static inline struct acpi_nfit_desc *to_acpi_desc(
{
return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
}
+
+const u8 *to_nfit_uuid(enum nfit_uuids id);
+int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
+extern const struct attribute_group *acpi_nfit_attribute_groups[];
#endif /* __NFIT_H__ */
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
new file mode 100644
index 000000000000..8e9b64520ec1
--- /dev/null
+++ b/tools/testing/nvdimm/Kbuild
@@ -0,0 +1,40 @@
+ldflags-y += --wrap=ioremap_cache
+ldflags-y += --wrap=ioremap_nocache
+ldflags-y += --wrap=iounmap
+ldflags-y += --wrap=__request_region
+ldflags-y += --wrap=__release_region
+
+DRIVERS := ../../../drivers
+NVDIMM_SRC := $(DRIVERS)/nvdimm
+ACPI_SRC := $(DRIVERS)/acpi
+
+obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
+obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
+obj-$(CONFIG_ACPI_NFIT) += nfit.o
+
+nfit-y := $(ACPI_SRC)/nfit.o
+nfit-y += config_check.o
+
+nd_pmem-y := $(NVDIMM_SRC)/pmem.o
+nd_pmem-y += config_check.o
+
+nd_btt-y := $(NVDIMM_SRC)/btt.o
+nd_btt-y += config_check.o
+
+nd_blk-y := $(NVDIMM_SRC)/blk.o
+nd_blk-y += config_check.o
+
+libnvdimm-y := $(NVDIMM_SRC)/core.o
+libnvdimm-y += $(NVDIMM_SRC)/bus.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/dimm.o
+libnvdimm-y += $(NVDIMM_SRC)/region_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/region.o
+libnvdimm-y += $(NVDIMM_SRC)/namespace_devs.o
+libnvdimm-y += $(NVDIMM_SRC)/label.o
+libnvdimm-$(CONFIG_BTT) += $(NVDIMM_SRC)/btt_devs.o
+libnvdimm-y += config_check.o
+
+obj-m += test/
diff --git a/tools/testing/nvdimm/Makefile b/tools/testing/nvdimm/Makefile
new file mode 100644
index 000000000000..3dfe024b4e7e
--- /dev/null
+++ b/tools/testing/nvdimm/Makefile
@@ -0,0 +1,7 @@
+KDIR ?= ../../../
+
+default:
+ $(MAKE) -C $(KDIR) M=$$PWD
+
+install: default
+ $(MAKE) -C $(KDIR) M=$$PWD modules_install
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
new file mode 100644
index 000000000000..f2c7615554eb
--- /dev/null
+++ b/tools/testing/nvdimm/config_check.c
@@ -0,0 +1,15 @@
+#include <linux/kconfig.h>
+#include <linux/bug.h>
+
+void check(void)
+{
+ /*
+ * These kconfig symbols must be set to "m" for nfit_test to
+ * load and operate.
+ */
+ BUILD_BUG_ON(!IS_MODULE(CONFIG_LIBNVDIMM));
+ BUILD_BUG_ON(!IS_MODULE(CONFIG_BLK_DEV_PMEM));
+ BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT));
+ BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK));
+ BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
+}
diff --git a/tools/testing/nvdimm/test/Kbuild b/tools/testing/nvdimm/test/Kbuild
new file mode 100644
index 000000000000..9241064970fe
--- /dev/null
+++ b/tools/testing/nvdimm/test/Kbuild
@@ -0,0 +1,8 @@
+ccflags-y := -I$(src)/../../../../drivers/nvdimm/
+ccflags-y += -I$(src)/../../../../drivers/acpi/
+
+obj-m += nfit_test.o
+obj-m += nfit_test_iomap.o
+
+nfit_test-y := nfit.o
+nfit_test_iomap-y := iomap.o
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
new file mode 100644
index 000000000000..c85a6f6ba559
--- /dev/null
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -0,0 +1,151 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rculist.h>
+#include <linux/export.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include "nfit_test.h"
+
+static LIST_HEAD(iomap_head);
+
+static struct iomap_ops {
+ nfit_test_lookup_fn nfit_test_lookup;
+ struct list_head list;
+} iomap_ops = {
+ .list = LIST_HEAD_INIT(iomap_ops.list),
+};
+
+void nfit_test_setup(nfit_test_lookup_fn lookup)
+{
+ iomap_ops.nfit_test_lookup = lookup;
+ list_add_rcu(&iomap_ops.list, &iomap_head);
+}
+EXPORT_SYMBOL(nfit_test_setup);
+
+void nfit_test_teardown(void)
+{
+ list_del_rcu(&iomap_ops.list);
+ synchronize_rcu();
+}
+EXPORT_SYMBOL(nfit_test_teardown);
+
+static struct nfit_test_resource *get_nfit_res(resource_size_t resource)
+{
+ struct iomap_ops *ops;
+
+ ops = list_first_or_null_rcu(&iomap_head, typeof(*ops), list);
+ if (ops)
+ return ops->nfit_test_lookup(resource);
+ return NULL;
+}
+
+void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size,
+ void __iomem *(*fallback_fn)(resource_size_t, unsigned long))
+{
+ struct nfit_test_resource *nfit_res;
+
+ rcu_read_lock();
+ nfit_res = get_nfit_res(offset);
+ rcu_read_unlock();
+ if (nfit_res)
+ return (void __iomem *) nfit_res->buf + offset
+ - nfit_res->res->start;
+ return fallback_fn(offset, size);
+}
+
+void __iomem *__wrap_ioremap_cache(resource_size_t offset, unsigned long size)
+{
+ return __nfit_test_ioremap(offset, size, ioremap_cache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_cache);
+
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
+{
+ return __nfit_test_ioremap(offset, size, ioremap_nocache);
+}
+EXPORT_SYMBOL(__wrap_ioremap_nocache);
+
+void __wrap_iounmap(volatile void __iomem *addr)
+{
+ struct nfit_test_resource *nfit_res;
+
+ rcu_read_lock();
+ nfit_res = get_nfit_res((unsigned long) addr);
+ rcu_read_unlock();
+ if (nfit_res)
+ return;
+ return iounmap(addr);
+}
+EXPORT_SYMBOL(__wrap_iounmap);
+
+struct resource *__wrap___request_region(struct resource *parent,
+ resource_size_t start, resource_size_t n, const char *name,
+ int flags)
+{
+ struct nfit_test_resource *nfit_res;
+
+ if (parent == &iomem_resource) {
+ rcu_read_lock();
+ nfit_res = get_nfit_res(start);
+ rcu_read_unlock();
+ if (nfit_res) {
+ struct resource *res = nfit_res->res + 1;
+
+ if (start + n > nfit_res->res->start
+ + resource_size(nfit_res->res)) {
+ pr_debug("%s: start: %llx n: %llx overflow: %pr\n",
+ __func__, start, n,
+ nfit_res->res);
+ return NULL;
+ }
+
+ res->start = start;
+ res->end = start + n - 1;
+ res->name = name;
+ res->flags = resource_type(parent);
+ res->flags |= IORESOURCE_BUSY | flags;
+ pr_debug("%s: %pr\n", __func__, res);
+ return res;
+ }
+ }
+ return __request_region(parent, start, n, name, flags);
+}
+EXPORT_SYMBOL(__wrap___request_region);
+
+void __wrap___release_region(struct resource *parent, resource_size_t start,
+ resource_size_t n)
+{
+ struct nfit_test_resource *nfit_res;
+
+ if (parent == &iomem_resource) {
+ rcu_read_lock();
+ nfit_res = get_nfit_res(start);
+ rcu_read_unlock();
+ if (nfit_res) {
+ struct resource *res = nfit_res->res + 1;
+
+ if (start != res->start || resource_size(res) != n)
+ pr_info("%s: start: %llx n: %llx mismatch: %pr\n",
+ __func__, start, n, res);
+ else
+ memset(res, 0, sizeof(*res));
+ return;
+ }
+ }
+ __release_region(parent, start, n);
+}
+EXPORT_SYMBOL(__wrap___release_region);
+
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
new file mode 100644
index 000000000000..7a4a5a5edbe4
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -0,0 +1,1113 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+#include <linux/libnvdimm.h>
+#include <linux/vmalloc.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/ndctl.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <nfit.h>
+#include <nd.h>
+#include "nfit_test.h"
+
+/*
+ * Generate an NFIT table to describe the following topology:
+ *
+ * BUS0: Interleaved PMEM regions, and aliasing with BLK regions
+ *
+ * (a) (b) DIMM BLK-REGION
+ * +----------+--------------+----------+---------+
+ * +------+ | blk2.0 | pm0.0 | blk2.1 | pm1.0 | 0 region2
+ * | imc0 +--+- - - - - region0 - - - -+----------+ +
+ * +--+---+ | blk3.0 | pm0.0 | blk3.1 | pm1.0 | 1 region3
+ * | +----------+--------------v----------v v
+ * +--+---+ | |
+ * | cpu0 | region1
+ * +--+---+ | |
+ * | +-------------------------^----------^ ^
+ * +--+---+ | blk4.0 | pm1.0 | 2 region4
+ * | imc1 +--+-------------------------+----------+ +
+ * +------+ | blk5.0 | pm1.0 | 3 region5
+ * +-------------------------+----------+-+-------+
+ *
+ * *) In this layout we have four dimms and two memory controllers in one
+ * socket. Each unique interface (BLK or PMEM) to DPA space
+ * is identified by a region device with a dynamically assigned id.
+ *
+ * *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
+ * A single PMEM namespace "pm0.0" is created using half of the
+ * REGION0 SPA-range. REGION0 spans dimm0 and dimm1. PMEM namespace
+ * allocate from from the bottom of a region. The unallocated
+ * portion of REGION0 aliases with REGION2 and REGION3. That
+ * unallacted capacity is reclaimed as BLK namespaces ("blk2.0" and
+ * "blk3.0") starting at the base of each DIMM to offset (a) in those
+ * DIMMs. "pm0.0", "blk2.0" and "blk3.0" are free-form readable
+ * names that can be assigned to a namespace.
+ *
+ * *) In the last portion of dimm0 and dimm1 we have an interleaved
+ * SPA range, REGION1, that spans those two dimms as well as dimm2
+ * and dimm3. Some of REGION1 allocated to a PMEM namespace named
+ * "pm1.0" the rest is reclaimed in 4 BLK namespaces (for each
+ * dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ * "blk5.0".
+ *
+ * *) The portion of dimm2 and dimm3 that do not participate in the
+ * REGION1 interleaved SPA range (i.e. the DPA address below offset
+ * (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ * Note, that BLK namespaces need not be contiguous in DPA-space, and
+ * can consume aliased capacity from multiple interleave sets.
+ *
+ * BUS1: Legacy NVDIMM (single contiguous range)
+ *
+ * region2
+ * +---------------------+
+ * |---------------------|
+ * || pm2.0 ||
+ * |---------------------|
+ * +---------------------+
+ *
+ * *) A NFIT-table may describe a simple system-physical-address range
+ * with no BLK aliasing. This type of region may optionally
+ * reference an NVDIMM.
+ */
+enum {
+ NUM_PM = 2,
+ NUM_DCR = 4,
+ NUM_BDW = NUM_DCR,
+ NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
+ NUM_MEM = NUM_DCR + NUM_BDW + 2 /* spa0 iset */ + 4 /* spa1 iset */,
+ DIMM_SIZE = SZ_32M,
+ LABEL_SIZE = SZ_128K,
+ SPA0_SIZE = DIMM_SIZE,
+ SPA1_SIZE = DIMM_SIZE*2,
+ SPA2_SIZE = DIMM_SIZE,
+ BDW_SIZE = 64 << 8,
+ DCR_SIZE = 12,
+ NUM_NFITS = 2, /* permit testing multiple NFITs per system */
+};
+
+struct nfit_test_dcr {
+ __le64 bdw_addr;
+ __le32 bdw_status;
+ __u8 aperature[BDW_SIZE];
+};
+
+#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
+ (((node & 0xfff) << 16) | ((socket & 0xf) << 12) \
+ | ((imc & 0xf) << 8) | ((chan & 0xf) << 4) | (dimm & 0xf))
+
+static u32 handle[NUM_DCR] = {
+ [0] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 0),
+ [1] = NFIT_DIMM_HANDLE(0, 0, 0, 0, 1),
+ [2] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 0),
+ [3] = NFIT_DIMM_HANDLE(0, 0, 1, 0, 1),
+};
+
+struct nfit_test {
+ struct acpi_nfit_desc acpi_desc;
+ struct platform_device pdev;
+ struct list_head resources;
+ void *nfit_buf;
+ dma_addr_t nfit_dma;
+ size_t nfit_size;
+ int num_dcr;
+ int num_pm;
+ void **dimm;
+ dma_addr_t *dimm_dma;
+ void **label;
+ dma_addr_t *label_dma;
+ void **spa_set;
+ dma_addr_t *spa_set_dma;
+ struct nfit_test_dcr **dcr;
+ dma_addr_t *dcr_dma;
+ int (*alloc)(struct nfit_test *t);
+ void (*setup)(struct nfit_test *t);
+};
+
+static struct nfit_test *to_nfit_test(struct device *dev)
+{
+ struct platform_device *pdev = to_platform_device(dev);
+
+ return container_of(pdev, struct nfit_test, pdev);
+}
+
+static int nfit_test_ctl(struct nvdimm_bus_descriptor *nd_desc,
+ struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+ unsigned int buf_len)
+{
+ struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
+ struct nfit_test *t = container_of(acpi_desc, typeof(*t), acpi_desc);
+ struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+ int i, rc;
+
+ if (!nfit_mem || !test_bit(cmd, &nfit_mem->dsm_mask))
+ return -ENXIO;
+
+ /* lookup label space for the given dimm */
+ for (i = 0; i < ARRAY_SIZE(handle); i++)
+ if (__to_nfit_memdev(nfit_mem)->device_handle == handle[i])
+ break;
+ if (i >= ARRAY_SIZE(handle))
+ return -ENXIO;
+
+ switch (cmd) {
+ case ND_CMD_GET_CONFIG_SIZE: {
+ struct nd_cmd_get_config_size *nd_cmd = buf;
+
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+ nd_cmd->status = 0;
+ nd_cmd->config_size = LABEL_SIZE;
+ nd_cmd->max_xfer = SZ_4K;
+ rc = 0;
+ break;
+ }
+ case ND_CMD_GET_CONFIG_DATA: {
+ struct nd_cmd_get_config_data_hdr *nd_cmd = buf;
+ unsigned int len, offset = nd_cmd->in_offset;
+
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+ if (offset >= LABEL_SIZE)
+ return -EINVAL;
+ if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+ return -EINVAL;
+
+ nd_cmd->status = 0;
+ len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+ memcpy(nd_cmd->out_buf, t->label[i] + offset, len);
+ rc = buf_len - sizeof(*nd_cmd) - len;
+ break;
+ }
+ case ND_CMD_SET_CONFIG_DATA: {
+ struct nd_cmd_set_config_hdr *nd_cmd = buf;
+ unsigned int len, offset = nd_cmd->in_offset;
+ u32 *status;
+
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+ if (offset >= LABEL_SIZE)
+ return -EINVAL;
+ if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+ return -EINVAL;
+
+ status = buf + nd_cmd->in_length + sizeof(*nd_cmd);
+ *status = 0;
+ len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+ memcpy(t->label[i] + offset, nd_cmd->in_buf, len);
+ rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+ break;
+ }
+ default:
+ return -ENOTTY;
+ }
+
+ return rc;
+}
+
+static DEFINE_SPINLOCK(nfit_test_lock);
+static struct nfit_test *instances[NUM_NFITS];
+
+static void release_nfit_res(void *data)
+{
+ struct nfit_test_resource *nfit_res = data;
+ struct resource *res = nfit_res->res;
+
+ spin_lock(&nfit_test_lock);
+ list_del(&nfit_res->list);
+ spin_unlock(&nfit_test_lock);
+
+ if (is_vmalloc_addr(nfit_res->buf))
+ vfree(nfit_res->buf);
+ else
+ dma_free_coherent(nfit_res->dev, resource_size(res),
+ nfit_res->buf, res->start);
+ kfree(res);
+ kfree(nfit_res);
+}
+
+static void *__test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma,
+ void *buf)
+{
+ struct device *dev = &t->pdev.dev;
+ struct resource *res = kzalloc(sizeof(*res) * 2, GFP_KERNEL);
+ struct nfit_test_resource *nfit_res = kzalloc(sizeof(*nfit_res),
+ GFP_KERNEL);
+ int rc;
+
+ if (!res || !buf || !nfit_res)
+ goto err;
+ rc = devm_add_action(dev, release_nfit_res, nfit_res);
+ if (rc)
+ goto err;
+ INIT_LIST_HEAD(&nfit_res->list);
+ memset(buf, 0, size);
+ nfit_res->dev = dev;
+ nfit_res->buf = buf;
+ nfit_res->res = res;
+ res->start = *dma;
+ res->end = *dma + size - 1;
+ res->name = "NFIT";
+ spin_lock(&nfit_test_lock);
+ list_add(&nfit_res->list, &t->resources);
+ spin_unlock(&nfit_test_lock);
+
+ return nfit_res->buf;
+ err:
+ if (buf && !is_vmalloc_addr(buf))
+ dma_free_coherent(dev, size, buf, *dma);
+ else if (buf)
+ vfree(buf);
+ kfree(res);
+ kfree(nfit_res);
+ return NULL;
+}
+
+static void *test_alloc(struct nfit_test *t, size_t size, dma_addr_t *dma)
+{
+ void *buf = vmalloc(size);
+
+ *dma = (unsigned long) buf;
+ return __test_alloc(t, size, dma, buf);
+}
+
+static void *test_alloc_coherent(struct nfit_test *t, size_t size,
+ dma_addr_t *dma)
+{
+ struct device *dev = &t->pdev.dev;
+ void *buf = dma_alloc_coherent(dev, size, dma, GFP_KERNEL);
+
+ return __test_alloc(t, size, dma, buf);
+}
+
+static struct nfit_test_resource *nfit_test_lookup(resource_size_t addr)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(instances); i++) {
+ struct nfit_test_resource *n, *nfit_res = NULL;
+ struct nfit_test *t = instances[i];
+
+ if (!t)
+ continue;
+ spin_lock(&nfit_test_lock);
+ list_for_each_entry(n, &t->resources, list) {
+ if (addr >= n->res->start && (addr < n->res->start
+ + resource_size(n->res))) {
+ nfit_res = n;
+ break;
+ } else if (addr >= (unsigned long) n->buf
+ && (addr < (unsigned long) n->buf
+ + resource_size(n->res))) {
+ nfit_res = n;
+ break;
+ }
+ }
+ spin_unlock(&nfit_test_lock);
+ if (nfit_res)
+ return nfit_res;
+ }
+
+ return NULL;
+}
+
+static int nfit_test0_alloc(struct nfit_test *t)
+{
+ size_t nfit_size = sizeof(struct acpi_table_nfit)
+ + sizeof(struct acpi_nfit_system_address) * NUM_SPA
+ + sizeof(struct acpi_nfit_memory_map) * NUM_MEM
+ + sizeof(struct acpi_nfit_control_region) * NUM_DCR
+ + sizeof(struct acpi_nfit_data_region) * NUM_BDW;
+ int i;
+
+ t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+ if (!t->nfit_buf)
+ return -ENOMEM;
+ t->nfit_size = nfit_size;
+
+ t->spa_set[0] = test_alloc_coherent(t, SPA0_SIZE, &t->spa_set_dma[0]);
+ if (!t->spa_set[0])
+ return -ENOMEM;
+
+ t->spa_set[1] = test_alloc_coherent(t, SPA1_SIZE, &t->spa_set_dma[1]);
+ if (!t->spa_set[1])
+ return -ENOMEM;
+
+ for (i = 0; i < NUM_DCR; i++) {
+ t->dimm[i] = test_alloc(t, DIMM_SIZE, &t->dimm_dma[i]);
+ if (!t->dimm[i])
+ return -ENOMEM;
+
+ t->label[i] = test_alloc(t, LABEL_SIZE, &t->label_dma[i]);
+ if (!t->label[i])
+ return -ENOMEM;
+ sprintf(t->label[i], "label%d", i);
+ }
+
+ for (i = 0; i < NUM_DCR; i++) {
+ t->dcr[i] = test_alloc(t, LABEL_SIZE, &t->dcr_dma[i]);
+ if (!t->dcr[i])
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int nfit_test1_alloc(struct nfit_test *t)
+{
+ size_t nfit_size = sizeof(struct acpi_table_nfit)
+ + sizeof(struct acpi_nfit_system_address)
+ + sizeof(struct acpi_nfit_memory_map)
+ + sizeof(struct acpi_nfit_control_region);
+
+ t->nfit_buf = test_alloc(t, nfit_size, &t->nfit_dma);
+ if (!t->nfit_buf)
+ return -ENOMEM;
+ t->nfit_size = nfit_size;
+
+ t->spa_set[0] = test_alloc_coherent(t, SPA2_SIZE, &t->spa_set_dma[0]);
+ if (!t->spa_set[0])
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void nfit_test_init_header(struct acpi_table_nfit *nfit, size_t size)
+{
+ memcpy(nfit->header.signature, ACPI_SIG_NFIT, 4);
+ nfit->header.length = size;
+ nfit->header.revision = 1;
+ memcpy(nfit->header.oem_id, "LIBND", 6);
+ memcpy(nfit->header.oem_table_id, "TEST", 5);
+ nfit->header.oem_revision = 1;
+ memcpy(nfit->header.asl_compiler_id, "TST", 4);
+ nfit->header.asl_compiler_revision = 1;
+}
+
+static void nfit_test0_setup(struct nfit_test *t)
+{
+ struct nvdimm_bus_descriptor *nd_desc;
+ struct acpi_nfit_desc *acpi_desc;
+ struct acpi_nfit_memory_map *memdev;
+ void *nfit_buf = t->nfit_buf;
+ size_t size = t->nfit_size;
+ struct acpi_nfit_system_address *spa;
+ struct acpi_nfit_control_region *dcr;
+ struct acpi_nfit_data_region *bdw;
+ unsigned int offset;
+
+ nfit_test_init_header(nfit_buf, size);
+
+ /*
+ * spa0 (interleave first half of dimm0 and dimm1, note storage
+ * does not actually alias the related block-data-window
+ * regions)
+ */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit);
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+ spa->range_index = 0+1;
+ spa->address = t->spa_set_dma[0];
+ spa->length = SPA0_SIZE;
+
+ /*
+ * spa1 (interleave last half of the 4 DIMMS, note storage
+ * does not actually alias the related block-data-window
+ * regions)
+ */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa);
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+ spa->range_index = 1+1;
+ spa->address = t->spa_set_dma[1];
+ spa->length = SPA1_SIZE;
+
+ /* spa2 (dcr0) dimm0 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 2;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+ spa->range_index = 2+1;
+ spa->address = t->dcr_dma[0];
+ spa->length = DCR_SIZE;
+
+ /* spa3 (dcr1) dimm1 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 3;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+ spa->range_index = 3+1;
+ spa->address = t->dcr_dma[1];
+ spa->length = DCR_SIZE;
+
+ /* spa4 (dcr2) dimm2 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 4;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+ spa->range_index = 4+1;
+ spa->address = t->dcr_dma[2];
+ spa->length = DCR_SIZE;
+
+ /* spa5 (dcr3) dimm3 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 5;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_DCR), 16);
+ spa->range_index = 5+1;
+ spa->address = t->dcr_dma[3];
+ spa->length = DCR_SIZE;
+
+ /* spa6 (bdw for dcr0) dimm0 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 6;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+ spa->range_index = 6+1;
+ spa->address = t->dimm_dma[0];
+ spa->length = DIMM_SIZE;
+
+ /* spa7 (bdw for dcr1) dimm1 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 7;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+ spa->range_index = 7+1;
+ spa->address = t->dimm_dma[1];
+ spa->length = DIMM_SIZE;
+
+ /* spa8 (bdw for dcr2) dimm2 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 8;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+ spa->range_index = 8+1;
+ spa->address = t->dimm_dma[2];
+ spa->length = DIMM_SIZE;
+
+ /* spa9 (bdw for dcr3) dimm3 */
+ spa = nfit_buf + sizeof(struct acpi_table_nfit) + sizeof(*spa) * 9;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_BDW), 16);
+ spa->range_index = 9+1;
+ spa->address = t->dimm_dma[3];
+ spa->length = DIMM_SIZE;
+
+ offset = sizeof(struct acpi_table_nfit) + sizeof(*spa) * 10;
+ /* mem-region0 (spa0, dimm0) */
+ memdev = nfit_buf + offset;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[0];
+ memdev->physical_id = 0;
+ memdev->region_id = 0;
+ memdev->range_index = 0+1;
+ memdev->region_index = 0+1;
+ memdev->region_size = SPA0_SIZE/2;
+ memdev->region_offset = t->spa_set_dma[0];
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 2;
+
+ /* mem-region1 (spa0, dimm1) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map);
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[1];
+ memdev->physical_id = 1;
+ memdev->region_id = 0;
+ memdev->range_index = 0+1;
+ memdev->region_index = 1+1;
+ memdev->region_size = SPA0_SIZE/2;
+ memdev->region_offset = t->spa_set_dma[0] + SPA0_SIZE/2;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 2;
+
+ /* mem-region2 (spa1, dimm0) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 2;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[0];
+ memdev->physical_id = 0;
+ memdev->region_id = 1;
+ memdev->range_index = 1+1;
+ memdev->region_index = 0+1;
+ memdev->region_size = SPA1_SIZE/4;
+ memdev->region_offset = t->spa_set_dma[1];
+ memdev->address = SPA0_SIZE/2;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 4;
+
+ /* mem-region3 (spa1, dimm1) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 3;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[1];
+ memdev->physical_id = 1;
+ memdev->region_id = 1;
+ memdev->range_index = 1+1;
+ memdev->region_index = 1+1;
+ memdev->region_size = SPA1_SIZE/4;
+ memdev->region_offset = t->spa_set_dma[1] + SPA1_SIZE/4;
+ memdev->address = SPA0_SIZE/2;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 4;
+
+ /* mem-region4 (spa1, dimm2) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 4;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[2];
+ memdev->physical_id = 2;
+ memdev->region_id = 0;
+ memdev->range_index = 1+1;
+ memdev->region_index = 2+1;
+ memdev->region_size = SPA1_SIZE/4;
+ memdev->region_offset = t->spa_set_dma[1] + 2*SPA1_SIZE/4;
+ memdev->address = SPA0_SIZE/2;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 4;
+
+ /* mem-region5 (spa1, dimm3) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 5;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[3];
+ memdev->physical_id = 3;
+ memdev->region_id = 0;
+ memdev->range_index = 1+1;
+ memdev->region_index = 3+1;
+ memdev->region_size = SPA1_SIZE/4;
+ memdev->region_offset = t->spa_set_dma[1] + 3*SPA1_SIZE/4;
+ memdev->address = SPA0_SIZE/2;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 4;
+
+ /* mem-region6 (spa/dcr0, dimm0) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 6;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[0];
+ memdev->physical_id = 0;
+ memdev->region_id = 0;
+ memdev->range_index = 2+1;
+ memdev->region_index = 0+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region7 (spa/dcr1, dimm1) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 7;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[1];
+ memdev->physical_id = 1;
+ memdev->region_id = 0;
+ memdev->range_index = 3+1;
+ memdev->region_index = 1+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region8 (spa/dcr2, dimm2) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 8;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[2];
+ memdev->physical_id = 2;
+ memdev->region_id = 0;
+ memdev->range_index = 4+1;
+ memdev->region_index = 2+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region9 (spa/dcr3, dimm3) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 9;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[3];
+ memdev->physical_id = 3;
+ memdev->region_id = 0;
+ memdev->range_index = 5+1;
+ memdev->region_index = 3+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region10 (spa/bdw0, dimm0) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 10;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[0];
+ memdev->physical_id = 0;
+ memdev->region_id = 0;
+ memdev->range_index = 6+1;
+ memdev->region_index = 0+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region11 (spa/bdw1, dimm1) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 11;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[1];
+ memdev->physical_id = 1;
+ memdev->region_id = 0;
+ memdev->range_index = 7+1;
+ memdev->region_index = 1+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region12 (spa/bdw2, dimm2) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 12;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[2];
+ memdev->physical_id = 2;
+ memdev->region_id = 0;
+ memdev->range_index = 8+1;
+ memdev->region_index = 2+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ /* mem-region13 (spa/dcr3, dimm3) */
+ memdev = nfit_buf + offset + sizeof(struct acpi_nfit_memory_map) * 13;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = handle[3];
+ memdev->physical_id = 3;
+ memdev->region_id = 0;
+ memdev->range_index = 9+1;
+ memdev->region_index = 3+1;
+ memdev->region_size = 0;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ offset = offset + sizeof(struct acpi_nfit_memory_map) * 14;
+ /* dcr-descriptor0 */
+ dcr = nfit_buf + offset;
+ dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+ dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->region_index = 0+1;
+ dcr->vendor_id = 0xabcd;
+ dcr->device_id = 0;
+ dcr->revision_id = 1;
+ dcr->serial_number = ~handle[0];
+ dcr->windows = 1;
+ dcr->window_size = DCR_SIZE;
+ dcr->command_offset = 0;
+ dcr->command_size = 8;
+ dcr->status_offset = 8;
+ dcr->status_size = 4;
+
+ /* dcr-descriptor1 */
+ dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region);
+ dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+ dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->region_index = 1+1;
+ dcr->vendor_id = 0xabcd;
+ dcr->device_id = 0;
+ dcr->revision_id = 1;
+ dcr->serial_number = ~handle[1];
+ dcr->windows = 1;
+ dcr->window_size = DCR_SIZE;
+ dcr->command_offset = 0;
+ dcr->command_size = 8;
+ dcr->status_offset = 8;
+ dcr->status_size = 4;
+
+ /* dcr-descriptor2 */
+ dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 2;
+ dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+ dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->region_index = 2+1;
+ dcr->vendor_id = 0xabcd;
+ dcr->device_id = 0;
+ dcr->revision_id = 1;
+ dcr->serial_number = ~handle[2];
+ dcr->windows = 1;
+ dcr->window_size = DCR_SIZE;
+ dcr->command_offset = 0;
+ dcr->command_size = 8;
+ dcr->status_offset = 8;
+ dcr->status_size = 4;
+
+ /* dcr-descriptor3 */
+ dcr = nfit_buf + offset + sizeof(struct acpi_nfit_control_region) * 3;
+ dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+ dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->region_index = 3+1;
+ dcr->vendor_id = 0xabcd;
+ dcr->device_id = 0;
+ dcr->revision_id = 1;
+ dcr->serial_number = ~handle[3];
+ dcr->windows = 1;
+ dcr->window_size = DCR_SIZE;
+ dcr->command_offset = 0;
+ dcr->command_size = 8;
+ dcr->status_offset = 8;
+ dcr->status_size = 4;
+
+ offset = offset + sizeof(struct acpi_nfit_control_region) * 4;
+ /* bdw0 (spa/dcr0, dimm0) */
+ bdw = nfit_buf + offset;
+ bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+ bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->region_index = 0+1;
+ bdw->windows = 1;
+ bdw->offset = 0;
+ bdw->size = BDW_SIZE;
+ bdw->capacity = DIMM_SIZE;
+ bdw->start_address = 0;
+
+ /* bdw1 (spa/dcr1, dimm1) */
+ bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region);
+ bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+ bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->region_index = 1+1;
+ bdw->windows = 1;
+ bdw->offset = 0;
+ bdw->size = BDW_SIZE;
+ bdw->capacity = DIMM_SIZE;
+ bdw->start_address = 0;
+
+ /* bdw2 (spa/dcr2, dimm2) */
+ bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 2;
+ bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+ bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->region_index = 2+1;
+ bdw->windows = 1;
+ bdw->offset = 0;
+ bdw->size = BDW_SIZE;
+ bdw->capacity = DIMM_SIZE;
+ bdw->start_address = 0;
+
+ /* bdw3 (spa/dcr3, dimm3) */
+ bdw = nfit_buf + offset + sizeof(struct acpi_nfit_data_region) * 3;
+ bdw->header.type = ACPI_NFIT_TYPE_DATA_REGION;
+ bdw->header.length = sizeof(struct acpi_nfit_data_region);
+ bdw->region_index = 3+1;
+ bdw->windows = 1;
+ bdw->offset = 0;
+ bdw->size = BDW_SIZE;
+ bdw->capacity = DIMM_SIZE;
+ bdw->start_address = 0;
+
+ acpi_desc = &t->acpi_desc;
+ set_bit(ND_CMD_GET_CONFIG_SIZE, &acpi_desc->dimm_dsm_force_en);
+ set_bit(ND_CMD_GET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+ set_bit(ND_CMD_SET_CONFIG_DATA, &acpi_desc->dimm_dsm_force_en);
+ nd_desc = &acpi_desc->nd_desc;
+ nd_desc->ndctl = nfit_test_ctl;
+}
+
+static void nfit_test1_setup(struct nfit_test *t)
+{
+ size_t size = t->nfit_size, offset;
+ void *nfit_buf = t->nfit_buf;
+ struct acpi_nfit_memory_map *memdev;
+ struct acpi_nfit_control_region *dcr;
+ struct acpi_nfit_system_address *spa;
+
+ nfit_test_init_header(nfit_buf, size);
+
+ offset = sizeof(struct acpi_table_nfit);
+ /* spa0 (flat range with no bdw aliasing) */
+ spa = nfit_buf + offset;
+ spa->header.type = ACPI_NFIT_TYPE_SYSTEM_ADDRESS;
+ spa->header.length = sizeof(*spa);
+ memcpy(spa->range_guid, to_nfit_uuid(NFIT_SPA_PM), 16);
+ spa->range_index = 0+1;
+ spa->address = t->spa_set_dma[0];
+ spa->length = SPA2_SIZE;
+
+ offset += sizeof(*spa);
+ /* mem-region0 (spa0, dimm0) */
+ memdev = nfit_buf + offset;
+ memdev->header.type = ACPI_NFIT_TYPE_MEMORY_MAP;
+ memdev->header.length = sizeof(*memdev);
+ memdev->device_handle = 0;
+ memdev->physical_id = 0;
+ memdev->region_id = 0;
+ memdev->range_index = 0+1;
+ memdev->region_index = 0+1;
+ memdev->region_size = SPA2_SIZE;
+ memdev->region_offset = 0;
+ memdev->address = 0;
+ memdev->interleave_index = 0;
+ memdev->interleave_ways = 1;
+
+ offset += sizeof(*memdev);
+ /* dcr-descriptor0 */
+ dcr = nfit_buf + offset;
+ dcr->header.type = ACPI_NFIT_TYPE_CONTROL_REGION;
+ dcr->header.length = sizeof(struct acpi_nfit_control_region);
+ dcr->region_index = 0+1;
+ dcr->vendor_id = 0xabcd;
+ dcr->device_id = 0;
+ dcr->revision_id = 1;
+ dcr->serial_number = ~0;
+ dcr->code = 0x201;
+ dcr->windows = 0;
+ dcr->window_size = 0;
+ dcr->command_offset = 0;
+ dcr->command_size = 0;
+ dcr->status_offset = 0;
+ dcr->status_size = 0;
+}
+
+static int nfit_test_blk_do_io(struct nd_blk_region *ndbr, resource_size_t dpa,
+ void *iobuf, u64 len, int rw)
+{
+ struct nfit_blk *nfit_blk = ndbr->blk_provider_data;
+ struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+ struct nd_region *nd_region = &ndbr->nd_region;
+ unsigned int lane;
+
+ lane = nd_region_acquire_lane(nd_region);
+ if (rw)
+ memcpy(mmio->base + dpa, iobuf, len);
+ else
+ memcpy(iobuf, mmio->base + dpa, len);
+ nd_region_release_lane(nd_region, lane);
+
+ return 0;
+}
+
+static int nfit_test_probe(struct platform_device *pdev)
+{
+ struct nvdimm_bus_descriptor *nd_desc;
+ struct acpi_nfit_desc *acpi_desc;
+ struct device *dev = &pdev->dev;
+ struct nfit_test *nfit_test;
+ int rc;
+
+ nfit_test = to_nfit_test(&pdev->dev);
+
+ /* common alloc */
+ if (nfit_test->num_dcr) {
+ int num = nfit_test->num_dcr;
+
+ nfit_test->dimm = devm_kcalloc(dev, num, sizeof(void *),
+ GFP_KERNEL);
+ nfit_test->dimm_dma = devm_kcalloc(dev, num, sizeof(dma_addr_t),
+ GFP_KERNEL);
+ nfit_test->label = devm_kcalloc(dev, num, sizeof(void *),
+ GFP_KERNEL);
+ nfit_test->label_dma = devm_kcalloc(dev, num,
+ sizeof(dma_addr_t), GFP_KERNEL);
+ nfit_test->dcr = devm_kcalloc(dev, num,
+ sizeof(struct nfit_test_dcr *), GFP_KERNEL);
+ nfit_test->dcr_dma = devm_kcalloc(dev, num,
+ sizeof(dma_addr_t), GFP_KERNEL);
+ if (nfit_test->dimm && nfit_test->dimm_dma && nfit_test->label
+ && nfit_test->label_dma && nfit_test->dcr
+ && nfit_test->dcr_dma)
+ /* pass */;
+ else
+ return -ENOMEM;
+ }
+
+ if (nfit_test->num_pm) {
+ int num = nfit_test->num_pm;
+
+ nfit_test->spa_set = devm_kcalloc(dev, num, sizeof(void *),
+ GFP_KERNEL);
+ nfit_test->spa_set_dma = devm_kcalloc(dev, num,
+ sizeof(dma_addr_t), GFP_KERNEL);
+ if (nfit_test->spa_set && nfit_test->spa_set_dma)
+ /* pass */;
+ else
+ return -ENOMEM;
+ }
+
+ /* per-nfit specific alloc */
+ if (nfit_test->alloc(nfit_test))
+ return -ENOMEM;
+
+ nfit_test->setup(nfit_test);
+ acpi_desc = &nfit_test->acpi_desc;
+ acpi_desc->dev = &pdev->dev;
+ acpi_desc->nfit = nfit_test->nfit_buf;
+ acpi_desc->blk_do_io = nfit_test_blk_do_io;
+ nd_desc = &acpi_desc->nd_desc;
+ nd_desc->attr_groups = acpi_nfit_attribute_groups;
+ acpi_desc->nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+ if (!acpi_desc->nvdimm_bus)
+ return -ENXIO;
+
+ rc = acpi_nfit_init(acpi_desc, nfit_test->nfit_size);
+ if (rc) {
+ nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+ return rc;
+ }
+
+ return 0;
+}
+
+static int nfit_test_remove(struct platform_device *pdev)
+{
+ struct nfit_test *nfit_test = to_nfit_test(&pdev->dev);
+ struct acpi_nfit_desc *acpi_desc = &nfit_test->acpi_desc;
+
+ nvdimm_bus_unregister(acpi_desc->nvdimm_bus);
+
+ return 0;
+}
+
+static void nfit_test_release(struct device *dev)
+{
+ struct nfit_test *nfit_test = to_nfit_test(dev);
+
+ kfree(nfit_test);
+}
+
+static const struct platform_device_id nfit_test_id[] = {
+ { KBUILD_MODNAME },
+ { },
+};
+
+static struct platform_driver nfit_test_driver = {
+ .probe = nfit_test_probe,
+ .remove = nfit_test_remove,
+ .driver = {
+ .name = KBUILD_MODNAME,
+ },
+ .id_table = nfit_test_id,
+};
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+static __init int nfit_test_init(void)
+{
+ int rc, i;
+
+ nfit_test_setup(nfit_test_lookup);
+
+ for (i = 0; i < NUM_NFITS; i++) {
+ struct nfit_test *nfit_test;
+ struct platform_device *pdev;
+ static int once;
+
+ nfit_test = kzalloc(sizeof(*nfit_test), GFP_KERNEL);
+ if (!nfit_test) {
+ rc = -ENOMEM;
+ goto err_register;
+ }
+ INIT_LIST_HEAD(&nfit_test->resources);
+ switch (i) {
+ case 0:
+ nfit_test->num_pm = NUM_PM;
+ nfit_test->num_dcr = NUM_DCR;
+ nfit_test->alloc = nfit_test0_alloc;
+ nfit_test->setup = nfit_test0_setup;
+ break;
+ case 1:
+ nfit_test->num_pm = 1;
+ nfit_test->alloc = nfit_test1_alloc;
+ nfit_test->setup = nfit_test1_setup;
+ break;
+ default:
+ rc = -EINVAL;
+ goto err_register;
+ }
+ pdev = &nfit_test->pdev;
+ pdev->name = KBUILD_MODNAME;
+ pdev->id = i;
+ pdev->dev.release = nfit_test_release;
+ rc = platform_device_register(pdev);
+ if (rc) {
+ put_device(&pdev->dev);
+ goto err_register;
+ }
+
+ rc = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+ if (rc)
+ goto err_register;
+
+ instances[i] = nfit_test;
+
+ if (!once++) {
+ dma_addr_t dma;
+ void *buf;
+
+ buf = dma_alloc_coherent(&pdev->dev, SZ_128M, &dma,
+ GFP_KERNEL);
+ if (!buf) {
+ rc = -ENOMEM;
+ dev_warn(&pdev->dev, "need 128M of free cma\n");
+ goto err_register;
+ }
+ dma_free_coherent(&pdev->dev, SZ_128M, buf, dma);
+ }
+ }
+
+ rc = platform_driver_register(&nfit_test_driver);
+ if (rc)
+ goto err_register;
+ return 0;
+
+ err_register:
+ for (i = 0; i < NUM_NFITS; i++)
+ if (instances[i])
+ platform_device_unregister(&instances[i]->pdev);
+ nfit_test_teardown();
+ return rc;
+}
+
+static __exit void nfit_test_exit(void)
+{
+ int i;
+
+ platform_driver_unregister(&nfit_test_driver);
+ for (i = 0; i < NUM_NFITS; i++)
+ platform_device_unregister(&instances[i]->pdev);
+ nfit_test_teardown();
+}
+
+module_init(nfit_test_init);
+module_exit(nfit_test_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
new file mode 100644
index 000000000000..96c5e16d7db9
--- /dev/null
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#ifndef __NFIT_TEST_H__
+#define __NFIT_TEST_H__
+
+struct nfit_test_resource {
+ struct list_head list;
+ struct resource *res;
+ struct device *dev;
+ void *buf;
+};
+
+typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
+void __iomem *__wrap_ioremap_nocache(resource_size_t offset,
+ unsigned long size);
+void __wrap_iounmap(volatile void __iomem *addr);
+void nfit_test_setup(nfit_test_lookup_fn lookup);
+void nfit_test_teardown(void);
+#endif

2015-06-25 09:46:39

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 05/17] libnvdimm: Non-Volatile Devices

Maintainer information and documentation for drivers/nvdimm

Cc: Andy Lutomirski <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Documentation/nvdimm/btt.txt | 24 +
Documentation/nvdimm/nvdimm.txt | 808 +++++++++++++++++++++++++++++++++++++++
MAINTAINERS | 39 ++
3 files changed, 858 insertions(+), 13 deletions(-)
create mode 100644 Documentation/nvdimm/nvdimm.txt

diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt
index 95134d5ec4a0..b91443f577dc 100644
--- a/Documentation/nvdimm/btt.txt
+++ b/Documentation/nvdimm/btt.txt
@@ -80,9 +80,17 @@ block. Each map entry is 32 bits. The two most significant bits are special
flags, and the remaining form the internal block number.

Bit Description
-31 : TRIM flag - marks if the block was trimmed or discarded
-30 : ERROR flag - marks an error block. Cleared on write.
-29 - 0 : Mappings to internal 'postmap' blocks
+31 - 30 : Error and Zero flags - Used in the following way:
+ Bit Description
+ 31 30
+ -----------------------------------------------------------------------
+ 00 Initial state. Reads return zeroes; Premap = Postmap
+ 01 Zero state: Reads return zeroes
+ 10 Error state: Reads fail; Writes clear 'E' bit
+ 11 Normal Block – has valid postmap
+
+
+29 - 0 : Mappings to internal 'postmap' blocks


Some of the terminology that will be subsequently used:
@@ -127,10 +135,11 @@ old_map': alternate old postmap entry
new_map': alternate new postmap entry
seq' : alternate sequence number.

-Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are
+Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
+padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
done such that for any entry being written, it:
a. overwrites the 'old' section in the entry based on sequence numbers
-b. writes the new entry such that the sequence number is written last.
+b. writes the 'new' section such that the sequence number is written last.


c. The concept of lanes
@@ -141,8 +150,9 @@ concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
process.
nlanes = min(nfree, num_cpus)
A lane number is obtained at the start of any IO, and is used for indexing into
-all the on-disk and in-memory data structures for the duration of the IO. It is
-protected by a spinlock.
+all the on-disk and in-memory data structures for the duration of the IO. If
+there are more CPUs than the max number of available lanes, than lanes are
+protected by spinlocks.


d. In-memory data structure: Read Tracking Table (RTT)
diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt
new file mode 100644
index 000000000000..197a0b6b0582
--- /dev/null
+++ b/Documentation/nvdimm/nvdimm.txt
@@ -0,0 +1,808 @@
+ LIBNVDIMM: Non-Volatile Devices
+ libnvdimm - kernel / libndctl - userspace helper library
+ [email protected]
+ v13
+
+
+ Glossary
+ Overview
+ Supporting Documents
+ Git Trees
+ LIBNVDIMM PMEM and BLK
+ Why BLK?
+ PMEM vs BLK
+ BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+ Example NVDIMM Platform
+ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+ LIBNDCTL: Context
+ libndctl: instantiate a new library context example
+ LIBNVDIMM/LIBNDCTL: Bus
+ libnvdimm: control class device in /sys/class
+ libnvdimm: bus
+ libndctl: bus enumeration example
+ LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+ libnvdimm: DIMM (NMEM)
+ libndctl: DIMM enumeration example
+ LIBNVDIMM/LIBNDCTL: Region
+ libnvdimm: region
+ libndctl: region enumeration example
+ Why Not Encode the Region Type into the Region Name?
+ How Do I Determine the Major Type of a Region?
+ LIBNVDIMM/LIBNDCTL: Namespace
+ libnvdimm: namespace
+ libndctl: namespace enumeration example
+ libndctl: namespace creation example
+ Why the Term "namespace"?
+ LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+ libnvdimm: btt layout
+ libndctl: btt creation example
+ Summary LIBNDCTL Diagram
+
+
+Glossary
+--------
+
+PMEM: A system-physical-address range where writes are persistent. A
+block device composed of PMEM is capable of DAX. A PMEM address range
+may span an interleave of several DIMMs.
+
+BLK: A set of one or more programmable memory mapped apertures provided
+by a DIMM to access its media. This indirection precludes the
+performance benefit of interleaving, but enables DIMM-bounded failure
+modes.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset. With one DIMM in
+the system there would be a 1:1 system-physical-address:DPA association.
+Once more DIMMs are added a memory controller interleave must be
+decoded to determine the DPA associated with a given
+system-physical-address. BLK capacity always has a 1:1 relationship
+with a single-DIMM's DPA range.
+
+DAX: File system extensions to bypass the page cache and block layer to
+mmap persistent memory, from a PMEM block device, directly into a
+process address space.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes. The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK. It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+Overview
+--------
+
+The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access. These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources. The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK. When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+LIBNVDIMM PMEM and BLK
+------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways. Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss. Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
+device driver:
+
+ 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This
+ range is contiguous in system memory and may be interleaved (hardware
+ memory controller striped) across multiple DIMMs. When interleaved the
+ platform may optionally provide details of which DIMMs are participating
+ in the interleave.
+
+ Note that while LIBNVDIMM describes system-physical-address ranges that may
+ alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+ alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+ distinction. The different device-types are an implementation detail
+ that userspace can exploit to implement policies like "only interface
+ with address ranges from certain DIMMs". It is worth noting that when
+ aliasing is present and a DIMM lacks a label, then no block device can
+ be created by default as userspace needs to do at least one allocation
+ of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once
+ registered, can be immediately attached to nd_pmem.
+
+ 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+ defined apertures. A set of apertures will all access just one DIMM.
+ Multiple windows allow multiple concurrent accesses, much like
+ tagged-command-queuing, and would likely be used by different threads or
+ different CPUs.
+
+ The NFIT specification defines a standard format for a BLK-aperture, but
+ the spec also allows for vendor specific layouts, and non-NFIT BLK
+ implementations may other designs for BLK I/O. For this reason "nd_blk"
+ calls back into platform-specific code to perform the I/O. One such
+ implementation is defined in the "Driver Writer's Guide" and "DSM
+ Interface Example".
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model. An access to a corrupted
+system-physical-address address causes a cpu exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register. The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries. Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+BLK-apertures solve this RAS problem, but their presence is also the
+major contributing factor to the complexity of the ND subsystem. They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture. Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results. For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL. The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions. For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member. The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes. While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX. However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition. See
+LIBNVDIMM/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+-----------------------
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts.
+
+
+ (a) (b) DIMM BLK-REGION
+ +-------------------+--------+--------+--------+
++------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2
+| imc0 +--+- - - region0- - - +--------+ +--------+
++--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3
+ | +-------------------+--------v v--------+
++--+---+ | |
+| cpu0 | region1
++--+---+ | |
+ | +----------------------------^ ^--------+
++--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4
+| imc1 +--+----------------------------| +--------+
++------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5
+ +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket. Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+ 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+ single PMEM namespace is created in the REGION0-SPA-range that spans
+ DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+ interleaved system-physical-address range is reclaimed as BLK-aperture
+ accessed space starting at DPA-offset (a) into each DIMM. In that
+ reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+ REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+ could be set to any user-desired name in the LABEL.
+
+ 2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+ system-physical-address range, REGION1, that spans those two DIMMs as
+ well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace
+ named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
+ each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+ "blk5.0".
+
+ 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+ interleaved system-physical-address range (i.e. the DPA address below
+ offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+ Note, that this example shows that BLK-aperture namespaces don't need to
+ be contiguous in DPA-space.
+
+ This bus is provided by the kernel under the device
+ /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+ the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the
+ acpi_nfit.ko driver as well.
+
+
+LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+----------------------------------------------------
+
+What follows is a description of the LIBNVDIMM sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+api. The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+Every api call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state. The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+LIBNDCTL: instantiate a new library context example
+
+ struct ndctl_ctx *ctx;
+
+ if (ndctl_new(&ctx) == 0)
+ return ctx;
+ else
+ return NULL;
+
+LIBNVDIMM/LIBNDCTL: Bus
+-------------------
+
+A bus has a 1:1 relationship with an NFIT. The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it. The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+LIBNVDIMM: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+ /sys/class/nd/ndctl0
+ |-- dev
+ |-- device -> ../../../ndbus0
+ |-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBNVDIMM: bus
+
+ struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
+ struct nvdimm_bus_descriptor *nfit_desc);
+
+ /sys/devices/platform/nfit_test.0/ndbus0
+ |-- commands
+ |-- nd
+ |-- nfit
+ |-- nmem0
+ |-- nmem1
+ |-- nmem2
+ |-- nmem3
+ |-- power
+ |-- provider
+ |-- region0
+ |-- region1
+ |-- region2
+ |-- region3
+ |-- region4
+ |-- region5
+ |-- uevent
+ `-- wait_probe
+
+LIBNDCTL: bus enumeration example
+Find the bus handle that describes the bus from Example NVDIMM Platform
+
+ static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+ const char *provider)
+ {
+ struct ndctl_bus *bus;
+
+ ndctl_bus_foreach(ctx, bus)
+ if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+ return bus;
+
+ return NULL;
+ }
+
+ bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+---------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs. If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX". The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBNVDIMM: DIMM (NMEM)
+
+ struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
+ const struct attribute_group **groups, unsigned long flags,
+ unsigned long *dsm_mask);
+
+ /sys/devices/platform/nfit_test.0/ndbus0
+ |-- nmem0
+ | |-- available_slots
+ | |-- commands
+ | |-- dev
+ | |-- devtype
+ | |-- driver -> ../../../../../bus/nd/drivers/nvdimm
+ | |-- modalias
+ | |-- nfit
+ | | |-- device
+ | | |-- format
+ | | |-- handle
+ | | |-- phys_id
+ | | |-- rev_id
+ | | |-- serial
+ | | `-- vendor
+ | |-- state
+ | |-- subsystem -> ../../../../../bus/nd
+ | `-- uevent
+ |-- nmem1
+ [..]
+
+
+LIBNDCTL: DIMM enumeration example
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+Bit 3:0 DIMM number within the memory channel
+Bit 7:4 memory channel number
+Bit 11:8 memory controller ID
+Bit 15:12 socket ID (within scope of a Node controller if node controller is present)
+Bit 27:16 Node Controller ID
+Bit 31:28 Reserved
+
+ static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+ unsigned int handle)
+ {
+ struct ndctl_dimm *dimm;
+
+ ndctl_dimm_foreach(bus, dimm)
+ if (ndctl_dimm_get_handle(dimm) == handle)
+ return dimm;
+
+ return NULL;
+ }
+
+ #define DIMM_HANDLE(n, s, i, c, d) \
+ (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+ | ((c & 0xf) << 4) | (d & 0xf))
+
+ dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBNVDIMM/LIBNDCTL: Region
+----------------------
+
+A generic REGION device is registered for each PMEM range orBLK-aperture
+set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus. The primary role of regions are to be a
+container of "mappings". A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBNVDIMM provides a built-in driver for these REGION devices. This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBNVDIMM: region
+
+ struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
+ struct nd_region_desc *ndr_desc);
+ struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
+ struct nd_region_desc *ndr_desc);
+
+ /sys/devices/platform/nfit_test.0/ndbus0
+ |-- region0
+ | |-- available_size
+ | |-- btt0
+ | |-- btt_seed
+ | |-- devtype
+ | |-- driver -> ../../../../../bus/nd/drivers/nd_region
+ | |-- init_namespaces
+ | |-- mapping0
+ | |-- mapping1
+ | |-- mappings
+ | |-- modalias
+ | |-- namespace0.0
+ | |-- namespace_seed
+ | |-- numa_node
+ | |-- nfit
+ | | `-- spa_index
+ | |-- nstype
+ | |-- set_cookie
+ | |-- size
+ | |-- subsystem -> ../../../../../bus/nd
+ | `-- uevent
+ |-- region1
+ [..]
+
+LIBNDCTL: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK.
+
+ static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+ unsigned int spa_index)
+ {
+ struct ndctl_region *region;
+
+ ndctl_region_foreach(bus, region) {
+ if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+ continue;
+ if (ndctl_region_get_spa_index(region) == spa_index)
+ return region;
+ }
+ return NULL;
+ }
+
+ static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+ unsigned int handle)
+ {
+ struct ndctl_region *region;
+
+ ndctl_region_foreach(bus, region) {
+ struct ndctl_mapping *map;
+
+ if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+ continue;
+ ndctl_mapping_foreach(region, map) {
+ struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+ if (ndctl_dimm_get_handle(dimm) == handle)
+ return region;
+ }
+ }
+ return NULL;
+ }
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names. However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+ 1. There are already more than two REGION and "namespace" types. For
+ PMEM there are two subtypes. As mentioned previously we have PMEM where
+ the constituent DIMM devices are known and anonymous PMEM. For BLK
+ regions the NFIT specification already anticipates vendor specific
+ implementations. The exact distinction of what a region contains is in
+ the region-attributes not the region-name or the region-devtype.
+
+ 2. A region with zero child-namespaces is a possible configuration. For
+ example, the NFIT allows for a DCR to be published without a
+ corresponding BLK-aperture. This equates to a DIMM that can only accept
+ control/configuration messages, but no i/o through a descendant block
+ device. Again, this "type" is advertised in the attributes ('mappings'
+ == 0) and the name does not tell you much.
+
+ 3. What if a third major interface type arises in the future? Outside
+ of vendor specific implementations, it's not difficult to envision a
+ third class of interface type beyond BLK and PMEM. With a generic name
+ for the REGION level of the device-hierarchy old userspace
+ implementations can still make sense of new kernel advertised
+ region-types. Userspace can always rely on the generic region
+ attributes like "mappings", "size", etc and the expected child devices
+ named "namespace". This generic format of the device-model hierarchy
+ allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
+ future-proof.
+
+ 4. There are more robust mechanisms for determining the major type of a
+ region than a device name. See the next section, How Do I Determine the
+ Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+ 1. module alias lookup:
+
+ The whole point of region/namespace device type differentiation is to
+ decide which block-device driver will attach to a given LIBNVDIMM namespace.
+ One can simply use the modalias to lookup the resulting module. It's
+ important to note that this method is robust in the presence of a
+ vendor-specific driver down the road. If a vendor-specific
+ implementation wants to supplant the standard nd_blk driver it can with
+ minimal impact to the rest of LIBNVDIMM.
+
+ In fact, a vendor may also want to have a vendor-specific region-driver
+ (outside of nd_region). For example, if a vendor defined its own LABEL
+ format it would need its own region driver to parse that LABEL and emit
+ the resulting namespaces. The output from module resolution is more
+ accurate than a region-name or region-devtype.
+
+ 2. udev:
+
+ The kernel "devtype" is registered in the udev database
+ # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+ P: /devices/platform/nfit_test.0/ndbus0/region0
+ E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+ E: DEVTYPE=nd_pmem
+ E: MODALIAS=nd:t2
+ E: SUBSYSTEM=nd
+
+ # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+ P: /devices/platform/nfit_test.0/ndbus0/region4
+ E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+ E: DEVTYPE=nd_blk
+ E: MODALIAS=nd:t3
+ E: SUBSYSTEM=nd
+
+ ...and is available as a region attribute, but keep in mind that the
+ "devtype" does not indicate sub-type variations and scripts should
+ really be understanding the other attributes.
+
+ 3. type specific attributes:
+
+ As it currently stands a BLK-aperture region will never have a
+ "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A
+ BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+ that does not allow I/O. A PMEM region with a "mappings" value of zero
+ is a simple system-physical-address range.
+
+
+LIBNVDIMM/LIBNDCTL: Namespace
+-------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices. The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBNVDIMM: namespace
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL).
+
+ /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+ |-- alt_name
+ |-- devtype
+ |-- dpa_extents
+ |-- force_raw
+ |-- modalias
+ |-- numa_node
+ |-- resource
+ |-- size
+ |-- subsystem -> ../../../../../../bus/nd
+ |-- type
+ |-- uevent
+ `-- uuid
+ /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+ |-- alt_name
+ |-- devtype
+ |-- dpa_extents
+ |-- force_raw
+ |-- modalias
+ |-- numa_node
+ |-- sector_size
+ |-- size
+ |-- subsystem -> ../../../../../../bus/nd
+ |-- type
+ |-- uevent
+ `-- uuid
+ /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+ |-- block
+ | `-- pmem0
+ |-- devtype
+ |-- driver -> ../../../../../../bus/nd/drivers/pmem
+ |-- force_raw
+ |-- modalias
+ |-- numa_node
+ |-- resource
+ |-- size
+ |-- subsystem -> ../../../../../../bus/nd
+ |-- type
+ `-- uevent
+
+LIBNDCTL: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard. For a static namespace identifier use its
+'uuid' attribute.
+
+static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+ unsigned int id)
+{
+ struct ndctl_namespace *ndns;
+
+ ndctl_namespace_foreach(region, ndns)
+ if (ndctl_namespace_get_id(ndns) == id)
+ return ndns;
+
+ return NULL;
+}
+
+LIBNDCTL: namespace creation example
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it. For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'. This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+static int configure_namespace(struct ndctl_region *region,
+ struct ndctl_namespace *ndns,
+ struct namespace_parameters *parameters)
+{
+ char devname[50];
+
+ snprintf(devname, sizeof(devname), "namespace%d.%d",
+ ndctl_region_get_id(region), paramaters->id);
+
+ ndctl_namespace_set_alt_name(ndns, devname);
+ /* 'uuid' must be set prior to setting size! */
+ ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+ ndctl_namespace_set_size(ndns, paramaters->size);
+ /* unlike pmem namespaces, blk namespaces have a sector size */
+ if (parameters->lbasize)
+ ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+ ndctl_namespace_enable(ndns);
+}
+
+
+Why the Term "namespace"?
+
+ 1. Why not "volume" for instance? "volume" ran the risk of confusing ND
+ as a volume manager like device-mapper.
+
+ 2. The term originated to describe the sub-devices that can be created
+ within a NVME controller (see the nvme specification:
+ http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+ meant to parallel the capabilities and configurability of
+ NVME-namespaces.
+
+
+LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+---------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBNVDIMM: btt layout
+Every region will start out with at least one BTT device which is the
+seed device. To activate it set the "namespace", "uuid", and
+"sector_size" attributes and then bind the device to the nd_pmem or
+nd_blk driver depending on the region type.
+
+ /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
+ |-- namespace
+ |-- delete
+ |-- devtype
+ |-- modalias
+ |-- numa_node
+ |-- sector_size
+ |-- subsystem -> ../../../../../bus/nd
+ |-- uevent
+ `-- uuid
+
+LIBNDCTL: btt creation example
+Similar to namespaces an idle BTT device is automatically created per
+region. Each time this "seed" btt device is configured and enabled a new
+seed is created. Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to consume a PMEM or BLK namespace.
+
+ static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
+ {
+ struct ndctl_btt *btt;
+
+ ndctl_btt_foreach(region, btt)
+ if (!ndctl_btt_is_enabled(btt)
+ && !ndctl_btt_is_configured(btt))
+ return btt;
+
+ return NULL;
+ }
+
+ static int configure_btt(struct ndctl_region *region,
+ struct btt_parameters *parameters)
+ {
+ btt = get_idle_btt(region);
+
+ ndctl_btt_set_uuid(btt, parameters->uuid);
+ ndctl_btt_set_sector_size(btt, parameters->sector_size);
+ ndctl_btt_set_namespace(btt, parameters->ndns);
+ /* turn off raw mode device */
+ ndctl_namespace_disable(parameters->ndns);
+ /* turn on btt access */
+ ndctl_btt_enable(btt);
+ }
+
+Once instantiated a new inactive btt seed device will appear underneath
+the region.
+
+Once a "namespace" is removed from a BTT that instance of the BTT device
+will be deleted or otherwise reset to default values. This deletion is
+only at the device model level. In order to destroy a BTT the "info
+block" needs to be destroyed. Note, that to destroy a BTT the media
+needs to be written in raw mode. By default, the kernel will autodetect
+the presence of a BTT and disable raw mode. This autodetect behavior
+can be suppressed by enabling raw mode for the namespace via the
+ndctl_namespace_set_raw_mode() api.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
+ +---+
+ |CTX| +---------+ +--------------+ +---------------+
+ +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+ | | +---------+ +--------------+ +---------------+
++-------+ | | +---------+ +--------------+ +---------------+
+| DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ | | | +---------+ +--------------+ +---------------+
+| DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+
++-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 |
+| DIMM3 <-+ | +--------------+ +----------------------+
++-------+ | +---------+ +--------------+ +---------------+
+ +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" |
+ | +---------+ | +--------------+ +----------------------+
+ | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 |
+ | +--------------+ +----------------------+
+ | +---------+ +--------------+ +---------------+
+ +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" |
+ | +---------+ +--------------+ +---------------+
+ | +---------+ +--------------+ +----------------------+
+ +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 |
+ +---------+ +--------------+ +---------------+------+
+
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 590304b96b03..c2f55aed811d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5912,6 +5912,39 @@ M: Sasha Levin <[email protected]>
S: Maintained
F: tools/lib/lockdep/

+LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
+M: Dan Williams <[email protected]>
+L: [email protected]
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/*
+F: include/linux/nd.h
+F: include/linux/libnvdimm.h
+F: include/uapi/linux/ndctl.h
+
+LIBNVDIMM BLK: MMIO-APERTURE DRIVER
+M: Ross Zwisler <[email protected]>
+L: [email protected]
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/blk.c
+F: drivers/nvdimm/region_devs.c
+F: drivers/acpi/nfit*
+
+LIBNVDIMM BTT: BLOCK TRANSLATION TABLE
+M: Vishal Verma <[email protected]>
+L: [email protected]
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/btt*
+
+LIBNVDIMM PMEM: PERSISTENT MEMORY DRIVER
+M: Ross Zwisler <[email protected]>
+L: [email protected]
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/pmem.c
+
LINUX FOR IBM pSERIES (RS/6000)
M: Paul Mackerras <[email protected]>
W: http://www.ibm.com/linux/ltc/projects/ppc
@@ -8136,12 +8169,6 @@ S: Maintained
F: Documentation/blockdev/ramdisk.txt
F: drivers/block/brd.c

-PERSISTENT MEMORY DRIVER
-M: Ross Zwisler <[email protected]>
-L: [email protected]
-S: Supported
-F: drivers/block/pmem.c
-
RANDOM NUMBER DRIVER
M: "Theodore Ts'o" <[email protected]>
S: Maintained

2015-06-25 09:46:31

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 06/17] fs/block_dev.c: skip rw_page if bdev has integrity

From: Vishal Verma <[email protected]>

If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.

Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.

Cc: Jens Axboe <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/block_dev.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7e4163ede87..054ef1bbb821 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -376,7 +376,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
struct page *page)
{
const struct block_device_operations *ops = bdev->bd_disk->fops;
- if (!ops->rw_page)
+ if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
}
@@ -407,7 +407,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
int result;
int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
const struct block_device_operations *ops = bdev->bd_disk->fops;
- if (!ops->rw_page)
+ if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);

2015-06-25 09:46:21

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 07/17] libnvdimm, btt: add support for blk integrity

From: Vishal Verma <[email protected]>

Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.

Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/btt.c | 129 +++++++++++++++++++++++++++++++++++++++------
drivers/nvdimm/btt.h | 2 -
drivers/nvdimm/btt_devs.c | 3 +
drivers/nvdimm/core.c | 37 +++++++++++++
drivers/nvdimm/nd.h | 1
5 files changed, 154 insertions(+), 18 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 7ae38aac2c25..18a2463c2300 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -837,6 +837,11 @@ static int btt_meta_init(struct btt *btt)
return ret;
}

+static u32 btt_meta_size(struct btt *btt)
+{
+ return btt->lbasize - btt->sector_size;
+}
+
/*
* This function calculates the arena in which the given LBA lies
* by doing a linear walk. This is acceptable since we expect only
@@ -921,8 +926,63 @@ static void zero_fill_data(struct page *page, unsigned int off, u32 len)
kunmap_atomic(mem);
}

-static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
- sector_t sector, unsigned int len)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+ struct arena_info *arena, u32 postmap, int rw)
+{
+ unsigned int len = btt_meta_size(btt);
+ u64 meta_nsoff;
+ int ret = 0;
+
+ if (bip == NULL)
+ return 0;
+
+ meta_nsoff = to_namespace_offset(arena, postmap) + btt->sector_size;
+
+ while (len) {
+ unsigned int cur_len;
+ struct bio_vec bv;
+ void *mem;
+
+ bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+ /*
+ * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+ * .bv_offset already adjusted for iter->bi_bvec_done, and we
+ * can use those directly
+ */
+
+ cur_len = min(len, bv.bv_len);
+ mem = kmap_atomic(bv.bv_page);
+ if (rw)
+ ret = arena_write_bytes(arena, meta_nsoff,
+ mem + bv.bv_offset, cur_len);
+ else
+ ret = arena_read_bytes(arena, meta_nsoff,
+ mem + bv.bv_offset, cur_len);
+
+ kunmap_atomic(mem);
+ if (ret)
+ return ret;
+
+ len -= cur_len;
+ meta_nsoff += cur_len;
+ bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ }
+
+ return ret;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
+ struct arena_info *arena, u32 postmap, int rw)
+{
+ return 0;
+}
+#endif
+
+static int btt_read_pg(struct btt *btt, struct bio_integrity_payload *bip,
+ struct page *page, unsigned int off, sector_t sector,
+ unsigned int len)
{
int ret = 0;
int t_flag, e_flag;
@@ -984,6 +1044,12 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
if (ret)
goto out_rtt;

+ if (bip) {
+ ret = btt_rw_integrity(btt, bip, arena, postmap, READ);
+ if (ret)
+ goto out_rtt;
+ }
+
arena->rtt[lane] = RTT_INVALID;
nd_region_release_lane(btt->nd_region, lane);

@@ -1001,8 +1067,9 @@ static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off,
return ret;
}

-static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
- unsigned int off, unsigned int len)
+static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
+ sector_t sector, struct page *page, unsigned int off,
+ unsigned int len)
{
int ret = 0;
struct arena_info *arena = NULL;
@@ -1036,12 +1103,19 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
if (new_postmap >= arena->internal_nlba) {
ret = -EIO;
goto out_lane;
- } else
- ret = btt_data_write(arena, new_postmap, page,
- off, cur_len);
+ }
+
+ ret = btt_data_write(arena, new_postmap, page, off, cur_len);
if (ret)
goto out_lane;

+ if (bip) {
+ ret = btt_rw_integrity(btt, bip, arena, new_postmap,
+ WRITE);
+ if (ret)
+ goto out_lane;
+ }
+
lock_map(arena, premap);
ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL);
if (ret)
@@ -1081,18 +1155,18 @@ static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page,
return ret;
}

-static int btt_do_bvec(struct btt *btt, struct page *page,
- unsigned int len, unsigned int off, int rw,
- sector_t sector)
+static int btt_do_bvec(struct btt *btt, struct bio_integrity_payload *bip,
+ struct page *page, unsigned int len, unsigned int off,
+ int rw, sector_t sector)
{
int ret;

if (rw == READ) {
- ret = btt_read_pg(btt, page, off, sector, len);
+ ret = btt_read_pg(btt, bip, page, off, sector, len);
flush_dcache_page(page);
} else {
flush_dcache_page(page);
- ret = btt_write_pg(btt, sector, page, off, len);
+ ret = btt_write_pg(btt, bip, sector, page, off, len);
}

return ret;
@@ -1100,11 +1174,23 @@ static int btt_do_bvec(struct btt *btt, struct page *page,

static void btt_make_request(struct request_queue *q, struct bio *bio)
{
+ struct bio_integrity_payload *bip = bio_integrity(bio);
struct btt *btt = q->queuedata;
struct bvec_iter iter;
struct bio_vec bvec;
int err = 0, rw;

+ /*
+ * bio_integrity_enabled also checks if the bio already has an
+ * integrity payload attached. If it does, we *don't* do a
+ * bio_integrity_prep here - the payload has been generated by
+ * another kernel subsystem, and we just pass it through.
+ */
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+ err = -EIO;
+ goto out;
+ }
+
rw = bio_data_dir(bio);
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
@@ -1115,7 +1201,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
BUG_ON(len < btt->sector_size);
BUG_ON(len % btt->sector_size);

- err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset,
+ err = btt_do_bvec(btt, bip, bvec.bv_page, len, bvec.bv_offset,
rw, iter.bi_sector);
if (err) {
dev_info(&btt->nd_btt->dev,
@@ -1135,7 +1221,7 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
{
struct btt *btt = bdev->bd_disk->private_data;

- btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ btt_do_bvec(btt, NULL, page, PAGE_CACHE_SIZE, 0, rw, sector);
page_endio(page, rw & WRITE, 0);
return 0;
}
@@ -1188,15 +1274,26 @@ static int btt_blk_init(struct btt *btt)
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue);
btt->btt_queue->queuedata = btt;

- set_capacity(btt->btt_disk,
- btt->nlba * btt->sector_size >> SECTOR_SHIFT);
+ set_capacity(btt->btt_disk, 0);
add_disk(btt->btt_disk);
+ if (btt_meta_size(btt)) {
+ int rc = nd_integrity_init(btt->btt_disk, btt_meta_size(btt));
+
+ if (rc) {
+ del_gendisk(btt->btt_disk);
+ put_disk(btt->btt_disk);
+ blk_cleanup_queue(btt->btt_queue);
+ return rc;
+ }
+ }
+ set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);

return 0;
}

static void btt_blk_cleanup(struct btt *btt)
{
+ blk_integrity_unregister(btt->btt_disk);
del_gendisk(btt->btt_disk);
put_disk(btt->btt_disk);
blk_cleanup_queue(btt->btt_queue);
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 8c95a7792c3e..2caa0ef7e67a 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,7 @@
#define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */
#define RTT_VALID (1UL << 31)
#define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 256
+#define INT_LBASIZE_ALIGNMENT 64
#define BTT_PG_SIZE 4096
#define BTT_DEFAULT_NFREE ND_MAX_LANES
#define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index cde9f075fcea..b6724cfbcfca 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -104,7 +104,8 @@ struct nd_btt *to_nd_btt(struct device *dev)
}
EXPORT_SYMBOL(to_nd_btt);

-static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+static const unsigned long btt_lbasize_supported[] = { 512, 520, 528,
+ 4096, 4104, 4160, 4224, 0 };

static ssize_t sector_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index dd824d7c2669..1d96b9a6e4cc 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -13,6 +13,7 @@
#include <linux/libnvdimm.h>
#include <linux/export.h>
#include <linux/module.h>
+#include <linux/blkdev.h>
#include <linux/device.h>
#include <linux/ctype.h>
#include <linux/ndctl.h>
@@ -361,6 +362,42 @@ void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus)
}
EXPORT_SYMBOL_GPL(nvdimm_bus_unregister);

+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_pi_nop_generate_verify(struct blk_integrity_iter *iter)
+{
+ return 0;
+}
+
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+ struct blk_integrity integrity = {
+ .name = "ND-PI-NOP",
+ .generate_fn = nd_pi_nop_generate_verify,
+ .verify_fn = nd_pi_nop_generate_verify,
+ .tuple_size = meta_size,
+ .tag_size = meta_size,
+ };
+ int ret;
+
+ ret = blk_integrity_register(disk, &integrity);
+ if (ret)
+ return ret;
+
+ blk_queue_max_integrity_segments(disk->queue, 1);
+
+ return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
+{
+ return 0;
+}
+EXPORT_SYMBOL(nd_integrity_init);
+
+#endif
+
static __init int libnvdimm_init(void)
{
int rc;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index d7fc6ef4e4d8..6a969a885d70 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -135,6 +135,7 @@ enum nd_async_mode {
ND_ASYNC,
};

+int nd_integrity_init(struct gendisk *disk, unsigned long meta_size);
void wait_nvdimm_bus_probe_idle(struct device *dev);
void nd_device_register(struct device *dev);
void nd_device_unregister(struct device *dev, enum nd_async_mode mode);

2015-06-25 09:43:22

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 08/17] libnvdimm, blk: add support for blk integrity

From: Vishal Verma <[email protected]>

Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.

Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/blk.c | 174 ++++++++++++++++++++++++++++++++++-----
drivers/nvdimm/btt.h | 1
drivers/nvdimm/core.c | 3 +
drivers/nvdimm/namespace_devs.c | 3 -
drivers/nvdimm/nd.h | 1
5 files changed, 159 insertions(+), 23 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9ac0c266c15c..5c44e067652f 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -27,10 +27,17 @@ struct nd_blk_device {
struct nd_namespace_blk *nsblk;
struct nd_blk_region *ndbr;
size_t disk_size;
+ u32 sector_size;
+ u32 internal_lbasize;
};

static int nd_blk_major;

+static u32 nd_blk_meta_size(struct nd_blk_device *blk_dev)
+{
+ return blk_dev->nsblk->lbasize - blk_dev->sector_size;
+}
+
static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
resource_size_t ns_offset, unsigned int len)
{
@@ -52,41 +59,145 @@ static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
return SIZE_MAX;
}

+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+ struct bio_integrity_payload *bip, u64 lba,
+ int rw)
+{
+ unsigned int len = nd_blk_meta_size(blk_dev);
+ resource_size_t dev_offset, ns_offset;
+ struct nd_namespace_blk *nsblk;
+ struct nd_blk_region *ndbr;
+ int err = 0;
+
+ nsblk = blk_dev->nsblk;
+ ndbr = blk_dev->ndbr;
+ ns_offset = lba * blk_dev->internal_lbasize + blk_dev->sector_size;
+ dev_offset = to_dev_offset(nsblk, ns_offset, len);
+ if (dev_offset == SIZE_MAX)
+ return -EIO;
+
+ while (len) {
+ unsigned int cur_len;
+ struct bio_vec bv;
+ void *iobuf;
+
+ bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+ /*
+ * The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+ * .bv_offset already adjusted for iter->bi_bvec_done, and we
+ * can use those directly
+ */
+
+ cur_len = min(len, bv.bv_len);
+ iobuf = kmap_atomic(bv.bv_page);
+ err = ndbr->do_io(ndbr, dev_offset, iobuf + bv.bv_offset,
+ cur_len, rw);
+ kunmap_atomic(iobuf);
+ if (err)
+ return err;
+
+ len -= cur_len;
+ dev_offset += cur_len;
+ bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ }
+
+ return err;
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
+ struct bio_integrity_payload *bip, u64 lba,
+ int rw)
+{
+ return 0;
+}
+#endif
+
+static int nd_blk_do_bvec(struct nd_blk_device *blk_dev,
+ struct bio_integrity_payload *bip, struct page *page,
+ unsigned int len, unsigned int off, int rw,
+ sector_t sector)
+{
+ struct nd_blk_region *ndbr = blk_dev->ndbr;
+ resource_size_t dev_offset, ns_offset;
+ int err = 0;
+ void *iobuf;
+ u64 lba;
+
+ while (len) {
+ unsigned int cur_len;
+
+ /*
+ * If we don't have an integrity payload, we don't have to
+ * split the bvec into sectors, as this would cause unnecessary
+ * Block Window setup/move steps. the do_io routine is capable
+ * of handling len <= PAGE_SIZE.
+ */
+ cur_len = bip ? min(len, blk_dev->sector_size) : len;
+
+ lba = div_u64(sector << SECTOR_SHIFT, blk_dev->sector_size);
+ ns_offset = lba * blk_dev->internal_lbasize;
+ dev_offset = to_dev_offset(blk_dev->nsblk, ns_offset, cur_len);
+ if (dev_offset == SIZE_MAX)
+ return -EIO;
+
+ iobuf = kmap_atomic(page);
+ err = ndbr->do_io(ndbr, dev_offset, iobuf + off, cur_len, rw);
+ kunmap_atomic(iobuf);
+ if (err)
+ return err;
+
+ if (bip) {
+ err = nd_blk_rw_integrity(blk_dev, bip, lba, rw);
+ if (err)
+ return err;
+ }
+ len -= cur_len;
+ off += cur_len;
+ sector += blk_dev->sector_size >> SECTOR_SHIFT;
+ }
+
+ return err;
+}
+
static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct gendisk *disk = bdev->bd_disk;
- struct nd_namespace_blk *nsblk;
+ struct bio_integrity_payload *bip;
struct nd_blk_device *blk_dev;
- struct nd_blk_region *ndbr;
struct bvec_iter iter;
struct bio_vec bvec;
int err = 0, rw;

+ /*
+ * bio_integrity_enabled also checks if the bio already has an
+ * integrity payload attached. If it does, we *don't* do a
+ * bio_integrity_prep here - the payload has been generated by
+ * another kernel subsystem, and we just pass it through.
+ */
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+ err = -EIO;
+ goto out;
+ }
+
+ bip = bio_integrity(bio);
blk_dev = disk->private_data;
- nsblk = blk_dev->nsblk;
- ndbr = blk_dev->ndbr;
rw = bio_data_dir(bio);
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
- resource_size_t dev_offset;
- void *iobuf;

BUG_ON(len > PAGE_SIZE);
-
- dev_offset = to_dev_offset(nsblk,
- iter.bi_sector << SECTOR_SHIFT, len);
- if (dev_offset == SIZE_MAX) {
- err = -EIO;
+ err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
+ bvec.bv_offset, rw, iter.bi_sector);
+ if (err) {
+ dev_info(&blk_dev->nsblk->common.dev,
+ "io error in %s sector %lld, len %d,\n",
+ (rw == READ) ? "READ" : "WRITE",
+ (unsigned long long) iter.bi_sector, len);
goto out;
}
-
- iobuf = kmap_atomic(bvec.bv_page);
- err = ndbr->do_io(ndbr, dev_offset, iobuf + bvec.bv_offset,
- len, rw);
- kunmap_atomic(iobuf);
- if (err)
- goto out;
}

out:
@@ -121,8 +232,12 @@ static const struct block_device_operations nd_blk_fops = {
static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
struct nd_blk_device *blk_dev)
{
- struct nd_namespace_blk *nsblk = to_nd_namespace_blk(&ndns->dev);
+ resource_size_t available_disk_size;
struct gendisk *disk;
+ u64 internal_nlba;
+
+ internal_nlba = div_u64(blk_dev->disk_size, blk_dev->internal_lbasize);
+ available_disk_size = internal_nlba * blk_dev->sector_size;

blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
if (!blk_dev->queue)
@@ -131,7 +246,7 @@ static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
blk_queue_max_hw_sectors(blk_dev->queue, UINT_MAX);
blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
- blk_queue_logical_block_size(blk_dev->queue, nsblk->lbasize);
+ blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, blk_dev->queue);

disk = blk_dev->disk = alloc_disk(0);
@@ -148,15 +263,28 @@ static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
disk->queue = blk_dev->queue;
disk->flags = GENHD_FL_EXT_DEVT;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
- set_capacity(disk, blk_dev->disk_size >> SECTOR_SHIFT);
+ set_capacity(disk, 0);
add_disk(disk);

+ if (nd_blk_meta_size(blk_dev)) {
+ int rc = nd_integrity_init(disk, nd_blk_meta_size(blk_dev));
+
+ if (rc) {
+ del_gendisk(disk);
+ put_disk(disk);
+ blk_cleanup_queue(blk_dev->queue);
+ return rc;
+ }
+ }
+
+ set_capacity(disk, available_disk_size >> SECTOR_SHIFT);
return 0;
}

static int nd_blk_probe(struct device *dev)
{
struct nd_namespace_common *ndns;
+ struct nd_namespace_blk *nsblk;
struct nd_blk_device *blk_dev;
int rc;

@@ -168,9 +296,13 @@ static int nd_blk_probe(struct device *dev)
if (!blk_dev)
return -ENOMEM;

+ nsblk = to_nd_namespace_blk(&ndns->dev);
blk_dev->disk_size = nvdimm_namespace_capacity(ndns);
blk_dev->ndbr = to_nd_blk_region(dev->parent);
blk_dev->nsblk = to_nd_namespace_blk(&ndns->dev);
+ blk_dev->internal_lbasize = roundup(nsblk->lbasize,
+ INT_LBASIZE_ALIGNMENT);
+ blk_dev->sector_size = ((nsblk->lbasize >= 4096) ? 4096 : 512);
dev_set_drvdata(dev, blk_dev);

ndns->rw_bytes = nd_blk_rw_bytes;
diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
index 2caa0ef7e67a..75b0d80a6bd9 100644
--- a/drivers/nvdimm/btt.h
+++ b/drivers/nvdimm/btt.h
@@ -31,7 +31,6 @@
#define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */
#define RTT_VALID (1UL << 31)
#define RTT_INVALID 0
-#define INT_LBASIZE_ALIGNMENT 64
#define BTT_PG_SIZE 4096
#define BTT_DEFAULT_NFREE ND_MAX_LANES
#define LOG_SEQ_INIT 1
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 1d96b9a6e4cc..4288169432de 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -379,6 +379,9 @@ int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
};
int ret;

+ if (meta_size == 0)
+ return 0;
+
ret = blk_integrity_register(disk, &integrity);
if (ret)
return ret;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 1ce1e70de44a..27d69bd3b4d6 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1059,7 +1059,8 @@ static ssize_t resource_show(struct device *dev,
}
static DEVICE_ATTR_RO(resource);

-static const unsigned long ns_lbasize_supported[] = { 512, 0 };
+static const unsigned long ns_lbasize_supported[] = { 512, 520, 528,
+ 4096, 4104, 4160, 4224, 0 };

static ssize_t sector_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 6a969a885d70..e73c34dcd935 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -27,6 +27,7 @@ enum {
*/
ND_MAX_LANES = 256,
SECTOR_SHIFT = 9,
+ INT_LBASIZE_ALIGNMENT = 64,
};

struct nvdimm_drvdata {

2015-06-25 09:43:40

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 09/17] libnvdimm, pmem: fix up max_hw_sectors

There is no hardware limit to enforce on the size of the i/o that can be passed
to an nvdimm block device, so set it to UINT_MAX.

Reviewed-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 7346054bccbb..d29a42adb95a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -167,7 +167,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
return -ENOMEM;

blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
- blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+ blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);

disk = alloc_disk(0);

2015-06-25 09:46:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 10/17] pmem: make_request cleanups

Various cleanups:

1/ Kill the BUG_ON since we've already told the block layer we don't
support DISCARD on all these drivers.

2/ Kill the 'rw' variable, no need to cache it.

3/ Kill the local 'sector' variable. bio_for_each_segment() is already
advancing the iterator's sector number by the bio_vec length.

4/ Kill the check for accessing past the end of device
generic_make_request_checks() already does that.

Suggested-by: Christoph Hellwig <[email protected]>
[hch: kill access past end of the device check]
Reviewed-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 26 +++++---------------------
1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d29a42adb95a..e846a627ebdf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -58,31 +58,15 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,

static void pmem_make_request(struct request_queue *q, struct bio *bio)
{
- struct block_device *bdev = bio->bi_bdev;
- struct pmem_device *pmem = bdev->bd_disk->private_data;
- int rw;
struct bio_vec bvec;
- sector_t sector;
struct bvec_iter iter;
- int err = 0;
-
- if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
- err = -EIO;
- goto out;
- }
-
- BUG_ON(bio->bi_rw & REQ_DISCARD);
+ struct block_device *bdev = bio->bi_bdev;
+ struct pmem_device *pmem = bdev->bd_disk->private_data;

- rw = bio_data_dir(bio);
- sector = bio->bi_iter.bi_sector;
- bio_for_each_segment(bvec, bio, iter) {
+ bio_for_each_segment(bvec, bio, iter)
pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
- rw, sector);
- sector += bvec.bv_len >> 9;
- }
-
-out:
- bio_endio(bio, err);
+ bio_data_dir(bio), iter.bi_sector);
+ bio_endio(bio, 0);
}

static int pmem_rw_page(struct block_device *bdev, sector_t sector,

2015-06-25 09:45:42

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 11/17] libnvdimm: enable iostat

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.

Reviewed-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/blk.c | 7 ++++++-
drivers/nvdimm/btt.c | 7 ++++++-
drivers/nvdimm/core.c | 29 +++++++++++++++++++++++++++++
drivers/nvdimm/nd.h | 13 +++++++++++++
drivers/nvdimm/pmem.c | 5 +++++
5 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 5c44e067652f..96ef38ceeceb 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -168,8 +168,10 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
struct bio_integrity_payload *bip;
struct nd_blk_device *blk_dev;
struct bvec_iter iter;
+ unsigned long start;
struct bio_vec bvec;
int err = 0, rw;
+ bool do_acct;

/*
* bio_integrity_enabled also checks if the bio already has an
@@ -185,6 +187,7 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
bip = bio_integrity(bio);
blk_dev = disk->private_data;
rw = bio_data_dir(bio);
+ do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;

@@ -196,9 +199,11 @@ static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
"io error in %s sector %lld, len %d,\n",
(rw == READ) ? "READ" : "WRITE",
(unsigned long long) iter.bi_sector, len);
- goto out;
+ break;
}
}
+ if (do_acct)
+ nd_iostat_end(bio, start);

out:
bio_endio(bio, err);
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 18a2463c2300..c02065aed03d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1177,8 +1177,10 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
struct bio_integrity_payload *bip = bio_integrity(bio);
struct btt *btt = q->queuedata;
struct bvec_iter iter;
+ unsigned long start;
struct bio_vec bvec;
int err = 0, rw;
+ bool do_acct;

/*
* bio_integrity_enabled also checks if the bio already has an
@@ -1191,6 +1193,7 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
goto out;
}

+ do_acct = nd_iostat_start(bio, &start);
rw = bio_data_dir(bio);
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
@@ -1208,9 +1211,11 @@ static void btt_make_request(struct request_queue *q, struct bio *bio)
"io error in %s sector %lld, len %d,\n",
(rw == READ) ? "READ" : "WRITE",
(unsigned long long) iter.bi_sector, len);
- goto out;
+ break;
}
}
+ if (do_acct)
+ nd_iostat_end(bio, start);

out:
bio_endio(bio, err);
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 4288169432de..cb62ec6a12d0 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -214,6 +214,35 @@ ssize_t nd_sector_size_store(struct device *dev, const char *buf,
}
}

+void __nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ const int rw = bio_data_dir(bio);
+ int cpu = part_stat_lock();
+
+ *start = jiffies;
+ part_round_stats(cpu, &disk->part0);
+ part_stat_inc(cpu, &disk->part0, ios[rw]);
+ part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
+ part_inc_in_flight(&disk->part0, rw);
+ part_stat_unlock();
+}
+EXPORT_SYMBOL(__nd_iostat_start);
+
+void nd_iostat_end(struct bio *bio, unsigned long start)
+{
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ unsigned long duration = jiffies - start;
+ const int rw = bio_data_dir(bio);
+ int cpu = part_stat_lock();
+
+ part_stat_add(cpu, &disk->part0, ticks[rw], duration);
+ part_round_stats(cpu, &disk->part0);
+ part_dec_in_flight(&disk->part0, rw);
+ part_stat_unlock();
+}
+EXPORT_SYMBOL(nd_iostat_end);
+
static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index e73c34dcd935..cf6849b72c4f 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -13,6 +13,7 @@
#ifndef __ND_H__
#define __ND_H__
#include <linux/libnvdimm.h>
+#include <linux/blkdev.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/ndctl.h>
@@ -201,5 +202,17 @@ int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns);
const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
char *name);
int nd_blk_region_init(struct nd_region *nd_region);
+void __nd_iostat_start(struct bio *bio, unsigned long *start);
+static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
+{
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+ if (!blk_queue_io_stat(disk->queue))
+ return false;
+
+ __nd_iostat_start(bio, start);
+ return true;
+}
+void nd_iostat_end(struct bio *bio, unsigned long start);
resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e846a627ebdf..09195e3b7453 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -58,14 +58,19 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,

static void pmem_make_request(struct request_queue *q, struct bio *bio)
{
+ bool do_acct;
+ unsigned long start;
struct bio_vec bvec;
struct bvec_iter iter;
struct block_device *bdev = bio->bi_bdev;
struct pmem_device *pmem = bdev->bd_disk->private_data;

+ do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter)
pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
bio_data_dir(bio), iter.bi_sector);
+ if (do_acct)
+ nd_iostat_end(bio, start);
bio_endio(bio, 0);
}

2015-06-25 09:44:05

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 12/17] pmem: flag pmem block devices as non-rotational

...since they are effectively SSDs as far as userspace is concerned.

Reviewed-by: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 09195e3b7453..a9709db0704c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,6 +158,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+ queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);

disk = alloc_disk(0);
if (!disk) {

2015-06-25 09:43:55

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 13/17] libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.

Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit.c | 31 +++++++++++++++++++++++++++++++
drivers/acpi/nfit.h | 3 +++
drivers/nvdimm/blk.c | 2 ++
drivers/nvdimm/btt.c | 10 ++++++++--
drivers/nvdimm/bus.c | 18 ++++++++++++++++++
drivers/nvdimm/nd.h | 3 ++-
drivers/nvdimm/pmem.c | 2 ++
drivers/nvdimm/region_devs.c | 29 +++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 2 ++
tools/testing/nvdimm/test/nfit.c | 3 +++
10 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 07d630e9f4ae..1f6f1b1a54f4 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -668,6 +668,20 @@ static ssize_t serial_show(struct device *dev,
}
static DEVICE_ATTR_RO(serial);

+static ssize_t flags_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u16 flags = to_nfit_memdev(dev)->flags;
+
+ return sprintf(buf, "%s%s%s%s%s\n",
+ flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+ flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+ flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+ flags & ACPI_NFIT_MEM_ARMED ? "arm " : "",
+ flags & ACPI_NFIT_MEM_HEALTH_OBSERVED ? "smart " : "");
+}
+static DEVICE_ATTR_RO(flags);
+
static struct attribute *acpi_nfit_dimm_attributes[] = {
&dev_attr_handle.attr,
&dev_attr_phys_id.attr,
@@ -676,6 +690,7 @@ static struct attribute *acpi_nfit_dimm_attributes[] = {
&dev_attr_format.attr,
&dev_attr_serial.attr,
&dev_attr_rev_id.attr,
+ &dev_attr_flags.attr,
NULL,
};

@@ -768,6 +783,7 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
struct nvdimm *nvdimm;
unsigned long flags = 0;
u32 device_handle;
+ u16 mem_flags;
int rc;

device_handle = __to_nfit_memdev(nfit_mem)->device_handle;
@@ -785,6 +801,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
if (nfit_mem->bdw && nfit_mem->memdev_pmem)
flags |= NDD_ALIASING;

+ mem_flags = __to_nfit_memdev(nfit_mem)->flags;
+ if (mem_flags & ACPI_NFIT_MEM_ARMED)
+ flags |= NDD_UNARMED;
+
rc = acpi_nfit_add_dimm(acpi_desc, nfit_mem, device_handle);
if (rc)
continue;
@@ -797,6 +817,17 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)

nfit_mem->nvdimm = nvdimm;
dimm_count++;
+
+ if ((mem_flags & ACPI_NFIT_MEM_FAILED_MASK) == 0)
+ continue;
+
+ dev_info(acpi_desc->dev, "%s: failed: %s%s%s%s\n",
+ nvdimm_name(nvdimm),
+ mem_flags & ACPI_NFIT_MEM_SAVE_FAILED ? "save " : "",
+ mem_flags & ACPI_NFIT_MEM_RESTORE_FAILED ? "restore " : "",
+ mem_flags & ACPI_NFIT_MEM_FLUSH_FAILED ? "flush " : "",
+ mem_flags & ACPI_NFIT_MEM_ARMED ? "arm " : "");
+
}

return nvdimm_bus_check_dimm_count(acpi_desc->nvdimm_bus, dimm_count);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index c62fffea8423..81f2e8c5a79c 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -22,6 +22,9 @@

#define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
#define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
+#define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \
+ | ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \
+ | ACPI_NFIT_MEM_ARMED)

enum nfit_uuids {
NFIT_SPA_VOLATILE,
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 96ef38ceeceb..4f97b248c236 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -232,6 +232,7 @@ static int nd_blk_rw_bytes(struct nd_namespace_common *ndns,

static const struct block_device_operations nd_blk_fops = {
.owner = THIS_MODULE,
+ .revalidate_disk = nvdimm_revalidate_disk,
};

static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
@@ -283,6 +284,7 @@ static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
}

set_capacity(disk, available_disk_size >> SECTOR_SHIFT);
+ revalidate_disk(disk);
return 0;
}

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index c02065aed03d..411c7b2bb37a 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1245,6 +1245,7 @@ static const struct block_device_operations btt_fops = {
.owner = THIS_MODULE,
.rw_page = btt_rw_page,
.getgeo = btt_getgeo,
+ .revalidate_disk = nvdimm_revalidate_disk,
};

static int btt_blk_init(struct btt *btt)
@@ -1292,6 +1293,7 @@ static int btt_blk_init(struct btt *btt)
}
}
set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
+ revalidate_disk(btt->btt_disk);

return 0;
}
@@ -1346,7 +1348,11 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
goto out_free;
}

- if (btt->init_state != INIT_READY) {
+ if (btt->init_state != INIT_READY && nd_region->ro) {
+ dev_info(dev, "%s is read-only, unable to init btt metadata\n",
+ dev_name(&nd_region->dev));
+ goto out_free;
+ } else if (btt->init_state != INIT_READY) {
btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
dev_dbg(dev, "init: %d arenas for %llu rawsize\n",
@@ -1361,7 +1367,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize,
ret = btt_meta_init(btt);
if (ret) {
dev_err(dev, "init: error in meta_init: %d\n", ret);
- return NULL;
+ goto out_free;
}
}

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index dd12f38397db..ec59f1f26d95 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -227,6 +227,24 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, struct module *owner,
}
EXPORT_SYMBOL(__nd_driver_register);

+int nvdimm_revalidate_disk(struct gendisk *disk)
+{
+ struct device *dev = disk->driverfs_dev;
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+ const char *pol = nd_region->ro ? "only" : "write";
+
+ if (nd_region->ro == get_disk_ro(disk))
+ return 0;
+
+ dev_info(dev, "%s read-%s, marking %s read-%s\n",
+ dev_name(&nd_region->dev), pol, disk->disk_name, pol);
+ set_disk_ro(disk, nd_region->ro);
+
+ return 0;
+
+}
+EXPORT_SYMBOL(nvdimm_revalidate_disk);
+
static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index cf6849b72c4f..b870de9add79 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -96,7 +96,7 @@ struct nd_region {
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
- int id, num_lanes;
+ int id, num_lanes, ro;
void *provider_data;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
@@ -188,6 +188,7 @@ u64 nd_region_interleave_set_cookie(struct nd_region *nd_region);
void nvdimm_bus_lock(struct device *dev);
void nvdimm_bus_unlock(struct device *dev);
bool is_nvdimm_bus_locked(struct device *dev);
+int nvdimm_revalidate_disk(struct gendisk *disk);
void nvdimm_drvdata_release(struct kref *kref);
void put_ndd(struct nvdimm_drvdata *ndd);
int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a9709db0704c..42b766f33e59 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -104,6 +104,7 @@ static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
.direct_access = pmem_direct_access,
+ .revalidate_disk = nvdimm_revalidate_disk,
};

static struct pmem_device *pmem_alloc(struct device *dev,
@@ -178,6 +179,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
pmem->pmem_disk = disk;

add_disk(disk);
+ revalidate_disk(disk);

return 0;
}
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fb35d451016f..8f8c7ea485f1 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -345,11 +345,35 @@ static ssize_t btt_seed_show(struct device *dev,
}
static DEVICE_ATTR_RO(btt_seed);

+static ssize_t read_only_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+
+ return sprintf(buf, "%d\n", nd_region->ro);
+}
+
+static ssize_t read_only_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ bool ro;
+ int rc = strtobool(buf, &ro);
+ struct nd_region *nd_region = to_nd_region(dev);
+
+ if (rc)
+ return rc;
+
+ nd_region->ro = ro;
+ return len;
+}
+static DEVICE_ATTR_RW(read_only);
+
static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
&dev_attr_mappings.attr,
&dev_attr_btt_seed.attr,
+ &dev_attr_read_only.attr,
&dev_attr_set_cookie.attr,
&dev_attr_available_size.attr,
&dev_attr_namespace_seed.attr,
@@ -641,6 +665,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
struct device *dev;
void *region_buf;
unsigned int i;
+ int ro = 0;

for (i = 0; i < ndr_desc->num_mappings; i++) {
struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
@@ -652,6 +677,9 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,

return NULL;
}
+
+ if (nvdimm->flags & NDD_UNARMED)
+ ro = 1;
}

if (dev_type == &nd_blk_device_type) {
@@ -707,6 +735,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->provider_data = ndr_desc->provider_data;
nd_region->nd_set = ndr_desc->nd_set;
nd_region->num_lanes = ndr_desc->num_lanes;
+ nd_region->ro = ro;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 7fc1b25bdb5d..dc799a29ed1a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -21,6 +21,8 @@
enum {
/* when a dimm supports both PMEM and BLK access a label is required */
NDD_ALIASING = 1 << 0,
+ /* unarmed memory devices may not persist writes */
+ NDD_UNARMED = 1 << 1,

/* need to set a limit somewhere, but yes, this is likely overkill */
ND_IOCTL_MAX_BUFLEN = SZ_4M,
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 7a4a5a5edbe4..4b69b8368de0 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -874,6 +874,9 @@ static void nfit_test1_setup(struct nfit_test *t)
memdev->address = 0;
memdev->interleave_index = 0;
memdev->interleave_ways = 1;
+ memdev->flags = ACPI_NFIT_MEM_SAVE_FAILED | ACPI_NFIT_MEM_RESTORE_FAILED
+ | ACPI_NFIT_MEM_FLUSH_FAILED | ACPI_NFIT_MEM_HEALTH_OBSERVED
+ | ACPI_NFIT_MEM_ARMED;

offset += sizeof(*memdev);
/* dcr-descriptor0 */

2015-06-25 09:44:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 14/17] acpi: Add acpi_map_pxm_to_online_node()

From: Toshi Kani <[email protected]>

The kernel initializes CPU & memory's NUMA topology from ACPI
SRAT table. Some other ACPI tables, such as NFIT and DMAR, also
contain proximity IDs for their device's NUMA topology. This
information can be used to improve performance of these devices.

This patch introduces acpi_map_pxm_to_online_node(), which is
similar to acpi_map_pxm_to_node(), but always returns an online
node. When the mapped node from a given proximity ID is offline,
it looks up the node distance table and returns the nearest
online node.

ACPI device drivers, which are called after the NUMA initialization
has completed in the kernel, can call this interface to obtain their
device NUMA topology from ACPI tables. Such drivers do not have to
deal with offline nodes. A node may be offline when a device
proximity ID is unique, SRAT memory entry does not exist, or NUMA is
disabled, ex. "numa=off" on x86.

This patch also moves the pxm range check from acpi_get_node() to
acpi_map_pxm_to_node().

Signed-off-by: Toshi Kani <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/numa.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/acpi.h | 5 +++++
2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 1333cbdc3ea2..acaa3b4ea504 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -29,6 +29,8 @@
#include <linux/errno.h>
#include <linux/acpi.h>
#include <linux/numa.h>
+#include <linux/nodemask.h>
+#include <linux/topology.h>

#define PREFIX "ACPI: "

@@ -70,7 +72,12 @@ static void __acpi_map_pxm_to_node(int pxm, int node)

int acpi_map_pxm_to_node(int pxm)
{
- int node = pxm_to_node_map[pxm];
+ int node;
+
+ if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
+ return NUMA_NO_NODE;
+
+ node = pxm_to_node_map[pxm];

if (node == NUMA_NO_NODE) {
if (nodes_weight(nodes_found_map) >= MAX_NUMNODES)
@@ -83,6 +90,45 @@ int acpi_map_pxm_to_node(int pxm)
return node;
}

+/**
+ * acpi_map_pxm_to_online_node - Map proximity ID to online node
+ * @pxm: ACPI proximity ID
+ *
+ * This is similar to acpi_map_pxm_to_node(), but always returns an online
+ * node. When the mapped node from a given proximity ID is offline, it
+ * looks up the node distance table and returns the nearest online node.
+ *
+ * ACPI device drivers, which are called after the NUMA initialization has
+ * completed in the kernel, can call this interface to obtain their device
+ * NUMA topology from ACPI tables. Such drivers do not have to deal with
+ * offline nodes. A node may be offline when a device proximity ID is
+ * unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
+ * "numa=off" on x86.
+ */
+int acpi_map_pxm_to_online_node(int pxm)
+{
+ int node, n, dist, min_dist;
+
+ node = acpi_map_pxm_to_node(pxm);
+
+ if (node == NUMA_NO_NODE)
+ node = 0;
+
+ if (!node_online(node)) {
+ min_dist = INT_MAX;
+ for_each_online_node(n) {
+ dist = node_distance(node, n);
+ if (dist < min_dist) {
+ min_dist = dist;
+ node = n;
+ }
+ }
+ }
+
+ return node;
+}
+EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
+
static void __init
acpi_table_print_srat_entry(struct acpi_subtable_header *header)
{
@@ -328,8 +374,6 @@ int acpi_get_node(acpi_handle handle)
int pxm;

pxm = acpi_get_pxm(handle);
- if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
- return NUMA_NO_NODE;

return acpi_map_pxm_to_node(pxm);
}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index e4da5e35e29c..1b3bbb11d11c 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -289,8 +289,13 @@ extern void acpi_dmi_osi_linux(int enable, const struct dmi_system_id *d);
extern void acpi_osi_setup(char *str);

#ifdef CONFIG_ACPI_NUMA
+int acpi_map_pxm_to_online_node(int pxm);
int acpi_get_node(acpi_handle handle);
#else
+static inline int acpi_map_pxm_to_online_node(int pxm)
+{
+ return 0;
+}
static inline int acpi_get_node(acpi_handle handle)
{
return 0;

2015-06-25 09:44:43

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

From: Toshi Kani <[email protected]>

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.

Signed-off-by: Toshi Kani <[email protected]>
[djbw: move set_dev_node() from region 'probe' to 'create']
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit.c | 6 ++++++
drivers/nvdimm/region_devs.c | 4 +++-
include/linux/libnvdimm.h | 1 +
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 1f6f1b1a54f4..d96c8fe974dd 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1392,6 +1392,12 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc->res = &res;
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
+ if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
+ ndr_desc->numa_node = acpi_map_pxm_to_online_node(
+ spa->proximity_domain);
+ else
+ ndr_desc->numa_node = NUMA_NO_NODE;
+
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
struct nd_mapping *nd_mapping;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 8f8c7ea485f1..b2ec5045f5a5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -738,13 +738,15 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->ro = ro;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
+ device_initialize(dev);
dev_set_name(dev, "region%d", nd_region->id);
dev->parent = &nvdimm_bus->dev;
dev->type = dev_type;
dev->groups = ndr_desc->attr_groups;
+ set_dev_node(dev, ndr_desc->numa_node);
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
- nd_device_register(dev);
+ __nd_device_register(dev);

return nd_region;

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index dc799a29ed1a..30b3deaafd51 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -89,6 +89,7 @@ struct nd_region_desc {
struct nd_interleave_set *nd_set;
void *provider_data;
int num_lanes;
+ int numa_node;
};

struct nvdimm_bus;

2015-06-25 09:44:33

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 16/17] libnvdimm: Add sysfs numa_node to NVDIMM devices

From: Toshi Kani <[email protected]>

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.
When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
/sys/bus/nd/devices
|-- btt0/numa_node:-1
|-- btt1/numa_node:0
|-- namespace0.0/numa_node:0
|-- namespace1.0/numa_node:1
|-- region0/numa_node:0
|-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
/sys/class/block/pmem0/device/numa_node:0
/sys/class/block/pmem0s/device/numa_node:0
/sys/class/block/pmem1/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem0s --show

Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit.c | 1 +
drivers/nvdimm/btt_devs.c | 1 +
drivers/nvdimm/bus.c | 30 ++++++++++++++++++++++++++++++
drivers/nvdimm/namespace_devs.c | 1 +
include/linux/libnvdimm.h | 1 +
5 files changed, 34 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index d96c8fe974dd..2161fa178c8d 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -873,6 +873,7 @@ static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
&nd_region_attribute_group,
&nd_mapping_attribute_group,
&nd_device_attribute_group,
+ &nd_numa_attribute_group,
&acpi_nfit_region_attribute_group,
NULL,
};
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index b6724cfbcfca..2dfb529f4d35 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -294,6 +294,7 @@ static struct attribute_group nd_btt_attribute_group = {
static const struct attribute_group *nd_btt_attribute_groups[] = {
&nd_btt_attribute_group,
&nd_device_attribute_group,
+ &nd_numa_attribute_group,
NULL,
};

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index ec59f1f26d95..1380954a6593 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -274,6 +274,36 @@ struct attribute_group nd_device_attribute_group = {
};
EXPORT_SYMBOL_GPL(nd_device_attribute_group);

+static ssize_t numa_node_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", dev_to_node(dev));
+}
+static DEVICE_ATTR_RO(numa_node);
+
+static struct attribute *nd_numa_attributes[] = {
+ &dev_attr_numa_node.attr,
+ NULL,
+};
+
+static umode_t nd_numa_attr_visible(struct kobject *kobj, struct attribute *a,
+ int n)
+{
+ if (!IS_ENABLED(CONFIG_NUMA))
+ return 0;
+
+ return a->mode;
+}
+
+/**
+ * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
+ */
+struct attribute_group nd_numa_attribute_group = {
+ .attrs = nd_numa_attributes,
+ .is_visible = nd_numa_attr_visible,
+};
+EXPORT_SYMBOL_GPL(nd_numa_attribute_group);
+
int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus)
{
dev_t devt = MKDEV(nvdimm_bus_major, nvdimm_bus->id);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 27d69bd3b4d6..fef0dd80d4ad 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1228,6 +1228,7 @@ static struct attribute_group nd_namespace_attribute_group = {
static const struct attribute_group *nd_namespace_attribute_groups[] = {
&nd_device_attribute_group,
&nd_namespace_attribute_group,
+ &nd_numa_attribute_group,
NULL,
};

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 30b3deaafd51..75e3af01ee32 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -38,6 +38,7 @@ enum {
extern struct attribute_group nvdimm_bus_attribute_group;
extern struct attribute_group nvdimm_attribute_group;
extern struct attribute_group nd_device_attribute_group;
+extern struct attribute_group nd_numa_attribute_group;
extern struct attribute_group nd_region_attribute_group;
extern struct attribute_group nd_mapping_attribute_group;

2015-06-25 09:44:25

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 17/17] arch, x86: pmem api for ensuring durability of persistent memory updates

From: Ross Zwisler <[email protected]>

Based on an original patch by Ross Zwisler [1].

Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media. Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range). A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.

This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Ross Zwisler <[email protected]>
[djbw: various reworks]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 1
arch/x86/include/asm/cacheflush.h | 72 +++++++++++++++++
arch/x86/include/asm/io.h | 6 +
drivers/nvdimm/pmem.c | 33 +++++---
include/linux/compiler.h | 2
include/linux/pmem.h | 153 +++++++++++++++++++++++++++++++++++++
lib/Kconfig | 3 +
7 files changed, 257 insertions(+), 13 deletions(-)
create mode 100644 include/linux/pmem.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a2cbf641667..62564ddf7f78 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86
select ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
+ select ARCH_HAS_PMEM_API
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index 47c8e32f621a..ec23bb753a3e 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -4,6 +4,7 @@
/* Caches aren't brain-dead on the intel. */
#include <asm-generic/cacheflush.h>
#include <asm/special_insns.h>
+#include <asm/uaccess.h>

/*
* The set_memory_* API can be used to change various attributes of a virtual
@@ -104,4 +105,75 @@ static inline int rodata_test(void)
}
#endif

+#ifdef ARCH_HAS_NOCACHE_UACCESS
+
+/**
+ * arch_memcpy_to_pmem - copy data to persistent memory
+ * @dst: destination buffer for the copy
+ * @src: source buffer for the copy
+ * @n: length of the copy in bytes
+ *
+ * Copy data to persistent memory media via non-temporal stores so that
+ * a subsequent arch_wmb_pmem() can flush cpu and memory controller
+ * write buffers to guarantee durability.
+ */
+static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t n)
+{
+ int unwritten;
+
+ /*
+ * We are copying between two kernel buffers, if
+ * __copy_from_user_inatomic_nocache() returns an error (page
+ * fault) we would have already reported a general protection fault
+ * before the WARN+BUG.
+ */
+ unwritten = __copy_from_user_inatomic_nocache((void __force *) dst,
+ (void __user *) src, n);
+ if (WARN(unwritten, "%s: fault copying %p <- %p unwritten: %d\n",
+ __func__, dst, src, unwritten))
+ BUG();
+}
+
+/**
+ * arch_wmb_pmem - synchronize writes to persistent memory
+ *
+ * After a series of arch_memcpy_to_pmem() operations this drains data
+ * from cpu write buffers and any platform (memory controller) buffers
+ * to ensure that written data is durable on persistent memory media.
+ */
+static inline void arch_wmb_pmem(void)
+{
+ /*
+ * wmb() to 'sfence' all previous writes such that they are
+ * architecturally visible to 'pcommit'. Note, that we've
+ * already arranged for pmem writes to avoid the cache via
+ * arch_memcpy_to_pmem().
+ */
+ wmb();
+ pcommit_sfence();
+}
+
+static inline bool __arch_has_wmb_pmem(void)
+{
+#ifdef CONFIG_X86_64
+ /*
+ * We require that wmb() be an 'sfence', that is only guaranteed on
+ * 64-bit builds
+ */
+ return static_cpu_has(X86_FEATURE_PCOMMIT);
+#else
+ return false;
+#endif
+}
+#else /* ARCH_HAS_NOCACHE_UACCESS i.e. ARCH=um */
+extern void arch_memcpy_to_pmem(void __pmem *dst, const void *src, size_t n);
+extern void arch_wmb_pmem(void);
+
+static inline bool __arch_has_wmb_pmem(void)
+{
+ return false;
+}
+#endif
+
#endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 34a5b93704d3..c60c3f3b0183 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -247,6 +247,12 @@ static inline void flush_write_buffers(void)
#endif
}

+static inline void __pmem *arch_memremap_pmem(resource_size_t offset,
+ unsigned long size)
+{
+ return (void __force __pmem *) ioremap_cache(offset, size);
+}
+
#endif /* __KERNEL__ */

extern void native_io_delay(void);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b766f33e59..ade9eb917a4d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -23,6 +23,7 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/pmem.h>
#include <linux/nd.h>
#include "nd.h"

@@ -32,7 +33,7 @@ struct pmem_device {

/* One contiguous memory region per device */
phys_addr_t phys_addr;
- void *virt_addr;
+ void __pmem *virt_addr;
size_t size;
};

@@ -44,13 +45,14 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
{
void *mem = kmap_atomic(page);
size_t pmem_off = sector << 9;
+ void __pmem *pmem_addr = pmem->virt_addr + pmem_off;

if (rw == READ) {
- memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+ memcpy_from_pmem(mem + off, pmem_addr, len);
flush_dcache_page(page);
} else {
flush_dcache_page(page);
- memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+ memcpy_to_pmem(pmem_addr, mem + off, len);
}

kunmap_atomic(mem);
@@ -71,6 +73,10 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
bio_data_dir(bio), iter.bi_sector);
if (do_acct)
nd_iostat_end(bio, start);
+
+ if (bio_data_dir(bio))
+ wmb_pmem();
+
bio_endio(bio, 0);
}

@@ -94,7 +100,8 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
if (!pmem)
return -ENODEV;

- *kaddr = pmem->virt_addr + offset;
+ /* FIXME convert DAX to comprehend that this mapping has a lifetime */
+ *kaddr = (void __force *) pmem->virt_addr + offset;
*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;

return pmem->size - offset;
@@ -118,6 +125,8 @@ static struct pmem_device *pmem_alloc(struct device *dev,

pmem->phys_addr = res->start;
pmem->size = resource_size(res);
+ if (!arch_has_pmem_api())
+ dev_warn(dev, "unable to guarantee persistence of writes\n");

if (!request_mem_region(pmem->phys_addr, pmem->size, dev_name(dev))) {
dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n",
@@ -126,11 +135,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
return ERR_PTR(-EBUSY);
}

- /*
- * Map the memory as non-cachable, as we can't write back the contents
- * of the CPU caches in case of a crash.
- */
- pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+ pmem->virt_addr = memremap_pmem(pmem->phys_addr, pmem->size);
if (!pmem->virt_addr) {
release_mem_region(pmem->phys_addr, pmem->size);
kfree(pmem);
@@ -195,16 +200,18 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
}

if (rw == READ)
- memcpy(buf, pmem->virt_addr + offset, size);
- else
- memcpy(pmem->virt_addr + offset, buf, size);
+ memcpy_from_pmem(buf, pmem->virt_addr + offset, size);
+ else {
+ memcpy_to_pmem(pmem->virt_addr + offset, buf, size);
+ wmb_pmem();
+ }

return 0;
}

static void pmem_free(struct pmem_device *pmem)
{
- iounmap(pmem->virt_addr);
+ memunmap_pmem(pmem->virt_addr);
release_mem_region(pmem->phys_addr, pmem->size);
kfree(pmem);
}
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 867722591be2..9a528d945498 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -21,6 +21,7 @@
# define __rcu __attribute__((noderef, address_space(4)))
#else
# define __rcu
+# define __pmem __attribute__((noderef, address_space(5)))
#endif
extern void __chk_user_ptr(const volatile void __user *);
extern void __chk_io_ptr(const volatile void __iomem *);
@@ -42,6 +43,7 @@ extern void __chk_io_ptr(const volatile void __iomem *);
# define __cond_lock(x,c) (c)
# define __percpu
# define __rcu
+# define __pmem
#endif

/* Indirect macros required for expanded argument pasting, eg. __LINE__. */
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
new file mode 100644
index 000000000000..f6481a0b1d4f
--- /dev/null
+++ b/include/linux/pmem.h
@@ -0,0 +1,153 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#ifndef __PMEM_H__
+#define __PMEM_H__
+
+#include <linux/io.h>
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#include <asm/cacheflush.h>
+#else
+static inline void arch_wmb_pmem(void)
+{
+ BUG();
+}
+
+static inline bool __arch_has_wmb_pmem(void)
+{
+ return false;
+}
+
+static inline void __pmem *arch_memremap_pmem(resource_size_t offset,
+ unsigned long size)
+{
+ return NULL;
+}
+
+static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t n)
+{
+ BUG();
+}
+#endif
+
+/*
+ * Architectures that define ARCH_HAS_PMEM_API must provide
+ * implementations for arch_memremap_pmem(), arch_memcpy_to_pmem(),
+ * arch_wmb_pmem(), and __arch_has_wmb_pmem().
+ */
+
+static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
+{
+ memcpy(dst, (void __force const *) src, size);
+}
+
+static inline void memunmap_pmem(void __pmem *addr)
+{
+ iounmap((void __force __iomem *) addr);
+}
+
+/**
+ * arch_has_wmb_pmem - true if wmb_pmem() ensures durability
+ *
+ * For a given cpu implementation within an architecture it is possible
+ * that wmb_pmem() resolves to a nop. In the case this returns
+ * false, pmem api users are unable to ensure durability and may want to
+ * fall back to a different data consistency model, or otherwise notify
+ * the user.
+ */
+static inline bool arch_has_wmb_pmem(void)
+{
+ if (IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
+ return __arch_has_wmb_pmem();
+ return false;
+}
+
+static inline bool arch_has_pmem_api(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && arch_has_wmb_pmem();
+}
+
+/*
+ * These defaults seek to offer decent performance and minimize the
+ * window between i/o completion and writes being durable on media.
+ * However, it is undefined / architecture specific whether
+ * default_memremap_pmem + default_memcpy_to_pmem is sufficient for
+ * making data durable relative to i/o completion.
+ */
+static void default_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t size)
+{
+ memcpy((void __force *) dst, src, size);
+}
+
+static void __pmem *default_memremap_pmem(resource_size_t offset,
+ unsigned long size)
+{
+ /* TODO: convert to ioremap_wt() */
+ return (void __pmem __force *)ioremap_nocache(offset, size);
+}
+
+/**
+ * memremap_pmem - map physical persistent memory for pmem api
+ * @offset: physical address of persistent memory
+ * @size: size of the mapping
+ *
+ * Establish a mapping of the architecture specific memory type expected
+ * by memcpy_to_pmem() and wmb_pmem(). For example, it may be
+ * the case that an uncacheable or writethrough mapping is sufficient,
+ * or a writeback mapping provided memcpy_to_pmem() and
+ * wmb_pmem() arrange for the data to be written through the
+ * cache to persistent media.
+ */
+static inline void __pmem *memremap_pmem(resource_size_t offset,
+ unsigned long size)
+{
+ if (arch_has_pmem_api())
+ return arch_memremap_pmem(offset, size);
+ return default_memremap_pmem(offset, size);
+}
+
+/**
+ * memcpy_to_pmem - copy data to persistent memory
+ * @dst: destination buffer for the copy
+ * @src: source buffer for the copy
+ * @n: length of the copy in bytes
+ *
+ * Perform a memory copy that results in the destination of the copy
+ * being effectively evicted from, or never written to, the processor
+ * cache hierarchy after the copy completes. After memcpy_to_pmem()
+ * data may still reside in cpu or platform buffers, so this operation
+ * must be followed by a wmb_pmem().
+ */
+static inline void memcpy_to_pmem(void __pmem *dst, const void *src, size_t n)
+{
+ if (arch_has_pmem_api())
+ arch_memcpy_to_pmem(dst, src, n);
+ else
+ default_memcpy_to_pmem(dst, src, n);
+}
+
+/**
+ * wmb_pmem - synchronize writes to persistent memory
+ *
+ * After a series of memcpy_to_pmem() operations this drains data from
+ * cpu write buffers and any platform (memory controller) buffers to
+ * ensure that written data is durable on persistent memory media.
+ */
+static inline void wmb_pmem(void)
+{
+ if (arch_has_pmem_api())
+ arch_wmb_pmem();
+}
+#endif /* __PMEM_H__ */
diff --git a/lib/Kconfig b/lib/Kconfig
index 601965a948e8..d27c13a91c28 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -522,4 +522,7 @@ source "lib/fonts/Kconfig"
config ARCH_HAS_SG_CHAIN
def_bool n

+config ARCH_HAS_PMEM_API
+ bool
+
endmenu

2015-06-25 17:45:50

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
> From: Toshi Kani <[email protected]>
>
> ACPI NFIT table has System Physical Address Range Structure entries that
> describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
> set in the flags.
>
> Change acpi_nfit_register_region() to map a proximity ID to its node ID,
> and set it to a new numa_node field of nd_region_desc, which is then
> conveyed to the nd_region device.
>
> The device core arranges for btt and namespace devices to inherit their
> node from their parent region.
>
> Signed-off-by: Toshi Kani <[email protected]>
> [djbw: move set_dev_node() from region 'probe' to 'create']

Sorry, I failed to mention other issue, which led me call set_dev_node()
in probe. nd_async_device_register() calls device_add(), which does:

/* use parent numa_node */
if (parent)
set_dev_node(dev, dev_to_node(parent));

and overwrites numa_node to -1. Since region's parent is ndbusN, we
cannot set numa_node to the parent. So, I had to set it in probe.

Thanks,
-Toshi

2015-06-25 17:47:57

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 10:45 AM, Toshi Kani <[email protected]> wrote:
> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
>> From: Toshi Kani <[email protected]>
>>
>> ACPI NFIT table has System Physical Address Range Structure entries that
>> describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
>> set in the flags.
>>
>> Change acpi_nfit_register_region() to map a proximity ID to its node ID,
>> and set it to a new numa_node field of nd_region_desc, which is then
>> conveyed to the nd_region device.
>>
>> The device core arranges for btt and namespace devices to inherit their
>> node from their parent region.
>>
>> Signed-off-by: Toshi Kani <[email protected]>
>> [djbw: move set_dev_node() from region 'probe' to 'create']
>
> Sorry, I failed to mention other issue, which led me call set_dev_node()
> in probe. nd_async_device_register() calls device_add(), which does:
>
> /* use parent numa_node */
> if (parent)
> set_dev_node(dev, dev_to_node(parent));
>
> and overwrites numa_node to -1. Since region's parent is ndbusN, we
> cannot set numa_node to the parent. So, I had to set it in probe.

ah, ok.

2015-06-25 18:35:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 11:45 -0600, Toshi Kani wrote:
> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
> > From: Toshi Kani <[email protected]>
> >
> > ACPI NFIT table has System Physical Address Range Structure entries that
> > describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
> > set in the flags.
> >
> > Change acpi_nfit_register_region() to map a proximity ID to its node ID,
> > and set it to a new numa_node field of nd_region_desc, which is then
> > conveyed to the nd_region device.
> >
> > The device core arranges for btt and namespace devices to inherit their
> > node from their parent region.
> >
> > Signed-off-by: Toshi Kani <[email protected]>
> > [djbw: move set_dev_node() from region 'probe' to 'create']
>
> Sorry, I failed to mention other issue, which led me call set_dev_node()
> in probe. nd_async_device_register() calls device_add(), which does:
>
> /* use parent numa_node */
> if (parent)
> set_dev_node(dev, dev_to_node(parent));
>
> and overwrites numa_node to -1. Since region's parent is ndbusN, we
> cannot set numa_node to the parent. So, I had to set it in probe.

In general, I still don't like leaving it up to ->probe() which is
within its rights to fail and not set the node. How about the following
that moves it to the bus uevent code? Should get triggered before probe
so the numa_node is valid before userspace is ever notified about the
device.

device_add() does:

kobject_uevent(&dev->kobj, KOBJ_ADD);
bus_probe_device(dev);

...so I think we're good, agree? I also added a missing init of
ndr_desc.numa_node in arch/x86/kernel/pmem.c, see below.

8<-----
Subject: libnvdimm: Set numa_node to NVDIMM devices

From: Toshi Kani <[email protected]>

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.

Signed-off-by: Toshi Kani <[email protected]>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/kernel/pmem.c | 1 +
drivers/acpi/nfit.c | 6 ++++++
drivers/nvdimm/bus.c | 6 ++++++
drivers/nvdimm/nd.h | 2 +-
drivers/nvdimm/region_devs.c | 1 +
include/linux/libnvdimm.h | 1 +
6 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 0f4ef472ab9e..64f90f53bb85 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -67,6 +67,7 @@ static __init int register_e820_pmem(void)
memset(&ndr_desc, 0, sizeof(ndr_desc));
ndr_desc.res = &res;
ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
+ ndr_desc.numa_node = NUMA_NO_NODE;
if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
goto err;
}
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 1f6f1b1a54f4..d96c8fe974dd 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1392,6 +1392,12 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc->res = &res;
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
+ if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
+ ndr_desc->numa_node = acpi_map_pxm_to_online_node(
+ spa->proximity_domain);
+ else
+ ndr_desc->numa_node = NUMA_NO_NODE;
+
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
struct nd_mapping *nd_mapping;
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index ec59f1f26d95..205344643852 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -48,6 +48,12 @@ static int to_nd_device_type(struct device *dev)

static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
{
+ /*
+ * Ensure that region devices always have their numa node set as
+ * early as possible.
+ */
+ if (is_nd_pmem(dev) || is_nd_blk(dev))
+ set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index b870de9add79..72c26461835d 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -96,7 +96,7 @@ struct nd_region {
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
- int id, num_lanes, ro;
+ int id, num_lanes, ro, numa_node;
void *provider_data;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 8f8c7ea485f1..55b424f6ba0d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -736,6 +736,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->nd_set = ndr_desc->nd_set;
nd_region->num_lanes = ndr_desc->num_lanes;
nd_region->ro = ro;
+ nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index dc799a29ed1a..30b3deaafd51 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -89,6 +89,7 @@ struct nd_region_desc {
struct nd_interleave_set *nd_set;
void *provider_data;
int num_lanes;
+ int numa_node;
};

struct nvdimm_bus;


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-06-25 21:31:40

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 11:34 AM, Williams, Dan J
<[email protected]> wrote:
> On Thu, 2015-06-25 at 11:45 -0600, Toshi Kani wrote:
>> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
>> > From: Toshi Kani <[email protected]>
>> >
>> > ACPI NFIT table has System Physical Address Range Structure entries that
>> > describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
>> > set in the flags.
>> >
>> > Change acpi_nfit_register_region() to map a proximity ID to its node ID,
>> > and set it to a new numa_node field of nd_region_desc, which is then
>> > conveyed to the nd_region device.
>> >
>> > The device core arranges for btt and namespace devices to inherit their
>> > node from their parent region.
>> >
>> > Signed-off-by: Toshi Kani <[email protected]>
>> > [djbw: move set_dev_node() from region 'probe' to 'create']
>>
>> Sorry, I failed to mention other issue, which led me call set_dev_node()
>> in probe. nd_async_device_register() calls device_add(), which does:
>>
>> /* use parent numa_node */
>> if (parent)
>> set_dev_node(dev, dev_to_node(parent));
>>
>> and overwrites numa_node to -1. Since region's parent is ndbusN, we
>> cannot set numa_node to the parent. So, I had to set it in probe.
>
> In general, I still don't like leaving it up to ->probe() which is
> within its rights to fail and not set the node. How about the following
> that moves it to the bus uevent code? Should get triggered before probe
> so the numa_node is valid before userspace is ever notified about the
> device.
>
> device_add() does:
>
> kobject_uevent(&dev->kobj, KOBJ_ADD);
> bus_probe_device(dev);
>
> ...so I think we're good, agree? I also added a missing init of
> ndr_desc.numa_node in arch/x86/kernel/pmem.c, see below.

This looks good in a quick manual test. It's interesting/illustrative
that I inadvertently broke the one bit of the libnvdimm sysfs
interface that did not have unit test coverage.

2015-06-25 21:52:17

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 14:31 -0700, Dan Williams wrote:
> On Thu, Jun 25, 2015 at 11:34 AM, Williams, Dan J
> <[email protected]> wrote:
> > On Thu, 2015-06-25 at 11:45 -0600, Toshi Kani wrote:
> >> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
> >> > From: Toshi Kani <[email protected]>
> >> >
> >> > ACPI NFIT table has System Physical Address Range Structure entries that
> >> > describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
> >> > set in the flags.
> >> >
> >> > Change acpi_nfit_register_region() to map a proximity ID to its node ID,
> >> > and set it to a new numa_node field of nd_region_desc, which is then
> >> > conveyed to the nd_region device.
> >> >
> >> > The device core arranges for btt and namespace devices to inherit their
> >> > node from their parent region.
> >> >
> >> > Signed-off-by: Toshi Kani <[email protected]>
> >> > [djbw: move set_dev_node() from region 'probe' to 'create']
> >>
> >> Sorry, I failed to mention other issue, which led me call set_dev_node()
> >> in probe. nd_async_device_register() calls device_add(), which does:
> >>
> >> /* use parent numa_node */
> >> if (parent)
> >> set_dev_node(dev, dev_to_node(parent));
> >>
> >> and overwrites numa_node to -1. Since region's parent is ndbusN, we
> >> cannot set numa_node to the parent. So, I had to set it in probe.
> >
> > In general, I still don't like leaving it up to ->probe() which is
> > within its rights to fail and not set the node. How about the following
> > that moves it to the bus uevent code? Should get triggered before probe
> > so the numa_node is valid before userspace is ever notified about the
> > device.
> >
> > device_add() does:
> >
> > kobject_uevent(&dev->kobj, KOBJ_ADD);
> > bus_probe_device(dev);
> >
> > ...so I think we're good, agree? I also added a missing init of
> > ndr_desc.numa_node in arch/x86/kernel/pmem.c, see below.
>
> This looks good in a quick manual test. It's interesting/illustrative
> that I inadvertently broke the one bit of the libnvdimm sysfs
> interface that did not have unit test coverage.

Sorry I had some interrupt. Yes, this works fine for region &
namespace. I'd like to check with you for btt since the attach logic
has changed in v2.

Previously, as described in patch 16/17, bttN bound to pmem had a valid
numa_node value, and seeding btt0 had -1.

/sys/bus/nd/devices
|-- btt0/numa_node:-1
|-- btt1/numa_node:0

In this version, there are unbound (seeding?) btt0-3 for every region
(there are 4 regions) and btt4 & 5 bound to pmem0 & 3 on my system.

btt0/numa_node:0
btt1/numa_node:0
btt2/numa_node:1
btt3/numa_node:1
btt4/numa_node:0
btt5/numa_node:1

btt0
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
btt1
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt1
btt2
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
btt3
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt3
btt4
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
btt5
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5

And unbound bttNs attach to different regions across a reboot.

btt0/numa_node:0
btt1/numa_node:1
btt2/numa_node:1
btt3/numa_node:0
btt4/numa_node:0
btt5/numa_node:1

btt0
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
btt1
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt1
btt2
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
btt3
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt3
btt4
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
btt5
-> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5

Is this how you'd expect btt to work in this version? (I have not
looked at the btt changes yet)

Thanks,
-Toshi

2015-06-25 22:01:05

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 2:51 PM, Toshi Kani <[email protected]> wrote:
> On Thu, 2015-06-25 at 14:31 -0700, Dan Williams wrote:
>> On Thu, Jun 25, 2015 at 11:34 AM, Williams, Dan J
>> <[email protected]> wrote:
>> > On Thu, 2015-06-25 at 11:45 -0600, Toshi Kani wrote:
>> >> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
>> >> > From: Toshi Kani <[email protected]>
>> >> >
>> >> > ACPI NFIT table has System Physical Address Range Structure entries that
>> >> > describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
>> >> > set in the flags.
>> >> >
>> >> > Change acpi_nfit_register_region() to map a proximity ID to its node ID,
>> >> > and set it to a new numa_node field of nd_region_desc, which is then
>> >> > conveyed to the nd_region device.
>> >> >
>> >> > The device core arranges for btt and namespace devices to inherit their
>> >> > node from their parent region.
>> >> >
>> >> > Signed-off-by: Toshi Kani <[email protected]>
>> >> > [djbw: move set_dev_node() from region 'probe' to 'create']
>> >>
>> >> Sorry, I failed to mention other issue, which led me call set_dev_node()
>> >> in probe. nd_async_device_register() calls device_add(), which does:
>> >>
>> >> /* use parent numa_node */
>> >> if (parent)
>> >> set_dev_node(dev, dev_to_node(parent));
>> >>
>> >> and overwrites numa_node to -1. Since region's parent is ndbusN, we
>> >> cannot set numa_node to the parent. So, I had to set it in probe.
>> >
>> > In general, I still don't like leaving it up to ->probe() which is
>> > within its rights to fail and not set the node. How about the following
>> > that moves it to the bus uevent code? Should get triggered before probe
>> > so the numa_node is valid before userspace is ever notified about the
>> > device.
>> >
>> > device_add() does:
>> >
>> > kobject_uevent(&dev->kobj, KOBJ_ADD);
>> > bus_probe_device(dev);
>> >
>> > ...so I think we're good, agree? I also added a missing init of
>> > ndr_desc.numa_node in arch/x86/kernel/pmem.c, see below.
>>
>> This looks good in a quick manual test. It's interesting/illustrative
>> that I inadvertently broke the one bit of the libnvdimm sysfs
>> interface that did not have unit test coverage.
>
> Sorry I had some interrupt. Yes, this works fine for region &
> namespace. I'd like to check with you for btt since the attach logic
> has changed in v2.
>
> Previously, as described in patch 16/17, bttN bound to pmem had a valid
> numa_node value, and seeding btt0 had -1.
>
> /sys/bus/nd/devices
> |-- btt0/numa_node:-1
> |-- btt1/numa_node:0
>
> In this version, there are unbound (seeding?) btt0-3 for every region
> (there are 4 regions) and btt4 & 5 bound to pmem0 & 3 on my system.
>
> btt0/numa_node:0
> btt1/numa_node:0
> btt2/numa_node:1
> btt3/numa_node:1
> btt4/numa_node:0
> btt5/numa_node:1
>
> btt0
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
> btt1
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt1
> btt2
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
> btt3
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt3
> btt4
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
> btt5
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5
>
> And unbound bttNs attach to different regions across a reboot.
>
> btt0/numa_node:0
> btt1/numa_node:1
> btt2/numa_node:1
> btt3/numa_node:0
> btt4/numa_node:0
> btt5/numa_node:1
>
> btt0
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
> btt1
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt1
> btt2
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
> btt3
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt3
> btt4
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
> btt5
> -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5
>
> Is this how you'd expect btt to work in this version? (I have not
> looked at the btt changes yet)

Yes, this looks fine.

As requested by Christoph, in the latest version BTTs are child
devices of regions rather than busses. They automatically inherit the
numa_node of the parent region. In your dump above the numa_nodes are
not changing from boot-to-boot, instead the BTTs are registered
asynchronously so get different ids from boot-to-boot. Userspace
should not care what the btt id is and the same naming trick we use to
give block devices static names would not work for BTTs. The child
block device of the BTT will still have the static name as we
discussed earlier (/dev/pmemXs or /dev/ndblkX.Ys) because the scan
order of those is deterministic.

2015-06-25 22:11:57

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
> On Thu, Jun 25, 2015 at 2:51 PM, Toshi Kani <[email protected]> wrote:
> > On Thu, 2015-06-25 at 14:31 -0700, Dan Williams wrote:
> >> On Thu, Jun 25, 2015 at 11:34 AM, Williams, Dan J
> >> <[email protected]> wrote:
> >> > On Thu, 2015-06-25 at 11:45 -0600, Toshi Kani wrote:
> >> >> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
> >> >> > From: Toshi Kani <[email protected]>
> >> >> >
> >> >> > ACPI NFIT table has System Physical Address Range Structure entries that
> >> >> > describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
> >> >> > set in the flags.
> >> >> >
> >> >> > Change acpi_nfit_register_region() to map a proximity ID to its node ID,
> >> >> > and set it to a new numa_node field of nd_region_desc, which is then
> >> >> > conveyed to the nd_region device.
> >> >> >
> >> >> > The device core arranges for btt and namespace devices to inherit their
> >> >> > node from their parent region.
> >> >> >
> >> >> > Signed-off-by: Toshi Kani <[email protected]>
> >> >> > [djbw: move set_dev_node() from region 'probe' to 'create']
> >> >>
> >> >> Sorry, I failed to mention other issue, which led me call set_dev_node()
> >> >> in probe. nd_async_device_register() calls device_add(), which does:
> >> >>
> >> >> /* use parent numa_node */
> >> >> if (parent)
> >> >> set_dev_node(dev, dev_to_node(parent));
> >> >>
> >> >> and overwrites numa_node to -1. Since region's parent is ndbusN, we
> >> >> cannot set numa_node to the parent. So, I had to set it in probe.
> >> >
> >> > In general, I still don't like leaving it up to ->probe() which is
> >> > within its rights to fail and not set the node. How about the following
> >> > that moves it to the bus uevent code? Should get triggered before probe
> >> > so the numa_node is valid before userspace is ever notified about the
> >> > device.
> >> >
> >> > device_add() does:
> >> >
> >> > kobject_uevent(&dev->kobj, KOBJ_ADD);
> >> > bus_probe_device(dev);
> >> >
> >> > ...so I think we're good, agree? I also added a missing init of
> >> > ndr_desc.numa_node in arch/x86/kernel/pmem.c, see below.
> >>
> >> This looks good in a quick manual test. It's interesting/illustrative
> >> that I inadvertently broke the one bit of the libnvdimm sysfs
> >> interface that did not have unit test coverage.
> >
> > Sorry I had some interrupt. Yes, this works fine for region &
> > namespace. I'd like to check with you for btt since the attach logic
> > has changed in v2.
> >
> > Previously, as described in patch 16/17, bttN bound to pmem had a valid
> > numa_node value, and seeding btt0 had -1.
> >
> > /sys/bus/nd/devices
> > |-- btt0/numa_node:-1
> > |-- btt1/numa_node:0
> >
> > In this version, there are unbound (seeding?) btt0-3 for every region
> > (there are 4 regions) and btt4 & 5 bound to pmem0 & 3 on my system.
> >
> > btt0/numa_node:0
> > btt1/numa_node:0
> > btt2/numa_node:1
> > btt3/numa_node:1
> > btt4/numa_node:0
> > btt5/numa_node:1
> >
> > btt0
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
> > btt1
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt1
> > btt2
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
> > btt3
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt3
> > btt4
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
> > btt5
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5
> >
> > And unbound bttNs attach to different regions across a reboot.
> >
> > btt0/numa_node:0
> > btt1/numa_node:1
> > btt2/numa_node:1
> > btt3/numa_node:0
> > btt4/numa_node:0
> > btt5/numa_node:1
> >
> > btt0
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt0
> > btt1
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt1
> > btt2
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/btt2
> > btt3
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/btt3
> > btt4
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/btt4
> > btt5
> > -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region3/btt5
> >
> > Is this how you'd expect btt to work in this version? (I have not
> > looked at the btt changes yet)
>
> Yes, this looks fine.
>
> As requested by Christoph, in the latest version BTTs are child
> devices of regions rather than busses. They automatically inherit the
> numa_node of the parent region. In your dump above the numa_nodes are
> not changing from boot-to-boot, instead the BTTs are registered
> asynchronously so get different ids from boot-to-boot. Userspace
> should not care what the btt id is and the same naming trick we use to
> give block devices static names would not work for BTTs. The child
> block device of the BTT will still have the static name as we
> discussed earlier (/dev/pmemXs or /dev/ndblkX.Ys) because the scan
> order of those is deterministic.

Yes, I see no problem with bound BTTs and their device files. So, how
do we bind BTT with this new version?

Thanks,
-Toshi


2015-06-25 22:35:05

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 3:11 PM, Toshi Kani <[email protected]> wrote:
> On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
> Yes, I see no problem with bound BTTs and their device files. So, how
> do we bind BTT with this new version?
>

# cd /sys/bus/nd/devices
# uuidgen > btt6/uuid
# echo 4096 > btt6/sector_size
# echo namespace6.0 > btt6/namespace
# echo namespace6.0 > ../drivers/nd_pmem/unbind
# echo btt6 > ../drivers/nd_pmem/bind

After reboot, when the system sees namespace6.0 again it will notice
the btt instance and attach bttX instead. The net effect is that now
you'll only ever have /dev/pmem6 or /dev/pmem6s, never both at the
same time that was a side effect of the stacking approach.

I'll post the patch that updates libndctl and the unit tests shortly

2015-06-25 22:55:42

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 15:34 -0700, Dan Williams wrote:
> On Thu, Jun 25, 2015 at 3:11 PM, Toshi Kani <[email protected]> wrote:
> > On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
> > Yes, I see no problem with bound BTTs and their device files. So, how
> > do we bind BTT with this new version?
> >
>
> # cd /sys/bus/nd/devices
> # uuidgen > btt6/uuid
> # echo 4096 > btt6/sector_size
> # echo namespace6.0 > btt6/namespace
> # echo namespace6.0 > ../drivers/nd_pmem/unbind
> # echo btt6 > ../drivers/nd_pmem/bind
>
> After reboot, when the system sees namespace6.0 again it will notice
> the btt instance and attach bttX instead. The net effect is that now
> you'll only ever have /dev/pmem6 or /dev/pmem6s, never both at the
> same time that was a side effect of the stacking approach.
>
> I'll post the patch that updates libndctl and the unit tests shortly

Maybe I am missing something, but I am getting errors on my system. (I
used btt0 since there is no btt6.)

# cat bind.sh
set -x
cd /sys/bus/nd/devices
uuidgen > btt0/uuid
echo 4096 > btt0/sector_size
echo namespace0.0 > btt0/namespace
echo namespace0.0 > ../drivers/nd_pmem/unbind
echo btt0 > ../drivers/nd_pmem/bind

# sh bind.sh
+ cd /sys/bus/nd/devices
+ uuidgen
+ echo 4096
+ echo namespace0.0
bind.sh: line 6: echo: write error: Device or resource busy
+ echo namespace0.0
bind.sh: line 7: echo: write error: No such device
+ echo btt0
bind.sh: line 8: echo: write error: No such device

# dmesg
:
[12513.839162] nd btt0: uuid_store: result: 0 wrote:
b32cd195-9aae-4c54-a5ac-49adb50a8a98
[12513.880286] nd btt0: sector_size_store: result: 0 wrote: 4096
[12513.909494] nd btt0: namespace0.0 already claimed
[12513.933364] nd btt0: namespace_store: result: -16 wrote: namespace0.0
[12513.966808] ndbus0: nd_pmem.probe(btt0) = -19

Thanks,
-Toshi

2015-06-25 23:42:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 16:55 -0600, Toshi Kani wrote:
> On Thu, 2015-06-25 at 15:34 -0700, Dan Williams wrote:
> > On Thu, Jun 25, 2015 at 3:11 PM, Toshi Kani <[email protected]> wrote:
> > > On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
> > > Yes, I see no problem with bound BTTs and their device files. So, how
> > > do we bind BTT with this new version?
> > >
> >
> > # cd /sys/bus/nd/devices
> > # uuidgen > btt6/uuid
> > # echo 4096 > btt6/sector_size
> > # echo namespace6.0 > btt6/namespace
> > # echo namespace6.0 > ../drivers/nd_pmem/unbind
> > # echo btt6 > ../drivers/nd_pmem/bind
> >
> > After reboot, when the system sees namespace6.0 again it will notice
> > the btt instance and attach bttX instead. The net effect is that now
> > you'll only ever have /dev/pmem6 or /dev/pmem6s, never both at the
> > same time that was a side effect of the stacking approach.
> >
> > I'll post the patch that updates libndctl and the unit tests shortly
>
> Maybe I am missing something, but I am getting errors on my system. (I
> used btt0 since there is no btt6.)
>
> # cat bind.sh
> set -x
> cd /sys/bus/nd/devices
> uuidgen > btt0/uuid
> echo 4096 > btt0/sector_size
> echo namespace0.0 > btt0/namespace
> echo namespace0.0 > ../drivers/nd_pmem/unbind
> echo btt0 > ../drivers/nd_pmem/bind
>
> # sh bind.sh
> + cd /sys/bus/nd/devices
> + uuidgen
> + echo 4096
> + echo namespace0.0
> bind.sh: line 6: echo: write error: Device or resource busy
> + echo namespace0.0
> bind.sh: line 7: echo: write error: No such device
> + echo btt0
> bind.sh: line 8: echo: write error: No such device
>
> # dmesg
> :
> [12513.839162] nd btt0: uuid_store: result: 0 wrote:
> b32cd195-9aae-4c54-a5ac-49adb50a8a98
> [12513.880286] nd btt0: sector_size_store: result: 0 wrote: 4096
> [12513.909494] nd btt0: namespace0.0 already claimed
> [12513.933364] nd btt0: namespace_store: result: -16 wrote: namespace0.0
> [12513.966808] ndbus0: nd_pmem.probe(btt0) = -19
>

So this turned out to be a perfect example of why we might want to have
the region-id in the btt device name just like namespaces, because btt0
was actually bound to namespace4.0 on Toshi's system. The following
update, that I will fold in to the series, fixes this. Note that the
association of btt id to to namespace is still non-deterministic. I.e.
btt0.1 could be assigned as the btt for namespace0.0, but at least when
looking at /sys/bus/nd/devices it will be clear which btts belong to
which regions.

# ls /sys/bus/nd/devices
btt0.0 btt3.0 btt6.0 namespace2.0 namespace5.0 nmem1 nmem4 region2 region5
btt1.0 btt4.0 namespace0.0 namespace3.0 namespace6.0 nmem2 region0 region3 region6
btt2.0 btt5.0 namespace1.0 namespace4.0 nmem0 nmem3 region1 region4


8<------------
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 2dfb529f4d35..6ac8c0fea3ec 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -21,8 +21,6 @@
#include "btt.h"
#include "nd.h"

-static DEFINE_IDA(btt_ida);
-
static void __nd_btt_detach_ndns(struct nd_btt *nd_btt)
{
struct nd_namespace_common *ndns = nd_btt->ndns;
@@ -75,11 +73,12 @@ static bool nd_btt_attach_ndns(struct nd_btt *nd_btt,

static void nd_btt_release(struct device *dev)
{
+ struct nd_region *nd_region = to_nd_region(dev->parent);
struct nd_btt *nd_btt = to_nd_btt(dev);

dev_dbg(dev, "%s\n", __func__);
nd_btt_detach_ndns(nd_btt);
- ida_simple_remove(&btt_ida, nd_btt->id);
+ ida_simple_remove(&nd_region->btt_ida, nd_btt->id);
kfree(nd_btt->uuid);
kfree(nd_btt);
}
@@ -309,7 +308,7 @@ static struct device *__nd_btt_create(struct nd_region *nd_region,
if (!nd_btt)
return NULL;

- nd_btt->id = ida_simple_get(&btt_ida, 0, 0, GFP_KERNEL);
+ nd_btt->id = ida_simple_get(&nd_region->btt_ida, 0, 0, GFP_KERNEL);
if (nd_btt->id < 0) {
kfree(nd_btt);
return NULL;
@@ -320,7 +319,7 @@ static struct device *__nd_btt_create(struct nd_region *nd_region,
uuid = kmemdup(uuid, 16, GFP_KERNEL);
nd_btt->uuid = uuid;
dev = &nd_btt->dev;
- dev_set_name(dev, "btt%d", nd_btt->id);
+ dev_set_name(dev, "btt%d.%d", nd_region->id, nd_btt->id);
dev->parent = &nd_region->dev;
dev->type = &nd_btt_device_type;
dev->groups = nd_btt_attribute_groups;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 72c26461835d..c41f53e74277 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -91,6 +91,7 @@ struct nd_percpu_lane {
struct nd_region {
struct device dev;
struct ida ns_ida;
+ struct ida btt_ida;
struct device *ns_seed;
struct device *btt_seed;
u16 ndr_mappings;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 55b424f6ba0d..a5233422f9dc 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -738,6 +738,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->ro = ro;
nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
+ ida_init(&nd_region->btt_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
dev->parent = &nvdimm_bus->dev;

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-06-26 00:55:47

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 23:42 +0000, Williams, Dan J wrote:
> On Thu, 2015-06-25 at 16:55 -0600, Toshi Kani wrote:
> > On Thu, 2015-06-25 at 15:34 -0700, Dan Williams wrote:
> > > On Thu, Jun 25, 2015 at 3:11 PM, Toshi Kani <[email protected]> wrote:
> > > > On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
> > > > Yes, I see no problem with bound BTTs and their device files. So, how
> > > > do we bind BTT with this new version?
> > > >
> > >
> > > # cd /sys/bus/nd/devices
> > > # uuidgen > btt6/uuid
> > > # echo 4096 > btt6/sector_size
> > > # echo namespace6.0 > btt6/namespace
> > > # echo namespace6.0 > ../drivers/nd_pmem/unbind
> > > # echo btt6 > ../drivers/nd_pmem/bind
> > >
> > > After reboot, when the system sees namespace6.0 again it will notice
> > > the btt instance and attach bttX instead. The net effect is that now
> > > you'll only ever have /dev/pmem6 or /dev/pmem6s, never both at the
> > > same time that was a side effect of the stacking approach.
> > >
> > > I'll post the patch that updates libndctl and the unit tests shortly
> >
> > Maybe I am missing something, but I am getting errors on my system. (I
> > used btt0 since there is no btt6.)
> >
> > # cat bind.sh
> > set -x
> > cd /sys/bus/nd/devices
> > uuidgen > btt0/uuid
> > echo 4096 > btt0/sector_size
> > echo namespace0.0 > btt0/namespace
> > echo namespace0.0 > ../drivers/nd_pmem/unbind
> > echo btt0 > ../drivers/nd_pmem/bind
> >
> > # sh bind.sh
> > + cd /sys/bus/nd/devices
> > + uuidgen
> > + echo 4096
> > + echo namespace0.0
> > bind.sh: line 6: echo: write error: Device or resource busy
> > + echo namespace0.0
> > bind.sh: line 7: echo: write error: No such device
> > + echo btt0
> > bind.sh: line 8: echo: write error: No such device
> >
> > # dmesg
> > :
> > [12513.839162] nd btt0: uuid_store: result: 0 wrote:
> > b32cd195-9aae-4c54-a5ac-49adb50a8a98
> > [12513.880286] nd btt0: sector_size_store: result: 0 wrote: 4096
> > [12513.909494] nd btt0: namespace0.0 already claimed
> > [12513.933364] nd btt0: namespace_store: result: -16 wrote: namespace0.0
> > [12513.966808] ndbus0: nd_pmem.probe(btt0) = -19
> >
>
> So this turned out to be a perfect example of why we might want to have
> the region-id in the btt device name just like namespaces, because btt0
> was actually bound to namespace4.0 on Toshi's system. The following
> update, that I will fold in to the series, fixes this. Note that the
> association of btt id to to namespace is still non-deterministic. I.e.
> btt0.1 could be assigned as the btt for namespace0.0, but at least when
> looking at /sys/bus/nd/devices it will be clear which btts belong to
> which regions.
>
> # ls /sys/bus/nd/devices
> btt0.0 btt3.0 btt6.0 namespace2.0 namespace5.0 nmem1 nmem4 region2 region5
> btt1.0 btt4.0 namespace0.0 namespace3.0 namespace6.0 nmem2 region0 region3 region6
> btt2.0 btt5.0 namespace1.0 namespace4.0 nmem0 nmem3 region1 region4

Yes, this works nicely. :-)

Now, how do I unbind BTT? I did the following as a guess, but BTT got
reattached again before I have a chance to delete the metadata, which I
need /dev/pmemN.

NUM=1
cd /sys/bus/nd/devices
echo "" > btt${NUM}.1/namespace
echo btt${NUM}.1 > ../drivers/nd_pmem/unbind
echo namespace${NUM}.0 > ../drivers/nd_pmem/bind

Thanks,
-Toshi


2015-06-26 01:09:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 5:55 PM, Toshi Kani <[email protected]> wrote:
> On Thu, 2015-06-25 at 23:42 +0000, Williams, Dan J wrote:
>> On Thu, 2015-06-25 at 16:55 -0600, Toshi Kani wrote:
>> > On Thu, 2015-06-25 at 15:34 -0700, Dan Williams wrote:
>> > > On Thu, Jun 25, 2015 at 3:11 PM, Toshi Kani <[email protected]> wrote:
>> > > > On Thu, 2015-06-25 at 15:00 -0700, Dan Williams wrote:
>> > > > Yes, I see no problem with bound BTTs and their device files. So, how
>> > > > do we bind BTT with this new version?
>> > > >
>> > >
>> > > # cd /sys/bus/nd/devices
>> > > # uuidgen > btt6/uuid
>> > > # echo 4096 > btt6/sector_size
>> > > # echo namespace6.0 > btt6/namespace
>> > > # echo namespace6.0 > ../drivers/nd_pmem/unbind
>> > > # echo btt6 > ../drivers/nd_pmem/bind
>> > >
>> > > After reboot, when the system sees namespace6.0 again it will notice
>> > > the btt instance and attach bttX instead. The net effect is that now
>> > > you'll only ever have /dev/pmem6 or /dev/pmem6s, never both at the
>> > > same time that was a side effect of the stacking approach.
>> > >
>> > > I'll post the patch that updates libndctl and the unit tests shortly
>> >
>> > Maybe I am missing something, but I am getting errors on my system. (I
>> > used btt0 since there is no btt6.)
>> >
>> > # cat bind.sh
>> > set -x
>> > cd /sys/bus/nd/devices
>> > uuidgen > btt0/uuid
>> > echo 4096 > btt0/sector_size
>> > echo namespace0.0 > btt0/namespace
>> > echo namespace0.0 > ../drivers/nd_pmem/unbind
>> > echo btt0 > ../drivers/nd_pmem/bind
>> >
>> > # sh bind.sh
>> > + cd /sys/bus/nd/devices
>> > + uuidgen
>> > + echo 4096
>> > + echo namespace0.0
>> > bind.sh: line 6: echo: write error: Device or resource busy
>> > + echo namespace0.0
>> > bind.sh: line 7: echo: write error: No such device
>> > + echo btt0
>> > bind.sh: line 8: echo: write error: No such device
>> >
>> > # dmesg
>> > :
>> > [12513.839162] nd btt0: uuid_store: result: 0 wrote:
>> > b32cd195-9aae-4c54-a5ac-49adb50a8a98
>> > [12513.880286] nd btt0: sector_size_store: result: 0 wrote: 4096
>> > [12513.909494] nd btt0: namespace0.0 already claimed
>> > [12513.933364] nd btt0: namespace_store: result: -16 wrote: namespace0.0
>> > [12513.966808] ndbus0: nd_pmem.probe(btt0) = -19
>> >
>>
>> So this turned out to be a perfect example of why we might want to have
>> the region-id in the btt device name just like namespaces, because btt0
>> was actually bound to namespace4.0 on Toshi's system. The following
>> update, that I will fold in to the series, fixes this. Note that the
>> association of btt id to to namespace is still non-deterministic. I.e.
>> btt0.1 could be assigned as the btt for namespace0.0, but at least when
>> looking at /sys/bus/nd/devices it will be clear which btts belong to
>> which regions.
>>
>> # ls /sys/bus/nd/devices
>> btt0.0 btt3.0 btt6.0 namespace2.0 namespace5.0 nmem1 nmem4 region2 region5
>> btt1.0 btt4.0 namespace0.0 namespace3.0 namespace6.0 nmem2 region0 region3 region6
>> btt2.0 btt5.0 namespace1.0 namespace4.0 nmem0 nmem3 region1 region4
>
> Yes, this works nicely. :-)
>
> Now, how do I unbind BTT? I did the following as a guess, but BTT got
> reattached again before I have a chance to delete the metadata, which I
> need /dev/pmemN.
>
> NUM=1
> cd /sys/bus/nd/devices
> echo "" > btt${NUM}.1/namespace
> echo btt${NUM}.1 > ../drivers/nd_pmem/unbind

echo 1 > namespace${NUM}.0/force_raw

> echo namespace${NUM}.0 > ../drivers/nd_pmem/bind
>

2015-06-26 01:22:20

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 15/17] libnvdimm: Set numa_node to NVDIMM devices

On Thu, 2015-06-25 at 18:08 -0700, Dan Williams wrote:
> On Thu, Jun 25, 2015 at 5:55 PM, Toshi Kani <[email protected]> wrote:
> > On Thu, 2015-06-25 at 23:42 +0000, Williams, Dan J wrote:
> >> On Thu, 2015-06-25 at 16:55 -0600, Toshi Kani wrote:
:
> >
> > Now, how do I unbind BTT? I did the following as a guess, but BTT got
> > reattached again before I have a chance to delete the metadata, which I
> > need /dev/pmemN.
> >
> > NUM=1
> > cd /sys/bus/nd/devices
> > echo "" > btt${NUM}.1/namespace
> > echo btt${NUM}.1 > ../drivers/nd_pmem/unbind
>
> echo 1 > namespace${NUM}.0/force_raw
>
> > echo namespace${NUM}.0 > ../drivers/nd_pmem/bind
> >

Cool! Yes, it works with the force_raw. (I changed the ordering a
bit).

NUM=1
cd /sys/bus/nd/devices
echo btt${NUM}.1 > ../drivers/nd_pmem/unbind
echo "" > btt${NUM}.1/namespace
echo 1 > namespace${NUM}.0/force_raw
echo namespace${NUM}.0 > ../drivers/nd_pmem/bind

Thanks,
-Toshi

2015-06-26 02:22:14

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 16/17] libnvdimm: Add sysfs numa_node to NVDIMM devices

On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
> From: Toshi Kani <[email protected]>
>
> Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
> under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.
> When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).
>
> An example of numa_node values on a 2-socket system with a single
> NVDIMM range on each socket is shown below.
> /sys/bus/nd/devices
> |-- btt0/numa_node:-1
> |-- btt1/numa_node:0
> |-- namespace0.0/numa_node:0
> |-- namespace1.0/numa_node:1
> |-- region0/numa_node:0
> |-- region1/numa_node:1
>
> These numa_node files are then linked under the block class of
> their device names.
> /sys/class/block/pmem0/device/numa_node:0
> /sys/class/block/pmem0s/device/numa_node:0
> /sys/class/block/pmem1/device/numa_node:1
>
> This enables numactl(8) to accept 'block:' and 'file:' paths of
> pmem and btt devices as shown in the examples below.
> numactl --preferred block:pmem0 --show
> numactl --preferred file:/dev/pmem0s --show
>
> Signed-off-by: Toshi Kani <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Can you please update the commit log with the following? It reflects
the changes in sysfs btt files.

Thanks,
-Toshi

=====
From: Toshi Kani <[email protected]>

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
/sys/bus/nd/devices
|-- btt0.0/numa_node:0
|-- btt1.0/numa_node:1
|-- btt1.1/numa_node:1
|-- namespace0.0/numa_node:0
|-- namespace1.0/numa_node:1
|-- region0/numa_node:0
|-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
/sys/class/block/pmem0/device/numa_node:0
/sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem1s --show

Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Dan Williams <[email protected]>

2015-06-26 15:26:54

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 16/17] libnvdimm: Add sysfs numa_node to NVDIMM devices

On Thu, Jun 25, 2015 at 7:21 PM, Toshi Kani <[email protected]> wrote:
> On Thu, 2015-06-25 at 05:37 -0400, Dan Williams wrote:
>> From: Toshi Kani <[email protected]>
>>
>> Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
>> under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.
>> When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).
>>
>> An example of numa_node values on a 2-socket system with a single
>> NVDIMM range on each socket is shown below.
>> /sys/bus/nd/devices
>> |-- btt0/numa_node:-1
>> |-- btt1/numa_node:0
>> |-- namespace0.0/numa_node:0
>> |-- namespace1.0/numa_node:1
>> |-- region0/numa_node:0
>> |-- region1/numa_node:1
>>
>> These numa_node files are then linked under the block class of
>> their device names.
>> /sys/class/block/pmem0/device/numa_node:0
>> /sys/class/block/pmem0s/device/numa_node:0
>> /sys/class/block/pmem1/device/numa_node:1
>>
>> This enables numactl(8) to accept 'block:' and 'file:' paths of
>> pmem and btt devices as shown in the examples below.
>> numactl --preferred block:pmem0 --show
>> numactl --preferred file:/dev/pmem0s --show
>>
>> Signed-off-by: Toshi Kani <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>
> Can you please update the commit log with the following? It reflects
> the changes in sysfs btt files.

Done, thanks!

2015-06-30 10:22:26

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH v2 17/17] arch, x86: pmem api for ensuring durability of persistent memory updates

On Thu, Jun 25, 2015 at 05:37:49AM -0400, Dan Williams wrote:
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 867722591be2..9a528d945498 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -21,6 +21,7 @@
> # define __rcu __attribute__((noderef, address_space(4)))

On this side of the #ifdef CONFIG_SPARSE_RCU_POINTER statement then
__pmem isn't defined so it leads to a build error running a CHECKER on
today's linux-next.

I would define __pmem away, but I don't understand why __pmem and
CONFIG_SPARSE_RCU_POINTER are related at all. Maybe it should be
outside the if statement?

> #else
> # define __rcu
> +# define __pmem __attribute__((noderef, address_space(5)))
> #endif
> extern void __chk_user_ptr(const volatile void __user *);
> extern void __chk_io_ptr(const volatile void __iomem *);
> @@ -42,6 +43,7 @@ extern void __chk_io_ptr(const volatile void __iomem *);
> # define __cond_lock(x,c) (c)
> # define __percpu
> # define __rcu
> +# define __pmem
> #endif
>
> /* Indirect macros required for expanded argument pasting, eg. __LINE__. */

regards,
dan carpenter

2015-06-30 16:23:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 17/17] arch, x86: pmem api for ensuring durability of persistent memory updates

On Tue, 2015-06-30 at 13:21 +0300, Dan Carpenter wrote:
> On Thu, Jun 25, 2015 at 05:37:49AM -0400, Dan Williams wrote:
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 867722591be2..9a528d945498 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -21,6 +21,7 @@
> > # define __rcu __attribute__((noderef, address_space(4)))
>
> On this side of the #ifdef CONFIG_SPARSE_RCU_POINTER statement then
> __pmem isn't defined so it leads to a build error running a CHECKER on
> today's linux-next.
>
> I would define __pmem away, but I don't understand why __pmem and
> CONFIG_SPARSE_RCU_POINTER are related at all. Maybe it should be
> outside the if statement?
>

Yes it should be outside the if...

8<----
Subject: sparse: fix misplaced __pmem definition

From: Dan Williams <[email protected]>

Move the definition of __pmem outside of CONFIG_SPARSE_RCU_POINTER to fix:

drivers/nvdimm/pmem.c:198:17: sparse: too many arguments for function __builtin_expect
drivers/nvdimm/pmem.c:36:33: sparse: expected ; at end of declaration
drivers/nvdimm/pmem.c:48:21: sparse: void declaration

...due to __pmem failing to be defined in some configurations when
CONFIG_SPARSE_RCU_POINTER=y.

Reported-by: kbuild test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/compiler.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 26fc8bc77f85..d8fbd500330e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -17,11 +17,11 @@
# define __release(x) __context__(x,-1)
# define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0)
# define __percpu __attribute__((noderef, address_space(3)))
+# define __pmem __attribute__((noderef, address_space(5)))
#ifdef CONFIG_SPARSE_RCU_POINTER
# define __rcu __attribute__((noderef, address_space(4)))
#else
# define __rcu
-# define __pmem __attribute__((noderef, address_space(5)))
#endif
extern void __chk_user_ptr(const volatile void __user *);
extern void __chk_io_ptr(const volatile void __iomem *);

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?