[ resend to add dm-devel, linux-block, and fs-devel, apologies for the
duplicates ]
Changes since v1 [1] and the dax-fs RFC [2]:
* rename struct dax_inode to struct dax_device (Christoph)
* rewrite arch_memcpy_to_pmem() in C with inline asm
* use QUEUE_FLAG_WC to gate dax cache management (Jeff)
* add device-mapper plumbing for the ->copy_from_iter() and ->flush()
dax_operations
* kill struct blk_dax_ctl and bdev_direct_access (Christoph)
* cleanup the ->direct_access() calling convention to be page based
(Christoph)
* introduce dax_get_by_host() and don't pollute struct super_block with
dax_device details (Christoph)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lwn.net/Articles/713064/
---
A few months back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus proposed that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver level
[3]:
"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.
As you point out, it's also fundamentally buggy crap.
Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.
If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."
This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.
Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
platforms).
So, this series:
1/ moves what was previously named "the pmem api" out of the global
namespace and into drivers that need to be concerned with
architecture specific persistent memory considerations.
2/ arranges for dax to stop abusing __copy_user_nocache() and implements
a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
expanded in the future to use 'movntdqa' if the copy size is above
some threshold, or expanded with support for other architectures [4].
3/ makes cache maintenance optional by arranging for dax to call driver
specific copy and flush operations only if the driver publishes them.
4/ allows filesytem-dax cache management to be controlled by the block
device write-cache queue flag. The pmem driver is updated to clear
that flag by default when pmem is driving volatile memory.
[3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
These patches have been through a round of build regression fixes
notified by the 0day robot. All review welcome, but the patches that
need extra attention are the device-mapper and uio changes
(copy_from_iter_ops).
This series is based on a merge of char-misc-next (for cdev api reworks)
and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
---
Dan Williams (33):
device-dax: rename 'dax_dev' to 'dev_dax'
dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
dax: add a facility to lookup a dax device by 'host' device name
dax: introduce dax_operations
pmem: add dax_operations support
axon_ram: add dax_operations support
brd: add dax_operations support
dcssblk: add dax_operations support
block: kill bdev_dax_capable()
dax: introduce dax_direct_access()
dm: add dax_device and dax_operations support
dm: teach dm-targets to use a dax_device + dax_operations
ext2, ext4, xfs: retrieve dax_device for iomap operations
Revert "block: use DAX for partition table reads"
filesystem-dax: convert to dax_direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
block: remove block_device_operations ->direct_access()
x86, dax, pmem: remove indirection around memcpy_from_pmem()
dax, pmem: introduce 'copy_from_iter' dax operation
dm: add ->copy_from_iter() dax operation support
filesystem-dax: convert to dax_copy_from_iter()
dax, pmem: introduce an optional 'flush' dax_operation
dm: add ->flush() dax operation support
filesystem-dax: convert to dax_flush()
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
x86, libnvdimm, dax: stop abusing __copy_user_nocache
uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
MAINTAINERS | 2
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 45 +++-
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 141 ------------
arch/x86/include/asm/string_64.h | 1
block/Kconfig | 1
block/partition-generic.c | 17 -
drivers/Makefile | 2
drivers/acpi/nfit/core.c | 15 +
drivers/block/Kconfig | 1
drivers/block/brd.c | 52 +++-
drivers/dax/Kconfig | 10 +
drivers/dax/Makefile | 5
drivers/dax/dax.h | 15 -
drivers/dax/device-dax.h | 25 ++
drivers/dax/device.c | 415 +++++++++++------------------------
drivers/dax/pmem.c | 10 -
drivers/dax/super.c | 445 ++++++++++++++++++++++++++++++++++++++
drivers/md/Kconfig | 1
drivers/md/dm-core.h | 1
drivers/md/dm-linear.c | 53 ++++-
drivers/md/dm-snap.c | 6 -
drivers/md/dm-stripe.c | 65 ++++--
drivers/md/dm-target.c | 6 -
drivers/md/dm.c | 112 ++++++++--
drivers/nvdimm/Kconfig | 6 +
drivers/nvdimm/Makefile | 1
drivers/nvdimm/bus.c | 10 -
drivers/nvdimm/claim.c | 9 -
drivers/nvdimm/core.c | 2
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/dimm_devs.c | 2
drivers/nvdimm/namespace_devs.c | 9 -
drivers/nvdimm/nd-core.h | 9 +
drivers/nvdimm/pfn_devs.c | 4
drivers/nvdimm/pmem.c | 82 +++++--
drivers/nvdimm/pmem.h | 26 ++
drivers/nvdimm/region_devs.c | 39 ++-
drivers/nvdimm/x86.c | 155 +++++++++++++
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 44 +++-
fs/block_dev.c | 117 +++-------
fs/dax.c | 302 ++++++++++++++------------
fs/ext2/inode.c | 9 +
fs/ext4/inode.c | 9 +
fs/iomap.c | 3
fs/xfs/xfs_iomap.c | 10 +
include/linux/blkdev.h | 19 --
include/linux/dax.h | 43 +++-
include/linux/device-mapper.h | 14 +
include/linux/iomap.h | 1
include/linux/libnvdimm.h | 10 +
include/linux/pmem.h | 165 --------------
include/linux/string.h | 8 +
include/linux/uio.h | 4
lib/Kconfig | 6 -
lib/iov_iter.c | 25 ++
tools/testing/nvdimm/Kbuild | 11 +
tools/testing/nvdimm/pmem-dax.c | 21 +-
60 files changed, 1584 insertions(+), 1042 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
create mode 100644 drivers/dax/device-dax.h
rename drivers/dax/{dax.c => device.c} (60%)
create mode 100644 drivers/dax/super.c
create mode 100644 drivers/nvdimm/x86.c
delete mode 100644 include/linux/pmem.h
We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.
For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.
[1]: https://lkml.org/lkml/2017/1/19/880
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/Makefile | 2
drivers/dax/Kconfig | 10 +
drivers/dax/Makefile | 5 +
drivers/dax/dax.h | 20 +--
drivers/dax/device-dax.h | 25 ++++
drivers/dax/device.c | 241 ++++++----------------------------
drivers/dax/pmem.c | 2
drivers/dax/super.c | 303 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 3
tools/testing/nvdimm/Kbuild | 10 +
10 files changed, 404 insertions(+), 217 deletions(-)
create mode 100644 drivers/dax/device-dax.h
rename drivers/dax/{dax.c => device.c} (77%)
create mode 100644 drivers/dax/super.c
diff --git a/drivers/Makefile b/drivers/Makefile
index 2eced9afba53..0442e982cf35 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/
obj-$(CONFIG_NVM) += lightnvm/
obj-y += base/ block/ misc/ mfd/ nfc/
obj-$(CONFIG_LIBNVDIMM) += nvdimm/
-obj-$(CONFIG_DEV_DAX) += dax/
+obj-$(CONFIG_DAX) += dax/
obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
obj-$(CONFIG_NUBUS) += nubus/
obj-y += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 9e95bf94eb13..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,8 +1,13 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+ select SRCU
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+ tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
- select SRCU
help
Support raw access to differentiated (persistence, bandwidth,
latency...) memory via an mmap(2) capable character
@@ -11,7 +16,6 @@ menuconfig DEV_DAX
baseline memory pool. Mappings of a /dev/daxX.Y device impose
restrictions that make the mapping behavior deterministic.
-if DEV_DAX
config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+dax-y := super.o
dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ea176d875d60..2472d9da96db 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
/*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,12 @@
*/
#ifndef __DAX_H__
#define __DAX_H__
-struct device;
-struct dev_dax;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
- int region_id, struct resource *res, unsigned int align,
- void *addr, unsigned long flags);
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
- struct resource *res, int count);
+struct dax_device;
+struct dax_device *alloc_dax(void *private);
+void put_dax(struct dax_device *dax_dev);
+bool dax_alive(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
+struct inode *dax_inode(struct dax_device *dax_dev);
+void *dax_get_private(struct dax_device *dax_dev);
#endif /* __DAX_H__ */
diff --git a/drivers/dax/device-dax.h b/drivers/dax/device-dax.h
new file mode 100644
index 000000000000..fdcd9769ffde
--- /dev/null
+++ b/drivers/dax/device-dax.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#ifndef __DEVICE_DAX_H__
+#define __DEVICE_DAX_H__
+struct device;
+struct dev_dax;
+struct resource;
+struct dax_region;
+void dax_region_put(struct dax_region *dax_region);
+struct dax_region *alloc_dax_region(struct device *parent,
+ int region_id, struct resource *res, unsigned int align,
+ void *addr, unsigned long flags);
+struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
+ struct resource *res, int count);
+#endif /* __DEVICE_DAX_H__ */
diff --git a/drivers/dax/dax.c b/drivers/dax/device.c
similarity index 77%
rename from drivers/dax/dax.c
rename to drivers/dax/device.c
index 376fdd353aea..19a42edbfa03 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/device.c
@@ -1,5 +1,5 @@
/*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
@@ -13,10 +13,7 @@
#include <linux/pagemap.h>
#include <linux/module.h>
#include <linux/device.h>
-#include <linux/magic.h>
-#include <linux/mount.h>
#include <linux/pfn_t.h>
-#include <linux/hash.h>
#include <linux/cdev.h>
#include <linux/slab.h>
#include <linux/dax.h>
@@ -24,16 +21,7 @@
#include <linux/mm.h>
#include "dax.h"
-static dev_t dax_devt;
-DEFINE_STATIC_SRCU(dax_srcu);
static struct class *dax_class;
-static DEFINE_IDA(dax_minor_ida);
-static int nr_dax = CONFIG_NR_DEV_DAX;
-module_param(nr_dax, int, S_IRUGO);
-static struct vfsmount *dax_mnt;
-static struct kmem_cache *dax_cache __read_mostly;
-static struct super_block *dax_superblock __read_mostly;
-MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
/**
* struct dax_region - mapping infrastructure for dax devices
@@ -59,19 +47,16 @@ struct dax_region {
/**
* struct dev_dax - instance data for a subdivision of a dax region
* @region - parent region
- * @dev - device backing the character device
- * @cdev - core chardev data
- * @alive - !alive + srcu grace period == no new mappings can be established
+ * @dax_dev - core dax functionality
+ * @dev - device core
* @id - child id in the region
* @num_resources - number of physical address extents in this device
* @res - array of physical address ranges
*/
struct dev_dax {
struct dax_region *region;
- struct inode *inode;
+ struct dax_device *dax_dev;
struct device dev;
- struct cdev cdev;
- bool alive;
int id;
int num_resources;
struct resource res[0];
@@ -144,117 +129,6 @@ static const struct attribute_group *dax_region_attribute_groups[] = {
NULL,
};
-static struct inode *dax_alloc_inode(struct super_block *sb)
-{
- return kmem_cache_alloc(dax_cache, GFP_KERNEL);
-}
-
-static void dax_i_callback(struct rcu_head *head)
-{
- struct inode *inode = container_of(head, struct inode, i_rcu);
-
- kmem_cache_free(dax_cache, inode);
-}
-
-static void dax_destroy_inode(struct inode *inode)
-{
- call_rcu(&inode->i_rcu, dax_i_callback);
-}
-
-static const struct super_operations dax_sops = {
- .statfs = simple_statfs,
- .alloc_inode = dax_alloc_inode,
- .destroy_inode = dax_destroy_inode,
- .drop_inode = generic_delete_inode,
-};
-
-static struct dentry *dax_mount(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
-{
- return mount_pseudo(fs_type, "dax:", &dax_sops, NULL, DAXFS_MAGIC);
-}
-
-static struct file_system_type dax_type = {
- .name = "dax",
- .mount = dax_mount,
- .kill_sb = kill_anon_super,
-};
-
-static int dax_test(struct inode *inode, void *data)
-{
- return inode->i_cdev == data;
-}
-
-static int dax_set(struct inode *inode, void *data)
-{
- inode->i_cdev = data;
- return 0;
-}
-
-static struct inode *dax_inode_get(struct cdev *cdev, dev_t devt)
-{
- struct inode *inode;
-
- inode = iget5_locked(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
- dax_test, dax_set, cdev);
-
- if (!inode)
- return NULL;
-
- if (inode->i_state & I_NEW) {
- inode->i_mode = S_IFCHR;
- inode->i_flags = S_DAX;
- inode->i_rdev = devt;
- mapping_set_gfp_mask(&inode->i_data, GFP_USER);
- unlock_new_inode(inode);
- }
- return inode;
-}
-
-static void init_once(void *inode)
-{
- inode_init_once(inode);
-}
-
-static int dax_inode_init(void)
-{
- int rc;
-
- dax_cache = kmem_cache_create("dax_cache", sizeof(struct inode), 0,
- (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
- SLAB_MEM_SPREAD|SLAB_ACCOUNT),
- init_once);
- if (!dax_cache)
- return -ENOMEM;
-
- rc = register_filesystem(&dax_type);
- if (rc)
- goto err_register_fs;
-
- dax_mnt = kern_mount(&dax_type);
- if (IS_ERR(dax_mnt)) {
- rc = PTR_ERR(dax_mnt);
- goto err_mount;
- }
- dax_superblock = dax_mnt->mnt_sb;
-
- return 0;
-
- err_mount:
- unregister_filesystem(&dax_type);
- err_register_fs:
- kmem_cache_destroy(dax_cache);
-
- return rc;
-}
-
-static void dax_inode_exit(void)
-{
- kern_unmount(dax_mnt);
- unregister_filesystem(&dax_type);
- kmem_cache_destroy(dax_cache);
-}
-
static void dax_region_free(struct kref *kref)
{
struct dax_region *dax_region;
@@ -363,7 +237,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
struct device *dev = &dev_dax->dev;
unsigned long mask;
- if (!dev_dax->alive)
+ if (!dax_alive(dev_dax->dax_dev))
return -ENXIO;
/* prevent private mappings from being established */
@@ -582,7 +456,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
? "write" : "read",
vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
- id = srcu_read_lock(&dax_srcu);
+ id = dax_read_lock();
switch (pe_size) {
case PE_SIZE_PTE:
rc = __dev_dax_pte_fault(dev_dax, vmf);
@@ -596,7 +470,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
default:
rc = VM_FAULT_SIGBUS;
}
- srcu_read_unlock(&dax_srcu, id);
+ dax_read_unlock(id);
return rc;
}
@@ -614,11 +488,17 @@ static const struct vm_operations_struct dax_vm_ops = {
static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct dev_dax *dev_dax = filp->private_data;
- int rc;
+ int rc, id;
dev_dbg(&dev_dax->dev, "%s\n", __func__);
+ /*
+ * We lock to check dax_dev liveness and will re-check at
+ * fault time.
+ */
+ id = dax_read_lock();
rc = check_vma(dev_dax, vma, __func__);
+ dax_read_unlock(id);
if (rc)
return rc;
@@ -664,12 +544,13 @@ static unsigned long dax_get_unmapped_area(struct file *filp,
static int dax_open(struct inode *inode, struct file *filp)
{
- struct dev_dax *dev_dax;
+ struct dax_device *dax_dev = inode_dax(inode);
+ struct inode *__dax_inode = dax_inode(dax_dev);
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
- dev_dax = container_of(inode->i_cdev, struct dev_dax, cdev);
dev_dbg(&dev_dax->dev, "%s\n", __func__);
- inode->i_mapping = dev_dax->inode->i_mapping;
- inode->i_mapping->host = dev_dax->inode;
+ inode->i_mapping = __dax_inode->i_mapping;
+ inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
@@ -698,36 +579,34 @@ static void dev_dax_release(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_region *dax_region = dev_dax->region;
+ struct dax_device *dax_dev = dev_dax->dax_dev;
ida_simple_remove(&dax_region->ida, dev_dax->id);
- ida_simple_remove(&dax_minor_ida, MINOR(dev->devt));
dax_region_put(dax_region);
- iput(dev_dax->inode);
+ put_dax(dax_dev);
kfree(dev_dax);
}
static void kill_dev_dax(struct dev_dax *dev_dax)
{
- /*
- * Note, rcu is not protecting the liveness of dev_dax, rcu is
- * ensuring that any fault handlers that might have seen
- * dev_dax->alive == true, have completed. Any fault handlers
- * that start after synchronize_srcu() has started will abort
- * upon seeing dev_dax->alive == false.
- */
- dev_dax->alive = false;
- synchronize_srcu(&dax_srcu);
- unmap_mapping_range(dev_dax->inode->i_mapping, 0, 0, 1);
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct inode *inode = dax_inode(dax_dev);
+
+ kill_dax(dax_dev);
+ unmap_mapping_range(inode->i_mapping, 0, 0, 1);
}
static void unregister_dev_dax(void *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct inode *inode = dax_inode(dax_dev);
+ struct cdev *cdev = inode->i_cdev;
dev_dbg(dev, "%s\n", __func__);
kill_dev_dax(dev_dax);
- cdev_device_del(&dev_dax->cdev, dev);
+ cdev_device_del(cdev, dev);
put_device(dev);
}
@@ -735,11 +614,12 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
struct resource *res, int count)
{
struct device *parent = dax_region->dev;
+ struct dax_device *dax_dev;
struct dev_dax *dev_dax;
- int rc = 0, minor, i;
+ struct inode *inode;
struct device *dev;
struct cdev *cdev;
- dev_t dev_t;
+ int rc = 0, i;
dev_dax = kzalloc(sizeof(*dev_dax) + sizeof(*res) * count, GFP_KERNEL);
if (!dev_dax)
@@ -765,33 +645,25 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
goto err_id;
}
- minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL);
- if (minor < 0) {
- rc = minor;
- goto err_minor;
- }
+ dax_dev = alloc_dax(dev_dax);
+ if (!dax_dev)
+ goto err_dax;
- dev_t = MKDEV(MAJOR(dax_devt), minor);
+ /* from here on we're committed to teardown via dax_dev_release() */
dev = &dev_dax->dev;
- dev_dax->inode = dax_inode_get(&dev_dax->cdev, dev_t);
- if (!dev_dax->inode) {
- rc = -ENOMEM;
- goto err_inode;
- }
-
- /* from here on we're committed to teardown via dev_dax_release() */
device_initialize(dev);
- cdev = &dev_dax->cdev;
+ inode = dax_inode(dax_dev);
+ cdev = inode->i_cdev;
cdev_init(cdev, &dax_fops);
cdev->owner = parent->driver->owner;
dev_dax->num_resources = count;
- dev_dax->alive = true;
+ dev_dax->dax_dev = dax_dev;
dev_dax->region = dax_region;
kref_get(&dax_region->kref);
- dev->devt = dev_t;
+ dev->devt = inode->i_rdev;
dev->class = dax_class;
dev->parent = parent;
dev->groups = dax_attribute_groups;
@@ -811,9 +683,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
return dev_dax;
- err_inode:
- ida_simple_remove(&dax_minor_ida, minor);
- err_minor:
+ err_dax:
ida_simple_remove(&dax_region->ida, dev_dax->id);
err_id:
kfree(dev_dax);
@@ -824,38 +694,13 @@ EXPORT_SYMBOL_GPL(devm_create_dev_dax);
static int __init dax_init(void)
{
- int rc;
-
- rc = dax_inode_init();
- if (rc)
- return rc;
-
- nr_dax = max(nr_dax, 256);
- rc = alloc_chrdev_region(&dax_devt, 0, nr_dax, "dax");
- if (rc)
- goto err_chrdev;
-
dax_class = class_create(THIS_MODULE, "dax");
- if (IS_ERR(dax_class)) {
- rc = PTR_ERR(dax_class);
- goto err_class;
- }
-
- return 0;
-
- err_class:
- unregister_chrdev_region(dax_devt, nr_dax);
- err_chrdev:
- dax_inode_exit();
- return rc;
+ return PTR_ERR_OR_ZERO(dax_class);
}
static void __exit dax_exit(void)
{
class_destroy(dax_class);
- unregister_chrdev_region(dax_devt, nr_dax);
- ida_destroy(&dax_minor_ida);
- dax_inode_exit();
}
MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 2c736fc4508b..d4ca19bd74eb 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -16,7 +16,7 @@
#include <linux/pfn_t.h>
#include "../nvdimm/pfn.h"
#include "../nvdimm/nd.h"
-#include "dax.h"
+#include "device-dax.h"
struct dax_pmem {
struct device *dev;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
new file mode 100644
index 000000000000..c9f85f1c086e
--- /dev/null
+++ b/drivers/dax/super.c
@@ -0,0 +1,303 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/magic.h>
+#include <linux/cdev.h>
+#include <linux/hash.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+
+static int nr_dax = CONFIG_NR_DEV_DAX;
+module_param(nr_dax, int, S_IRUGO);
+MODULE_PARM_DESC(nr_dax, "max number of dax device instances");
+
+static dev_t dax_devt;
+DEFINE_STATIC_SRCU(dax_srcu);
+static struct vfsmount *dax_mnt;
+static DEFINE_IDA(dax_minor_ida);
+static struct kmem_cache *dax_cache __read_mostly;
+static struct super_block *dax_superblock __read_mostly;
+
+int dax_read_lock(void)
+{
+ return srcu_read_lock(&dax_srcu);
+}
+EXPORT_SYMBOL_GPL(dax_read_lock);
+
+void dax_read_unlock(int id)
+{
+ srcu_read_unlock(&dax_srcu, id);
+}
+EXPORT_SYMBOL_GPL(dax_read_unlock);
+
+/**
+ * struct dax_device - anchor object for dax services
+ * @inode: core vfs
+ * @cdev: optional character interface for "device dax"
+ * @private: dax driver private data
+ * @alive: !alive + rcu grace period == no new operations / mappings
+ */
+struct dax_device {
+ struct inode inode;
+ struct cdev cdev;
+ void *private;
+ bool alive;
+};
+
+bool dax_alive(struct dax_device *dax_dev)
+{
+ lockdep_assert_held(&dax_srcu);
+ return dax_dev->alive;
+}
+EXPORT_SYMBOL_GPL(dax_alive);
+
+/*
+ * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
+ * that any fault handlers or operations that might have seen
+ * dax_alive(), have completed. Any operations that start after
+ * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ */
+void kill_dax(struct dax_device *dax_dev)
+{
+ if (!dax_dev)
+ return;
+
+ dax_dev->alive = false;
+ synchronize_srcu(&dax_srcu);
+ dax_dev->private = NULL;
+}
+EXPORT_SYMBOL_GPL(kill_dax);
+
+static struct inode *dax_alloc_inode(struct super_block *sb)
+{
+ struct dax_device *dax_dev;
+
+ dax_dev = kmem_cache_alloc(dax_cache, GFP_KERNEL);
+ return &dax_dev->inode;
+}
+
+static struct dax_device *to_dax_dev(struct inode *inode)
+{
+ return container_of(inode, struct dax_device, inode);
+}
+
+static void dax_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ struct dax_device *dax_dev = to_dax_dev(inode);
+
+ ida_simple_remove(&dax_minor_ida, MINOR(inode->i_rdev));
+ kmem_cache_free(dax_cache, dax_dev);
+}
+
+static void dax_destroy_inode(struct inode *inode)
+{
+ struct dax_device *dax_dev = to_dax_dev(inode);
+
+ WARN_ONCE(dax_dev->alive,
+ "kill_dax() must be called before final iput()\n");
+ call_rcu(&inode->i_rcu, dax_i_callback);
+}
+
+static const struct super_operations dax_sops = {
+ .statfs = simple_statfs,
+ .alloc_inode = dax_alloc_inode,
+ .destroy_inode = dax_destroy_inode,
+ .drop_inode = generic_delete_inode,
+};
+
+static struct dentry *dax_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ return mount_pseudo(fs_type, "dax:", &dax_sops, NULL, DAXFS_MAGIC);
+}
+
+static struct file_system_type dax_fs_type = {
+ .name = "dax",
+ .mount = dax_mount,
+ .kill_sb = kill_anon_super,
+};
+
+static int dax_test(struct inode *inode, void *data)
+{
+ dev_t devt = *(dev_t *) data;
+
+ return inode->i_rdev == devt;
+}
+
+static int dax_set(struct inode *inode, void *data)
+{
+ dev_t devt = *(dev_t *) data;
+
+ inode->i_rdev = devt;
+ return 0;
+}
+
+static struct dax_device *dax_dev_get(dev_t devt)
+{
+ struct dax_device *dax_dev;
+ struct inode *inode;
+
+ inode = iget5_locked(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
+ dax_test, dax_set, &devt);
+
+ if (!inode)
+ return NULL;
+
+ dax_dev = to_dax_dev(inode);
+ if (inode->i_state & I_NEW) {
+ dax_dev->alive = true;
+ inode->i_cdev = &dax_dev->cdev;
+ inode->i_mode = S_IFCHR;
+ inode->i_flags = S_DAX;
+ mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ unlock_new_inode(inode);
+ }
+
+ return dax_dev;
+}
+
+struct dax_device *alloc_dax(void *private)
+{
+ struct dax_device *dax_dev;
+ dev_t devt;
+ int minor;
+
+ minor = ida_simple_get(&dax_minor_ida, 0, nr_dax, GFP_KERNEL);
+ if (minor < 0)
+ return NULL;
+
+ devt = MKDEV(MAJOR(dax_devt), minor);
+ dax_dev = dax_dev_get(devt);
+ if (!dax_dev)
+ goto err_inode;
+
+ dax_dev->private = private;
+ return dax_dev;
+
+ err_inode:
+ ida_simple_remove(&dax_minor_ida, minor);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(alloc_dax);
+
+void put_dax(struct dax_device *dax_dev)
+{
+ if (!dax_dev)
+ return;
+ iput(&dax_dev->inode);
+}
+EXPORT_SYMBOL_GPL(put_dax);
+
+/**
+ * inode_dax: convert a public inode into its dax_dev
+ * @inode: An inode with i_cdev pointing to a dax_dev
+ *
+ * Note this is not equivalent to to_dax_dev() which is for private
+ * internal use where we know the inode filesystem type == dax_fs_type.
+ */
+struct dax_device *inode_dax(struct inode *inode)
+{
+ struct cdev *cdev = inode->i_cdev;
+
+ return container_of(cdev, struct dax_device, cdev);
+}
+EXPORT_SYMBOL_GPL(inode_dax);
+
+struct inode *dax_inode(struct dax_device *dax_dev)
+{
+ return &dax_dev->inode;
+}
+EXPORT_SYMBOL_GPL(dax_inode);
+
+void *dax_get_private(struct dax_device *dax_dev)
+{
+ return dax_dev->private;
+}
+EXPORT_SYMBOL_GPL(dax_get_private);
+
+static void init_once(void *_dax_dev)
+{
+ struct dax_device *dax_dev = _dax_dev;
+ struct inode *inode = &dax_dev->inode;
+
+ inode_init_once(inode);
+}
+
+static int __dax_fs_init(void)
+{
+ int rc;
+
+ dax_cache = kmem_cache_create("dax_cache", sizeof(struct dax_device), 0,
+ (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
+ SLAB_MEM_SPREAD|SLAB_ACCOUNT),
+ init_once);
+ if (!dax_cache)
+ return -ENOMEM;
+
+ rc = register_filesystem(&dax_fs_type);
+ if (rc)
+ goto err_register_fs;
+
+ dax_mnt = kern_mount(&dax_fs_type);
+ if (IS_ERR(dax_mnt)) {
+ rc = PTR_ERR(dax_mnt);
+ goto err_mount;
+ }
+ dax_superblock = dax_mnt->mnt_sb;
+
+ return 0;
+
+ err_mount:
+ unregister_filesystem(&dax_fs_type);
+ err_register_fs:
+ kmem_cache_destroy(dax_cache);
+
+ return rc;
+}
+
+static void __dax_fs_exit(void)
+{
+ kern_unmount(dax_mnt);
+ unregister_filesystem(&dax_fs_type);
+ kmem_cache_destroy(dax_cache);
+}
+
+static int __init dax_fs_init(void)
+{
+ int rc;
+
+ rc = __dax_fs_init();
+ if (rc)
+ return rc;
+
+ nr_dax = max(nr_dax, 256);
+ rc = alloc_chrdev_region(&dax_devt, 0, nr_dax, "dax");
+ if (rc)
+ __dax_fs_exit();
+ return rc;
+}
+
+static void __exit dax_fs_exit(void)
+{
+ unregister_chrdev_region(dax_devt, nr_dax);
+ ida_destroy(&dax_minor_ida);
+ __dax_fs_exit();
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+subsys_initcall(dax_fs_init);
+module_exit(dax_fs_exit);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d8a3dc042e1c..5b62f5d19aea 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,6 +8,9 @@
struct iomap_ops;
+int dax_read_lock(void);
+void dax_read_unlock(int id);
+
/*
* We use lowest available bit in exceptional entry for locking, one bit for
* the entry size (PMD) and two more to tell us if the entry is a huge zero
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 405212be044a..2033ad03b8cd 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -28,7 +28,10 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
obj-$(CONFIG_ND_BLK) += nd_blk.o
obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
obj-$(CONFIG_ACPI_NFIT) += nfit.o
-obj-$(CONFIG_DEV_DAX) += dax.o
+ifeq ($(CONFIG_DAX),m)
+obj-$(CONFIG_DAX) += dax.o
+endif
+obj-$(CONFIG_DEV_DAX) += device_dax.o
obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
nfit-y := $(ACPI_SRC)/core.o
@@ -48,9 +51,12 @@ nd_blk-y += config_check.o
nd_e820-y := $(NVDIMM_SRC)/e820.o
nd_e820-y += config_check.o
-dax-y := $(DAX_SRC)/dax.o
+dax-y := $(DAX_SRC)/super.o
dax-y += config_check.o
+device_dax-y := $(DAX_SRC)/device.o
+device_dax-y += config_check.o
+
dax_pmem-y := $(DAX_SRC)/pmem.o
dax_pmem-y += config_check.o
For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_device associated with a block_device. Add a
'host' property of a dax_device that can be used for this purpose. It is
a free form string, but for a dax_device associated with a block device
it is the bdev name.
This is a stop-gap until filesystems are able to mount on a dax-inode
directly.
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/dax.h | 2 +
drivers/dax/device.c | 2 +
drivers/dax/super.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/dax.h | 1 +
4 files changed, 82 insertions(+), 6 deletions(-)
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 2472d9da96db..246a24d68d4c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,7 @@
#ifndef __DAX_H__
#define __DAX_H__
struct dax_device;
-struct dax_device *alloc_dax(void *private);
+struct dax_device *alloc_dax(void *private, const char *host);
void put_dax(struct dax_device *dax_dev);
bool dax_alive(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 19a42edbfa03..db68f4fa8ce0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -645,7 +645,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
goto err_id;
}
- dax_dev = alloc_dax(dev_dax);
+ dax_dev = alloc_dax(dev_dax, NULL);
if (!dax_dev)
goto err_dax;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c9f85f1c086e..bb22956a106b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -30,6 +30,10 @@ static DEFINE_IDA(dax_minor_ida);
static struct kmem_cache *dax_cache __read_mostly;
static struct super_block *dax_superblock __read_mostly;
+#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
+static struct hlist_head dax_host_list[DAX_HASH_SIZE];
+static DEFINE_SPINLOCK(dax_host_lock);
+
int dax_read_lock(void)
{
return srcu_read_lock(&dax_srcu);
@@ -46,12 +50,15 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
* struct dax_device - anchor object for dax services
* @inode: core vfs
* @cdev: optional character interface for "device dax"
+ * @host: optional name for lookups where the device path is not available
* @private: dax driver private data
* @alive: !alive + rcu grace period == no new operations / mappings
*/
struct dax_device {
+ struct hlist_node list;
struct inode inode;
struct cdev cdev;
+ const char *host;
void *private;
bool alive;
};
@@ -63,6 +70,11 @@ bool dax_alive(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(dax_alive);
+static int dax_host_hash(const char *host)
+{
+ return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
+}
+
/*
* Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
* that any fault handlers or operations that might have seen
@@ -75,6 +87,12 @@ void kill_dax(struct dax_device *dax_dev)
return;
dax_dev->alive = false;
+
+ spin_lock(&dax_host_lock);
+ if (!hlist_unhashed(&dax_dev->list))
+ hlist_del_init(&dax_dev->list);
+ spin_unlock(&dax_host_lock);
+
synchronize_srcu(&dax_srcu);
dax_dev->private = NULL;
}
@@ -98,6 +116,8 @@ static void dax_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct dax_device *dax_dev = to_dax_dev(inode);
+ kfree(dax_dev->host);
+ dax_dev->host = NULL;
ida_simple_remove(&dax_minor_ida, MINOR(inode->i_rdev));
kmem_cache_free(dax_cache, dax_dev);
}
@@ -169,26 +189,49 @@ static struct dax_device *dax_dev_get(dev_t devt)
return dax_dev;
}
-struct dax_device *alloc_dax(void *private)
+static void dax_add_host(struct dax_device *dax_dev, const char *host)
+{
+ int hash;
+
+ INIT_HLIST_NODE(&dax_dev->list);
+ if (!host)
+ return;
+
+ dax_dev->host = host;
+ hash = dax_host_hash(host);
+ spin_lock(&dax_host_lock);
+ hlist_add_head(&dax_dev->list, &dax_host_list[hash]);
+ spin_unlock(&dax_host_lock);
+}
+
+struct dax_device *alloc_dax(void *private, const char *__host)
{
struct dax_device *dax_dev;
+ const char *host;
dev_t devt;
int minor;
+ host = kstrdup(__host, GFP_KERNEL);
+ if (__host && !host)
+ return NULL;
+
minor = ida_simple_get(&dax_minor_ida, 0, nr_dax, GFP_KERNEL);
if (minor < 0)
- return NULL;
+ goto err_minor;
devt = MKDEV(MAJOR(dax_devt), minor);
dax_dev = dax_dev_get(devt);
if (!dax_dev)
- goto err_inode;
+ goto err_dev;
+ dax_add_host(dax_dev, host);
dax_dev->private = private;
return dax_dev;
- err_inode:
+ err_dev:
ida_simple_remove(&dax_minor_ida, minor);
+ err_minor:
+ kfree(host);
return NULL;
}
EXPORT_SYMBOL_GPL(alloc_dax);
@@ -202,6 +245,38 @@ void put_dax(struct dax_device *dax_dev)
EXPORT_SYMBOL_GPL(put_dax);
/**
+ * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
+ * @host: alternate name for the device registered by a dax driver
+ */
+struct dax_device *dax_get_by_host(const char *host)
+{
+ struct dax_device *dax_dev, *found = NULL;
+ int hash, id;
+
+ if (!host)
+ return NULL;
+
+ hash = dax_host_hash(host);
+
+ id = dax_read_lock();
+ spin_lock(&dax_host_lock);
+ hlist_for_each_entry(dax_dev, &dax_host_list[hash], list) {
+ if (!dax_alive(dax_dev)
+ || strcmp(host, dax_dev->host) != 0)
+ continue;
+
+ if (igrab(&dax_dev->inode))
+ found = dax_dev;
+ break;
+ }
+ spin_unlock(&dax_host_lock);
+ dax_read_unlock(id);
+
+ return found;
+}
+EXPORT_SYMBOL_GPL(dax_get_by_host);
+
+/**
* inode_dax: convert a public inode into its dax_dev
* @inode: An inode with i_cdev pointing to a dax_dev
*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5b62f5d19aea..9b2d5ba10d7d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,6 +10,7 @@ struct iomap_ops;
int dax_read_lock(void);
void dax_read_unlock(int id);
+struct dax_device *dax_get_by_host(const char *host);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
Track a set of dax_operations per dax_device that can be set at
alloc_dax() time. These operations will be used to stop the abuse of
block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax device methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.
This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/dax.h | 4 +++-
drivers/dax/device.c | 6 +++++-
drivers/dax/super.c | 6 +++++-
include/linux/dax.h | 10 ++++++++++
4 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 246a24d68d4c..617bbc24be2b 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,7 +13,9 @@
#ifndef __DAX_H__
#define __DAX_H__
struct dax_device;
-struct dax_device *alloc_dax(void *private, const char *host);
+struct dax_operations;
+struct dax_device *alloc_dax(void *private, const char *host,
+ const struct dax_operations *ops);
void put_dax(struct dax_device *dax_dev);
bool dax_alive(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db68f4fa8ce0..a0db055054a4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -645,7 +645,11 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
goto err_id;
}
- dax_dev = alloc_dax(dev_dax, NULL);
+ /*
+ * No 'host' or dax_operations since there is no access to this
+ * device outside of mmap of the resulting character device.
+ */
+ dax_dev = alloc_dax(dev_dax, NULL, NULL);
if (!dax_dev)
goto err_dax;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index bb22956a106b..45ccfc043da8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
#include <linux/cdev.h>
#include <linux/hash.h>
#include <linux/slab.h>
+#include <linux/dax.h>
#include <linux/fs.h>
static int nr_dax = CONFIG_NR_DEV_DAX;
@@ -61,6 +62,7 @@ struct dax_device {
const char *host;
void *private;
bool alive;
+ const struct dax_operations *ops;
};
bool dax_alive(struct dax_device *dax_dev)
@@ -204,7 +206,8 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host)
spin_unlock(&dax_host_lock);
}
-struct dax_device *alloc_dax(void *private, const char *__host)
+struct dax_device *alloc_dax(void *private, const char *__host,
+ const struct dax_operations *ops)
{
struct dax_device *dax_dev;
const char *host;
@@ -225,6 +228,7 @@ struct dax_device *alloc_dax(void *private, const char *__host)
goto err_dev;
dax_add_host(dax_dev, host);
+ dax_dev->ops = ops;
dax_dev->private = private;
return dax_dev;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9b2d5ba10d7d..74ebb92b625a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,16 @@
#include <asm/pgtable.h>
struct iomap_ops;
+struct dax_device;
+struct dax_operations {
+ /*
+ * direct_access: translate a device-relative
+ * logical-page-offset into an absolute physical pfn. Return the
+ * number of pages available for DAX at that pfn.
+ */
+ long (*direct_access)(struct dax_device *, pgoff_t, long,
+ void **, pfn_t *);
+};
int dax_read_lock(void);
void dax_read_unlock(int id);
Setup a dax_device to have the same lifetime as the axon_ram block
device and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/platforms/Kconfig | 1 +
arch/powerpc/sysdev/axonram.c | 48 +++++++++++++++++++++++++++++++++++-----
2 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+ select DAX
default m
help
It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index f523ac883150..ad857d5e81b1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/dax.h>
#include <linux/device.h>
#include <linux/errno.h>
#include <linux/fs.h>
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
struct axon_ram_bank {
struct platform_device *device;
struct gendisk *disk;
+ struct dax_device *dax_dev;
unsigned int irq_id;
unsigned long ph_addr;
unsigned long io_addr;
@@ -137,25 +139,47 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
return BLK_QC_T_NONE;
}
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_pages,
+ void **kaddr, pfn_t *pfn)
+{
+ resource_size_t offset = pgoff * PAGE_SIZE;
+
+ *kaddr = (void *) bank->io_addr + offset;
+ *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+ return (bank->size - offset) / PAGE_SIZE;
+}
+
/**
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
void **kaddr, pfn_t *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
- *kaddr = (void *) bank->io_addr + offset;
- *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
- return bank->size - offset;
+ return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
+ size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
}
static const struct block_device_operations axon_ram_devops = {
.owner = THIS_MODULE,
- .direct_access = axon_ram_direct_access
+ .direct_access = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
+ void **kaddr, pfn_t *pfn)
+{
+ struct axon_ram_bank *bank = dax_get_private(dax_dev);
+
+ return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+ .direct_access = axon_ram_dax_direct_access,
};
/**
@@ -219,6 +243,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = &axon_ram_devops;
@@ -227,6 +252,11 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
+ bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
+ &axon_ram_dax_ops);
+ if (!bank->dax_dev)
+ goto failed;
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(&device->dev, "Cannot register disk queue\n");
@@ -278,6 +308,10 @@ static int axon_ram_probe(struct platform_device *device)
del_gendisk(bank->disk);
put_disk(bank->disk);
}
+ if (bank->dax_dev) {
+ kill_dax(bank->dax_dev);
+ put_dax(bank->dax_dev);
+ }
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->io_addr);
@@ -300,6 +334,8 @@ axon_ram_remove(struct platform_device *device)
device_remove_file(&device->dev, &dev_attr_ecc);
free_irq(bank->irq_id, device);
+ kill_dax(bank->dax_dev);
+ put_dax(bank->dax_dev);
del_gendisk(bank->disk);
put_disk(bank->disk);
iounmap((void __iomem *) bank->io_addr);
Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/dax.h | 7 ----
drivers/nvdimm/Kconfig | 1 +
drivers/nvdimm/pmem.c | 61 +++++++++++++++++++++++++++++++--------
drivers/nvdimm/pmem.h | 7 +++-
include/linux/dax.h | 6 ++++
tools/testing/nvdimm/pmem-dax.c | 21 ++++++-------
6 files changed, 70 insertions(+), 33 deletions(-)
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 617bbc24be2b..f9e5feea742c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,13 +13,6 @@
#ifndef __DAX_H__
#define __DAX_H__
struct dax_device;
-struct dax_operations;
-struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
-void put_dax(struct dax_device *dax_dev);
-bool dax_alive(struct dax_device *dax_dev);
-void kill_dax(struct dax_device *dax_dev);
struct dax_device *inode_dax(struct inode *inode);
struct inode *dax_inode(struct dax_device *dax_dev);
-void *dax_get_private(struct dax_device *dax_dev);
#endif /* __DAX_H__ */
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 59e750183b7f..5bdd499b5f4f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ if LIBNVDIMM
config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
+ select DAX
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..fbbcf8154eec 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
#include <linux/pfn_t.h>
#include <linux/slab.h>
#include <linux/pmem.h>
+#include <linux/dax.h>
#include <linux/nd.h>
#include "pmem.h"
#include "pfn.h"
@@ -199,13 +200,13 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
}
/* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
-__weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
- struct pmem_device *pmem = bdev->bd_queue->queuedata;
- resource_size_t offset = sector * 512 + pmem->data_offset;
+ resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
- if (unlikely(is_bad_pmem(&pmem->bb, sector, size)))
+ if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
+ PFN_PHYS(nr_pages))))
return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
@@ -215,26 +216,51 @@ __weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
* requested range.
*/
if (unlikely(pmem->bb.count))
- return size;
- return pmem->size - pmem->pfn_pad - offset;
+ return nr_pages;
+ return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
+}
+
+static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, pfn_t *pfn, long size)
+{
+ struct pmem_device *pmem = bdev->bd_queue->queuedata;
+
+ return __pmem_direct_access(pmem, PHYS_PFN(sector * 512),
+ PHYS_PFN(size), kaddr, pfn);
}
static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
- .direct_access = pmem_direct_access,
+ .direct_access = pmem_blk_direct_access,
.revalidate_disk = nvdimm_revalidate_disk,
};
+static long pmem_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+
+ return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations pmem_dax_ops = {
+ .direct_access = pmem_dax_direct_access,
+};
+
static void pmem_release_queue(void *q)
{
blk_cleanup_queue(q);
}
-static void pmem_release_disk(void *disk)
+static void pmem_release_disk(void *__pmem)
{
- del_gendisk(disk);
- put_disk(disk);
+ struct pmem_device *pmem = __pmem;
+
+ kill_dax(pmem->dax_dev);
+ put_dax(pmem->dax_dev);
+ del_gendisk(pmem->disk);
+ put_disk(pmem->disk);
}
static int pmem_attach_disk(struct device *dev,
@@ -245,6 +271,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = &nsio->res;
struct nd_pfn *nd_pfn = NULL;
+ struct dax_device *dax_dev;
int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
struct pmem_device *pmem;
@@ -325,6 +352,7 @@ static int pmem_attach_disk(struct device *dev,
disk = alloc_disk_node(0, nid);
if (!disk)
return -ENOMEM;
+ pmem->disk = disk;
disk->fops = &pmem_fops;
disk->queue = q;
@@ -336,9 +364,16 @@ static int pmem_attach_disk(struct device *dev,
return -ENOMEM;
nvdimm_badblocks_populate(nd_region, &pmem->bb, res);
disk->bb = &pmem->bb;
- device_add_disk(dev, disk);
- if (devm_add_action_or_reset(dev, pmem_release_disk, disk))
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ if (!dax_dev) {
+ put_disk(disk);
+ return -ENOMEM;
+ }
+ pmem->dax_dev = dax_dev;
+
+ device_add_disk(dev, disk);
+ if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
revalidate_disk(disk);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index b4ee4f71b4a1..7f4dbd72a90a 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,8 +5,6 @@
#include <linux/pfn_t.h>
#include <linux/fs.h>
-long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size);
/* this definition is in it's own header for tools/testing/nvdimm to consume */
struct pmem_device {
/* One contiguous memory region per device */
@@ -20,5 +18,10 @@ struct pmem_device {
/* trim size when namespace capacity has been section aligned */
u32 pfn_pad;
struct badblocks bb;
+ struct dax_device *dax_dev;
+ struct gendisk *disk;
};
+
+long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn);
#endif /* __NVDIMM_PMEM_H__ */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 74ebb92b625a..39a0312c45c3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -21,6 +21,12 @@ struct dax_operations {
int dax_read_lock(void);
void dax_read_unlock(int id);
struct dax_device *dax_get_by_host(const char *host);
+struct dax_device *alloc_dax(void *private, const char *host,
+ const struct dax_operations *ops);
+void put_dax(struct dax_device *dax_dev);
+bool dax_alive(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+void *dax_get_private(struct dax_device *dax_dev);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index c9b8c48f85fc..b53596ad601b 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -15,13 +15,13 @@
#include <pmem.h>
#include <nd.h>
-long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
- struct pmem_device *pmem = bdev->bd_queue->queuedata;
- resource_size_t offset = sector * 512 + pmem->data_offset;
+ resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
- if (unlikely(is_bad_pmem(&pmem->bb, sector, size)))
+ if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
+ PFN_PHYS(nr_pages))))
return -EIO;
/*
@@ -34,11 +34,10 @@ long pmem_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = pmem->virt_addr + offset;
page = vmalloc_to_page(pmem->virt_addr + offset);
*pfn = page_to_pfn_t(page);
- dev_dbg_ratelimited(disk_to_dev(bdev->bd_disk)->parent,
- "%s: sector: %#llx pfn: %#lx\n", __func__,
- (unsigned long long) sector, page_to_pfn(page));
+ pr_debug_ratelimited("%s: pmem: %p pgoff: %#lx pfn: %#lx\n",
+ __func__, pmem, pgoff, page_to_pfn(page));
- return PAGE_SIZE;
+ return 1;
}
*kaddr = pmem->virt_addr + offset;
@@ -49,6 +48,6 @@ long pmem_direct_access(struct block_device *bdev, sector_t sector,
* requested range.
*/
if (unlikely(pmem->bb.count))
- return size;
- return pmem->size - pmem->pfn_pad - offset;
+ return nr_pages;
+ return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
}
Setup a dax_dev to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.
Cc: Gerald Schaefer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/s390/block/Kconfig | 1 +
drivers/s390/block/dcssblk.c | 54 +++++++++++++++++++++++++++++++++++-------
2 files changed, 46 insertions(+), 9 deletions(-)
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
config DCSSBLK
def_tristate m
+ select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..682a9eb4934d 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
#include <linux/interrupt.h>
#include <linux/platform_device.h>
#include <linux/pfn_t.h>
+#include <linux/dax.h>
#include <asm/extmem.h>
#include <asm/io.h>
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn);
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = {
.owner = THIS_MODULE,
.open = dcssblk_open,
.release = dcssblk_release,
- .direct_access = dcssblk_direct_access,
+ .direct_access = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+ .direct_access = dcssblk_dax_direct_access,
};
struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+ struct dax_device *dax_dev;
};
struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch
}
list_del(&dev_info->lh);
+ kill_dax(dev_info->dax_dev);
+ put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+ struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
if (rc)
goto put_dev;
+ dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
+ &dcssblk_dax_ops);
+ if (!dax_dev)
+ goto put_dev;
+
get_device(&dev_info->dev);
device_add_disk(&dev_info->dev, dev_info->gd);
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
}
list_del(&dev_info->lh);
+ kill_dax(dev_info->dax_dev);
+ put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
}
static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ resource_size_t offset = pgoff * PAGE_SIZE;
+ unsigned long dev_sz;
+
+ dev_sz = dev_info->end - dev_info->start + 1;
+ *kaddr = (void *) dev_info->start + offset;
+ *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+ return (dev_sz - offset) / PAGE_SIZE;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long offset, dev_sz;
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
- dev_sz = dev_info->end - dev_info->start + 1;
- offset = secnum * 512;
- *kaddr = (void *) dev_info->start + offset;
- *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+ return __dcssblk_direct_access(dev_info, PHYS_PFN(secnum * 512),
+ PHYS_PFN(size), kaddr, pfn) * PAGE_SIZE;
+}
+
+static long
+dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
- return dev_sz - offset;
+ return __dcssblk_direct_access(dev_info, pgoff, nr_pages, kaddr, pfn);
}
static void
Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.
Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.
Signed-off-by: Dan Williams <[email protected]>
---
block/Kconfig | 1 +
drivers/dax/super.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/block_dev.c | 14 ++++++++++++++
include/linux/blkdev.h | 1 +
include/linux/dax.h | 2 ++
5 files changed, 57 insertions(+)
diff --git a/block/Kconfig b/block/Kconfig
index e9f780f815f5..93da7fc3f254 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+ select DAX
help
Provide block layer support for the kernel.
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 45ccfc043da8..23ce3ab49f10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,45 @@ struct dax_device {
const struct dax_operations *ops;
};
+/**
+ * dax_direct_access() - translate a device pgoff to an absolute pfn
+ * @dax_dev: a dax_device instance representing the logical memory range
+ * @pgoff: offset in pages from the start of the device to translate
+ * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @kaddr: output parameter that returns a virtual address mapping of pfn
+ * @pfn: output parameter that returns an absolute pfn translation of @pgoff
+ *
+ * Return: negative errno if an error occurs, otherwise the number of
+ * pages accessible at the device relative @pgoff.
+ */
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
+ void **kaddr, pfn_t *pfn)
+{
+ long avail;
+
+ /*
+ * The device driver is allowed to sleep, in order to make the
+ * memory directly accessible.
+ */
+ might_sleep();
+
+ if (!dax_dev)
+ return -EOPNOTSUPP;
+
+ if (!dax_alive(dax_dev))
+ return -ENXIO;
+
+ if (nr_pages < 0)
+ return nr_pages;
+
+ avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
+ kaddr, pfn);
+ if (!avail)
+ return -ERANGE;
+ return min(avail, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7f40ea2f0875..2f7885712575 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/magic.h>
+#include <linux/dax.h>
#include <linux/buffer_head.h>
#include <linux/swap.h>
#include <linux/pagevec.h>
@@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
}
EXPORT_SYMBOL_GPL(bdev_direct_access);
+int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
+ pgoff_t *pgoff)
+{
+ phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+
+ if (pgoff)
+ *pgoff = PHYS_PFN(phys_off);
+ if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+ return -EINVAL;
+ return 0;
+}
+EXPORT_SYMBOL(bdev_dax_pgoff);
+
/**
* bdev_dax_supported() - Check if the device supports dax for filesystem
* @sb: The superblock of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f72708399b83..612c497d1461 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
extern int bdev_dax_supported(struct super_block *, int);
+int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
#else /* CONFIG_BLOCK */
struct block_device;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 39a0312c45c3..7e62e280c11f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev);
bool dax_alive(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
+ void **kaddr, pfn_t *pfn);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
This is leftover dead code that has since been replaced by
bdev_dax_supported().
Signed-off-by: Dan Williams <[email protected]>
---
fs/block_dev.c | 24 ------------------------
include/linux/blkdev.h | 1 -
2 files changed, 25 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..7f40ea2f0875 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int blocksize)
}
EXPORT_SYMBOL_GPL(bdev_dax_supported);
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
- struct blk_dax_ctl dax = {
- .size = PAGE_SIZE,
- };
-
- if (!IS_ENABLED(CONFIG_FS_DAX))
- return false;
-
- dax.sector = 0;
- if (bdev_direct_access(bdev, &dax) < 0)
- return false;
-
- dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
- if (bdev_direct_access(bdev, &dax) < 0)
- return false;
-
- return true;
-}
-
/*
* pseudo-fs
*/
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..f72708399b83 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
#else /* CONFIG_BLOCK */
struct block_device;
Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations. Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.
A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.
This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.
Cc: Toshi Kani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/md/Kconfig | 1
drivers/md/dm-core.h | 1
drivers/md/dm.c | 84 ++++++++++++++++++++++++++++++++++-------
include/linux/device-mapper.h | 1
4 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+ select DAX
---help---
Device-mapper is a low level volume manager. It works by allowing
people to specify mappings for ranges of logical sectors. Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..538630190f66 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -58,6 +58,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
struct gendisk *disk;
+ struct dax_device *dax_dev;
char name[16];
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..bd56dfe43a99 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -16,6 +16,7 @@
#include <linux/blkpg.h>
#include <linux/bio.h>
#include <linux/mempool.h>
+#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/idr.h>
#include <linux/hdreg.h>
@@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
}
EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
+ sector_t sector, int *srcu_idx)
{
- struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map;
struct dm_target *ti;
- int srcu_idx;
- long len, ret = -EIO;
- map = dm_get_live_table(md, &srcu_idx);
+ map = dm_get_live_table(md, srcu_idx);
if (!map)
- goto out;
+ return NULL;
ti = dm_table_find_target(map, sector);
if (!dm_target_is_valid(ti))
- goto out;
+ return NULL;
- len = max_io_len(sector, ti) << SECTOR_SHIFT;
- size = min(len, size);
+ return ti;
+}
- if (ti->type->direct_access)
- ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
-out:
+static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ long len, ret = -EIO;
+ int srcu_idx;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+ if (!ti)
+ goto out;
+ if (!ti->type->direct_access)
+ goto out;
+ len = max_io_len(sector, ti) / PAGE_SECTORS;
+ if (len < 1)
+ goto out;
+ nr_pages = min(len, nr_pages);
+ if (ti->type->direct_access) {
+ ret = ti->type->direct_access(ti, sector, kaddr, pfn,
+ nr_pages * PAGE_SIZE);
+ /*
+ * FIXME: convert ti->type->direct_access to return
+ * nr_pages directly.
+ */
+ if (ret >= 0)
+ ret /= PAGE_SIZE;
+ }
+ out:
dm_put_live_table(md, srcu_idx);
- return min(ret, size);
+
+ return ret;
+}
+
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, pfn_t *pfn, long size)
+{
+ struct mapped_device *md = bdev->bd_disk->private_data;
+ struct dax_device *dax_dev = md->dax_dev;
+ long nr_pages = size / PAGE_SIZE;
+
+ nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
+ nr_pages, kaddr, pfn);
+ return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
}
/*
@@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor)
}
static const struct block_device_operations dm_blk_dops;
+static const struct dax_operations dm_dax_ops;
static void dm_wq_work(struct work_struct *work);
@@ -1483,6 +1522,12 @@ static void cleanup_mapped_device(struct mapped_device *md)
if (md->bs)
bioset_free(md->bs);
+ if (md->dax_dev) {
+ kill_dax(md->dax_dev);
+ put_dax(md->dax_dev);
+ md->dax_dev = NULL;
+ }
+
if (md->disk) {
spin_lock(&_minor_lock);
md->disk->private_data = NULL;
@@ -1510,6 +1555,7 @@ static void cleanup_mapped_device(struct mapped_device *md)
static struct mapped_device *alloc_dev(int minor)
{
int r, numa_node_id = dm_get_numa_node();
+ struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
@@ -1574,6 +1620,12 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->queue = md->queue;
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
+
+ dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ if (!dax_dev)
+ goto bad;
+ md->dax_dev = dax_dev;
+
add_disk(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));
@@ -2781,6 +2833,10 @@ static const struct block_device_operations dm_blk_dops = {
.owner = THIS_MODULE
};
+static const struct dax_operations dm_dax_ops = {
+ .direct_access = dm_dax_direct_access,
+};
+
/*
* module hooks
*/
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7e6903866fd..bcba4d89089c 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -130,6 +130,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
*/
typedef long (*dm_direct_access_fn) (struct dm_target *ti, sector_t sector,
void **kaddr, pfn_t *pfn, long size);
+#define PAGE_SECTORS (PAGE_SIZE / 512)
void dm_error(const char *message);
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.
This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 277 +++++++++++++++++++++++++++++----------------------
fs/iomap.c | 3 -
include/linux/dax.h | 6 +
3 files changed, 162 insertions(+), 124 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index b78a6947c4f5..ce9dc9c3e829 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void)
}
fs_initcall(init_dax_wait_table);
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
- struct request_queue *q = bdev->bd_queue;
- long rc = -EIO;
-
- dax->addr = ERR_PTR(-EIO);
- if (blk_queue_enter(q, true) != 0)
- return rc;
-
- rc = bdev_direct_access(bdev, dax);
- if (rc < 0) {
- dax->addr = ERR_PTR(rc);
- blk_queue_exit(q);
- return rc;
- }
- return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
- const struct blk_dax_ctl *dax)
-{
- if (IS_ERR(dax->addr))
- return;
- blk_queue_exit(bdev->bd_queue);
-}
-
static int dax_is_pmd_entry(void *entry)
{
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, void **entry,
return ret;
}
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
- struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+ sector_t sector, size_t size, struct page *to,
+ unsigned long vaddr)
{
- struct blk_dax_ctl dax = {
- .sector = sector,
- .size = size,
- };
- void *vto;
-
- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
+ void *vto, *kaddr;
+ pgoff_t pgoff;
+ pfn_t pfn;
+ long rc;
+ int id;
+
+ rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (rc)
+ return rc;
+
+ id = dax_read_lock();
+ rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
+ if (rc < 0) {
+ dax_read_unlock(id);
+ return rc;
+ }
vto = kmap_atomic(to);
- copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
+ copy_user_page(vto, (void __force *)kaddr, vaddr, to);
kunmap_atomic(vto);
- dax_unmap_atomic(bdev, &dax);
+ dax_read_unlock(id);
return 0;
}
@@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
}
static int dax_writeback_one(struct block_device *bdev,
- struct address_space *mapping, pgoff_t index, void *entry)
+ struct dax_device *dax_dev, struct address_space *mapping,
+ pgoff_t index, void *entry)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
- struct blk_dax_ctl dax;
- void *entry2, **slot;
- int ret = 0;
+ void *entry2, **slot, *kaddr;
+ long ret = 0, id;
+ sector_t sector;
+ pgoff_t pgoff;
+ size_t size;
+ pfn_t pfn;
/*
* A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev,
* 'entry'. This allows us to flush for PMD_SIZE and not have to
* worry about partial PMD writebacks.
*/
- dax.sector = dax_radix_sector(entry);
- dax.size = PAGE_SIZE << dax_radix_order(entry);
+ sector = dax_radix_sector(entry);
+ size = PAGE_SIZE << dax_radix_order(entry);
+
+ id = dax_read_lock();
+ ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (ret)
+ goto dax_unlock;
/*
- * We cannot hold tree_lock while calling dax_map_atomic() because it
- * eventually calls cond_resched().
+ * dax_direct_access() may sleep, so cannot hold tree_lock over
+ * its invocation.
*/
- ret = dax_map_atomic(bdev, &dax);
- if (ret < 0) {
- put_locked_mapping_entry(mapping, index, entry);
- return ret;
- }
+ ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
+ if (ret < 0)
+ goto dax_unlock;
- if (WARN_ON_ONCE(ret < dax.size)) {
+ if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
ret = -EIO;
- goto unmap;
+ goto dax_unlock;
}
- dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
- wb_cache_pmem(dax.addr, dax.size);
+ dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
+ wb_cache_pmem(kaddr, size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
@@ -818,8 +808,8 @@ static int dax_writeback_one(struct block_device *bdev,
spin_lock_irq(&mapping->tree_lock);
radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
spin_unlock_irq(&mapping->tree_lock);
- unmap:
- dax_unmap_atomic(bdev, &dax);
+ dax_unlock:
+ dax_read_unlock(id);
put_locked_mapping_entry(mapping, index, entry);
return ret;
@@ -840,6 +830,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct inode *inode = mapping->host;
pgoff_t start_index, end_index;
pgoff_t indices[PAGEVEC_SIZE];
+ struct dax_device *dax_dev;
struct pagevec pvec;
bool done = false;
int i, ret = 0;
@@ -850,6 +841,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
return 0;
+ dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ if (!dax_dev)
+ return -EIO;
+
start_index = wbc->range_start >> PAGE_SHIFT;
end_index = wbc->range_end >> PAGE_SHIFT;
@@ -870,38 +865,49 @@ int dax_writeback_mapping_range(struct address_space *mapping,
break;
}
- ret = dax_writeback_one(bdev, mapping, indices[i],
- pvec.pages[i]);
- if (ret < 0)
+ ret = dax_writeback_one(bdev, dax_dev, mapping,
+ indices[i], pvec.pages[i]);
+ if (ret < 0) {
+ put_dax(dax_dev);
return ret;
+ }
}
}
+ put_dax(dax_dev);
return 0;
}
EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
static int dax_insert_mapping(struct address_space *mapping,
- struct block_device *bdev, sector_t sector, size_t size,
- void **entryp, struct vm_area_struct *vma, struct vm_fault *vmf)
+ struct block_device *bdev, struct dax_device *dax_dev,
+ sector_t sector, size_t size, void **entryp,
+ struct vm_area_struct *vma, struct vm_fault *vmf)
{
unsigned long vaddr = vmf->address;
- struct blk_dax_ctl dax = {
- .sector = sector,
- .size = size,
- };
- void *ret;
void *entry = *entryp;
+ void *ret, *kaddr;
+ pgoff_t pgoff;
+ int id, rc;
+ pfn_t pfn;
- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
- dax_unmap_atomic(bdev, &dax);
+ rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (rc)
+ return rc;
- ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector, 0);
+ id = dax_read_lock();
+ rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
+ if (rc < 0) {
+ dax_read_unlock(id);
+ return rc;
+ }
+ dax_read_unlock(id);
+
+ ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
if (IS_ERR(ret))
return PTR_ERR(ret);
*entryp = ret;
- return vm_insert_mixed(vma, vaddr, dax.pfn);
+ return vm_insert_mixed(vma, vaddr, pfn);
}
/**
@@ -950,24 +956,34 @@ static bool dax_range_is_aligned(struct block_device *bdev,
return true;
}
-int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
- unsigned int offset, unsigned int length)
+int __dax_zero_page_range(struct block_device *bdev,
+ struct dax_device *dax_dev, sector_t sector,
+ unsigned int offset, unsigned int size)
{
- struct blk_dax_ctl dax = {
- .sector = sector,
- .size = PAGE_SIZE,
- };
-
- if (dax_range_is_aligned(bdev, offset, length)) {
- sector_t start_sector = dax.sector + (offset >> 9);
+ if (dax_range_is_aligned(bdev, offset, size)) {
+ sector_t start_sector = sector + (offset >> 9);
return blkdev_issue_zeroout(bdev, start_sector,
- length >> 9, GFP_NOFS, true);
+ size >> 9, GFP_NOFS, true);
} else {
- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
- clear_pmem(dax.addr + offset, length);
- dax_unmap_atomic(bdev, &dax);
+ pgoff_t pgoff;
+ long rc, id;
+ void *kaddr;
+ pfn_t pfn;
+
+ rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (rc)
+ return rc;
+
+ id = dax_read_lock();
+ rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr,
+ &pfn);
+ if (rc < 0) {
+ dax_read_unlock(id);
+ return rc;
+ }
+ clear_pmem(kaddr + offset, size);
+ dax_read_unlock(id);
}
return 0;
}
@@ -982,9 +998,12 @@ static loff_t
dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
{
+ struct block_device *bdev = iomap->bdev;
+ struct dax_device *dax_dev = iomap->dax_dev;
struct iov_iter *iter = data;
loff_t end = pos + length, done = 0;
ssize_t ret = 0;
+ int id;
if (iov_iter_rw(iter) == READ) {
end = min(end, i_size_read(inode));
@@ -1009,34 +1028,42 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
(end - 1) >> PAGE_SHIFT);
}
+ id = dax_read_lock();
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
- struct blk_dax_ctl dax = { 0 };
+ const size_t size = ALIGN(length + offset, PAGE_SIZE);
+ const sector_t sector = dax_iomap_sector(iomap, pos);
ssize_t map_len;
+ pgoff_t pgoff;
+ void *kaddr;
+ pfn_t pfn;
if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}
- dax.sector = dax_iomap_sector(iomap, pos);
- dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
- map_len = dax_map_atomic(iomap->bdev, &dax);
+ ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (ret)
+ break;
+
+ map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
+ &kaddr, &pfn);
if (map_len < 0) {
ret = map_len;
break;
}
- dax.addr += offset;
+ map_len = PFN_PHYS(map_len);
+ kaddr += offset;
map_len -= offset;
if (map_len > end - pos)
map_len = end - pos;
if (iov_iter_rw(iter) == WRITE)
- map_len = copy_from_iter_pmem(dax.addr, map_len, iter);
+ map_len = copy_from_iter_pmem(kaddr, map_len, iter);
else
- map_len = copy_to_iter(dax.addr, map_len, iter);
- dax_unmap_atomic(iomap->bdev, &dax);
+ map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
ret = map_len ? map_len : -EFAULT;
break;
@@ -1046,6 +1073,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
length -= map_len;
done += map_len;
}
+ dax_read_unlock(id);
return done ? done : ret;
}
@@ -1152,8 +1180,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
clear_user_highpage(vmf->cow_page, vaddr);
break;
case IOMAP_MAPPED:
- error = copy_user_dax(iomap.bdev, sector, PAGE_SIZE,
- vmf->cow_page, vaddr);
+ error = copy_user_dax(iomap.bdev, iomap.dax_dev,
+ sector, PAGE_SIZE, vmf->cow_page, vaddr);
break;
default:
WARN_ON_ONCE(1);
@@ -1178,8 +1206,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
- error = dax_insert_mapping(mapping, iomap.bdev, sector,
- PAGE_SIZE, &entry, vmf->vma, vmf);
+ error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
+ sector, PAGE_SIZE, &entry, vmf->vma, vmf);
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error == -EBUSY)
error = 0;
@@ -1229,41 +1257,48 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
loff_t pos, void **entryp)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ const sector_t sector = dax_iomap_sector(iomap, pos);
+ struct dax_device *dax_dev = iomap->dax_dev;
struct block_device *bdev = iomap->bdev;
struct inode *inode = mapping->host;
- struct blk_dax_ctl dax = {
- .sector = dax_iomap_sector(iomap, pos),
- .size = PMD_SIZE,
- };
- long length = dax_map_atomic(bdev, &dax);
- void *ret = NULL;
-
- if (length < 0) /* dax_map_atomic() failed */
+ const size_t size = PMD_SIZE;
+ void *ret = NULL, *kaddr;
+ long length = 0;
+ pgoff_t pgoff;
+ pfn_t pfn;
+ int id;
+
+ if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
goto fallback;
- if (length < PMD_SIZE)
- goto unmap_fallback;
- if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
- goto unmap_fallback;
- if (!pfn_t_devmap(dax.pfn))
- goto unmap_fallback;
-
- dax_unmap_atomic(bdev, &dax);
- ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
+ id = dax_read_lock();
+ length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
+ if (length < 0)
+ goto unlock_fallback;
+ length = PFN_PHYS(length);
+
+ if (length < size)
+ goto unlock_fallback;
+ if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
+ goto unlock_fallback;
+ if (!pfn_t_devmap(pfn))
+ goto unlock_fallback;
+ dax_read_unlock(id);
+
+ ret = dax_insert_mapping_entry(mapping, vmf, *entryp, sector,
RADIX_DAX_PMD);
if (IS_ERR(ret))
goto fallback;
*entryp = ret;
- trace_dax_pmd_insert_mapping(inode, vmf, length, dax.pfn, ret);
+ trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret);
return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
- dax.pfn, vmf->flags & FAULT_FLAG_WRITE);
+ pfn, vmf->flags & FAULT_FLAG_WRITE);
- unmap_fallback:
- dax_unmap_atomic(bdev, &dax);
+unlock_fallback:
+ dax_read_unlock(id);
fallback:
- trace_dax_pmd_insert_mapping_fallback(inode, vmf, length,
- dax.pfn, ret);
+ trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
return VM_FAULT_FALLBACK;
}
diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd55a8b..4ec0d6ac8bc1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -360,7 +360,8 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
sector_t sector = iomap->blkno +
(((pos & ~(PAGE_SIZE - 1)) - iomap->offset) >> 9);
- return __dax_zero_page_range(iomap->bdev, sector, offset, bytes);
+ return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, sector,
+ offset, bytes);
}
static loff_t
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0d0d890f9186..d3158e74a59e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -70,11 +70,13 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
pgoff_t index, void *entry, bool wake_all);
#ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
+int __dax_zero_page_range(struct block_device *bdev,
+ struct dax_device *dax_dev, sector_t sector,
unsigned int offset, unsigned int length);
#else
static inline int __dax_zero_page_range(struct block_device *bdev,
- sector_t sector, unsigned int offset, unsigned int length)
+ struct dax_device *dax_dev, sector_t sector,
+ unsigned int offset, unsigned int length)
{
return -ENXIO;
}
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.
Signed-off-by: Dan Williams <[email protected]>
---
fs/ext2/inode.c | 9 ++++++++-
fs/ext4/inode.c | 9 ++++++++-
fs/xfs/xfs_iomap.c | 10 ++++++++++
include/linux/iomap.h | 1 +
4 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 128cce540645..4c9d2d44e879 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
+ struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
@@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
iomap->flags = 0;
- iomap->bdev = inode->i_sb->s_bdev;
+ bdev = inode->i_sb->s_bdev;
+ iomap->bdev = bdev;
iomap->offset = (u64)first_block << blkbits;
+ if (blk_queue_dax(bdev->bd_queue))
+ iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ else
+ iomap->dax_dev = NULL;
if (ret == 0) {
iomap->type = IOMAP_HOLE;
@@ -835,6 +841,7 @@ static int
ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ put_dax(iomap->dax_dev);
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4247d8d25687..2cb2634daa99 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
+ struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long last_block = (offset + length - 1) >> blkbits;
@@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
iomap->flags = 0;
- iomap->bdev = inode->i_sb->s_bdev;
+ bdev = inode->i_sb->s_bdev;
+ iomap->bdev = bdev;
+ if (blk_queue_dax(bdev->bd_queue))
+ iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ else
+ iomap->dax_dev = NULL;
iomap->offset = first_block << blkbits;
if (ret == 0) {
@@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
int blkbits = inode->i_blkbits;
bool truncate = false;
+ put_dax(iomap->dax_dev);
if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
return 0;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..4b47403f8089 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -976,6 +976,7 @@ xfs_file_iomap_begin(
int nimaps = 1, error = 0;
bool shared = false, trimmed = false;
unsigned lockmode;
+ struct block_device *bdev;
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
@@ -1063,6 +1064,14 @@ xfs_file_iomap_begin(
}
xfs_bmbt_to_iomap(ip, iomap, &imap);
+
+ /* optionally associate a dax device with the iomap bdev */
+ bdev = iomap->bdev;
+ if (blk_queue_dax(bdev->bd_queue))
+ iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ else
+ iomap->dax_dev = NULL;
+
if (shared)
iomap->flags |= IOMAP_F_SHARED;
return 0;
@@ -1140,6 +1149,7 @@ xfs_file_iomap_end(
unsigned flags,
struct iomap *iomap)
{
+ put_dax(iomap->dax_dev);
if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
length, written, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..f753e788da31 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type; /* type of mapping */
u16 flags; /* flags for mapping */
struct block_device *bdev; /* block device for I/O */
+ struct dax_device *dax_dev; /* dax_dev for dax operations */
};
/*
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 23 ++++-----------------
drivers/block/brd.c | 15 --------------
drivers/md/dm.c | 13 ------------
drivers/nvdimm/pmem.c | 10 ---------
drivers/s390/block/dcssblk.c | 16 ---------------
fs/block_dev.c | 45 -----------------------------------------
include/linux/blkdev.h | 17 ---------------
7 files changed, 4 insertions(+), 135 deletions(-)
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad857d5e81b1..83eb56ff1d2c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
return BLK_QC_T_NONE;
}
+static const struct block_device_operations axon_ram_devops = {
+ .owner = THIS_MODULE,
+};
+
static long
__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn)
@@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page
return (bank->size - offset) / PAGE_SIZE;
}
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
-{
- struct axon_ram_bank *bank = device->bd_disk->private_data;
-
- return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
- size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
-}
-
-static const struct block_device_operations axon_ram_devops = {
- .owner = THIS_MODULE,
- .direct_access = axon_ram_blk_direct_access
-};
-
static long
axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 60f3193c9ce2..bfa4ed2c75ef 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
return 1;
}
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
-{
- struct brd_device *brd = bdev->bd_disk->private_data;
- long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
- PHYS_PFN(size), kaddr, pfn);
-
- if (nr_pages < 0)
- return nr_pages;
- return nr_pages * PAGE_SIZE;
-}
-
static long brd_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
{
@@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device *dax_dev,
static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
};
-#else
-#define brd_blk_direct_access NULL
#endif
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
- .direct_access = brd_blk_direct_access,
};
/*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ef4c6f8cad47..79d5f5fd823e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
-{
- struct mapped_device *md = bdev->bd_disk->private_data;
- struct dax_device *dax_dev = md->dax_dev;
- long nr_pages = size / PAGE_SIZE;
-
- nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
- nr_pages, kaddr, pfn);
- return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
-}
-
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH.
@@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = {
.open = dm_blk_open,
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
- .direct_access = dm_blk_direct_access,
.getgeo = dm_blk_getgeo,
.pr_ops = &dm_pr_ops,
.owner = THIS_MODULE
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fbbcf8154eec..85b85633d674 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
}
-static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
-{
- struct pmem_device *pmem = bdev->bd_queue->queuedata;
-
- return __pmem_direct_access(pmem, PHYS_PFN(sector * 512),
- PHYS_PFN(size), kaddr, pfn);
-}
-
static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
- .direct_access = pmem_blk_direct_access,
.revalidate_disk = nvdimm_revalidate_disk,
};
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 682a9eb4934d..1586e11b979e 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -31,8 +31,6 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, pfn_t *pfn, long size);
static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
@@ -43,7 +41,6 @@ static const struct block_device_operations dcssblk_devops = {
.owner = THIS_MODULE,
.open = dcssblk_open,
.release = dcssblk_release,
- .direct_access = dcssblk_blk_direct_access,
};
static const struct dax_operations dcssblk_dax_ops = {
@@ -915,19 +912,6 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
}
static long
-dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, pfn_t *pfn, long size)
-{
- struct dcssblk_dev_info *dev_info;
-
- dev_info = bdev->bd_disk->private_data;
- if (!dev_info)
- return -ENODEV;
- return __dcssblk_direct_access(dev_info, PHYS_PFN(secnum * 512),
- PHYS_PFN(size), kaddr, pfn) * PAGE_SIZE;
-}
-
-static long
dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
{
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ecbdc8f9f718..10e21465d5a9 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -718,51 +718,6 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);
-/**
- * bdev_direct_access() - Get the address for directly-accessibly memory
- * @bdev: The device containing the memory
- * @dax: control and output parameters for ->direct_access
- *
- * If a block device is made up of directly addressable memory, this function
- * will tell the caller the PFN and the address of the memory. The address
- * may be directly dereferenced within the kernel without the need to call
- * ioremap(), kmap() or similar. The PFN is suitable for inserting into
- * page tables.
- *
- * Return: negative errno if an error occurs, otherwise the number of bytes
- * accessible at this address.
- */
-long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
- sector_t sector = dax->sector;
- long avail, size = dax->size;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
-
- /*
- * The device driver is allowed to sleep, in order to make the
- * memory directly accessible.
- */
- might_sleep();
-
- if (size < 0)
- return size;
- if (!blk_queue_dax(bdev_get_queue(bdev)) || !ops->direct_access)
- return -EOPNOTSUPP;
- if ((sector + DIV_ROUND_UP(size, 512)) >
- part_nr_sects_read(bdev->bd_part))
- return -ERANGE;
- sector += get_start_sect(bdev);
- if (sector % (PAGE_SIZE / 512))
- return -EINVAL;
- avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn, size);
- if (!avail)
- return -ERANGE;
- if (avail > 0 && avail & ~PAGE_MASK)
- return -ENXIO;
- return min(avail, size);
-}
-EXPORT_SYMBOL_GPL(bdev_direct_access);
-
int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
{
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 612c497d1461..848f87eb1905 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1916,28 +1916,12 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
#endif /* CONFIG_BLK_DEV_INTEGRITY */
-/**
- * struct blk_dax_ctl - control and output parameters for ->direct_access
- * @sector: (input) offset relative to a block_device
- * @addr: (output) kernel virtual address for @sector populated by driver
- * @pfn: (output) page frame number for @addr populated by driver
- * @size: (input) number of bytes requested
- */
-struct blk_dax_ctl {
- sector_t sector;
- void *addr;
- long size;
- pfn_t pfn;
-};
-
struct block_device_operations {
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, bool);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- long (*direct_access)(struct block_device *, sector_t, void **, pfn_t *,
- long);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1956,7 +1940,6 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
extern int bdev_dax_supported(struct super_block *, int);
int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
#else /* CONFIG_BLOCK */
The direct-I/O write path for a pmem device must ensure that data is
flushed to a power-fail safe zone when the operation is complete.
However, other dax capable block devices, like brd, do not have this
requirement. Introduce a 'copy_from_iter' dax operation so that pmem
can inject cache management without imposing this overhead on other dax
capable block_device drivers.
This is also a first step of moving all architecture-specific
pmem-operations to the pmem driver.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 3 +++
2 files changed, 46 insertions(+)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..e501df4ab4b4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
}
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ size_t len;
+
+ /* TODO: skip the write-back by always using non-temporal stores */
+ len = copy_from_iter_nocache(addr, bytes, i);
+
+ /*
+ * In the iovec case on x86_64 copy_from_iter_nocache() uses
+ * non-temporal stores for the bulk of the transfer, but we need
+ * to manually flush if the transfer is unaligned. A cached
+ * memory copy is used when destination or size is not naturally
+ * aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
+ *
+ * In the non-iovec case the entire destination needs to be
+ * flushed.
+ */
+ if (iter_is_iovec(i)) {
+ unsigned long flushed, dest = (unsigned long) addr;
+
+ if (bytes < 8) {
+ if (!IS_ALIGNED(dest, 4) || (bytes != 4))
+ wb_cache_pmem(addr, 1);
+ } else {
+ if (!IS_ALIGNED(dest, 8)) {
+ dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ wb_cache_pmem(addr, 1);
+ }
+
+ flushed = dest - (unsigned long) addr;
+ if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
+ wb_cache_pmem(addr + bytes - 1, 1);
+ }
+ } else
+ wb_cache_pmem(addr, bytes);
+
+ return len;
+}
+
static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
@@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .copy_from_iter = pmem_copy_from_iter,
};
static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+ /* copy_from_iter: dax-driver override for default copy_from_iter */
+ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+ struct iov_iter *);
};
int dax_read_lock(void);
Allow device-mapper to route copy_from_iter operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.
This conceptually allows for an array of mixed device drivers with
varying copy_from_iter implementations.
Cc: Toshi Kani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 13 +++++++++++++
drivers/md/dm-linear.c | 15 +++++++++++++++
drivers/md/dm-stripe.c | 20 ++++++++++++++++++++
drivers/md/dm.c | 26 ++++++++++++++++++++++++++
include/linux/dax.h | 2 ++
include/linux/device-mapper.h | 3 +++
6 files changed, 79 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 23ce3ab49f10..73f0da8e5d27 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
#include <linux/cdev.h>
#include <linux/hash.h>
#include <linux/slab.h>
+#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/fs.h>
@@ -104,6 +105,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
}
EXPORT_SYMBOL_GPL(dax_direct_access);
+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ if (!dax_alive(dax_dev))
+ return 0;
+
+ if (!dax_dev->ops->copy_from_iter)
+ return copy_from_iter(addr, bytes, i);
+ return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+EXPORT_SYMBOL_GPL(dax_copy_from_iter);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index c5a52f4dae81..5fe44a0ddfab 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -158,6 +158,20 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}
+static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct dax_device *dax_dev = lc->dev->dax_dev;
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+ dev_sector = linear_map_sector(ti, sector);
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
+ return 0;
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
static struct target_type linear_target = {
.name = "linear",
.version = {1, 3, 0},
@@ -169,6 +183,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
+ .dax_copy_from_iter = linear_dax_copy_from_iter,
};
int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index cb4b1e9e16ab..4f45d23249b2 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -330,6 +330,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}
+static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+ struct stripe_c *sc = ti->private;
+ struct dax_device *dax_dev;
+ struct block_device *bdev;
+ uint32_t stripe;
+
+ stripe_map_sector(sc, sector, &stripe, &dev_sector);
+ dev_sector += sc->stripe[stripe].physical_start;
+ dax_dev = sc->stripe[stripe].dev->dax_dev;
+ bdev = sc->stripe[stripe].dev->bdev;
+
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
+ return 0;
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
/*
* Stripe status:
*
@@ -448,6 +467,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
+ .dax_copy_from_iter = stripe_dax_copy_from_iter,
};
int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 79d5f5fd823e..8c8579efcba2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -19,6 +19,7 @@
#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/idr.h>
+#include <linux/uio.h>
#include <linux/hdreg.h>
#include <linux/delay.h>
#include <linux/wait.h>
@@ -957,6 +958,30 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}
+static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ long ret = 0;
+ int srcu_idx;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+ if (!ti)
+ goto out;
+ if (!ti->type->dax_copy_from_iter) {
+ ret = copy_from_iter(addr, bytes, i);
+ goto out;
+ }
+ ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i);
+ out:
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH.
@@ -2818,6 +2843,7 @@ static const struct block_device_operations dm_blk_dops = {
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
+ .copy_from_iter = dm_dax_copy_from_iter,
};
/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 156f067d4db5..cd8561bb21f3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,6 +32,8 @@ void kill_dax(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index df830d167892..e18e4277d09d 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -130,6 +130,8 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
*/
typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
+typedef size_t (*dm_dax_copy_from_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i);
#define PAGE_SECTORS (PAGE_SIZE / 512)
void dm_error(const char *message);
@@ -179,6 +181,7 @@ struct target_type {
dm_iterate_devices_fn iterate_devices;
dm_io_hints_fn io_hints;
dm_dax_direct_access_fn direct_access;
+ dm_dax_copy_from_iter_fn dax_copy_from_iter;
/* For internal device-mapper use. */
struct list_head list;
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.
Signed-off-by: Dan Williams <[email protected]>
---
fs/block_dev.c | 48 ++++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f7885712575..ecbdc8f9f718 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -788,35 +788,43 @@ EXPORT_SYMBOL(bdev_dax_pgoff);
*/
int bdev_dax_supported(struct super_block *sb, int blocksize)
{
- struct blk_dax_ctl dax = {
- .sector = 0,
- .size = PAGE_SIZE,
- };
- int err;
+ struct block_device *bdev = sb->s_bdev;
+ struct dax_device *dax_dev;
+ pgoff_t pgoff;
+ int err, id;
+ void *kaddr;
+ pfn_t pfn;
+ long len;
if (blocksize != PAGE_SIZE) {
vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax");
return -EINVAL;
}
- err = bdev_direct_access(sb->s_bdev, &dax);
- if (err < 0) {
- switch (err) {
- case -EOPNOTSUPP:
- vfs_msg(sb, KERN_ERR,
- "error: device does not support dax");
- break;
- case -EINVAL:
- vfs_msg(sb, KERN_ERR,
- "error: unaligned partition for dax");
- break;
- default:
- vfs_msg(sb, KERN_ERR,
- "error: dax access failed (%d)", err);
- }
+ err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
+ if (err) {
+ vfs_msg(sb, KERN_ERR, "error: unaligned partition for dax");
return err;
}
+ dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ if (!dax_dev) {
+ vfs_msg(sb, KERN_ERR, "error: device does not support dax");
+ return -EOPNOTSUPP;
+ }
+
+ id = dax_read_lock();
+ len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
+ dax_read_unlock(id);
+
+ put_dax(dax_dev);
+
+ if (len < 1) {
+ vfs_msg(sb, KERN_ERR,
+ "error: dax access failed (%d)", len);
+ return len < 0 ? len : -EIO;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(bdev_dax_supported);
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.
An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 11 +++++++++++
include/linux/dax.h | 2 ++
2 files changed, 13 insertions(+)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e501df4ab4b4..822b85fb3365 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
}
+static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t size)
+{
+ /*
+ * TODO: move arch specific cache management into the driver
+ * directly.
+ */
+ wb_cache_pmem(addr, size);
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+ .flush = pmem_dax_flush,
};
static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index cd8561bb21f3..c88bbcba26d9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -19,6 +19,8 @@ struct dax_operations {
/* copy_from_iter: dax-driver override for default copy_from_iter */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+ /* flush: optional driver-specific cache management after writes */
+ void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
};
int dax_read_lock(void);
Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 50 -------------------------------------------
fs/dax.c | 3 ++-
include/linux/pmem.h | 24 ---------------------
3 files changed, 2 insertions(+), 75 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d5a22bac9988..60e8edbe0205 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
}
/**
- * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr: PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- */
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- size_t len;
-
- /* TODO: skip the write-back by always using non-temporal stores */
- len = copy_from_iter_nocache(addr, bytes, i);
-
- /*
- * In the iovec case on x86_64 copy_from_iter_nocache() uses
- * non-temporal stores for the bulk of the transfer, but we need
- * to manually flush if the transfer is unaligned. A cached
- * memory copy is used when destination or size is not naturally
- * aligned. That is:
- * - Require 8-byte alignment when size is 8 bytes or larger.
- * - Require 4-byte alignment when size is 4 bytes.
- *
- * In the non-iovec case the entire destination needs to be
- * flushed.
- */
- if (iter_is_iovec(i)) {
- unsigned long flushed, dest = (unsigned long) addr;
-
- if (bytes < 8) {
- if (!IS_ALIGNED(dest, 4) || (bytes != 4))
- arch_wb_cache_pmem(addr, 1);
- } else {
- if (!IS_ALIGNED(dest, 8)) {
- dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
- arch_wb_cache_pmem(addr, 1);
- }
-
- flushed = dest - (unsigned long) addr;
- if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
- arch_wb_cache_pmem(addr + bytes - 1, 1);
- }
- } else
- arch_wb_cache_pmem(addr, bytes);
-
- return len;
-}
-
-/**
* arch_clear_pmem - zero a PMEM memory range
* @addr: virtual start address
* @size: number of bytes to zero
diff --git a/fs/dax.c b/fs/dax.c
index ce9dc9c3e829..11b9909c91df 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
map_len = end - pos;
if (iov_iter_rw(iter) == WRITE)
- map_len = copy_from_iter_pmem(kaddr, map_len, iter);
+ map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
+ map_len, iter);
else
map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 71ecf3d46aac..9d542a5600e4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- BUG();
- return 0;
-}
-
static inline void arch_clear_pmem(void *addr, size_t size)
{
BUG();
@@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
}
/**
- * copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr: PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- if (arch_has_pmem_api())
- return arch_copy_from_iter_pmem(addr, bytes, i);
- return copy_from_iter_nocache(addr, bytes, i);
-}
-
-/**
* clear_pmem - zero a PMEM memory range
* @addr: virtual start address
* @size: number of bytes to zero
Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.
This conceptually allows for an array of mixed device drivers with
varying flush implementations.
Cc: Toshi Kani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 11 +++++++++++
drivers/md/dm-linear.c | 15 +++++++++++++++
drivers/md/dm-stripe.c | 20 ++++++++++++++++++++
drivers/md/dm.c | 19 +++++++++++++++++++
include/linux/dax.h | 2 ++
include/linux/device-mapper.h | 3 +++
6 files changed, 70 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 73f0da8e5d27..1253c05a2e53 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
}
EXPORT_SYMBOL_GPL(dax_copy_from_iter);
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ if (!dax_alive(dax_dev))
+ return;
+
+ if (dax_dev->ops->flush)
+ dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+}
+EXPORT_SYMBOL_GPL(dax_flush);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 5fe44a0ddfab..70d8439a1b63 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}
+static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct dax_device *dax_dev = lc->dev->dax_dev;
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+ dev_sector = linear_map_sector(ti, sector);
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
+ return;
+ dax_flush(dax_dev, pgoff, addr, size);
+}
+
static struct target_type linear_target = {
.name = "linear",
.version = {1, 3, 0},
@@ -184,6 +198,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_copy_from_iter = linear_dax_copy_from_iter,
+ .dax_flush = linear_dax_flush,
};
int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 4f45d23249b2..829fd438318d 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}
+static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+ struct stripe_c *sc = ti->private;
+ struct dax_device *dax_dev;
+ struct block_device *bdev;
+ uint32_t stripe;
+
+ stripe_map_sector(sc, sector, &stripe, &dev_sector);
+ dev_sector += sc->stripe[stripe].physical_start;
+ dax_dev = sc->stripe[stripe].dev->dax_dev;
+ bdev = sc->stripe[stripe].dev->bdev;
+
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
+ return;
+ dax_flush(dax_dev, pgoff, addr, size);
+}
+
/*
* Stripe status:
*
@@ -468,6 +487,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_copy_from_iter = stripe_dax_copy_from_iter,
+ .dax_flush = stripe_dax_flush,
};
int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c8579efcba2..6a97711cdbdf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}
+static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ int srcu_idx;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+ if (!ti)
+ goto out;
+ if (ti->type->dax_flush)
+ ti->type->dax_flush(ti, pgoff, addr, size);
+ out:
+ dm_put_live_table(md, srcu_idx);
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH.
@@ -2844,6 +2862,7 @@ static const struct block_device_operations dm_blk_dops = {
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
.copy_from_iter = dm_dax_copy_from_iter,
+ .flush = dm_dax_flush,
};
/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index c88bbcba26d9..2a4f533ef5e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,6 +36,8 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index e18e4277d09d..3c5eab2d2a75 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -132,6 +132,8 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
typedef size_t (*dm_dax_copy_from_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
+typedef void (*dm_dax_flush_fn)(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size);
#define PAGE_SECTORS (PAGE_SIZE / 512)
void dm_error(const char *message);
@@ -182,6 +184,7 @@ struct target_type {
dm_io_hints_fn io_hints;
dm_dax_direct_access_fn direct_access;
dm_dax_copy_from_iter_fn dax_copy_from_iter;
+ dm_dax_flush_fn dax_flush;
/* For internal device-mapper use. */
struct list_head list;
With all calls to this routine re-directed through the pmem driver, we
can kill the pmem api indirection. arch_wb_cache_pmem() is now
optionally supplied by an arch specific extension to libnvdimm. Same as
before, pmem flushing is only defined for x86_64, but it is
straightforward to add other archs in the future.
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 21 ---------------------
drivers/nvdimm/Makefile | 1 +
drivers/nvdimm/pmem.c | 14 +++++---------
drivers/nvdimm/pmem.h | 8 ++++++++
drivers/nvdimm/x86.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/pmem.h | 19 -------------------
tools/testing/nvdimm/Kbuild | 1 +
7 files changed, 51 insertions(+), 49 deletions(-)
create mode 100644 drivers/nvdimm/x86.c
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..4759a179aa52 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
- unsigned long clflush_mask = x86_clflush_size - 1;
- void *vend = addr + size;
- void *p;
-
- for (p = (void *)((unsigned long)addr & ~clflush_mask);
- p < vend; p += x86_clflush_size)
- clwb(p);
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
clflush_cache_range(addr, size);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..9eafb1dd2876 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
libnvdimm-$(CONFIG_BTT) += btt_devs.o
libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += x86.o
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 822b85fb3365..c77a3a757729 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
if (bytes < 8) {
if (!IS_ALIGNED(dest, 4) || (bytes != 4))
- wb_cache_pmem(addr, 1);
+ arch_wb_cache_pmem(addr, 1);
} else {
if (!IS_ALIGNED(dest, 8)) {
dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
- wb_cache_pmem(addr, 1);
+ arch_wb_cache_pmem(addr, 1);
}
flushed = dest - (unsigned long) addr;
if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
- wb_cache_pmem(addr + bytes - 1, 1);
+ arch_wb_cache_pmem(addr + bytes - 1, 1);
}
} else
- wb_cache_pmem(addr, bytes);
+ arch_wb_cache_pmem(addr, bytes);
return len;
}
@@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
{
- /*
- * TODO: move arch specific cache management into the driver
- * directly.
- */
- wb_cache_pmem(addr, size);
+ arch_wb_cache_pmem(addr, size);
}
static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..c4b3371c7f88 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,6 +5,14 @@
#include <linux/pfn_t.h>
#include <linux/fs.h>
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
/* this definition is in it's own header for tools/testing/nvdimm to consume */
struct pmem_device {
/* One contiguous memory region per device */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
new file mode 100644
index 000000000000..79d7267da4d2
--- /dev/null
+++ b/drivers/nvdimm/x86.c
@@ -0,0 +1,36 @@
+/*
+ * Copyright(c) 2015 - 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <asm/cacheflush.h>
+#include <asm/cpufeature.h>
+#include <asm/special_insns.h>
+
+/**
+ * arch_wb_cache_pmem - write back a cache range with CLWB
+ * @vaddr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction.
+ */
+void arch_wb_cache_pmem(void *addr, size_t size)
+{
+ u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+ unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vend = addr + size;
+ void *p;
+
+ for (p = (void *)((unsigned long)addr & ~clflush_mask);
+ p < vend; p += x86_clflush_size)
+ clwb(p);
+}
+EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 772bd02a5b52..33ae761f010a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
BUG();
@@ -80,18 +75,4 @@ static inline void invalidate_pmem(void *addr, size_t size)
if (arch_has_pmem_api())
arch_invalidate_pmem(addr, size);
}
-
-/**
- * wb_cache_pmem - write back processor cache for PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back the processor cache range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void wb_cache_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_wb_cache_pmem(addr, size);
-}
#endif /* __PMEM_H__ */
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 2033ad03b8cd..0e0c444737e9 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -72,6 +72,7 @@ libnvdimm-$(CONFIG_ND_CLAIM) += $(NVDIMM_SRC)/claim.o
libnvdimm-$(CONFIG_BTT) += $(NVDIMM_SRC)/btt_devs.o
libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += $(NVDIMM_SRC)/x86.o
libnvdimm-y += config_check.o
obj-m += test/
The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/region_devs.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 307a48060aa3..5976f6c0407f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region)
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i;
- /* no nvdimm == flushing capability unknown */
- if (nd_region->ndr_mappings == 0)
+ /* no nvdimm or pmem api == flushing capability unknown */
+ if (nd_region->ndr_mappings == 0
+ || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
return -ENXIO;
for (i = 0; i < nd_region->ndr_mappings; i++)
The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.
With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 13 -------------
fs/dax.c | 3 ++-
include/linux/pmem.h | 21 ---------------------
3 files changed, 2 insertions(+), 35 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 60e8edbe0205..f4c119d253f3 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
}
-/**
- * arch_clear_pmem - zero a PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- */
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
- memset(addr, 0, size);
- arch_wb_cache_pmem(addr, size);
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
clflush_cache_range(addr, size);
diff --git a/fs/dax.c b/fs/dax.c
index edbf988de86c..edee7e8298bc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
- clear_pmem(kaddr + offset, size);
+ memset(kaddr + offset, 0, size);
+ dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 9d542a5600e4..772bd02a5b52 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
BUG();
@@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
}
/**
- * clear_pmem - zero a PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void clear_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_clear_pmem(addr, size);
- else
- memset(addr, 0, size);
-}
-
-/**
* invalidate_pmem - flush a pmem range from the cache hierarchy
* @addr: virtual start address
* @size: bytes to invalidate (internally aligned to cache line size)
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 9 +++++++--
drivers/nvdimm/region_devs.c | 6 ++++++
include/linux/libnvdimm.h | 1 +
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b000c6db5731..42876a75dab8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = &nsio->res;
struct nd_pfn *nd_pfn = NULL;
+ int has_flush, fua = 0, wbc;
struct dax_device *dax_dev;
int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
@@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (nvdimm_has_flush(nd_region) < 0)
+ has_flush = nvdimm_has_flush(nd_region);
+ if (has_flush < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
+ else
+ fua = has_flush;
+ wbc = nvdimm_has_cache(nd_region);
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(&ndns->dev))) {
@@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
- blk_queue_write_cache(q, true, true);
+ blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 2df259010720..a085f7094b76 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_flush);
+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
void __exit nd_region_devs_exit(void)
{
ida_destroy(®ion_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a98004745768..b733030107bb 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_memcpy_to_pmem(void *dst, void *src, unsigned size);
#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit/core.c | 9 ++++++++-
drivers/nvdimm/bus.c | 10 +++++-----
drivers/nvdimm/core.c | 2 +-
drivers/nvdimm/dax_devs.c | 2 +-
drivers/nvdimm/dimm_devs.c | 2 +-
drivers/nvdimm/namespace_devs.c | 8 ++++----
drivers/nvdimm/nd-core.h | 9 +++++++++
drivers/nvdimm/pfn_devs.c | 4 ++--
drivers/nvdimm/region_devs.c | 27 ++++++++++++++-------------
9 files changed, 45 insertions(+), 28 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 8b4c6212737c..6ac31846c4df 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct acpi_nfit_system_address *spa)
nfit_spa_type(spa) == NFIT_SPA_PCD);
}
+static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
+{
+ return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
+ nfit_spa_type(spa) == NFIT_SPA_VCD ||
+ nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
+}
+
static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
struct nfit_spa *nfit_spa)
{
@@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc);
if (!nfit_spa->nd_region)
rc = -ENOMEM;
- } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
+ } else if (nfit_spa_is_volatile(spa)) {
nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
ndr_desc);
if (!nfit_spa->nd_region)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8f6503..d4173fbdba28 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev)
{
if (is_nvdimm(dev))
return ND_DEVICE_DIMM;
- else if (is_nd_pmem(dev))
+ else if (is_memory(dev))
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
- else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+ else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
return 0;
@@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
* Ensure that region devices always have their numa node set as
* early as possible.
*/
- if (is_nd_pmem(dev) || is_nd_blk(dev))
+ if (is_nd_region(dev))
set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
@@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
static struct module *to_bus_provider(struct device *dev)
{
/* pin bus providers while regions are enabled */
- if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+ if (is_nd_region(dev)) {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
return nvdimm_bus->nd_desc->module;
@@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
static int pmem_active(struct device *dev, void *data)
{
- if (is_nd_pmem(dev) && dev->driver)
+ if (is_memory(dev) && dev->driver)
return -EBUSY;
return 0;
}
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 9303cfeb8bee..875ef4cecb35 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
struct list_head *poison_list;
- if (!is_nd_pmem(&nd_region->dev)) {
+ if (!is_memory(&nd_region->dev)) {
dev_WARN_ONCE(&nd_region->dev, 1,
"%s only valid for pmem regions\n", __func__);
return;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 45fa82cae87c..6a92b84c8072 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region *nd_region)
struct device *dev = NULL;
struct nd_dax *nd_dax;
- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;
nd_dax = nd_dax_alloc(nd_region);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 8b721321be5b..d814e5adab5c 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -403,7 +403,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct resource *res;
int i;
- if (!is_nd_pmem(dev))
+ if (!is_memory(dev))
return 0;
nd_region = to_nd_region(dev);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 2580f6655bea..4fa3c1fce6bf 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -111,7 +111,7 @@ static int is_uuid_busy(struct device *dev, void *data)
static int is_namespace_uuid_busy(struct device *dev, void *data)
{
- if (is_nd_pmem(dev) || is_nd_blk(dev))
+ if (is_nd_region(dev))
return device_for_each_child(dev, data, is_uuid_busy);
return 0;
}
@@ -786,7 +786,7 @@ static int __reserve_free_pmem(struct device *dev, void *data)
struct nd_label_id label_id;
int i;
- if (!is_nd_pmem(dev))
+ if (!is_memory(dev))
return 0;
nd_region = to_nd_region(dev);
@@ -1875,7 +1875,7 @@ static struct device *nd_namespace_pmem_create(struct nd_region *nd_region)
struct resource *res;
struct device *dev;
- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;
nspm = kzalloc(sizeof(*nspm), GFP_KERNEL);
@@ -2155,7 +2155,7 @@ static struct device **scan_labels(struct nd_region *nd_region)
}
dev->parent = &nd_region->dev;
devs[count++] = dev;
- } else if (is_nd_pmem(&nd_region->dev)) {
+ } else if (is_memory(&nd_region->dev)) {
/* clean unselected labels */
for (i = 0; i < nd_region->ndr_mappings; i++) {
struct list_head *l, *e;
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 8623e57c2ce3..f26263cc8a36 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -63,7 +63,16 @@ struct blk_alloc_info {
bool is_nvdimm(struct device *dev);
bool is_nd_pmem(struct device *dev);
+bool is_nd_volatile(struct device *dev);
bool is_nd_blk(struct device *dev);
+static inline bool is_nd_region(struct device *dev)
+{
+ return is_nd_pmem(dev) || is_nd_blk(dev) || is_nd_volatile(dev);
+}
+static inline bool is_memory(struct device *dev)
+{
+ return is_nd_pmem(dev) || is_nd_volatile(dev);
+}
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
int __init nvdimm_bus_init(void);
void nvdimm_bus_exit(void);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6c033c9a2f06..7cc77684316b 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -331,7 +331,7 @@ struct device *nd_pfn_create(struct nd_region *nd_region)
struct nd_pfn *nd_pfn;
struct device *dev;
- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;
nd_pfn = nd_pfn_alloc(nd_region);
@@ -354,7 +354,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (!pfn_sb || !ndns)
return -ENODEV;
- if (!is_nd_pmem(nd_pfn->dev.parent))
+ if (!is_memory(nd_pfn->dev.parent))
return -ENODEV;
if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb)))
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 5976f6c0407f..2df259010720 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -168,6 +168,11 @@ bool is_nd_blk(struct device *dev)
return dev ? dev->type == &nd_blk_device_type : false;
}
+bool is_nd_volatile(struct device *dev)
+{
+ return dev ? dev->type == &nd_volatile_device_type : false;
+}
+
struct nd_region *to_nd_region(struct device *dev)
{
struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
@@ -214,7 +219,7 @@ EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
*/
int nd_region_to_nstype(struct nd_region *nd_region)
{
- if (is_nd_pmem(&nd_region->dev)) {
+ if (is_memory(&nd_region->dev)) {
u16 i, alias;
for (i = 0, alias = 0; i < nd_region->ndr_mappings; i++) {
@@ -242,7 +247,7 @@ static ssize_t size_show(struct device *dev,
struct nd_region *nd_region = to_nd_region(dev);
unsigned long long size = 0;
- if (is_nd_pmem(dev)) {
+ if (is_memory(dev)) {
size = nd_region->ndr_size;
} else if (nd_region->ndr_mappings == 1) {
struct nd_mapping *nd_mapping = &nd_region->mapping[0];
@@ -278,7 +283,7 @@ static ssize_t set_cookie_show(struct device *dev,
struct nd_region *nd_region = to_nd_region(dev);
struct nd_interleave_set *nd_set = nd_region->nd_set;
- if (is_nd_pmem(dev) && nd_set)
+ if (is_memory(dev) && nd_set)
/* pass, should be precluded by region_visible */;
else
return -ENXIO;
@@ -305,7 +310,7 @@ resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
if (!ndd)
return 0;
- if (is_nd_pmem(&nd_region->dev)) {
+ if (is_memory(&nd_region->dev)) {
available += nd_pmem_available_dpa(nd_region,
nd_mapping, &overlap);
if (overlap > blk_max_overlap) {
@@ -469,10 +474,10 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
struct nd_interleave_set *nd_set = nd_region->nd_set;
int type = nd_region_to_nstype(nd_region);
- if (!is_nd_pmem(dev) && a == &dev_attr_pfn_seed.attr)
+ if (!is_memory(dev) && a == &dev_attr_pfn_seed.attr)
return 0;
- if (!is_nd_pmem(dev) && a == &dev_attr_dax_seed.attr)
+ if (!is_memory(dev) && a == &dev_attr_dax_seed.attr)
return 0;
if (a != &dev_attr_set_cookie.attr
@@ -483,7 +488,7 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
|| type == ND_DEVICE_NAMESPACE_BLK)
&& a == &dev_attr_available_size.attr)
return a->mode;
- else if (is_nd_pmem(dev) && nd_set)
+ else if (is_memory(dev) && nd_set)
return a->mode;
return 0;
@@ -535,7 +540,7 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
{
struct nd_region *nd_region;
- if (!probe && (is_nd_pmem(dev) || is_nd_blk(dev))) {
+ if (!probe && is_nd_region(dev)) {
int i;
nd_region = to_nd_region(dev);
@@ -553,12 +558,8 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
if (ndd)
atomic_dec(&nvdimm->busy);
}
-
- if (is_nd_pmem(dev))
- return;
}
- if (dev->parent && (is_nd_blk(dev->parent) || is_nd_pmem(dev->parent))
- && probe) {
+ if (dev->parent && is_nd_region(dev->parent) && probe) {
nd_region = to_nd_region(dev->parent);
nvdimm_bus_lock(dev);
if (nd_region->ns_seed == dev)
The pmem and nd_blk drivers both have need to copy data through the cpu
cache to persistent memory. To date they have been abusing
__copy_user_nocache through the memcpy_to_pmem abstraction, but this has
several problems:
* __copy_user_nocache does not guarantee that it will always avoid the
cache. While we have fixed the cases where the pmem usage might
trigger that behavior it's a fragile assumption and burdens the
uaccess.h implementation with worrying about the distinction between
'nocache' and the stricter write-through semantic needed by pmem.
Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or
(ab-)using copy_user_nocache() just needs to die. ... If some driver
ends up using "movnt" by hand, that is up to that *driver*."
* It implements SMAP (supervisor mode access protection) which is only
meant for user copies.
* It expects faults. For in-kernel copies, faults are fatal and we
should not be coding for exception handling in that case.
__arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache()
minus SMAP, unaligned support, and exception handling. The configuration
symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to
the implementation.
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
MAINTAINERS | 2 -
arch/x86/Kconfig | 1 -
arch/x86/include/asm/pmem.h | 48 -----------------------------
drivers/acpi/nfit/core.c | 3 +-
drivers/nvdimm/Kconfig | 4 ++
drivers/nvdimm/claim.c | 4 +-
drivers/nvdimm/namespace_devs.c | 1 -
drivers/nvdimm/pmem.c | 4 +-
drivers/nvdimm/region_devs.c | 1 -
drivers/nvdimm/x86.c | 65 +++++++++++++++++++++++++++++++++++++++
fs/dax.c | 1 -
include/linux/libnvdimm.h | 9 +++++
include/linux/pmem.h | 59 -----------------------------------
lib/Kconfig | 3 --
14 files changed, 83 insertions(+), 122 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
delete mode 100644 include/linux/pmem.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 819d5e8b668a..1c4da1bebd7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7458,8 +7458,6 @@ L: [email protected]
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/pmem.c
-F: include/linux/pmem.h
-F: arch/*/include/asm/pmem.h
LIGHTNVM PLATFORM SUPPORT
M: Matias Bjorling <[email protected]>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..d377da696903 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,7 +53,6 @@ config X86
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MMIO_FLUSH
- select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
deleted file mode 100644
index ded2541a7ba9..000000000000
--- a/arch/x86/include/asm/pmem.h
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
-#ifndef __ASM_X86_PMEM_H__
-#define __ASM_X86_PMEM_H__
-
-#include <linux/uaccess.h>
-#include <asm/cacheflush.h>
-#include <asm/cpufeature.h>
-#include <asm/special_insns.h>
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-/**
- * arch_memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Copy data to persistent memory media via non-temporal stores so that
- * a subsequent pmem driver flush operation will drain posted write queues.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
- int rem;
-
- /*
- * We are copying between two kernel buffers, if
- * __copy_from_user_inatomic_nocache() returns an error (page
- * fault) we would have already reported a general protection fault
- * before the WARN+BUG.
- */
- rem = __copy_from_user_inatomic_nocache(dst, (void __user *) src, n);
- if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
- __func__, dst, src, rem))
- BUG();
-}
-
-#endif /* CONFIG_ARCH_HAS_PMEM_API */
-#endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..8b4c6212737c 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -20,7 +20,6 @@
#include <linux/list.h>
#include <linux/acpi.h>
#include <linux/sort.h>
-#include <linux/pmem.h>
#include <linux/io.h>
#include <linux/nd.h>
#include <asm/cacheflush.h>
@@ -1776,7 +1775,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}
if (rw)
- memcpy_to_pmem(mmio->addr.aperture + offset,
+ arch_memcpy_to_pmem(mmio->addr.aperture + offset,
iobuf + copied, c);
else {
if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5bdd499b5f4f..4d45196d6f94 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -36,6 +36,10 @@ config BLK_DEV_PMEM
Say Y if you want to use an NVDIMM
+config ARCH_HAS_PMEM_API
+ depends on X86_64
+ def_bool y
+
config ND_BLK
tristate "BLK: Block data window (aperture) device support"
default LIBNVDIMM
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 1e13a196ce4b..0b222c8ce1d8 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -10,9 +10,9 @@
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
+#include <linux/libnvdimm.h>
#include <linux/device.h>
#include <linux/sizes.h>
-#include <linux/pmem.h>
#include "nd-core.h"
#include "pmem.h"
#include "pfn.h"
@@ -267,7 +267,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
rc = -EIO;
}
- memcpy_to_pmem(nsio->addr + offset, buf, size);
+ arch_memcpy_to_pmem(nsio->addr + offset, buf, size);
nvdimm_flush(to_nd_region(ndns->dev.parent));
return rc;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 1b481a5fb966..2580f6655bea 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -14,7 +14,6 @@
#include <linux/device.h>
#include <linux/sort.h>
#include <linux/slab.h>
-#include <linux/pmem.h>
#include <linux/list.h>
#include <linux/nd.h>
#include "nd-core.h"
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 769a510c20e8..329895ca88e1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -27,8 +27,8 @@
#include <linux/vmalloc.h>
#include <linux/pfn_t.h>
#include <linux/slab.h>
-#include <linux/pmem.h>
#include <linux/dax.h>
+#include <linux/uio.h>
#include <linux/nd.h>
#include "pmem.h"
#include "pfn.h"
@@ -79,7 +79,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
{
void *mem = kmap_atomic(page);
- memcpy_to_pmem(pmem_addr, mem + off, len);
+ arch_memcpy_to_pmem(pmem_addr, mem + off, len);
kunmap_atomic(mem);
}
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..307a48060aa3 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -15,7 +15,6 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/hash.h>
-#include <linux/pmem.h>
#include <linux/sort.h>
#include <linux/io.h>
#include <linux/nd.h>
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index 07478ed7ce97..d99b452332a9 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -40,3 +40,68 @@ void arch_invalidate_pmem(void *addr, size_t size)
clflush_cache_range(addr, size);
}
EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
+
+void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ /* cache copy and flush to align dest */
+ if (!IS_ALIGNED(dest, 8)) {
+ unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ arch_wb_cache_pmem((void *) dest, len);
+ dest += len;
+ source += len;
+ size -= len;
+ if (!size)
+ return;
+ }
+
+ /* 4x8 movnti loop */
+ while (size >= 32) {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movnti %%r8, (%1)\n"
+ "movnti %%r9, 8(%1)\n"
+ "movnti %%r10, 16(%1)\n"
+ "movnti %%r11, 24(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ dest += 32;
+ source += 32;
+ size -= 32;
+ }
+
+ /* 1x8 movnti loop */
+ while (size >= 8) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+
+ /* 1x4 movnti loop */
+ while (size >= 4) {
+ asm("movl (%0), %%r8d\n"
+ "movnti %%r8d, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 4;
+ source += 4;
+ size -= 4;
+ }
+
+ /* cache copy for remaining bytes */
+ if (size) {
+ memcpy((void *) dest, (void *) source, size);
+ arch_wb_cache_pmem((void *) dest, size);
+ }
+}
+EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem);
diff --git a/fs/dax.c b/fs/dax.c
index edee7e8298bc..f37ed21e4093 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,7 +25,6 @@
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/pagevec.h>
-#include <linux/pmem.h>
#include <linux/sched.h>
#include <linux/sched/signal.h>
#include <linux/uio.h>
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 77e7af32543f..a98004745768 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,4 +162,13 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_memcpy_to_pmem(void *dst, void *src, unsigned size);
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
+#else
+static inline void arch_memcpy_to_pmem(void *dst, void *src, unsigned size)
+{
+}
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
+#endif /* CONFIG_ARCH_HAS_PMEM_API */
#endif /* __LIBNVDIMM_H__ */
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
deleted file mode 100644
index 559c00848583..000000000000
--- a/include/linux/pmem.h
+++ /dev/null
@@ -1,59 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
-#ifndef __PMEM_H__
-#define __PMEM_H__
-
-#include <linux/io.h>
-#include <linux/uio.h>
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-#include <asm/pmem.h>
-#else
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
-/*
- * These are simply here to enable compilation, all call sites gate
- * calling these symbols with arch_has_pmem_api() and redirect to the
- * implementation in asm/pmem.h.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
- BUG();
-}
-#endif
-
-static inline bool arch_has_pmem_api(void)
-{
- return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
-}
-
-/**
- * memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Perform a memory copy that results in the destination of the copy
- * being effectively evicted from, or never written to, the processor
- * cache hierarchy after the copy completes. After memcpy_to_pmem()
- * data may still reside in cpu or platform buffers, so this operation
- * must be followed by a blkdev_issue_flush() on the pmem block device.
- */
-static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
- if (arch_has_pmem_api())
- arch_memcpy_to_pmem(dst, src, n);
- else
- memcpy(dst, src, n);
-}
-#endif /* __PMEM_H__ */
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..0c4aac6ef394 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -545,9 +545,6 @@ config SG_POOL
config ARCH_HAS_SG_CHAIN
def_bool n
-config ARCH_HAS_PMEM_API
- bool
-
config ARCH_HAS_MMIO_FLUSH
bool
Introduce copy_from_iter_ops() to enable passing custom sub-routines to
iterate_and_advance(). Define pmem operations that guarantee cache
bypass to supplement the existing usage of __copy_from_iter_nocache()
backed by arch_wb_cache_pmem().
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/Kconfig | 1 +
drivers/nvdimm/pmem.c | 38 +-------------------------------------
drivers/nvdimm/pmem.h | 7 +++++++
drivers/nvdimm/x86.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/uio.h | 4 ++++
lib/Kconfig | 3 +++
lib/iov_iter.c | 25 +++++++++++++++++++++++++
7 files changed, 89 insertions(+), 37 deletions(-)
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 4d45196d6f94..28002298cdc8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -38,6 +38,7 @@ config BLK_DEV_PMEM
config ARCH_HAS_PMEM_API
depends on X86_64
+ select COPY_FROM_ITER_OPS
def_bool y
config ND_BLK
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 329895ca88e1..b000c6db5731 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
{
- size_t len;
-
- /* TODO: skip the write-back by always using non-temporal stores */
- len = copy_from_iter_nocache(addr, bytes, i);
-
- /*
- * In the iovec case on x86_64 copy_from_iter_nocache() uses
- * non-temporal stores for the bulk of the transfer, but we need
- * to manually flush if the transfer is unaligned. A cached
- * memory copy is used when destination or size is not naturally
- * aligned. That is:
- * - Require 8-byte alignment when size is 8 bytes or larger.
- * - Require 4-byte alignment when size is 4 bytes.
- *
- * In the non-iovec case the entire destination needs to be
- * flushed.
- */
- if (iter_is_iovec(i)) {
- unsigned long flushed, dest = (unsigned long) addr;
-
- if (bytes < 8) {
- if (!IS_ALIGNED(dest, 4) || (bytes != 4))
- arch_wb_cache_pmem(addr, 1);
- } else {
- if (!IS_ALIGNED(dest, 8)) {
- dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
- arch_wb_cache_pmem(addr, 1);
- }
-
- flushed = dest - (unsigned long) addr;
- if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
- arch_wb_cache_pmem(addr + bytes - 1, 1);
- }
- } else
- arch_wb_cache_pmem(addr, bytes);
-
- return len;
+ return arch_copy_from_iter_pmem(addr, bytes, i);
}
static const struct block_device_operations pmem_fops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 00005900c1b7..574b63fb5376 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,11 +3,13 @@
#include <linux/badblocks.h>
#include <linux/types.h>
#include <linux/pfn_t.h>
+#include <linux/uio.h>
#include <linux/fs.h>
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
void arch_invalidate_pmem(void *addr, size_t size);
+size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i);
#else
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
@@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
}
+static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return copy_from_iter_nocache(addr, bytes, i);
+}
#endif
/* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index d99b452332a9..bc145d760d43 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -10,6 +10,9 @@
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
+#include <linux/uio.h>
+#include <linux/uaccess.h>
+#include <linux/highmem.h>
#include <asm/cacheflush.h>
#include <asm/cpufeature.h>
#include <asm/special_insns.h>
@@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned size)
}
}
EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem);
+
+static int pmem_from_user(void *dst, const void __user *src, unsigned size)
+{
+ unsigned long flushed, dest = (unsigned long) dest;
+ int rc = __copy_from_user_nocache(dst, src, size);
+
+ /*
+ * On x86_64 __copy_from_user_nocache() uses non-temporal stores
+ * for the bulk of the transfer, but we need to manually flush
+ * if the transfer is unaligned. A cached memory copy is used
+ * when destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
+ */
+ if (size < 8) {
+ if (!IS_ALIGNED(dest, 4) || size != 4)
+ arch_wb_cache_pmem(dst, 1);
+ } else {
+ if (!IS_ALIGNED(dest, 8)) {
+ dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ arch_wb_cache_pmem(dst, 1);
+ }
+
+ flushed = dest - (unsigned long) dst;
+ if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+ arch_wb_cache_pmem(dst + size - 1, 1);
+ }
+
+ return rc;
+}
+
+static void pmem_from_page(char *to, struct page *page, size_t offset, size_t len)
+{
+ char *from = kmap_atomic(page);
+
+ arch_memcpy_to_pmem(to, from + offset, len);
+ kunmap_atomic(from);
+}
+
+size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter_ops(addr, bytes, i, pmem_from_user, pmem_from_page,
+ arch_memcpy_to_pmem);
+}
+EXPORT_SYMBOL_GPL(arch_copy_from_iter_pmem);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 804e34c6f981..edb78f3fe2c8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -91,6 +91,10 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter_ops(void *addr, size_t bytes, struct iov_iter *i,
+ int (*user)(void *, const void __user *, unsigned),
+ void (*page)(char *, struct page *, size_t, size_t),
+ void (*copy)(void *, void *, unsigned));
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c4aac6ef394..4d8f575e65b3 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -404,6 +404,9 @@ config DMA_VIRT_OPS
depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT)
default n
+config COPY_FROM_ITER_OPS
+ bool
+
config CHECK_SIGNATURE
bool
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e68604ae3ced..85f8021504e3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -571,6 +571,31 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(copy_from_iter);
+#ifdef CONFIG_COPY_FROM_ITER_OPS
+size_t copy_from_iter_ops(void *addr, size_t bytes, struct iov_iter *i,
+ int (*user)(void *, const void __user *, unsigned),
+ void (*page)(char *, struct page *, size_t, size_t),
+ void (*copy)(void *, void *, unsigned))
+{
+ char *to = addr;
+
+ if (unlikely(i->type & ITER_PIPE)) {
+ WARN_ON(1);
+ return 0;
+ }
+ iterate_and_advance(i, bytes, v,
+ user((to += v.iov_len) - v.iov_len, v.iov_base,
+ v.iov_len),
+ page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset,
+ v.bv_len),
+ copy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_ops);
+#endif
+
bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i)
{
char *to = addr;
Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove the public pmem api.
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 4 ----
drivers/nvdimm/claim.c | 3 ++-
drivers/nvdimm/pmem.c | 2 +-
drivers/nvdimm/pmem.h | 4 ++++
drivers/nvdimm/x86.c | 6 ++++++
include/linux/pmem.h | 19 -------------------
6 files changed, 13 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 4759a179aa52..ded2541a7ba9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
- clflush_cache_range(addr, size);
-}
#endif /* CONFIG_ARCH_HAS_PMEM_API */
#endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..1e13a196ce4b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -14,6 +14,7 @@
#include <linux/sizes.h>
#include <linux/pmem.h>
#include "nd-core.h"
+#include "pmem.h"
#include "pfn.h"
#include "btt.h"
#include "nd.h"
@@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
cleared /= 512;
badblocks_clear(&nsio->bb, sector, cleared);
}
- invalidate_pmem(nsio->addr + offset, size);
+ arch_invalidate_pmem(nsio->addr + offset, size);
} else
rc = -EIO;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77a3a757729..769a510c20e8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset,
badblocks_clear(&pmem->bb, sector, cleared);
}
- invalidate_pmem(pmem->virt_addr + offset, len);
+ arch_invalidate_pmem(pmem->virt_addr + offset, len);
return rc;
}
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index c4b3371c7f88..00005900c1b7 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -7,10 +7,14 @@
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
+void arch_invalidate_pmem(void *addr, size_t size);
#else
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
}
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
#endif
/* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index 79d7267da4d2..07478ed7ce97 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
}
EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
+
+void arch_invalidate_pmem(void *addr, size_t size)
+{
+ clflush_cache_range(addr, size);
+}
+EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 33ae761f010a..559c00848583 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
{
BUG();
}
-
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
- BUG();
-}
#endif
static inline bool arch_has_pmem_api(void)
@@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
else
memcpy(dst, src, n);
}
-
-/**
- * invalidate_pmem - flush a pmem range from the cache hierarchy
- * @addr: virtual start address
- * @size: bytes to invalidate (internally aligned to cache line size)
- *
- * For platforms that support clearing poison this flushes any poisoned
- * ranges out of the cache
- */
-static inline void invalidate_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_invalidate_pmem(addr, size);
-}
#endif /* __PMEM_H__ */
Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.
Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index f37ed21e4093..5b7ee1bc74d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev,
}
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
- dax_flush(dax_dev, pgoff, kaddr, size);
+ if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
+ dax_flush(dax_dev, pgoff, kaddr, size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
@@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev,
return rc;
}
memset(kaddr + offset, 0, size);
- dax_flush(dax_dev, pgoff, kaddr + offset, size);
+ if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
+ dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.
We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 11b9909c91df..edbf988de86c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev,
}
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
- wb_cache_pmem(kaddr, size);
+ dax_flush(dax_dev, pgoff, kaddr, size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.
This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 5 -----
arch/x86/include/asm/string_64.h | 1 +
drivers/acpi/nfit/core.c | 3 +--
drivers/nvdimm/claim.c | 2 +-
drivers/nvdimm/pmem.c | 2 +-
include/linux/pmem.h | 23 -----------------------
include/linux/string.h | 8 ++++++++
7 files changed, 12 insertions(+), 32 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 529bb4a6487a..d5a22bac9988 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
- return memcpy_mcsafe(dst, src, n);
-}
-
/**
* arch_wb_cache_pmem - write back a cache range with CLWB
* @vaddr: virtual start address
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..733bae07fb29 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
#define memset(s, c, n) __memset(s, c, n)
#endif
+#define __HAVE_ARCH_MEMCPY_MCSAFE 1
__must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t cnt);
DECLARE_STATIC_KEY_FALSE(mcsafe_key);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index c8ea9d698cd0..d0c07b2344e4 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
mmio_flush_range((void __force *)
mmio->addr.aperture + offset, c);
- memcpy_from_pmem(iobuf + copied,
- mmio->addr.aperture + offset, c);
+ memcpy(iobuf + copied, mmio->addr.aperture + offset, c);
}
copied += c;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index ca6d572c48fc..3a35e8028b9c 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
if (rw == READ) {
if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align)))
return -EIO;
- return memcpy_from_pmem(buf, nsio->addr + offset, size);
+ return memcpy_mcsafe(buf, nsio->addr + offset, size);
}
if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align))) {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 85b85633d674..3b3dab73d741 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off,
int rc;
void *mem = kmap_atomic(page);
- rc = memcpy_from_pmem(mem + off, pmem_addr, len);
+ rc = memcpy_mcsafe(mem + off, pmem_addr, len);
kunmap_atomic(mem);
if (rc)
return -EIO;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index e856c2cb0fe8..71ecf3d46aac 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
- BUG();
- return -EFAULT;
-}
-
static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
struct iov_iter *i)
{
@@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void)
return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
}
-/*
- * memcpy_from_pmem - read from persistent memory with error handling
- * @dst: destination buffer
- * @src: source buffer
- * @size: transfer length
- *
- * Returns 0 on success negative error code on failure.
- */
-static inline int memcpy_from_pmem(void *dst, void const *src, size_t size)
-{
- if (arch_has_pmem_api())
- return arch_memcpy_from_pmem(dst, src, size);
- else
- memcpy(dst, src, size);
- return 0;
-}
-
/**
* memcpy_to_pmem - copy data to persistent memory
* @dst: destination buffer for the copy
diff --git a/include/linux/string.h b/include/linux/string.h
index 26b6f6a66f83..9d6f189157e2 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -114,6 +114,14 @@ extern int memcmp(const void *,const void *,__kernel_size_t);
#ifndef __HAVE_ARCH_MEMCHR
extern void * memchr(const void *,int,__kernel_size_t);
#endif
+#ifndef __HAVE_ARCH_MEMCPY_MCSAFE
+static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
+ size_t cnt)
+{
+ memcpy(dst, src, cnt);
+ return 0;
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.
Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.
Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").
Cc: Jeff Moyer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
block/partition-generic.c | 17 ++---------------
fs/dax.c | 20 --------------------
include/linux/dax.h | 6 ------
3 files changed, 2 insertions(+), 41 deletions(-)
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
#include <linux/kmod.h>
#include <linux/ctype.h>
#include <linux/genhd.h>
-#include <linux/dax.h>
#include <linux/blktrace_api.h>
#include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct block_device *bdev)
return 0;
}
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t n)
-{
- struct address_space *mapping = bdev->bd_inode->i_mapping;
-
- return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
- NULL);
-}
-
unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
{
+ struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
- /* don't populate page cache for dax capable devices */
- if (IS_DAX(bdev->bd_inode))
- page = read_dax_sector(bdev, n);
- else
- page = read_pagecache_sector(bdev, n);
-
+ page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..b78a6947c4f5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
}
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
- struct page *page = alloc_pages(GFP_KERNEL, 0);
- struct blk_dax_ctl dax = {
- .size = PAGE_SIZE,
- .sector = n & ~((((int) PAGE_SIZE) / 512) - 1),
- };
- long rc;
-
- if (!page)
- return ERR_PTR(-ENOMEM);
-
- rc = dax_map_atomic(bdev, &dax);
- if (rc < 0)
- return ERR_PTR(rc);
- memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
- dax_unmap_atomic(bdev, &dax);
- return page;
-}
-
/*
* DAX radix tree locking
*/
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7e62e280c11f..0d0d890f9186 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
pgoff_t index, void *entry, bool wake_all);
#ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
#else
-static inline struct page *read_dax_sector(struct block_device *bdev,
- sector_t n)
-{
- return ERR_PTR(-ENXIO);
-}
static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
{
Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.
Cc: Toshi Kani <[email protected]>
Cc: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/md/dm-linear.c | 27 +++++++++++++--------------
drivers/md/dm-snap.c | 6 +++---
drivers/md/dm-stripe.c | 29 ++++++++++++++---------------
drivers/md/dm-target.c | 6 +++---
drivers/md/dm.c | 16 ++++++----------
include/linux/device-mapper.h | 7 ++++---
6 files changed, 43 insertions(+), 48 deletions(-)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..c5a52f4dae81 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -9,6 +9,7 @@
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
+#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
@@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
}
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
+ long ret;
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
- struct blk_dax_ctl dax = {
- .sector = linear_map_sector(ti, sector),
- .size = size,
- };
- long ret;
-
- ret = bdev_direct_access(bdev, &dax);
- *kaddr = dax.addr;
- *pfn = dax.pfn;
-
- return ret;
+ struct dax_device *dax_dev = lc->dev->dax_dev;
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+ dev_sector = linear_map_sector(ti, sector);
+ ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, &pgoff);
+ if (ret)
+ return ret;
+ return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}
static struct target_type linear_target = {
@@ -169,7 +168,7 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
- .direct_access = linear_direct_access,
+ .direct_access = linear_dax_direct_access,
};
int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..e152d9817c81 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio *bio)
return do_origin(o->dev, bio);
}
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
DMWARN("device does not support dax.");
return -EIO;
@@ -2368,7 +2368,7 @@ static struct target_type origin_target = {
.postsuspend = origin_postsuspend,
.status = origin_status,
.iterate_devices = origin_iterate_devices,
- .direct_access = origin_direct_access,
+ .direct_access = origin_dax_direct_access,
};
static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..cb4b1e9e16ab 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
+#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/log2.h>
@@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_REMAPPED;
}
-static long stripe_direct_access(struct dm_target *ti, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
- uint32_t stripe;
+ struct dax_device *dax_dev;
struct block_device *bdev;
- struct blk_dax_ctl dax = {
- .size = size,
- };
+ uint32_t stripe;
long ret;
- stripe_map_sector(sc, sector, &stripe, &dax.sector);
-
- dax.sector += sc->stripe[stripe].physical_start;
+ stripe_map_sector(sc, sector, &stripe, &dev_sector);
+ dev_sector += sc->stripe[stripe].physical_start;
+ dax_dev = sc->stripe[stripe].dev->dax_dev;
bdev = sc->stripe[stripe].dev->bdev;
- ret = bdev_direct_access(bdev, &dax);
- *kaddr = dax.addr;
- *pfn = dax.pfn;
-
- return ret;
+ ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, &pgoff);
+ if (ret)
+ return ret;
+ return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}
/*
@@ -448,7 +447,7 @@ static struct target_type stripe_target = {
.status = stripe_status,
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
- .direct_access = stripe_direct_access,
+ .direct_access = stripe_dax_direct_access,
};
int __init dm_stripe_init(void)
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 43d3445b121d..6a7968f93f3c 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -142,8 +142,8 @@ static void io_err_release_clone_rq(struct request *clone)
{
}
-static long io_err_direct_access(struct dm_target *ti, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
return -EIO;
}
@@ -157,7 +157,7 @@ static struct target_type error_target = {
.map = io_err_map,
.clone_and_map_rq = io_err_clone_and_map_rq,
.release_clone_rq = io_err_release_clone_rq,
- .direct_access = io_err_direct_access,
+ .direct_access = io_err_dax_direct_access,
};
int __init dm_target_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index bd56dfe43a99..ef4c6f8cad47 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -630,6 +630,7 @@ static int open_table_device(struct table_device *td, dev_t dev,
}
td->dm_dev.bdev = bdev;
+ td->dm_dev.dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
return 0;
}
@@ -643,7 +644,9 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+ put_dax(td->dm_dev.dax_dev);
td->dm_dev.bdev = NULL;
+ td->dm_dev.dax_dev = NULL;
}
static struct table_device *find_table_device(struct list_head *l, dev_t dev,
@@ -945,16 +948,9 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
- if (ti->type->direct_access) {
- ret = ti->type->direct_access(ti, sector, kaddr, pfn,
- nr_pages * PAGE_SIZE);
- /*
- * FIXME: convert ti->type->direct_access to return
- * nr_pages directly.
- */
- if (ret >= 0)
- ret /= PAGE_SIZE;
- }
+ if (ti->type->direct_access)
+ ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+
out:
dm_put_live_table(md, srcu_idx);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index bcba4d89089c..df830d167892 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -128,14 +128,15 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
* < 0 : error
* >= 0 : the number of bytes accessible at the address
*/
-typedef long (*dm_direct_access_fn) (struct dm_target *ti, sector_t sector,
- void **kaddr, pfn_t *pfn, long size);
+typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn);
#define PAGE_SECTORS (PAGE_SIZE / 512)
void dm_error(const char *message);
struct dm_dev {
struct block_device *bdev;
+ struct dax_device *dax_dev;
fmode_t mode;
char name[16];
};
@@ -177,7 +178,7 @@ struct target_type {
dm_busy_fn busy;
dm_iterate_devices_fn iterate_devices;
dm_io_hints_fn io_hints;
- dm_direct_access_fn direct_access;
+ dm_dax_direct_access_fn direct_access;
/* For internal device-mapper use. */
struct list_head list;
Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.
Signed-off-by: Dan Williams <[email protected]>
---
drivers/block/Kconfig | 1 +
drivers/block/brd.c | 65 +++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 55 insertions(+), 11 deletions(-)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index f744de7a0f9b..e66956fc2c88 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -339,6 +339,7 @@ config BLK_DEV_SX8
config BLK_DEV_RAM
tristate "RAM block device support"
+ select DAX if BLK_DEV_RAM_DAX
---help---
Saying Y here will allow you to use a portion of your RAM memory as
a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..60f3193c9ce2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
#include <linux/slab.h>
#ifdef CONFIG_BLK_DEV_RAM_DAX
#include <linux/pfn_t.h>
+#include <linux/dax.h>
#endif
#include <linux/uaccess.h>
@@ -41,6 +42,9 @@ struct brd_device {
struct request_queue *brd_queue;
struct gendisk *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+ struct dax_device *dax_dev;
+#endif
struct list_head brd_list;
/*
@@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}
#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
{
- struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
if (!brd)
return -ENODEV;
- page = brd_insert_page(brd, sector);
+ page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn_t(page);
- return PAGE_SIZE;
+ return 1;
+}
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, pfn_t *pfn, long size)
+{
+ struct brd_device *brd = bdev->bd_disk->private_data;
+ long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
+ PHYS_PFN(size), kaddr, pfn);
+
+ if (nr_pages < 0)
+ return nr_pages;
+ return nr_pages * PAGE_SIZE;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct brd_device *brd = dax_get_private(dax_dev);
+
+ return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
}
+
+static const struct dax_operations brd_dax_ops = {
+ .direct_access = brd_dax_direct_access,
+};
#else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
#endif
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
- .direct_access = brd_direct_access,
+ .direct_access = brd_blk_direct_access,
};
/*
@@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i)
{
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+ struct dax_device *dax_dev;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
- queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+ queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+ dax_dev = alloc_dax(brd, disk->disk_name, &brd_dax_ops);
+ if (!dax_dev)
+ goto out_free_inode;
+#endif
+
+
return brd;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+ kill_dax(dax_dev);
+ put_dax(dax_dev);
+#endif
out_free_queue:
blk_cleanup_queue(brd->brd_queue);
out_free_dev:
@@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
static void brd_del_one(struct brd_device *brd)
{
list_del(&brd->brd_list);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+ kill_dax(brd->dax_dev);
+ put_dax(brd->dax_dev);
+#endif
del_gendisk(brd->brd_disk);
brd_free(brd);
}
In preparation for introducing a struct dax_device type to the kernel
global type namespace, rename dax_dev to dev_dax. A 'dax_device'
instance will be a generic device-driver object for any provider of dax
functionality. A 'dev_dax' object is a device-dax-driver local /
internal instance.
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/dax.c | 206 ++++++++++++++++++++++++++--------------------------
drivers/dax/dax.h | 4 +
drivers/dax/pmem.c | 8 +-
3 files changed, 109 insertions(+), 109 deletions(-)
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 352cc54056ce..376fdd353aea 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -57,7 +57,7 @@ struct dax_region {
};
/**
- * struct dax_dev - subdivision of a dax region
+ * struct dev_dax - instance data for a subdivision of a dax region
* @region - parent region
* @dev - device backing the character device
* @cdev - core chardev data
@@ -66,7 +66,7 @@ struct dax_region {
* @num_resources - number of physical address extents in this device
* @res - array of physical address ranges
*/
-struct dax_dev {
+struct dev_dax {
struct dax_region *region;
struct inode *inode;
struct device dev;
@@ -323,47 +323,47 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
}
EXPORT_SYMBOL_GPL(alloc_dax_region);
-static struct dax_dev *to_dax_dev(struct device *dev)
+static struct dev_dax *to_dev_dax(struct device *dev)
{
- return container_of(dev, struct dax_dev, dev);
+ return container_of(dev, struct dev_dax, dev);
}
static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- struct dax_dev *dax_dev = to_dax_dev(dev);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
unsigned long long size = 0;
int i;
- for (i = 0; i < dax_dev->num_resources; i++)
- size += resource_size(&dax_dev->res[i]);
+ for (i = 0; i < dev_dax->num_resources; i++)
+ size += resource_size(&dev_dax->res[i]);
return sprintf(buf, "%llu\n", size);
}
static DEVICE_ATTR_RO(size);
-static struct attribute *dax_device_attributes[] = {
+static struct attribute *dev_dax_attributes[] = {
&dev_attr_size.attr,
NULL,
};
-static const struct attribute_group dax_device_attribute_group = {
- .attrs = dax_device_attributes,
+static const struct attribute_group dev_dax_attribute_group = {
+ .attrs = dev_dax_attributes,
};
static const struct attribute_group *dax_attribute_groups[] = {
- &dax_device_attribute_group,
+ &dev_dax_attribute_group,
NULL,
};
-static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma,
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
{
- struct dax_region *dax_region = dax_dev->region;
- struct device *dev = &dax_dev->dev;
+ struct dax_region *dax_region = dev_dax->region;
+ struct device *dev = &dev_dax->dev;
unsigned long mask;
- if (!dax_dev->alive)
+ if (!dev_dax->alive)
return -ENXIO;
/* prevent private mappings from being established */
@@ -397,23 +397,23 @@ static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma,
return 0;
}
-static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+static phys_addr_t pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
{
struct resource *res;
phys_addr_t phys;
int i;
- for (i = 0; i < dax_dev->num_resources; i++) {
- res = &dax_dev->res[i];
+ for (i = 0; i < dev_dax->num_resources; i++) {
+ res = &dev_dax->res[i];
phys = pgoff * PAGE_SIZE + res->start;
if (phys >= res->start && phys <= res->end)
break;
pgoff -= PHYS_PFN(resource_size(res));
}
- if (i < dax_dev->num_resources) {
- res = &dax_dev->res[i];
+ if (i < dev_dax->num_resources) {
+ res = &dev_dax->res[i];
if (phys + size - 1 <= res->end)
return phys;
}
@@ -421,19 +421,19 @@ static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
return -1;
}
-static int __dax_dev_pte_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
+static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
{
- struct device *dev = &dax_dev->dev;
+ struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
int rc = VM_FAULT_SIGBUS;
phys_addr_t phys;
pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
- if (check_vma(dax_dev, vmf->vma, __func__))
+ if (check_vma(dev_dax, vmf->vma, __func__))
return VM_FAULT_SIGBUS;
- dax_region = dax_dev->region;
+ dax_region = dev_dax->region;
if (dax_region->align > PAGE_SIZE) {
dev_dbg(dev, "%s: alignment (%#x) > fault size (%#x)\n",
__func__, dax_region->align, fault_size);
@@ -443,7 +443,7 @@ static int __dax_dev_pte_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
if (fault_size != dax_region->align)
return VM_FAULT_SIGBUS;
- phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE);
+ phys = pgoff_to_phys(dev_dax, vmf->pgoff, PAGE_SIZE);
if (phys == -1) {
dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
vmf->pgoff);
@@ -462,20 +462,20 @@ static int __dax_dev_pte_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
return VM_FAULT_NOPAGE;
}
-static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
+static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
{
unsigned long pmd_addr = vmf->address & PMD_MASK;
- struct device *dev = &dax_dev->dev;
+ struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
pfn_t pfn;
unsigned int fault_size = PMD_SIZE;
- if (check_vma(dax_dev, vmf->vma, __func__))
+ if (check_vma(dev_dax, vmf->vma, __func__))
return VM_FAULT_SIGBUS;
- dax_region = dax_dev->region;
+ dax_region = dev_dax->region;
if (dax_region->align > PMD_SIZE) {
dev_dbg(dev, "%s: alignment (%#x) > fault size (%#x)\n",
__func__, dax_region->align, fault_size);
@@ -499,7 +499,7 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
pgoff = linear_page_index(vmf->vma, pmd_addr);
- phys = pgoff_to_phys(dax_dev, pgoff, PMD_SIZE);
+ phys = pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
if (phys == -1) {
dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
pgoff);
@@ -513,10 +513,10 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
+static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
{
unsigned long pud_addr = vmf->address & PUD_MASK;
- struct device *dev = &dax_dev->dev;
+ struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
@@ -524,10 +524,10 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
unsigned int fault_size = PUD_SIZE;
- if (check_vma(dax_dev, vmf->vma, __func__))
+ if (check_vma(dev_dax, vmf->vma, __func__))
return VM_FAULT_SIGBUS;
- dax_region = dax_dev->region;
+ dax_region = dev_dax->region;
if (dax_region->align > PUD_SIZE) {
dev_dbg(dev, "%s: alignment (%#x) > fault size (%#x)\n",
__func__, dax_region->align, fault_size);
@@ -551,7 +551,7 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
pgoff = linear_page_index(vmf->vma, pud_addr);
- phys = pgoff_to_phys(dax_dev, pgoff, PUD_SIZE);
+ phys = pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
if (phys == -1) {
dev_dbg(dev, "%s: pgoff_to_phys(%#lx) failed\n", __func__,
pgoff);
@@ -564,20 +564,20 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
vmf->flags & FAULT_FLAG_WRITE);
}
#else
-static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
+static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
{
return VM_FAULT_FALLBACK;
}
#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static int dax_dev_huge_fault(struct vm_fault *vmf,
+static int dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
{
int rc, id;
struct file *filp = vmf->vma->vm_file;
- struct dax_dev *dax_dev = filp->private_data;
+ struct dev_dax *dev_dax = filp->private_data;
- dev_dbg(&dax_dev->dev, "%s: %s: %s (%#lx - %#lx) size = %d\n", __func__,
+ dev_dbg(&dev_dax->dev, "%s: %s: %s (%#lx - %#lx) size = %d\n", __func__,
current->comm, (vmf->flags & FAULT_FLAG_WRITE)
? "write" : "read",
vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
@@ -585,13 +585,13 @@ static int dax_dev_huge_fault(struct vm_fault *vmf,
id = srcu_read_lock(&dax_srcu);
switch (pe_size) {
case PE_SIZE_PTE:
- rc = __dax_dev_pte_fault(dax_dev, vmf);
+ rc = __dev_dax_pte_fault(dev_dax, vmf);
break;
case PE_SIZE_PMD:
- rc = __dax_dev_pmd_fault(dax_dev, vmf);
+ rc = __dev_dax_pmd_fault(dev_dax, vmf);
break;
case PE_SIZE_PUD:
- rc = __dax_dev_pud_fault(dax_dev, vmf);
+ rc = __dev_dax_pud_fault(dev_dax, vmf);
break;
default:
rc = VM_FAULT_SIGBUS;
@@ -601,28 +601,28 @@ static int dax_dev_huge_fault(struct vm_fault *vmf,
return rc;
}
-static int dax_dev_fault(struct vm_fault *vmf)
+static int dev_dax_fault(struct vm_fault *vmf)
{
- return dax_dev_huge_fault(vmf, PE_SIZE_PTE);
+ return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
}
-static const struct vm_operations_struct dax_dev_vm_ops = {
- .fault = dax_dev_fault,
- .huge_fault = dax_dev_huge_fault,
+static const struct vm_operations_struct dax_vm_ops = {
+ .fault = dev_dax_fault,
+ .huge_fault = dev_dax_huge_fault,
};
static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
{
- struct dax_dev *dax_dev = filp->private_data;
+ struct dev_dax *dev_dax = filp->private_data;
int rc;
- dev_dbg(&dax_dev->dev, "%s\n", __func__);
+ dev_dbg(&dev_dax->dev, "%s\n", __func__);
- rc = check_vma(dax_dev, vma, __func__);
+ rc = check_vma(dev_dax, vma, __func__);
if (rc)
return rc;
- vma->vm_ops = &dax_dev_vm_ops;
+ vma->vm_ops = &dax_vm_ops;
vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
return 0;
}
@@ -633,13 +633,13 @@ static unsigned long dax_get_unmapped_area(struct file *filp,
unsigned long flags)
{
unsigned long off, off_end, off_align, len_align, addr_align, align;
- struct dax_dev *dax_dev = filp ? filp->private_data : NULL;
+ struct dev_dax *dev_dax = filp ? filp->private_data : NULL;
struct dax_region *dax_region;
- if (!dax_dev || addr)
+ if (!dev_dax || addr)
goto out;
- dax_region = dax_dev->region;
+ dax_region = dev_dax->region;
align = dax_region->align;
off = pgoff << PAGE_SHIFT;
off_end = off + len;
@@ -664,14 +664,14 @@ static unsigned long dax_get_unmapped_area(struct file *filp,
static int dax_open(struct inode *inode, struct file *filp)
{
- struct dax_dev *dax_dev;
+ struct dev_dax *dev_dax;
- dax_dev = container_of(inode->i_cdev, struct dax_dev, cdev);
- dev_dbg(&dax_dev->dev, "%s\n", __func__);
- inode->i_mapping = dax_dev->inode->i_mapping;
- inode->i_mapping->host = dax_dev->inode;
+ dev_dax = container_of(inode->i_cdev, struct dev_dax, cdev);
+ dev_dbg(&dev_dax->dev, "%s\n", __func__);
+ inode->i_mapping = dev_dax->inode->i_mapping;
+ inode->i_mapping->host = dev_dax->inode;
filp->f_mapping = inode->i_mapping;
- filp->private_data = dax_dev;
+ filp->private_data = dev_dax;
inode->i_flags = S_DAX;
return 0;
@@ -679,9 +679,9 @@ static int dax_open(struct inode *inode, struct file *filp)
static int dax_release(struct inode *inode, struct file *filp)
{
- struct dax_dev *dax_dev = filp->private_data;
+ struct dev_dax *dev_dax = filp->private_data;
- dev_dbg(&dax_dev->dev, "%s\n", __func__);
+ dev_dbg(&dev_dax->dev, "%s\n", __func__);
return 0;
}
@@ -694,55 +694,55 @@ static const struct file_operations dax_fops = {
.mmap = dax_mmap,
};
-static void dax_dev_release(struct device *dev)
+static void dev_dax_release(struct device *dev)
{
- struct dax_dev *dax_dev = to_dax_dev(dev);
- struct dax_region *dax_region = dax_dev->region;
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
- ida_simple_remove(&dax_region->ida, dax_dev->id);
+ ida_simple_remove(&dax_region->ida, dev_dax->id);
ida_simple_remove(&dax_minor_ida, MINOR(dev->devt));
dax_region_put(dax_region);
- iput(dax_dev->inode);
- kfree(dax_dev);
+ iput(dev_dax->inode);
+ kfree(dev_dax);
}
-static void kill_dax_dev(struct dax_dev *dax_dev)
+static void kill_dev_dax(struct dev_dax *dev_dax)
{
/*
- * Note, rcu is not protecting the liveness of dax_dev, rcu is
+ * Note, rcu is not protecting the liveness of dev_dax, rcu is
* ensuring that any fault handlers that might have seen
- * dax_dev->alive == true, have completed. Any fault handlers
+ * dev_dax->alive == true, have completed. Any fault handlers
* that start after synchronize_srcu() has started will abort
- * upon seeing dax_dev->alive == false.
+ * upon seeing dev_dax->alive == false.
*/
- dax_dev->alive = false;
+ dev_dax->alive = false;
synchronize_srcu(&dax_srcu);
- unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
+ unmap_mapping_range(dev_dax->inode->i_mapping, 0, 0, 1);
}
-static void unregister_dax_dev(void *dev)
+static void unregister_dev_dax(void *dev)
{
- struct dax_dev *dax_dev = to_dax_dev(dev);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
dev_dbg(dev, "%s\n", __func__);
- kill_dax_dev(dax_dev);
- cdev_device_del(&dax_dev->cdev, dev);
+ kill_dev_dax(dev_dax);
+ cdev_device_del(&dev_dax->cdev, dev);
put_device(dev);
}
-struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
+struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
struct resource *res, int count)
{
struct device *parent = dax_region->dev;
- struct dax_dev *dax_dev;
+ struct dev_dax *dev_dax;
int rc = 0, minor, i;
struct device *dev;
struct cdev *cdev;
dev_t dev_t;
- dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL);
- if (!dax_dev)
+ dev_dax = kzalloc(sizeof(*dev_dax) + sizeof(*res) * count, GFP_KERNEL);
+ if (!dev_dax)
return ERR_PTR(-ENOMEM);
for (i = 0; i < count; i++) {
@@ -752,16 +752,16 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
rc = -EINVAL;
break;
}
- dax_dev->res[i].start = res[i].start;
- dax_dev->res[i].end = res[i].end;
+ dev_dax->res[i].start = res[i].start;
+ dev_dax->res[i].end = res[i].end;
}
if (i < count)
goto err_id;
- dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL);
- if (dax_dev->id < 0) {
- rc = dax_dev->id;
+ dev_dax->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL);
+ if (dev_dax->id < 0) {
+ rc = dev_dax->id;
goto err_id;
}
@@ -772,55 +772,55 @@ struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
}
dev_t = MKDEV(MAJOR(dax_devt), minor);
- dev = &dax_dev->dev;
- dax_dev->inode = dax_inode_get(&dax_dev->cdev, dev_t);
- if (!dax_dev->inode) {
+ dev = &dev_dax->dev;
+ dev_dax->inode = dax_inode_get(&dev_dax->cdev, dev_t);
+ if (!dev_dax->inode) {
rc = -ENOMEM;
goto err_inode;
}
- /* from here on we're committed to teardown via dax_dev_release() */
+ /* from here on we're committed to teardown via dev_dax_release() */
device_initialize(dev);
- cdev = &dax_dev->cdev;
+ cdev = &dev_dax->cdev;
cdev_init(cdev, &dax_fops);
cdev->owner = parent->driver->owner;
- dax_dev->num_resources = count;
- dax_dev->alive = true;
- dax_dev->region = dax_region;
+ dev_dax->num_resources = count;
+ dev_dax->alive = true;
+ dev_dax->region = dax_region;
kref_get(&dax_region->kref);
dev->devt = dev_t;
dev->class = dax_class;
dev->parent = parent;
dev->groups = dax_attribute_groups;
- dev->release = dax_dev_release;
- dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id);
+ dev->release = dev_dax_release;
+ dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
rc = cdev_device_add(cdev, dev);
if (rc) {
- kill_dax_dev(dax_dev);
+ kill_dev_dax(dev_dax);
put_device(dev);
return ERR_PTR(rc);
}
- rc = devm_add_action_or_reset(dax_region->dev, unregister_dax_dev, dev);
+ rc = devm_add_action_or_reset(dax_region->dev, unregister_dev_dax, dev);
if (rc)
return ERR_PTR(rc);
- return dax_dev;
+ return dev_dax;
err_inode:
ida_simple_remove(&dax_minor_ida, minor);
err_minor:
- ida_simple_remove(&dax_region->ida, dax_dev->id);
+ ida_simple_remove(&dax_region->ida, dev_dax->id);
err_id:
- kfree(dax_dev);
+ kfree(dev_dax);
return ERR_PTR(rc);
}
-EXPORT_SYMBOL_GPL(devm_create_dax_dev);
+EXPORT_SYMBOL_GPL(devm_create_dev_dax);
static int __init dax_init(void)
{
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ddd829ab58c0..ea176d875d60 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,13 +13,13 @@
#ifndef __DAX_H__
#define __DAX_H__
struct device;
-struct dax_dev;
+struct dev_dax;
struct resource;
struct dax_region;
void dax_region_put(struct dax_region *dax_region);
struct dax_region *alloc_dax_region(struct device *parent,
int region_id, struct resource *res, unsigned int align,
void *addr, unsigned long flags);
-struct dax_dev *devm_create_dax_dev(struct dax_region *dax_region,
+struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
struct resource *res, int count);
#endif /* __DAX_H__ */
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 033f49b31fdc..2c736fc4508b 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -61,8 +61,8 @@ static int dax_pmem_probe(struct device *dev)
int rc;
void *addr;
struct resource res;
- struct dax_dev *dax_dev;
struct nd_pfn_sb *pfn_sb;
+ struct dev_dax *dev_dax;
struct dax_pmem *dax_pmem;
struct nd_region *nd_region;
struct nd_namespace_io *nsio;
@@ -130,12 +130,12 @@ static int dax_pmem_probe(struct device *dev)
return -ENOMEM;
/* TODO: support for subdividing a dax region... */
- dax_dev = devm_create_dax_dev(dax_region, &res, 1);
+ dev_dax = devm_create_dev_dax(dax_region, &res, 1);
- /* child dax_dev instances now own the lifetime of the dax_region */
+ /* child dev_dax instances now own the lifetime of the dax_region */
dax_region_put(dax_region);
- return PTR_ERR_OR_ZERO(dax_dev);
+ return PTR_ERR_OR_ZERO(dev_dax);
}
static struct nd_device_driver dax_pmem_driver = {
On Mon, 17 Apr 2017 12:09:32 -0700
Dan Williams <[email protected]> wrote:
> Setup a dax_dev to have the same lifetime as the dcssblk block device
> and add a ->direct_access() method that is equivalent to
> dcssblk_direct_access(). Once fs/dax.c has been converted to use
> dax_operations the old dcssblk_direct_access() will be removed.
>
> Cc: Gerald Schaefer <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/s390/block/Kconfig | 1 +
> drivers/s390/block/dcssblk.c | 54 +++++++++++++++++++++++++++++++++++-------
> 2 files changed, 46 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
> index 4a3b62326183..0acb8c2f9475 100644
> --- a/drivers/s390/block/Kconfig
> +++ b/drivers/s390/block/Kconfig
> @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
>
> config DCSSBLK
> def_tristate m
> + select DAX
> prompt "DCSSBLK support"
> depends on S390 && BLOCK
> help
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 415d10a67b7a..682a9eb4934d 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -18,6 +18,7 @@
> #include <linux/interrupt.h>
> #include <linux/platform_device.h>
> #include <linux/pfn_t.h>
> +#include <linux/dax.h>
> #include <asm/extmem.h>
> #include <asm/io.h>
>
> @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
> static void dcssblk_release(struct gendisk *disk, fmode_t mode);
> static blk_qc_t dcssblk_make_request(struct request_queue *q,
> struct bio *bio);
> -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
> +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
> void **kaddr, pfn_t *pfn, long size);
> +static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> + long nr_pages, void **kaddr, pfn_t *pfn);
>
> static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>
> @@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = {
> .owner = THIS_MODULE,
> .open = dcssblk_open,
> .release = dcssblk_release,
> - .direct_access = dcssblk_direct_access,
> + .direct_access = dcssblk_blk_direct_access,
> +};
> +
> +static const struct dax_operations dcssblk_dax_ops = {
> + .direct_access = dcssblk_dax_direct_access,
> };
>
> struct dcssblk_dev_info {
> @@ -57,6 +64,7 @@ struct dcssblk_dev_info {
> struct request_queue *dcssblk_queue;
> int num_of_segments;
> struct list_head seg_list;
> + struct dax_device *dax_dev;
> };
>
> struct segment_info {
> @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch
> }
> list_del(&dev_info->lh);
>
> + kill_dax(dev_info->dax_dev);
> + put_dax(dev_info->dax_dev);
> del_gendisk(dev_info->gd);
> blk_cleanup_queue(dev_info->dcssblk_queue);
> dev_info->gd->queue = NULL;
> @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> int rc, i, j, num_of_segments;
> struct dcssblk_dev_info *dev_info;
> struct segment_info *seg_info, *temp;
> + struct dax_device *dax_dev;
> char *local_buf;
> unsigned long seg_byte_size;
>
> @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> if (rc)
> goto put_dev;
>
> + dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
> + &dcssblk_dax_ops);
> + if (!dax_dev)
> + goto put_dev;
> +
The returned dax_dev should be stored into dev_info->dax_dev, for later use
by kill/put_dax(). This can also be done directly, so that we don't need the
local dax_dev variable here.
Also, in the error case, a proper rc should be set before going to put_dev,
probably -ENOMEM.
I took a quick look at the patches for the other affected drivers, and it
looks like axonram also has the "missing rc" issue, and brd the "missing
brd->dax_dev init" issue, pmem seems to be fine.
> get_device(&dev_info->dev);
> device_add_disk(&dev_info->dev, dev_info->gd);
>
> @@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
> }
>
> list_del(&dev_info->lh);
> + kill_dax(dev_info->dax_dev);
> + put_dax(dev_info->dax_dev);
> del_gendisk(dev_info->gd);
> blk_cleanup_queue(dev_info->dcssblk_queue);
> dev_info->gd->queue = NULL;
> @@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
> }
>
> static long
> -dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> +__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
> + long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> + resource_size_t offset = pgoff * PAGE_SIZE;
> + unsigned long dev_sz;
> +
> + dev_sz = dev_info->end - dev_info->start + 1;
> + *kaddr = (void *) dev_info->start + offset;
> + *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
> +
> + return (dev_sz - offset) / PAGE_SIZE;
> +}
> +
> +static long
> +dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
> void **kaddr, pfn_t *pfn, long size)
> {
> struct dcssblk_dev_info *dev_info;
> - unsigned long offset, dev_sz;
>
> dev_info = bdev->bd_disk->private_data;
> if (!dev_info)
> return -ENODEV;
> - dev_sz = dev_info->end - dev_info->start + 1;
> - offset = secnum * 512;
> - *kaddr = (void *) dev_info->start + offset;
> - *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
> + return __dcssblk_direct_access(dev_info, PHYS_PFN(secnum * 512),
> + PHYS_PFN(size), kaddr, pfn) * PAGE_SIZE;
> +}
> +
> +static long
> +dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> + long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> + struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
>
> - return dev_sz - offset;
> + return __dcssblk_direct_access(dev_info, pgoff, nr_pages, kaddr, pfn);
> }
>
> static void
>
On Wed, Apr 19, 2017 at 8:31 AM, Gerald Schaefer
<[email protected]> wrote:
> On Mon, 17 Apr 2017 12:09:32 -0700
> Dan Williams <[email protected]> wrote:
>
>> Setup a dax_dev to have the same lifetime as the dcssblk block device
>> and add a ->direct_access() method that is equivalent to
>> dcssblk_direct_access(). Once fs/dax.c has been converted to use
>> dax_operations the old dcssblk_direct_access() will be removed.
>>
>> Cc: Gerald Schaefer <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> drivers/s390/block/Kconfig | 1 +
>> drivers/s390/block/dcssblk.c | 54 +++++++++++++++++++++++++++++++++++-------
>> 2 files changed, 46 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
>> index 4a3b62326183..0acb8c2f9475 100644
>> --- a/drivers/s390/block/Kconfig
>> +++ b/drivers/s390/block/Kconfig
>> @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
>>
>> config DCSSBLK
>> def_tristate m
>> + select DAX
>> prompt "DCSSBLK support"
>> depends on S390 && BLOCK
>> help
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 415d10a67b7a..682a9eb4934d 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -18,6 +18,7 @@
>> #include <linux/interrupt.h>
>> #include <linux/platform_device.h>
>> #include <linux/pfn_t.h>
>> +#include <linux/dax.h>
>> #include <asm/extmem.h>
>> #include <asm/io.h>
>>
>> @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
>> static void dcssblk_release(struct gendisk *disk, fmode_t mode);
>> static blk_qc_t dcssblk_make_request(struct request_queue *q,
>> struct bio *bio);
>> -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
>> +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
>> void **kaddr, pfn_t *pfn, long size);
>> +static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>> + long nr_pages, void **kaddr, pfn_t *pfn);
>>
>> static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>>
>> @@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = {
>> .owner = THIS_MODULE,
>> .open = dcssblk_open,
>> .release = dcssblk_release,
>> - .direct_access = dcssblk_direct_access,
>> + .direct_access = dcssblk_blk_direct_access,
>> +};
>> +
>> +static const struct dax_operations dcssblk_dax_ops = {
>> + .direct_access = dcssblk_dax_direct_access,
>> };
>>
>> struct dcssblk_dev_info {
>> @@ -57,6 +64,7 @@ struct dcssblk_dev_info {
>> struct request_queue *dcssblk_queue;
>> int num_of_segments;
>> struct list_head seg_list;
>> + struct dax_device *dax_dev;
>> };
>>
>> struct segment_info {
>> @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch
>> }
>> list_del(&dev_info->lh);
>>
>> + kill_dax(dev_info->dax_dev);
>> + put_dax(dev_info->dax_dev);
>> del_gendisk(dev_info->gd);
>> blk_cleanup_queue(dev_info->dcssblk_queue);
>> dev_info->gd->queue = NULL;
>> @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
>> int rc, i, j, num_of_segments;
>> struct dcssblk_dev_info *dev_info;
>> struct segment_info *seg_info, *temp;
>> + struct dax_device *dax_dev;
>> char *local_buf;
>> unsigned long seg_byte_size;
>>
>> @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
>> if (rc)
>> goto put_dev;
>>
>> + dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
>> + &dcssblk_dax_ops);
>> + if (!dax_dev)
>> + goto put_dev;
>> +
>
> The returned dax_dev should be stored into dev_info->dax_dev, for later use
> by kill/put_dax(). This can also be done directly, so that we don't need the
> local dax_dev variable here.
>
> Also, in the error case, a proper rc should be set before going to put_dev,
> probably -ENOMEM.
>
> I took a quick look at the patches for the other affected drivers, and it
> looks like axonram also has the "missing rc" issue, and brd the "missing
> brd->dax_dev init" issue, pmem seems to be fine.
Thank you for taking a look. I'll get this fixed up.
On Mon, Apr 17, 2017 at 12:09 PM, Dan Williams <[email protected]> wrote:
> Allocate a dax_device to represent the capacity of a device-mapper
> instance. Provide a ->direct_access() method via the new dax_operations
> indirection that mirrors the functionality of the current direct_access
> support via block_device_operations. Once fs/dax.c has been converted
> to use dax_operations the old dm_blk_direct_access() will be removed.
>
> A new helper dm_dax_get_live_target() is introduced to separate some of
> the dm-specifics from the direct_access implementation.
>
> This enabling is only for the top-level dm representation to upper
> layers. Converting target direct_access implementations is deferred to a
> separate patch.
>
> Cc: Toshi Kani <[email protected]>
> Cc: Mike Snitzer <[email protected]>
Hi Mike,
Any concerns with these dax_device and dax_operations changes to
device-mapper for the upcoming merge window?
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/md/Kconfig | 1
> drivers/md/dm-core.h | 1
> drivers/md/dm.c | 84 ++++++++++++++++++++++++++++++++++-------
> include/linux/device-mapper.h | 1
> 4 files changed, 73 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index b7767da50c26..1de8372d9459 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
> config BLK_DEV_DM
> tristate "Device mapper support"
> select BLK_DEV_DM_BUILTIN
> + select DAX
> ---help---
> Device-mapper is a low level volume manager. It works by allowing
> people to specify mappings for ranges of logical sectors. Various
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 136fda3ff9e5..538630190f66 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -58,6 +58,7 @@ struct mapped_device {
> struct target_type *immutable_target_type;
>
> struct gendisk *disk;
> + struct dax_device *dax_dev;
> char name[16];
>
> void *interface_ptr;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index dfb75979e455..bd56dfe43a99 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -16,6 +16,7 @@
> #include <linux/blkpg.h>
> #include <linux/bio.h>
> #include <linux/mempool.h>
> +#include <linux/dax.h>
> #include <linux/slab.h>
> #include <linux/idr.h>
> #include <linux/hdreg.h>
> @@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
> }
> EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
>
> -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
> - void **kaddr, pfn_t *pfn, long size)
> +static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
> + sector_t sector, int *srcu_idx)
> {
> - struct mapped_device *md = bdev->bd_disk->private_data;
> struct dm_table *map;
> struct dm_target *ti;
> - int srcu_idx;
> - long len, ret = -EIO;
>
> - map = dm_get_live_table(md, &srcu_idx);
> + map = dm_get_live_table(md, srcu_idx);
> if (!map)
> - goto out;
> + return NULL;
>
> ti = dm_table_find_target(map, sector);
> if (!dm_target_is_valid(ti))
> - goto out;
> + return NULL;
>
> - len = max_io_len(sector, ti) << SECTOR_SHIFT;
> - size = min(len, size);
> + return ti;
> +}
>
> - if (ti->type->direct_access)
> - ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
> -out:
> +static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> + long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> + struct mapped_device *md = dax_get_private(dax_dev);
> + sector_t sector = pgoff * PAGE_SECTORS;
> + struct dm_target *ti;
> + long len, ret = -EIO;
> + int srcu_idx;
> +
> + ti = dm_dax_get_live_target(md, sector, &srcu_idx);
> +
> + if (!ti)
> + goto out;
> + if (!ti->type->direct_access)
> + goto out;
> + len = max_io_len(sector, ti) / PAGE_SECTORS;
> + if (len < 1)
> + goto out;
> + nr_pages = min(len, nr_pages);
> + if (ti->type->direct_access) {
> + ret = ti->type->direct_access(ti, sector, kaddr, pfn,
> + nr_pages * PAGE_SIZE);
> + /*
> + * FIXME: convert ti->type->direct_access to return
> + * nr_pages directly.
> + */
> + if (ret >= 0)
> + ret /= PAGE_SIZE;
> + }
> + out:
> dm_put_live_table(md, srcu_idx);
> - return min(ret, size);
> +
> + return ret;
> +}
> +
> +static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
> + void **kaddr, pfn_t *pfn, long size)
> +{
> + struct mapped_device *md = bdev->bd_disk->private_data;
> + struct dax_device *dax_dev = md->dax_dev;
> + long nr_pages = size / PAGE_SIZE;
> +
> + nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
> + nr_pages, kaddr, pfn);
> + return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
> }
>
> /*
> @@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor)
> }
>
> static const struct block_device_operations dm_blk_dops;
> +static const struct dax_operations dm_dax_ops;
>
> static void dm_wq_work(struct work_struct *work);
>
> @@ -1483,6 +1522,12 @@ static void cleanup_mapped_device(struct mapped_device *md)
> if (md->bs)
> bioset_free(md->bs);
>
> + if (md->dax_dev) {
> + kill_dax(md->dax_dev);
> + put_dax(md->dax_dev);
> + md->dax_dev = NULL;
> + }
> +
> if (md->disk) {
> spin_lock(&_minor_lock);
> md->disk->private_data = NULL;
> @@ -1510,6 +1555,7 @@ static void cleanup_mapped_device(struct mapped_device *md)
> static struct mapped_device *alloc_dev(int minor)
> {
> int r, numa_node_id = dm_get_numa_node();
> + struct dax_device *dax_dev;
> struct mapped_device *md;
> void *old_md;
>
> @@ -1574,6 +1620,12 @@ static struct mapped_device *alloc_dev(int minor)
> md->disk->queue = md->queue;
> md->disk->private_data = md;
> sprintf(md->disk->disk_name, "dm-%d", minor);
> +
> + dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
> + if (!dax_dev)
> + goto bad;
> + md->dax_dev = dax_dev;
> +
> add_disk(md->disk);
> format_dev_t(md->name, MKDEV(_major, minor));
>
> @@ -2781,6 +2833,10 @@ static const struct block_device_operations dm_blk_dops = {
> .owner = THIS_MODULE
> };
>
> +static const struct dax_operations dm_dax_ops = {
> + .direct_access = dm_dax_direct_access,
> +};
> +
> /*
> * module hooks
> */
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index a7e6903866fd..bcba4d89089c 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -130,6 +130,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
> */
> typedef long (*dm_direct_access_fn) (struct dm_target *ti, sector_t sector,
> void **kaddr, pfn_t *pfn, long size);
> +#define PAGE_SECTORS (PAGE_SIZE / 512)
>
> void dm_error(const char *message);
>
>
[ adding akpm, sfr, and jens ]
I applied this series and pushed it out for the nvdimm.git branch that
gets auto pulled into -next. The set is still awaiting acks from
device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
29/33). If those come next week perhaps this can be merged for 4.12,
but if not this will need to wait until 4.13.
There are some minor collisions with Al's copy_from_user rework, the
new dax tracepoints, and the removal of discard support from the brd
driver. A sample merge is available here:
https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge
If it causes any other problems just drop and I'll retry for 4.13.
On Mon, Apr 17, 2017 at 12:08 PM, Dan Williams <[email protected]> wrote:
> [ resend to add dm-devel, linux-block, and fs-devel, apologies for the
> duplicates ]
>
> Changes since v1 [1] and the dax-fs RFC [2]:
> * rename struct dax_inode to struct dax_device (Christoph)
> * rewrite arch_memcpy_to_pmem() in C with inline asm
> * use QUEUE_FLAG_WC to gate dax cache management (Jeff)
> * add device-mapper plumbing for the ->copy_from_iter() and ->flush()
> dax_operations
> * kill struct blk_dax_ctl and bdev_direct_access (Christoph)
> * cleanup the ->direct_access() calling convention to be page based
> (Christoph)
> * introduce dax_get_by_host() and don't pollute struct super_block with
> dax_device details (Christoph)
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
> [2]: https://lwn.net/Articles/713064/
>
> ---
> A few months back, in the course of reviewing the memcpy_nocache()
> proposal from Brian, Linus proposed that the pmem specific
> memcpy_to_pmem() routine be moved to be implemented at the driver level
> [3]:
>
> "Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using 'movnt' by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in
> the sense of <linux/uaccess.h>."
>
> This feedback also dovetails with another fs/dax.c design wart of being
> hard coded to assume the backing device is pmem. We call the pmem
> specific copy, clear, and flush routines even if the backing device
> driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
> There is no reason to spend cpu cycles flushing the cache after writing
> to brd, for example, since it is using volatile memory for storage.
>
> Moreover, the pmem driver might be fronting a volatile memory range
> published by the ACPI NFIT, or the platform might have arranged to flush
> cpu caches on power fail. This latter capability is a feature that has
> appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
> platforms).
>
> So, this series:
>
> 1/ moves what was previously named "the pmem api" out of the global
> namespace and into drivers that need to be concerned with
> architecture specific persistent memory considerations.
>
> 2/ arranges for dax to stop abusing __copy_user_nocache() and implements
> a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
> expanded in the future to use 'movntdqa' if the copy size is above
> some threshold, or expanded with support for other architectures [4].
>
> 3/ makes cache maintenance optional by arranging for dax to call driver
> specific copy and flush operations only if the driver publishes them.
>
> 4/ allows filesytem-dax cache management to be controlled by the block
> device write-cache queue flag. The pmem driver is updated to clear
> that flag by default when pmem is driving volatile memory.
>
> [3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
>
> These patches have been through a round of build regression fixes
> notified by the 0day robot. All review welcome, but the patches that
> need extra attention are the device-mapper and uio changes
> (copy_from_iter_ops).
>
> This series is based on a merge of char-misc-next (for cdev api reworks)
> and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
>
> ---
>
> Dan Williams (33):
> device-dax: rename 'dax_dev' to 'dev_dax'
> dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
> dax: add a facility to lookup a dax device by 'host' device name
> dax: introduce dax_operations
> pmem: add dax_operations support
> axon_ram: add dax_operations support
> brd: add dax_operations support
> dcssblk: add dax_operations support
> block: kill bdev_dax_capable()
> dax: introduce dax_direct_access()
> dm: add dax_device and dax_operations support
> dm: teach dm-targets to use a dax_device + dax_operations
> ext2, ext4, xfs: retrieve dax_device for iomap operations
> Revert "block: use DAX for partition table reads"
> filesystem-dax: convert to dax_direct_access()
> block, dax: convert bdev_dax_supported() to dax_direct_access()
> block: remove block_device_operations ->direct_access()
> x86, dax, pmem: remove indirection around memcpy_from_pmem()
> dax, pmem: introduce 'copy_from_iter' dax operation
> dm: add ->copy_from_iter() dax operation support
> filesystem-dax: convert to dax_copy_from_iter()
> dax, pmem: introduce an optional 'flush' dax_operation
> dm: add ->flush() dax operation support
> filesystem-dax: convert to dax_flush()
> x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
> x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
> x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
> x86, libnvdimm, dax: stop abusing __copy_user_nocache
> uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
> libnvdimm, pmem: fix persistence warning
> libnvdimm, nfit: enable support for volatile ranges
> filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
> libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
>
>
> MAINTAINERS | 2
> arch/powerpc/platforms/Kconfig | 1
> arch/powerpc/sysdev/axonram.c | 45 +++-
> arch/x86/Kconfig | 1
> arch/x86/include/asm/pmem.h | 141 ------------
> arch/x86/include/asm/string_64.h | 1
> block/Kconfig | 1
> block/partition-generic.c | 17 -
> drivers/Makefile | 2
> drivers/acpi/nfit/core.c | 15 +
> drivers/block/Kconfig | 1
> drivers/block/brd.c | 52 +++-
> drivers/dax/Kconfig | 10 +
> drivers/dax/Makefile | 5
> drivers/dax/dax.h | 15 -
> drivers/dax/device-dax.h | 25 ++
> drivers/dax/device.c | 415 +++++++++++------------------------
> drivers/dax/pmem.c | 10 -
> drivers/dax/super.c | 445 ++++++++++++++++++++++++++++++++++++++
> drivers/md/Kconfig | 1
> drivers/md/dm-core.h | 1
> drivers/md/dm-linear.c | 53 ++++-
> drivers/md/dm-snap.c | 6 -
> drivers/md/dm-stripe.c | 65 ++++--
> drivers/md/dm-target.c | 6 -
> drivers/md/dm.c | 112 ++++++++--
> drivers/nvdimm/Kconfig | 6 +
> drivers/nvdimm/Makefile | 1
> drivers/nvdimm/bus.c | 10 -
> drivers/nvdimm/claim.c | 9 -
> drivers/nvdimm/core.c | 2
> drivers/nvdimm/dax_devs.c | 2
> drivers/nvdimm/dimm_devs.c | 2
> drivers/nvdimm/namespace_devs.c | 9 -
> drivers/nvdimm/nd-core.h | 9 +
> drivers/nvdimm/pfn_devs.c | 4
> drivers/nvdimm/pmem.c | 82 +++++--
> drivers/nvdimm/pmem.h | 26 ++
> drivers/nvdimm/region_devs.c | 39 ++-
> drivers/nvdimm/x86.c | 155 +++++++++++++
> drivers/s390/block/Kconfig | 1
> drivers/s390/block/dcssblk.c | 44 +++-
> fs/block_dev.c | 117 +++-------
> fs/dax.c | 302 ++++++++++++++------------
> fs/ext2/inode.c | 9 +
> fs/ext4/inode.c | 9 +
> fs/iomap.c | 3
> fs/xfs/xfs_iomap.c | 10 +
> include/linux/blkdev.h | 19 --
> include/linux/dax.h | 43 +++-
> include/linux/device-mapper.h | 14 +
> include/linux/iomap.h | 1
> include/linux/libnvdimm.h | 10 +
> include/linux/pmem.h | 165 --------------
> include/linux/string.h | 8 +
> include/linux/uio.h | 4
> lib/Kconfig | 6 -
> lib/iov_iter.c | 25 ++
> tools/testing/nvdimm/Kbuild | 11 +
> tools/testing/nvdimm/pmem-dax.c | 21 +-
> 60 files changed, 1584 insertions(+), 1042 deletions(-)
> delete mode 100644 arch/x86/include/asm/pmem.h
> create mode 100644 drivers/dax/device-dax.h
> rename drivers/dax/{dax.c => device.c} (60%)
> create mode 100644 drivers/dax/super.c
> create mode 100644 drivers/nvdimm/x86.c
> delete mode 100644 include/linux/pmem.h
On Thu, Apr 20 2017 at 12:30pm -0400,
Dan Williams <[email protected]> wrote:
> On Mon, Apr 17, 2017 at 12:09 PM, Dan Williams <[email protected]> wrote:
> > Allocate a dax_device to represent the capacity of a device-mapper
> > instance. Provide a ->direct_access() method via the new dax_operations
> > indirection that mirrors the functionality of the current direct_access
> > support via block_device_operations. Once fs/dax.c has been converted
> > to use dax_operations the old dm_blk_direct_access() will be removed.
> >
> > A new helper dm_dax_get_live_target() is introduced to separate some of
> > the dm-specifics from the direct_access implementation.
> >
> > This enabling is only for the top-level dm representation to upper
> > layers. Converting target direct_access implementations is deferred to a
> > separate patch.
> >
> > Cc: Toshi Kani <[email protected]>
> > Cc: Mike Snitzer <[email protected]>
>
> Hi Mike,
>
> Any concerns with these dax_device and dax_operations changes to
> device-mapper for the upcoming merge window?
Sorry for the delay.
I just reviewed them, overall looks good. The enabling functions in the
DAX code, that are mixed in with the DM changes, could maybe be factored
out into separate earlier patches but I don't feel that strongly about
that.
Feel free to add this tag to the handful of relevant DM patches:
Reviewed-by: Mike Snitzer <[email protected]>
I haven't done a merge with the linux-dm.git 'dm-4.12' branch but it'd
be good to verify there aren't any merge conflicts. If there are then
it'd be nice to know going in to the merge so that we can forecast as
much to Linus.
I really appreciate you doing this work!
Thanks,
Mike
On Fri, Apr 21, 2017 at 6:06 PM, Dan Williams <[email protected]> wrote:
> [ adding akpm, sfr, and jens ]
>
> I applied this series and pushed it out for the nvdimm.git branch that
> gets auto pulled into -next. The set is still awaiting acks from
> device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
> 29/33). If those come next week perhaps this can be merged for 4.12,
> but if not this will need to wait until 4.13.
>
> There are some minor collisions with Al's copy_from_user rework, the
> new dax tracepoints, and the removal of discard support from the brd
> driver. A sample merge is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge
>
> If it causes any other problems just drop and I'll retry for 4.13.
Al has nak'd the uaccess related changes, and I'll need to rework
those patches to move the pmem routines into lib/iov_iter.c directly.
That doesn't affect the dax_device and dax_operations work, so I'm
still looking to move forward with that change. That reduces the set
targeting 4.12 to just the first 18 patches from this series:
Dan Williams (18):
device-dax: rename 'dax_dev' to 'dev_dax'
dax: refactor dax-fs into a generic provider of 'struct
dax_device' instances
dax: add a facility to lookup a dax device by 'host' device name
dax: introduce dax_operations
pmem: add dax_operations support
axon_ram: add dax_operations support
brd: add dax_operations support
dcssblk: add dax_operations support
block: kill bdev_dax_capable()
dax: introduce dax_direct_access()
dm: add dax_device and dax_operations support
dm: teach dm-targets to use a dax_device + dax_operations
ext2, ext4, xfs: retrieve dax_device for iomap operations
Revert "block: use DAX for partition table reads"
filesystem-dax: convert to dax_direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
block: remove block_device_operations ->direct_access()
x86, dax, pmem: remove indirection around memcpy_from_pmem()
On Mon, 2017-04-17 at 12:09 -0700, Dan Williams wrote:
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index b7767da50c26..1de8372d9459 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
> config BLK_DEV_DM
> tristate "Device mapper support"
> select BLK_DEV_DM_BUILTIN
> + select DAX
> ---help---
> Device-mapper is a low level volume manager. It works by allowing
> people to specify mappings for ranges of logical sectors. Various
(replying to an e-mail of three months ago)
Hello Dan,
While building a v4.12 kernel I noticed that enabling device mapper support
now unconditionally enables DAX. I think there are plenty of systems that use
dm but do not need DAX. Have you considered to rework this such that instead
of dm selecting DAX that DAX support is only enabled in dm if CONFIG_DAX is
enabled?
Thanks,
Bart.
On Fri, Jul 28 2017 at 12:17pm -0400,
Bart Van Assche <[email protected]> wrote:
> On Mon, 2017-04-17 at 12:09 -0700, Dan Williams wrote:
> > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> > index b7767da50c26..1de8372d9459 100644
> > --- a/drivers/md/Kconfig
> > +++ b/drivers/md/Kconfig
> > @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
> > config BLK_DEV_DM
> > tristate "Device mapper support"
> > select BLK_DEV_DM_BUILTIN
> > + select DAX
> > ---help---
> > Device-mapper is a low level volume manager. It works by allowing
> > people to specify mappings for ranges of logical sectors. Various
>
> (replying to an e-mail of three months ago)
>
> Hello Dan,
>
> While building a v4.12 kernel I noticed that enabling device mapper support
> now unconditionally enables DAX. I think there are plenty of systems that use
> dm but do not need DAX. Have you considered to rework this such that instead
> of dm selecting DAX that DAX support is only enabled in dm if CONFIG_DAX is
> enabled?
I haven't but patches to do so would be welcomed.
Mike
On Fri, Jul 28, 2017 at 9:17 AM, Bart Van Assche <[email protected]> wrote:
> On Mon, 2017-04-17 at 12:09 -0700, Dan Williams wrote:
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index b7767da50c26..1de8372d9459 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
>> config BLK_DEV_DM
>> tristate "Device mapper support"
>> select BLK_DEV_DM_BUILTIN
>> + select DAX
>> ---help---
>> Device-mapper is a low level volume manager. It works by allowing
>> people to specify mappings for ranges of logical sectors. Various
>
> (replying to an e-mail of three months ago)
>
> Hello Dan,
>
> While building a v4.12 kernel I noticed that enabling device mapper support
> now unconditionally enables DAX. I think there are plenty of systems that use
> dm but do not need DAX. Have you considered to rework this such that instead
> of dm selecting DAX that DAX support is only enabled in dm if CONFIG_DAX is
> enabled?
>
I'd rather flip this around and add a CONFIG_DM_DAX that gates whether
DM enables / links to the DAX core. I'll take a look at a patch.
On Sat, 2017-07-29 at 12:57 -0700, Dan Williams wrote:
> On Fri, Jul 28, 2017 at 9:17 AM, Bart Van Assche <[email protected]> wrote:
> > On Mon, 2017-04-17 at 12:09 -0700, Dan Williams wrote:
> > > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> > > index b7767da50c26..1de8372d9459 100644
> > > --- a/drivers/md/Kconfig
> > > +++ b/drivers/md/Kconfig
> > > @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
> > > config BLK_DEV_DM
> > > tristate "Device mapper support"
> > > select BLK_DEV_DM_BUILTIN
> > > + select DAX
> > > ---help---
> > > Device-mapper is a low level volume manager. It works by allowing
> > > people to specify mappings for ranges of logical sectors. Various
> >
> > (replying to an e-mail of three months ago)
> >
> > Hello Dan,
> >
> > While building a v4.12 kernel I noticed that enabling device mapper support
> > now unconditionally enables DAX. I think there are plenty of systems that use
> > dm but do not need DAX. Have you considered to rework this such that instead
> > of dm selecting DAX that DAX support is only enabled in dm if CONFIG_DAX is
> > enabled?
>
> I'd rather flip this around and add a CONFIG_DM_DAX that gates whether
> DM enables / links to the DAX core. I'll take a look at a patch.
Thanks! Please also consider to move all DAX-related dm code into a separate
source file such that the number of #ifdef CONFIG_DM_DAX statements can be
kept to an absolute minimum.
Bart.