Changes since v7 [1]:
* Introduce noop_direct_IO() and use it to clean up xfs_dax_aops,
ext4_dax_aops, and ext2_dax_aops (Jan, Christoph)
* Clarify dax_associcate_entry() vs zero-page and empty entries with
for_each_mapped_pfn() and a comment (Jan)
* Collect reviewed-by's from Jan and Darrick
* Fix an ARCH=UML build failure that made me realize that the patch to
enable filesystems to trigger ->page_free() callbacks was incomplete
with respect to the device-mapper dax enabling.
The investigation of adding support for device-mapper and
DEV_PAGEMAP_OPS resulted in a wider rework that includes 1) picking up
the CONFIG_DAX_DRIVER patches that missed the 4.16 merge window. 2)
Refactoring the build implementation to allow FS_DAX_LIMITED in the s390
case with the dcssblk driver, and full blown FS_DAX + DEV_PAGEMAP_OPS
for everyone else with the pmem driver.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014921.html
---
Background:
get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.
---
Dan Williams (18):
dax: store pfns in the radix
fs, dax: prepare for dax-specific address_space_operations
block, dax: remove dead code in blkdev_writepages()
xfs, dax: introduce xfs_dax_aops
ext4, dax: introduce ext4_dax_aops
ext2, dax: introduce ext2_dax_aops
fs, dax: use page->mapping to warn if truncate collides with a busy page
dax: introduce CONFIG_DAX_DRIVER
dax, dm: allow device-mapper to operate without dax support
dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
memremap: split devm_memremap_pages() and memremap() infrastructure
mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
mm, fs, dax: handle layout changes to pinned dax mappings
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
xfs: prepare xfs_break_layouts() for another layout type
xfs, dax: introduce xfs_break_dax_layouts()
drivers/dax/Kconfig | 5 +
drivers/dax/super.c | 118 +++++++++++++++++++---
drivers/md/Kconfig | 1
drivers/md/dm-linear.c | 6 +
drivers/md/dm-log-writes.c | 95 +++++++++---------
drivers/md/dm-stripe.c | 6 +
drivers/md/dm.c | 66 +++++++++++-
drivers/nvdimm/Kconfig | 2
drivers/nvdimm/pmem.c | 3 -
drivers/s390/block/Kconfig | 2
fs/Kconfig | 1
fs/block_dev.c | 5 -
fs/dax.c | 238 ++++++++++++++++++++++++++++++++++----------
fs/ext2/ext2.h | 1
fs/ext2/inode.c | 46 +++++----
fs/ext2/namei.c | 18 ---
fs/ext2/super.c | 6 +
fs/ext4/inode.c | 42 ++++++--
fs/ext4/super.c | 6 +
fs/libfs.c | 39 +++++++
fs/xfs/xfs_aops.c | 34 +++---
fs/xfs/xfs_aops.h | 1
fs/xfs/xfs_file.c | 73 ++++++++++++-
fs/xfs/xfs_inode.h | 16 +++
fs/xfs/xfs_ioctl.c | 8 -
fs/xfs/xfs_iops.c | 21 +++-
fs/xfs/xfs_pnfs.c | 16 ++-
fs/xfs/xfs_pnfs.h | 6 +
fs/xfs/xfs_super.c | 20 ++--
include/linux/dax.h | 115 ++++++++++++++++++---
include/linux/fs.h | 4 +
include/linux/memremap.h | 25 +----
include/linux/mm.h | 71 ++++++++++---
kernel/Makefile | 3 -
kernel/iomem.c | 167 +++++++++++++++++++++++++++++++
kernel/memremap.c | 210 +++++----------------------------------
mm/Kconfig | 5 +
mm/gup.c | 5 +
mm/hmm.c | 13 --
mm/swap.c | 3 -
40 files changed, 1047 insertions(+), 475 deletions(-)
create mode 100644 kernel/iomem.c
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 15 +++++++--
fs/dax.c | 83 +++++++++++++++++++--------------------------------
2 files changed, 43 insertions(+), 55 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
return len < 0 ? len : -EIO;
}
- if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
- || pfn_t_devmap(pfn))
+ if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+ /*
+ * An arch that has enabled the pmem api should also
+ * have its drivers support pfn_t_devmap()
+ *
+ * This is a developer warning and should not trigger in
+ * production. dax_flush() will crash since it depends
+ * on being able to do (page_address(pfn_to_page())).
+ */
+ WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+ } else if (pfn_t_devmap(pfn)) {
/* pass */;
- else {
+ } else {
pr_debug("VFS (%s): error: dax support not enabled\n",
sb->s_id);
return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
#define RADIX_DAX_ZERO_PAGE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
{
return (unsigned long)entry >> RADIX_DAX_SHIFT;
}
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
{
return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
- ((unsigned long)sector << RADIX_DAX_SHIFT) |
- RADIX_DAX_ENTRY_LOCK);
+ (pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
}
static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
*/
static void *dax_insert_mapping_entry(struct address_space *mapping,
struct vm_fault *vmf,
- void *entry, sector_t sector,
+ void *entry, pfn_t pfn_t,
unsigned long flags, bool dirty)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
- void *new_entry;
+ unsigned long pfn = pfn_t_to_pfn(pfn_t);
pgoff_t index = vmf->pgoff;
+ void *new_entry;
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
}
spin_lock_irq(&mapping->tree_lock);
- new_entry = dax_radix_locked_entry(sector, flags);
+ new_entry = dax_radix_locked_entry(pfn, flags);
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
i_mmap_unlock_read(mapping);
}
-static int dax_writeback_one(struct block_device *bdev,
- struct dax_device *dax_dev, struct address_space *mapping,
- pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+ struct address_space *mapping, pgoff_t index, void *entry)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
- void *entry2, **slot, *kaddr;
- long ret = 0, id;
- sector_t sector;
- pgoff_t pgoff;
+ void *entry2, **slot;
+ unsigned long pfn;
+ long ret = 0;
size_t size;
- pfn_t pfn;
/*
* A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
goto put_unlocked;
/*
* Entry got reallocated elsewhere? No need to writeback. We have to
- * compare sectors as we must not bail out due to difference in lockbit
+ * compare pfns as we must not bail out due to difference in lockbit
* or entry type.
*/
- if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+ if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
goto put_unlocked;
if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
dax_is_zero_entry(entry))) {
@@ -712,33 +709,15 @@ static int dax_writeback_one(struct block_device *bdev,
/*
* Even if dax_writeback_mapping_range() was given a wbc->range_start
* in the middle of a PMD, the 'index' we are given will be aligned to
- * the start index of the PMD, as will the sector we pull from
- * 'entry'. This allows us to flush for PMD_SIZE and not have to
- * worry about partial PMD writebacks.
+ * the start index of the PMD, as will the pfn we pull from 'entry'.
+ * This allows us to flush for PMD_SIZE and not have to worry about
+ * partial PMD writebacks.
*/
- sector = dax_radix_sector(entry);
+ pfn = dax_radix_pfn(entry);
size = PAGE_SIZE << dax_radix_order(entry);
- id = dax_read_lock();
- ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
- if (ret)
- goto dax_unlock;
-
- /*
- * dax_direct_access() may sleep, so cannot hold tree_lock over
- * its invocation.
- */
- ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
- if (ret < 0)
- goto dax_unlock;
-
- if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
- ret = -EIO;
- goto dax_unlock;
- }
-
- dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
- dax_flush(dax_dev, kaddr, size);
+ dax_mapping_entry_mkclean(mapping, index, pfn);
+ dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
@@ -749,8 +728,6 @@ static int dax_writeback_one(struct block_device *bdev,
radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
spin_unlock_irq(&mapping->tree_lock);
trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
- dax_read_unlock(id);
put_locked_mapping_entry(mapping, index);
return ret;
@@ -808,8 +785,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
break;
}
- ret = dax_writeback_one(bdev, dax_dev, mapping,
- indices[i], pvec.pages[i]);
+ ret = dax_writeback_one(dax_dev, mapping, indices[i],
+ pvec.pages[i]);
if (ret < 0) {
mapping_set_error(mapping, ret);
goto out;
@@ -877,6 +854,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
int ret = VM_FAULT_NOPAGE;
struct page *zero_page;
void *entry2;
+ pfn_t pfn;
zero_page = ZERO_PAGE(0);
if (unlikely(!zero_page)) {
@@ -884,14 +862,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
goto out;
}
- entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+ pfn = page_to_pfn_t(zero_page);
+ entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(entry2)) {
ret = VM_FAULT_SIGBUS;
goto out;
}
- vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+ vm_insert_mixed(vmf->vma, vaddr, pfn);
out:
trace_dax_load_hole(inode, vmf, ret);
return ret;
@@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (error < 0)
goto error_finish_iomap;
- entry = dax_insert_mapping_entry(mapping, vmf, entry,
- dax_iomap_sector(&iomap, pos),
+ entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
0, write && !sync);
if (IS_ERR(entry)) {
error = PTR_ERR(entry);
@@ -1280,13 +1258,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
void *ret = NULL;
spinlock_t *ptl;
pmd_t pmd_entry;
+ pfn_t pfn;
zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
if (unlikely(!zero_page))
goto fallback;
- ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+ pfn = page_to_pfn_t(zero_page);
+ ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(ret))
goto fallback;
@@ -1409,8 +1389,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (error < 0)
goto finish_iomap;
- entry = dax_insert_mapping_entry(mapping, vmf, entry,
- dax_iomap_sector(&iomap, pos),
+ entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
RADIX_DAX_PMD, write && !sync);
if (IS_ERR(entry))
goto finish_iomap;
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.
Cc: Jeff Moyer <[email protected]>
Cc: Ross Zwisler <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/libfs.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 12 +++++++++---
include/linux/fs.h | 4 ++++
3 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..0fb590d79f30 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1060,6 +1060,45 @@ int noop_fsync(struct file *file, loff_t start, loff_t end, int datasync)
}
EXPORT_SYMBOL(noop_fsync);
+int noop_set_page_dirty(struct page *page)
+{
+ /*
+ * Unlike __set_page_dirty_no_writeback that handles dirty page
+ * tracking in the page object, dax does all dirty tracking in
+ * the inode address_space in response to mkwrite faults. In the
+ * dax case we only need to worry about potentially dirty CPU
+ * caches, not dirty page cache pages to write back.
+ *
+ * This callback is defined to prevent fallback to
+ * __set_page_dirty_buffers() in set_page_dirty().
+ */
+ return 0;
+}
+EXPORT_SYMBOL_GPL(noop_set_page_dirty);
+
+void noop_invalidatepage(struct page *page, unsigned int offset,
+ unsigned int length)
+{
+ /*
+ * There is no page cache to invalidate in the dax case, however
+ * we need this callback defined to prevent falling back to
+ * block_invalidatepage() in do_invalidatepage().
+ */
+}
+EXPORT_SYMBOL_GPL(noop_invalidatepage);
+
+ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+ /*
+ * iomap based filesystems support direct I/O without need for
+ * this callback. However, it still needs to be set in
+ * inode->a_ops so that open/fcntl know that direct I/O is
+ * generally supported.
+ */
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(noop_direct_IO);
+
/* Because kfree isn't assignment-compatible with void(void*) ;-/ */
void kfree_link(void *p)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..ae27a7efe7ab 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
}
#endif
+struct writeback_control;
int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
#if IS_ENABLED(CONFIG_FS_DAX)
int __bdev_dax_supported(struct super_block *sb, int blocksize);
@@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
}
struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_writeback_mapping_range(struct address_space *mapping,
+ struct block_device *bdev, struct writeback_control *wbc);
#else
static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
{
@@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
{
return NULL;
}
+
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+ struct block_device *bdev, struct writeback_control *wbc)
+{
+ return -EOPNOTSUPP;
+}
#endif
int dax_read_lock(void);
@@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}
-struct writeback_control;
-int dax_writeback_mapping_range(struct address_space *mapping,
- struct block_device *bdev, struct writeback_control *wbc);
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..44f7f7080faa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3129,6 +3129,10 @@ extern int simple_rmdir(struct inode *, struct dentry *);
extern int simple_rename(struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
extern int noop_fsync(struct file *, loff_t, loff_t, int);
+extern int noop_set_page_dirty(struct page *page);
+extern void noop_invalidatepage(struct page *page, unsigned int offset,
+ unsigned int length);
+extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
extern int simple_empty(struct dentry *);
extern int simple_readpage(struct file *file, struct page *page);
extern int simple_write_begin(struct file *file, struct address_space *mapping,
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.
Here is an example of a collision that this implementation catches:
WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
[..]
Call Trace:
__dax_invalidate_mapping_entry+0x6c/0xf0
dax_delete_mapping_entry+0xf/0x20
truncate_exceptional_pvec_entries.part.12+0x1af/0x200
truncate_inode_pages_range+0x268/0x970
? tlb_gather_mmu+0x10/0x20
? up_write+0x1c/0x40
? unmap_mapping_range+0x73/0x140
xfs_free_file_space+0x1b6/0x5b0 [xfs]
? xfs_file_fallocate+0x7f/0x320 [xfs]
? down_write_nested+0x40/0x70
? xfs_ilock+0x21d/0x2f0 [xfs]
xfs_file_fallocate+0x162/0x320 [xfs]
? rcu_read_lock_sched_held+0x3f/0x70
? rcu_sync_lockdep_assert+0x2a/0x50
? __sb_start_write+0xd0/0x1b0
? vfs_fallocate+0x20c/0x270
vfs_fallocate+0x154/0x270
SyS_fallocate+0x43/0x80
entry_SYSCALL_64_fastpath+0x1f/0x96
Cc: Jeff Moyer <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..a77394fe586e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -298,6 +298,63 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
}
+static unsigned long dax_entry_size(void *entry)
+{
+ if (dax_is_zero_entry(entry))
+ return 0;
+ else if (dax_is_empty_entry(entry))
+ return 0;
+ else if (dax_is_pmd_entry(entry))
+ return PMD_SIZE;
+ else
+ return PAGE_SIZE;
+}
+
+static unsigned long dax_radix_end_pfn(void *entry)
+{
+ return dax_radix_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
+}
+
+/*
+ * Iterate through all mapped pfns represented by an entry, i.e. skip
+ * 'empty' and 'zero' entries.
+ */
+#define for_each_mapped_pfn(entry, pfn) \
+ for (pfn = dax_radix_pfn(entry); \
+ pfn < dax_radix_end_pfn(entry); pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+ unsigned long pfn;
+
+ if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+ return;
+
+ for_each_mapped_pfn(entry, pfn) {
+ struct page *page = pfn_to_page(pfn);
+
+ WARN_ON_ONCE(page->mapping);
+ page->mapping = mapping;
+ }
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+ bool trunc)
+{
+ unsigned long pfn;
+
+ if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+ return;
+
+ for_each_mapped_pfn(entry, pfn) {
+ struct page *page = pfn_to_page(pfn);
+
+ WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+ WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+ page->mapping = NULL;
+ }
+}
+
/*
* Find radix tree entry at given index. If it points to an exceptional entry,
* return it with the radix tree entry locked. If the radix tree doesn't
@@ -404,6 +461,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
}
if (pmd_downgrade) {
+ dax_disassociate_entry(entry, mapping, false);
radix_tree_delete(&mapping->page_tree, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -453,6 +511,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
(radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
goto out;
+ dax_disassociate_entry(entry, mapping, trunc);
radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
ret = 1;
@@ -547,6 +606,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
spin_lock_irq(&mapping->tree_lock);
new_entry = dax_radix_locked_entry(pfn, flags);
+ if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+ dax_disassociate_entry(entry, mapping, false);
+ dax_associate_entry(new_entry, mapping);
+ }
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
Change device-mapper's DAX dependency to require the presence of at
least one DAX_DRIVER. This allows device-mapper to be built without
bringing the DAX core along which is especially wasteful when there are
no DAX drivers, like BLK_DEV_PMEM, configured.
Cc: Alasdair Kergon <[email protected]>
Reported-by: Bart Van Assche <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/md/Kconfig | 1
drivers/md/dm-linear.c | 6 +++
drivers/md/dm-log-writes.c | 95 +++++++++++++++++++++++---------------------
drivers/md/dm-stripe.c | 6 +++
drivers/md/dm.c | 10 +++--
include/linux/dax.h | 30 +++++++++++---
6 files changed, 92 insertions(+), 56 deletions(-)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2c8ac3688815..6dfc328b8f99 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -201,7 +201,6 @@ config BLK_DEV_DM_BUILTIN
config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
- select DAX
---help---
Device-mapper is a low level volume manager. It works by allowing
people to specify mappings for ranges of logical sectors. Various
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d5f8eff7c11d..89443e0ededa 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -154,6 +154,7 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
}
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
{
@@ -184,6 +185,11 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}
+#else
+#define linear_dax_direct_access NULL
+#define linear_dax_copy_from_iter NULL
+#endif
+
static struct target_type linear_target = {
.name = "linear",
.version = {1, 4, 0},
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 3362d866793b..7fcb4216973f 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -610,51 +610,6 @@ static int log_mark(struct log_writes_c *lc, char *data)
return 0;
}
-static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
- struct iov_iter *i)
-{
- struct pending_block *block;
-
- if (!bytes)
- return 0;
-
- block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
- if (!block) {
- DMERR("Error allocating dax pending block");
- return -ENOMEM;
- }
-
- block->data = kzalloc(bytes, GFP_KERNEL);
- if (!block->data) {
- DMERR("Error allocating dax data space");
- kfree(block);
- return -ENOMEM;
- }
-
- /* write data provided via the iterator */
- if (!copy_from_iter(block->data, bytes, i)) {
- DMERR("Error copying dax data");
- kfree(block->data);
- kfree(block);
- return -EIO;
- }
-
- /* rewind the iterator so that the block driver can use it */
- iov_iter_revert(i, bytes);
-
- block->datalen = bytes;
- block->sector = bio_to_dev_sectors(lc, sector);
- block->nr_sectors = ALIGN(bytes, lc->sectorsize) >> lc->sectorshift;
-
- atomic_inc(&lc->pending_blocks);
- spin_lock_irq(&lc->blocks_lock);
- list_add_tail(&block->list, &lc->unflushed_blocks);
- spin_unlock_irq(&lc->blocks_lock);
- wake_up_process(lc->log_kthread);
-
- return 0;
-}
-
static void log_writes_dtr(struct dm_target *ti)
{
struct log_writes_c *lc = ti->private;
@@ -920,6 +875,52 @@ static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limit
limits->io_min = limits->physical_block_size;
}
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
+static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
+ struct iov_iter *i)
+{
+ struct pending_block *block;
+
+ if (!bytes)
+ return 0;
+
+ block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
+ if (!block) {
+ DMERR("Error allocating dax pending block");
+ return -ENOMEM;
+ }
+
+ block->data = kzalloc(bytes, GFP_KERNEL);
+ if (!block->data) {
+ DMERR("Error allocating dax data space");
+ kfree(block);
+ return -ENOMEM;
+ }
+
+ /* write data provided via the iterator */
+ if (!copy_from_iter(block->data, bytes, i)) {
+ DMERR("Error copying dax data");
+ kfree(block->data);
+ kfree(block);
+ return -EIO;
+ }
+
+ /* rewind the iterator so that the block driver can use it */
+ iov_iter_revert(i, bytes);
+
+ block->datalen = bytes;
+ block->sector = bio_to_dev_sectors(lc, sector);
+ block->nr_sectors = ALIGN(bytes, lc->sectorsize) >> lc->sectorshift;
+
+ atomic_inc(&lc->pending_blocks);
+ spin_lock_irq(&lc->blocks_lock);
+ list_add_tail(&block->list, &lc->unflushed_blocks);
+ spin_unlock_irq(&lc->blocks_lock);
+ wake_up_process(lc->log_kthread);
+
+ return 0;
+}
+
static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
{
@@ -956,6 +957,10 @@ static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
dax_copy:
return dax_copy_from_iter(lc->dev->dax_dev, pgoff, addr, bytes, i);
}
+#else
+#define log_writes_dax_direct_access NULL
+#define log_writes_dax_copy_from_iter NULL
+#endif
static struct target_type log_writes_target = {
.name = "log-writes",
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index b5e892149c54..ac2e8ee9d586 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -311,6 +311,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_REMAPPED;
}
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
{
@@ -351,6 +352,11 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}
+#else
+#define stripe_dax_direct_access NULL
+#define stripe_dax_copy_from_iter NULL
+#endif
+
/*
* Stripe status:
*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 68136806d365..ffc93aecc02a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1800,7 +1800,7 @@ static void cleanup_mapped_device(struct mapped_device *md)
static struct mapped_device *alloc_dev(int minor)
{
int r, numa_node_id = dm_get_numa_node();
- struct dax_device *dax_dev;
+ struct dax_device *dax_dev = NULL;
struct mapped_device *md;
void *old_md;
@@ -1866,9 +1866,11 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
- if (!dax_dev)
- goto bad;
+ if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
+ dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ if (!dax_dev)
+ goto bad;
+ }
md->dax_dev = dax_dev;
add_disk_no_queue_reg(md->disk);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ae27a7efe7ab..f9eb22ad341e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -26,16 +26,39 @@ extern struct attribute_group dax_attribute_group;
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *dax_get_by_host(const char *host);
+struct dax_device *alloc_dax(void *private, const char *host,
+ const struct dax_operations *ops);
void put_dax(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+void dax_write_cache(struct dax_device *dax_dev, bool wc);
+bool dax_write_cache_enabled(struct dax_device *dax_dev);
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
return NULL;
}
-
+static inline struct dax_device *alloc_dax(void *private, const char *host,
+ const struct dax_operations *ops)
+{
+ /*
+ * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
+ * NULL is an error or expected.
+ */
+ return NULL;
+}
static inline void put_dax(struct dax_device *dax_dev)
{
}
+static inline void kill_dax(struct dax_device *dax_dev)
+{
+}
+static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
+{
+}
+static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
+{
+ return false;
+}
#endif
struct writeback_control;
@@ -89,18 +112,13 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
int dax_read_lock(void);
void dax_read_unlock(int id);
-struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
bool dax_alive(struct dax_device *dax_dev);
-void kill_dax(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
-void dax_write_cache(struct dax_device *dax_dev, bool wc);
-bool dax_write_cache_enabled(struct dax_device *dax_dev);
ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops);
In preparation for allowing filesystems to augment the dev_pagemap
associated with a dax_device, add an ->fs_claim() callback. The
->fs_claim() callback is leveraged by the device-mapper dax
implementation to iterate all member devices in the map and repeat the
claim operation across the array.
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.
The ->fs_claim() operation arranges for this callback to be registered,
although that implementation is saved for a later patch.
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/dm.c | 56 ++++++++++++++++++++++++++++++++
include/linux/dax.h | 16 +++++++++
include/linux/memremap.h | 8 +++++
4 files changed, 160 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c4cf284dfe1c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
static DEFINE_IDA(dax_minor_ida);
static struct kmem_cache *dax_cache __read_mostly;
static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -169,9 +170,88 @@ struct dax_device {
const char *host;
void *private;
unsigned long flags;
+ struct dev_pagemap *pgmap;
const struct dax_operations *ops;
};
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+ /* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
+{
+ struct dev_pagemap *pgmap;
+
+ if (!dax_dev->pgmap)
+ return dax_dev;
+ pgmap = dax_dev->pgmap;
+
+ mutex_lock(&devmap_lock);
+ if (pgmap->data && pgmap->data == owner) {
+ /* dm might try to claim the same device more than once... */
+ mutex_unlock(&devmap_lock);
+ return dax_dev;
+ } else if (pgmap->page_free || pgmap->page_fault
+ || pgmap->type != MEMORY_DEVICE_HOST) {
+ put_dax(dax_dev);
+ mutex_unlock(&devmap_lock);
+ return NULL;
+ }
+
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+ pgmap->page_free = generic_dax_pagefree;
+ pgmap->data = owner;
+ mutex_unlock(&devmap_lock);
+
+ return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim);
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+ struct dax_device *dax_dev;
+
+ if (!blk_queue_dax(bdev->bd_queue))
+ return NULL;
+ dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+ if (dax_dev->ops->fs_claim)
+ return dax_dev->ops->fs_claim(dax_dev, owner);
+ else
+ return fs_dax_claim(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void __fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+ struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+ put_dax(dax_dev);
+ if (!pgmap)
+ return;
+ if (!pgmap->data)
+ return;
+
+ mutex_lock(&devmap_lock);
+ WARN_ON(pgmap->data != owner);
+ pgmap->type = MEMORY_DEVICE_HOST;
+ pgmap->page_free = NULL;
+ pgmap->data = NULL;
+ mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(__fs_dax_release);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+ if (dax_dev->ops->fs_release)
+ dax_dev->ops->fs_release(dax_dev, owner);
+ else
+ __fs_dax_release(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ffc93aecc02a..964cb7537f11 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1090,6 +1090,60 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}
+static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *owner)
+{
+ if (fs_dax_claim(dev->dax_dev, owner))
+ return 0;
+ /*
+ * Outside of a kernel bug there is no reason a dax_dev should
+ * fail a claim attempt. Device-mapper should have exclusive
+ * ownership of the dm_dev and the filesystem should have
+ * exclusive ownership of the dm_target.
+ */
+ WARN_ON_ONCE(1);
+ return -ENXIO;
+}
+
+static int dm_dax_dev_release(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *owner)
+{
+ __fs_dax_release(dev->dax_dev, owner);
+ return 0;
+}
+
+static struct dax_device *dm_dax_iterate(struct dax_device *dax_dev,
+ iterate_devices_callout_fn fn, void *arg)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ struct dm_table *map;
+ struct dm_target *ti;
+ int i, srcu_idx;
+
+ map = dm_get_live_table(md, &srcu_idx);
+
+ for (i = 0; i < dm_table_get_num_targets(map); i++) {
+ ti = dm_table_get_target(map, i);
+
+ if (ti->type->iterate_devices)
+ ti->type->iterate_devices(ti, fn, arg);
+ }
+
+ dm_put_live_table(md, srcu_idx);
+ return dax_dev;
+}
+
+static struct dax_device *dm_dax_fs_claim(struct dax_device *dax_dev,
+ void *owner)
+{
+ return dm_dax_iterate(dax_dev, dm_dax_dev_claim, owner);
+}
+
+static void dm_dax_fs_release(struct dax_device *dax_dev, void *owner)
+{
+ dm_dax_iterate(dax_dev, dm_dax_dev_release, owner);
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
@@ -3111,6 +3165,8 @@ static const struct block_device_operations dm_blk_dops = {
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
.copy_from_iter = dm_dax_copy_from_iter,
+ .fs_claim = dm_dax_fs_claim,
+ .fs_release = dm_dax_fs_release,
};
/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f9eb22ad341e..e9d59a6b06e1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -20,6 +20,10 @@ struct dax_operations {
/* copy_from_iter: required operation for fs-dax direct-i/o */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+ /* fs_claim: setup filesytem parameters for the device's dev_pagemap */
+ struct dax_device *(*fs_claim)(struct dax_device *, void *);
+ /* fs_release: restore device's dev_pagemap to its default state */
+ void (*fs_release)(struct dax_device *, void *);
};
extern struct attribute_group dax_attribute_group;
@@ -83,6 +87,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+void __fs_dax_release(struct dax_device *dax_dev, void *owner);
#else
static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
{
@@ -108,6 +114,16 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
{
return -EOPNOTSUPP;
}
+
+static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
+ void *owner)
+{
+ return NULL;
+}
+
+static inline void __fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+}
#endif
int dax_read_lock(void);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_FS_DAX,
};
/*
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.
Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.
For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Reported-by: Thomas Meyer <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: "Jérôme Glisse" <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 4 ++-
fs/Kconfig | 1 +
include/linux/dax.h | 35 ++++++++++++++++-------
include/linux/memremap.h | 17 -----------
include/linux/mm.h | 71 ++++++++++++++++++++++++++++++++++------------
kernel/memremap.c | 30 +++++++++++++++++--
mm/Kconfig | 5 +++
mm/hmm.c | 13 +-------
mm/swap.c | 3 +-
9 files changed, 116 insertions(+), 63 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7d260f118a39..3bafaddd02f1 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -164,7 +164,7 @@ struct dax_device {
const struct dax_operations *ops;
};
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
static void generic_dax_pagefree(struct page *page, void *data)
{
/* TODO: wakeup page-idle waiters */
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
return NULL;
}
+ dev_pagemap_get_ops();
pgmap->type = MEMORY_DEVICE_FS_DAX;
pgmap->page_free = generic_dax_pagefree;
pgmap->data = owner;
@@ -228,6 +229,7 @@ void __fs_dax_release(struct dax_device *dax_dev, void *owner)
pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_free = NULL;
pgmap->data = NULL;
+ dev_pagemap_put_ops();
mutex_unlock(&devmap_lock);
}
EXPORT_SYMBOL_GPL(__fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+ select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a88ff009e2a1..a36b74aa96e8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,6 +4,7 @@
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/genhd.h>
#include <linux/radix-tree.h>
#include <asm/pgtable.h>
@@ -87,12 +88,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
return dax_get_by_host(host);
}
-struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
-void fs_dax_release(struct dax_device *dax_dev, void *owner);
int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
-struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
-void __fs_dax_release(struct dax_device *dax_dev, void *owner);
#else
static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
{
@@ -104,26 +101,42 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
return NULL;
}
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+ struct block_device *bdev, struct writeback_control *wbc)
+{
+ return -EOPNOTSUPP;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+void __fs_dax_release(struct dax_device *dax_dev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
+#else
+#ifdef CONFIG_BLOCK
static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
void *owner)
{
- return NULL;
+ return fs_dax_get_by_host(bdev->bd_disk->disk_name);
}
-
-static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+ void *owner)
{
+ return NULL;
}
+#endif
-static inline int dax_writeback_mapping_range(struct address_space *mapping,
- struct block_device *bdev, struct writeback_control *wbc)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
{
- return -EOPNOTSUPP;
+ put_dax(dax_dev);
}
static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
void *owner)
{
- return NULL;
+ return dax_dev;
}
static inline void __fs_dax_release(struct dax_device *dax_dev, void *owner)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..8cc619fe347b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MEMREMAP_H_
#define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
#include <linux/ioport.h>
#include <linux/percpu-refcount.h>
@@ -137,8 +136,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
#else
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -169,20 +166,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
- return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
- return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
{
if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..be9969e3cf09 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,65 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
+{
+ if (!static_branch_unlikely(&devmap_managed_key))
+ return false;
+ if (!is_zone_device_page(page))
+ return false;
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
+ case MEMORY_DEVICE_FS_DAX:
+ __put_devmap_managed_page(page);
+ return true;
+ default:
+ break;
+ }
+ return false;
+}
+
+static inline bool is_device_private_page(const struct page *page)
{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}
-#define IS_HMM_ENABLED 0
+
+static inline bool is_device_public_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline void dev_pagemap_get_ops(void)
+{
+}
+
+static inline void dev_pagemap_put_ops(void)
+{
+}
+
+static inline bool put_devmap_managed_page(struct page *page)
+{
+ return false;
+}
+
static inline bool is_device_private_page(const struct page *page)
{
return false;
}
+
static inline bool is_device_public_page(const struct page *page)
{
return false;
}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline void get_page(struct page *page)
{
@@ -850,16 +888,13 @@ static inline void put_page(struct page *page)
page = compound_head(page);
/*
- * For private device pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the private device page is
- * free and we need to inform the device driver through callback. See
+ * For devmap managed pages we need to catch refcount transition from
+ * 2 to 1, when refcount reach one it means the page is free and we
+ * need to inform the device driver through callback. See
* include/linux/memremap.h and HMM for details.
*/
- if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
- unlikely(is_device_public_page(page)))) {
- put_zone_device_private_or_public_page(page);
+ if (put_devmap_managed_page(page))
return;
- }
if (put_page_testzero(page))
__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 52a2742f527f..07a6a405cf3d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -302,8 +302,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
return pgmap;
}
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+ if (atomic_inc_return(&devmap_enable) == 1)
+ static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+ if (atomic_dec_and_test(&devmap_enable))
+ static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
{
int count = page_ref_dec_return(page);
@@ -323,5 +345,5 @@ void put_zone_device_private_or_public_page(struct page *page)
} else if (!count)
__put_page(page);
}
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8fb7235..dc32828984a3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,9 @@ config ARCH_HAS_HMM
config MIGRATE_VMA_HELPER
bool
+config DEV_PAGEMAP_OPS
+ bool
+
config HMM
bool
select MIGRATE_VMA_HELPER
@@ -720,6 +723,7 @@ config DEVICE_PRIVATE
bool "Unaddressable device memory (GPU memory, ...)"
depends on ARCH_HAS_HMM
select HMM
+ select DEV_PAGEMAP_OPS
help
Allows creation of struct pages to represent unaddressable device
@@ -730,6 +734,7 @@ config DEVICE_PUBLIC
bool "Addressable device memory (like GPU memory)"
depends on ARCH_HAS_HMM
select HMM
+ select DEV_PAGEMAP_OPS
help
Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..4aa554e76d06 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
#if IS_ENABLED(CONFIG_HMM_MIRROR)
static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
resource_size_t addr;
int ret;
- static_branch_enable(&device_private_key);
+ dev_pagemap_get_ops();
devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
GFP_KERNEL, dev_to_node(device));
@@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
return ERR_PTR(-EINVAL);
- static_branch_enable(&device_private_key);
+ dev_pagemap_get_ops();
devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 0f17330dd0e5..eed846cfc8b8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/backing-dev.h>
+#include <linux/memremap.h>
#include <linux/memcontrol.h>
#include <linux/gfp.h>
#include <linux/uio.h>
@@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
flags);
locked_pgdat = NULL;
}
- put_zone_device_private_or_public_page(page);
+ put_devmap_managed_page(page);
continue;
}
The devm_memremap_pages() facility is tightly integrated with the
kernel's memory hotplug functionality. It injects an altmap argument
deep into the architecture specific vmemmap implementation to allow
allocating from specific reserved pages, and it has Linux specific
assumptions about page structure reference counting relative to
get_user_pages() and get_user_pages_fast(). It was an oversight that
this was not marked EXPORT_SYMBOL_GPL from the outset.
Cc: Michal Hocko <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
kernel/memremap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07a6a405cf3d..4b0e17df8981 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -257,7 +257,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
devres_free(pgmap);
return ERR_PTR(error);
}
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
{
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.
dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).
While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
Cc: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/xfs/xfs_file.c | 60 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 49 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 51e6506bdcb1..0342f6fb782f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,38 @@ xfs_file_write_iter(
return ret;
}
+static void
+xfs_wait_var_event(
+ struct inode *inode,
+ uint iolock,
+ bool *did_unlock)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+
+ *did_unlock = true;
+ xfs_iunlock(ip, iolock);
+ schedule();
+ xfs_ilock(ip, iolock);
+}
+
+static int
+xfs_break_dax_layouts(
+ struct inode *inode,
+ uint iolock,
+ bool *did_unlock)
+{
+ struct page *page;
+
+ *did_unlock = false;
+ page = dax_layout_busy_page(inode->i_mapping);
+ if (!page)
+ return 0;
+
+ return ___wait_var_event(&page->_refcount,
+ atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+ 0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
+}
+
int
xfs_break_layouts(
struct inode *inode,
@@ -763,17 +795,23 @@ xfs_break_layouts(
ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
- switch (reason) {
- case BREAK_UNMAP:
- ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
- /* fall through */
- case BREAK_WRITE:
- error = xfs_break_leased_layouts(inode, iolock, &retry);
- break;
- default:
- WARN_ON_ONCE(1);
- return -EINVAL;
- }
+ do {
+ switch (reason) {
+ case BREAK_UNMAP:
+ ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+
+ error = xfs_break_dax_layouts(inode, *iolock, &retry);
+ /* fall through */
+ case BREAK_WRITE:
+ if (error || retry)
+ break;
+ error = xfs_break_leased_layouts(inode, iolock, &retry);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+ } while (error == 0 && retry);
return error;
}
In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().
Cc: "Darrick J. Wong" <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dave Chinner <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/xfs/xfs_file.c | 5 +----
fs/xfs/xfs_ioctl.c | 5 +----
fs/xfs/xfs_iops.c | 10 +++++++---
fs/xfs/xfs_pnfs.c | 3 ++-
4 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..18edf04811d0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -768,7 +768,7 @@ xfs_file_fallocate(
struct xfs_inode *ip = XFS_I(inode);
long error;
enum xfs_prealloc_flags flags = 0;
- uint iolock = XFS_IOLOCK_EXCL;
+ uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
loff_t new_size = 0;
bool do_file_insert = false;
@@ -782,9 +782,6 @@ xfs_file_fallocate(
if (error)
goto out_unlock;
- xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
- iolock |= XFS_MMAPLOCK_EXCL;
-
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
struct xfs_inode *ip = XFS_I(inode);
struct iattr iattr;
enum xfs_prealloc_flags flags = 0;
- uint iolock = XFS_IOLOCK_EXCL;
+ uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
int error;
/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
if (error)
goto out_unlock;
- xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
- iolock |= XFS_MMAPLOCK_EXCL;
-
switch (bf->l_whence) {
case 0: /*SEEK_SET*/
break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..d23aa08426f9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1028,13 +1028,17 @@ xfs_vn_setattr(
if (iattr->ia_valid & ATTR_SIZE) {
struct xfs_inode *ip = XFS_I(d_inode(dentry));
- uint iolock = XFS_IOLOCK_EXCL;
+ uint iolock;
+
+ xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+ iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
error = xfs_break_layouts(d_inode(dentry), &iolock);
- if (error)
+ if (error) {
+ xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
return error;
+ }
- xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
error = xfs_vn_setattr_size(dentry, iattr);
xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..6ea7b0b55d02 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -43,7 +43,8 @@ xfs_break_layouts(
while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
xfs_iunlock(ip, *iolock);
error = break_layout(inode, true);
- *iolock = XFS_IOLOCK_EXCL;
+ *iolock &= ~XFS_IOLOCK_SHARED;
+ *iolock |= XFS_IOLOCK_EXCL;
xfs_ilock(ip, *iolock);
}
When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.
A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.
BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.
Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.
After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.
Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.
Cc: Ross Zwisler <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Reported-by: Dave Chinner <[email protected]>
Reported-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/xfs/xfs_file.c | 30 ++++++++++++++++++++++++++++--
fs/xfs/xfs_inode.h | 16 ++++++++++++++++
fs/xfs/xfs_ioctl.c | 3 +--
fs/xfs/xfs_iops.c | 6 +++---
fs/xfs/xfs_pnfs.c | 13 +++++++------
fs/xfs/xfs_pnfs.h | 6 ++++--
6 files changed, 59 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 18edf04811d0..51e6506bdcb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
if (error <= 0)
return error;
- error = xfs_break_layouts(inode, iolock);
+ error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
if (error)
return error;
@@ -752,6 +752,32 @@ xfs_file_write_iter(
return ret;
}
+int
+xfs_break_layouts(
+ struct inode *inode,
+ uint *iolock,
+ enum layout_break_reason reason)
+{
+ bool retry = false;
+ int error = 0;
+
+ ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+ switch (reason) {
+ case BREAK_UNMAP:
+ ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+ /* fall through */
+ case BREAK_WRITE:
+ error = xfs_break_leased_layouts(inode, iolock, &retry);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ return error;
+}
+
#define XFS_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \
@@ -778,7 +804,7 @@ xfs_file_fallocate(
return -EOPNOTSUPP;
xfs_ilock(ip, iolock);
- error = xfs_break_layouts(inode, &iolock);
+ error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
if (error)
goto out_unlock;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..7e1a077dfc04 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>> XFS_ILOCK_SHIFT)
/*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_UNMAP case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
+ * go idle.
+ */
+enum layout_break_reason {
+ BREAK_WRITE,
+ BREAK_UNMAP,
+};
+
+/*
* For multiple groups support: if S_ISGID bit is set in the parent
* directory, group of new file is set to that of the parent, and
* new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +461,8 @@ int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
xfs_fsize_t isize, bool *did_zeroing);
int xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
bool *did_zero);
+int xfs_break_layouts(struct inode *inode, uint *iolock,
+ enum layout_break_reason reason);
/* from xfs_iops.c */
extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4151fade4bb1..91e73d663099 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,6 @@
#include "xfs_icache.h"
#include "xfs_symlink.h"
#include "xfs_trans.h"
-#include "xfs_pnfs.h"
#include "xfs_acl.h"
#include "xfs_btree.h"
#include <linux/fsmap.h>
@@ -644,7 +643,7 @@ xfs_ioc_space(
return error;
xfs_ilock(ip, iolock);
- error = xfs_break_layouts(inode, &iolock);
+ error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
if (error)
goto out_unlock;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d23aa08426f9..04abb077e91a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,6 @@
#include "xfs_da_btree.h"
#include "xfs_dir2.h"
#include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
#include "xfs_iomap.h"
#include <linux/capability.h>
@@ -1027,13 +1026,14 @@ xfs_vn_setattr(
int error;
if (iattr->ia_valid & ATTR_SIZE) {
- struct xfs_inode *ip = XFS_I(d_inode(dentry));
+ struct inode *inode = d_inode(dentry);
+ struct xfs_inode *ip = XFS_I(inode);
uint iolock;
xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
- error = xfs_break_layouts(d_inode(dentry), &iolock);
+ error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
if (error) {
xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
return error;
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 6ea7b0b55d02..40e69edb7e2e 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -31,17 +31,18 @@
* rules in the page fault path we don't bother.
*/
int
-xfs_break_layouts(
+xfs_break_leased_layouts(
struct inode *inode,
- uint *iolock)
+ uint *iolock,
+ bool *did_unlock)
{
struct xfs_inode *ip = XFS_I(inode);
int error;
- ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
+ *did_unlock = false;
while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
xfs_iunlock(ip, *iolock);
+ *did_unlock = true;
error = break_layout(inode, true);
*iolock &= ~XFS_IOLOCK_SHARED;
*iolock |= XFS_IOLOCK_EXCL;
@@ -121,8 +122,8 @@ xfs_fs_map_blocks(
* Lock out any other I/O before we flush and invalidate the pagecache,
* and then hand out a layout to the remote system. This is very
* similar to direct I/O, except that the synchronization is much more
- * complicated. See the comment near xfs_break_layouts for a detailed
- * explanation.
+ * complicated. See the comment near xfs_break_leased_layouts
+ * for a detailed explanation.
*/
xfs_ilock(ip, XFS_IOLOCK_EXCL);
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index bf45951e28fe..0f2f51037064 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -9,11 +9,13 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
struct iattr *iattr);
-int xfs_break_layouts(struct inode *inode, uint *iolock);
+int xfs_break_leased_layouts(struct inode *inode, uint *iolock,
+ bool *did_unlock);
#else
static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
+xfs_break_leased_layouts(struct inode *inode, uint *iolock, bool *did_unlock)
{
+ *did_unlock = false;
return 0;
}
#endif /* CONFIG_EXPORTFS_BLOCK_OPS */
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.
Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.
Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.
Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.
Cc: Michal Hocko <[email protected]>
Reviewed-by: "Jérôme Glisse" <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 21 +++++++++++----------
drivers/nvdimm/pmem.c | 3 ++-
fs/ext2/super.c | 6 +++---
fs/ext4/super.c | 6 +++---
fs/xfs/xfs_super.c | 20 ++++++++++----------
include/linux/dax.h | 23 ++++++++++++++---------
6 files changed, 43 insertions(+), 36 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4cf284dfe1c..7d260f118a39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
}
EXPORT_SYMBOL(bdev_dax_pgoff);
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
- if (!blk_queue_dax(bdev->bd_queue))
- return NULL;
- return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
/**
* __bdev_dax_supported() - Check if the device supports dax for filesystem
* @sb: The superblock of the device
@@ -579,6 +569,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
}
EXPORT_SYMBOL_GPL(alloc_dax);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+ const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+ struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+ if (dax_dev)
+ dax_dev->pgmap = pgmap;
+ return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
void put_dax(struct dax_device *dax_dev)
{
if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;
- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+ &pmem->pgmap);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
brelse (sbi->s_sbh);
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);
- fs_put_dax(sbi->s_daxdev);
+ fs_dax_release(sbi->s_daxdev, sb);
kfree(sbi);
}
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
static int ext2_fill_super(struct super_block *sb, void *data, int silent)
{
- struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+ struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
struct buffer_head * bh;
struct ext2_sb_info * sbi;
struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
kfree(sbi->s_blockgroup_lock);
kfree(sbi);
failed:
- fs_put_dax(dax_dev);
+ fs_dax_release(dax_dev, sb);
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
if (sbi->s_chksum_driver)
crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
- fs_put_dax(sbi->s_daxdev);
+ fs_dax_release(sbi->s_daxdev, sb);
kfree(sbi);
}
@@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
static int ext4_fill_super(struct super_block *sb, void *data, int silent)
{
- struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+ struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
char *orig_data = kstrdup(data, GFP_KERNEL);
struct buffer_head *bh;
struct ext4_super_block *es = NULL;
@@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
out_free_base:
kfree(sbi);
kfree(orig_data);
- fs_put_dax(dax_dev);
+ fs_dax_release(dax_dev, sb);
return err ? err : ret;
}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93588ea3d3d2..ef7dd7148c0b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -724,7 +724,7 @@ xfs_close_devices(
xfs_free_buftarg(mp, mp->m_logdev_targp);
xfs_blkdev_put(logdev);
- fs_put_dax(dax_logdev);
+ fs_dax_release(dax_logdev, mp);
}
if (mp->m_rtdev_targp) {
struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -732,10 +732,10 @@ xfs_close_devices(
xfs_free_buftarg(mp, mp->m_rtdev_targp);
xfs_blkdev_put(rtdev);
- fs_put_dax(dax_rtdev);
+ fs_dax_release(dax_rtdev, mp);
}
xfs_free_buftarg(mp, mp->m_ddev_targp);
- fs_put_dax(dax_ddev);
+ fs_dax_release(dax_ddev, mp);
}
/*
@@ -753,9 +753,9 @@ xfs_open_devices(
struct xfs_mount *mp)
{
struct block_device *ddev = mp->m_super->s_bdev;
- struct dax_device *dax_ddev = fs_dax_get_by_bdev(ddev);
- struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
+ struct dax_device *dax_ddev = fs_dax_claim_bdev(ddev, mp);
struct block_device *logdev = NULL, *rtdev = NULL;
+ struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
int error;
/*
@@ -765,7 +765,7 @@ xfs_open_devices(
error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
if (error)
goto out;
- dax_logdev = fs_dax_get_by_bdev(logdev);
+ dax_logdev = fs_dax_claim_bdev(logdev, mp);
}
if (mp->m_rtname) {
@@ -779,7 +779,7 @@ xfs_open_devices(
error = -EINVAL;
goto out_close_rtdev;
}
- dax_rtdev = fs_dax_get_by_bdev(rtdev);
+ dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
}
/*
@@ -813,14 +813,14 @@ xfs_open_devices(
xfs_free_buftarg(mp, mp->m_ddev_targp);
out_close_rtdev:
xfs_blkdev_put(rtdev);
- fs_put_dax(dax_rtdev);
+ fs_dax_release(dax_rtdev, mp);
out_close_logdev:
if (logdev && logdev != ddev) {
xfs_blkdev_put(logdev);
- fs_put_dax(dax_logdev);
+ fs_dax_release(dax_logdev, mp);
}
out:
- fs_put_dax(dax_ddev);
+ fs_dax_release(dax_ddev, mp);
return error;
}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d59a6b06e1..a88ff009e2a1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,6 +32,8 @@ extern struct attribute_group dax_attribute_group;
struct dax_device *dax_get_by_host(const char *host);
struct dax_device *alloc_dax(void *private, const char *host,
const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+ const struct dax_operations *ops, struct dev_pagemap *pgmap);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
@@ -50,6 +52,12 @@ static inline struct dax_device *alloc_dax(void *private, const char *host,
*/
return NULL;
}
+static inline struct dax_device *alloc_dax_devmap(void *private,
+ const char *host, const struct dax_operations *ops,
+ struct dev_pagemap *pgmap)
+{
+ return NULL;
+}
static inline void put_dax(struct dax_device *dax_dev)
{
}
@@ -79,12 +87,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
return dax_get_by_host(host);
}
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
- put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
@@ -100,13 +104,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
return NULL;
}
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+ void *owner)
{
+ return NULL;
}
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
{
- return NULL;
}
static inline int dax_writeback_mapping_range(struct address_space *mapping,
Background:
get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andrew Morton <[email protected]>
Reported-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 2 +
fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 25 ++++++++++++++
mm/gup.c | 5 +++
4 files changed, 123 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3bafaddd02f1..91bfc34e3ca7 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
static void generic_dax_pagefree(struct page *page, void *data)
{
- /* TODO: wakeup page-idle waiters */
+ wake_up_var(&page->_refcount);
}
struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index a77394fe586e..c01f7989e0aa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -355,6 +355,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
}
}
+static struct page *dax_busy_page(void *entry)
+{
+ unsigned long pfn;
+
+ for_each_mapped_pfn(entry, pfn) {
+ struct page *page = pfn_to_page(pfn);
+
+ if (page_ref_count(page) > 1)
+ return page;
+ }
+ return NULL;
+}
+
/*
* Find radix tree entry at given index. If it points to an exceptional entry,
* return it with the radix tree entry locked. If the radix tree doesn't
@@ -496,6 +509,85 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
return entry;
}
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+ pgoff_t indices[PAGEVEC_SIZE];
+ struct page *page = NULL;
+ struct pagevec pvec;
+ pgoff_t index, end;
+ unsigned i;
+
+ /*
+ * In the 'limited' case get_user_pages() for dax is disabled.
+ */
+ if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+ return NULL;
+
+ if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+ return NULL;
+
+ pagevec_init(&pvec);
+ index = 0;
+ end = -1;
+ /*
+ * Flush dax_layout_lock() sections to ensure all possible page
+ * references have been taken, or otherwise arrange for faults
+ * to block on the filesystem lock that is taken for
+ * establishing new mappings.
+ */
+ unmap_mapping_range(mapping, 0, 0, 1);
+ synchronize_rcu();
+
+ while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
+ indices)) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *pvec_ent = pvec.pages[i];
+ void *entry;
+
+ index = indices[i];
+ if (index >= end)
+ break;
+
+ if (!radix_tree_exceptional_entry(pvec_ent))
+ continue;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, NULL);
+ if (entry)
+ page = dax_busy_page(entry);
+ put_unlocked_mapping_entry(mapping, index, entry);
+ spin_unlock_irq(&mapping->tree_lock);
+ if (page)
+ break;
+ }
+ pagevec_remove_exceptionals(&pvec);
+ pagevec_release(&pvec);
+ index++;
+
+ if (page)
+ break;
+ }
+ return page;
+}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+
static int __dax_invalidate_mapping_entry(struct address_space *mapping,
pgoff_t index, bool trunc)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a36b74aa96e8..1b0ad014bc28 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -90,6 +90,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
+
+struct page *dax_layout_busy_page(struct address_space *mapping);
#else
static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
{
@@ -106,6 +108,11 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
{
return -EOPNOTSUPP;
}
+
+static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+ return NULL;
+}
#endif
#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
@@ -113,6 +120,16 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
void __fs_dax_release(struct dax_device *dax_dev, void *owner);
void fs_dax_release(struct dax_device *dax_dev, void *owner);
+
+static inline void dax_layout_lock(void)
+{
+ rcu_read_lock();
+}
+
+static inline void dax_layout_unlock(void)
+{
+ rcu_read_unlock();
+}
#else
#ifdef CONFIG_BLOCK
static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
@@ -142,6 +159,14 @@ static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
static inline void __fs_dax_release(struct dax_device *dax_dev, void *owner)
{
}
+
+static inline void dax_layout_lock(void)
+{
+}
+
+static inline void dax_layout_unlock(void)
+{
+}
#endif
int dax_read_lock(void);
diff --git a/mm/gup.c b/mm/gup.c
index 1b46e6e74881..a81efac6983a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
#include <linux/sched/signal.h>
#include <linux/rwsem.h>
#include <linux/hugetlb.h>
+#include <linux/dax.h>
#include <asm/mmu_context.h>
#include <asm/pgtable.h>
@@ -693,7 +694,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;
cond_resched();
+ dax_layout_lock();
page = follow_page_mask(vma, start, foll_flags, &page_mask);
+ dax_layout_unlock();
if (!page) {
int ret;
ret = faultin_page(tsk, vma, start, &foll_flags,
@@ -1809,7 +1812,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (gup_fast_permitted(start, nr_pages, write)) {
local_irq_disable();
+ dax_layout_lock();
gup_pgd_range(addr, end, write, pages, &nr);
+ dax_layout_unlock();
local_irq_enable();
ret = nr;
}
In support of allowing device-mapper to compile out idle/dead code when
there are no dax providers in the system, introduce the DAX_DRIVER
symbol. This is selected by all leaf drivers that device-mapper might be
layered on top. This allows device-mapper to conditionally 'select DAX'
only when a provider is present.
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Reported-by: Bart Van Assche <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/Kconfig | 5 ++++-
drivers/nvdimm/Kconfig | 2 +-
drivers/s390/block/Kconfig | 2 +-
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index b79aa8f7a497..e0700bf4893a 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,3 +1,7 @@
+config DAX_DRIVER
+ select DAX
+ bool
+
menuconfig DAX
tristate "DAX: direct access to differentiated memory"
select SRCU
@@ -16,7 +20,6 @@ config DEV_DAX
baseline memory pool. Mappings of a /dev/daxX.Y device impose
restrictions that make the mapping behavior deterministic.
-
config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..40cbdb16e23e 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,7 +20,7 @@ if LIBNVDIMM
config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
- select DAX
+ select DAX_DRIVER
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 1444333210c7..9ac7574e3cfb 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,8 +15,8 @@ config BLK_DEV_XPRAM
config DCSSBLK
def_tristate m
- select DAX
select FS_DAX_LIMITED
+ select DAX_DRIVER
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
Currently, kernel/memremap.c contains generic code for supporting
memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
(CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
additions to memremap.c, especially for the ZONE_DEVICE case, need to be
careful about being placed in ifdef guards. Remove the need for these
ifdef guards by moving the ZONE_DEVICE support functions to their own
compilation unit.
Cc: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
kernel/Makefile | 3 +
kernel/iomem.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/memremap.c | 178 +----------------------------------------------------
3 files changed, 171 insertions(+), 177 deletions(-)
create mode 100644 kernel/iomem.c
diff --git a/kernel/Makefile b/kernel/Makefile
index f85ae5dfa474..9b9241361311 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,7 +112,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
-obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_HAS_IOMEM) += iomem.o
+obj-$(CONFIG_ZONE_DEVICE) += memremap.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/iomem.c b/kernel/iomem.c
new file mode 100644
index 000000000000..f7525e14ebc6
--- /dev/null
+++ b/kernel/iomem.c
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/device.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include <linux/mm.h>
+
+#ifndef ioremap_cache
+/* temporary while we convert existing ioremap_cache users to memremap */
+__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
+{
+ return ioremap(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_wb
+static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
+{
+ return (__force void *)ioremap_cache(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_can_ram_remap
+static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags)
+{
+ return true;
+}
+#endif
+
+static void *try_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags)
+{
+ unsigned long pfn = PHYS_PFN(offset);
+
+ /* In the simple case just return the existing linear address */
+ if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
+ arch_memremap_can_ram_remap(offset, size, flags))
+ return __va(offset);
+
+ return NULL; /* fallback to arch_memremap_wb */
+}
+
+/**
+ * memremap() - remap an iomem_resource as cacheable memory
+ * @offset: iomem resource start address
+ * @size: size of remap
+ * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
+ * MEMREMAP_ENC, MEMREMAP_DEC
+ *
+ * memremap() is "ioremap" for cases where it is known that the resource
+ * being mapped does not have i/o side effects and the __iomem
+ * annotation is not applicable. In the case of multiple flags, the different
+ * mapping types will be attempted in the order listed below until one of
+ * them succeeds.
+ *
+ * MEMREMAP_WB - matches the default mapping for System RAM on
+ * the architecture. This is usually a read-allocate write-back cache.
+ * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
+ * memremap() will bypass establishing a new mapping and instead return
+ * a pointer into the direct map.
+ *
+ * MEMREMAP_WT - establish a mapping whereby writes either bypass the
+ * cache or are written through to memory and never exist in a
+ * cache-dirty state with respect to program visibility. Attempts to
+ * map System RAM with this mapping type will fail.
+ *
+ * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
+ * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
+ * uncached. Attempts to map System RAM with this mapping type will fail.
+ */
+void *memremap(resource_size_t offset, size_t size, unsigned long flags)
+{
+ int is_ram = region_intersects(offset, size,
+ IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
+ void *addr = NULL;
+
+ if (!flags)
+ return NULL;
+
+ if (is_ram == REGION_MIXED) {
+ WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
+ &offset, (unsigned long) size);
+ return NULL;
+ }
+
+ /* Try all mapping types requested until one returns non-NULL */
+ if (flags & MEMREMAP_WB) {
+ /*
+ * MEMREMAP_WB is special in that it can be satisifed
+ * from the direct map. Some archs depend on the
+ * capability of memremap() to autodetect cases where
+ * the requested range is potentially in System RAM.
+ */
+ if (is_ram == REGION_INTERSECTS)
+ addr = try_ram_remap(offset, size, flags);
+ if (!addr)
+ addr = arch_memremap_wb(offset, size);
+ }
+
+ /*
+ * If we don't have a mapping yet and other request flags are
+ * present then we will be attempting to establish a new virtual
+ * address mapping. Enforce that this mapping is not aliasing
+ * System RAM.
+ */
+ if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
+ WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
+ &offset, (unsigned long) size);
+ return NULL;
+ }
+
+ if (!addr && (flags & MEMREMAP_WT))
+ addr = ioremap_wt(offset, size);
+
+ if (!addr && (flags & MEMREMAP_WC))
+ addr = ioremap_wc(offset, size);
+
+ return addr;
+}
+EXPORT_SYMBOL(memremap);
+
+void memunmap(void *addr)
+{
+ if (is_vmalloc_addr(addr))
+ iounmap((void __iomem *) addr);
+}
+EXPORT_SYMBOL(memunmap);
+
+static void devm_memremap_release(struct device *dev, void *res)
+{
+ memunmap(*(void **)res);
+}
+
+static int devm_memremap_match(struct device *dev, void *res, void *match_data)
+{
+ return *(void **)res == match_data;
+}
+
+void *devm_memremap(struct device *dev, resource_size_t offset,
+ size_t size, unsigned long flags)
+{
+ void **ptr, *addr;
+
+ ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
+ dev_to_node(dev));
+ if (!ptr)
+ return ERR_PTR(-ENOMEM);
+
+ addr = memremap(offset, size, flags);
+ if (addr) {
+ *ptr = addr;
+ devres_add(dev, ptr);
+ } else {
+ devres_free(ptr);
+ return ERR_PTR(-ENXIO);
+ }
+
+ return addr;
+}
+EXPORT_SYMBOL(devm_memremap);
+
+void devm_memunmap(struct device *dev, void *addr)
+{
+ WARN_ON(devres_release(dev, devm_memremap_release,
+ devm_memremap_match, addr));
+}
+EXPORT_SYMBOL(devm_memunmap);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..52a2742f527f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -1,15 +1,5 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2015 Intel Corporation. All rights reserved. */
#include <linux/radix-tree.h>
#include <linux/device.h>
#include <linux/types.h>
@@ -20,169 +10,6 @@
#include <linux/swap.h>
#include <linux/swapops.h>
-#ifndef ioremap_cache
-/* temporary while we convert existing ioremap_cache users to memremap */
-__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
-{
- return ioremap(offset, size);
-}
-#endif
-
-#ifndef arch_memremap_wb
-static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
-{
- return (__force void *)ioremap_cache(offset, size);
-}
-#endif
-
-#ifndef arch_memremap_can_ram_remap
-static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
- unsigned long flags)
-{
- return true;
-}
-#endif
-
-static void *try_ram_remap(resource_size_t offset, size_t size,
- unsigned long flags)
-{
- unsigned long pfn = PHYS_PFN(offset);
-
- /* In the simple case just return the existing linear address */
- if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
- arch_memremap_can_ram_remap(offset, size, flags))
- return __va(offset);
-
- return NULL; /* fallback to arch_memremap_wb */
-}
-
-/**
- * memremap() - remap an iomem_resource as cacheable memory
- * @offset: iomem resource start address
- * @size: size of remap
- * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
- * MEMREMAP_ENC, MEMREMAP_DEC
- *
- * memremap() is "ioremap" for cases where it is known that the resource
- * being mapped does not have i/o side effects and the __iomem
- * annotation is not applicable. In the case of multiple flags, the different
- * mapping types will be attempted in the order listed below until one of
- * them succeeds.
- *
- * MEMREMAP_WB - matches the default mapping for System RAM on
- * the architecture. This is usually a read-allocate write-back cache.
- * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
- * memremap() will bypass establishing a new mapping and instead return
- * a pointer into the direct map.
- *
- * MEMREMAP_WT - establish a mapping whereby writes either bypass the
- * cache or are written through to memory and never exist in a
- * cache-dirty state with respect to program visibility. Attempts to
- * map System RAM with this mapping type will fail.
- *
- * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
- * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
- * uncached. Attempts to map System RAM with this mapping type will fail.
- */
-void *memremap(resource_size_t offset, size_t size, unsigned long flags)
-{
- int is_ram = region_intersects(offset, size,
- IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
- void *addr = NULL;
-
- if (!flags)
- return NULL;
-
- if (is_ram == REGION_MIXED) {
- WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
- &offset, (unsigned long) size);
- return NULL;
- }
-
- /* Try all mapping types requested until one returns non-NULL */
- if (flags & MEMREMAP_WB) {
- /*
- * MEMREMAP_WB is special in that it can be satisifed
- * from the direct map. Some archs depend on the
- * capability of memremap() to autodetect cases where
- * the requested range is potentially in System RAM.
- */
- if (is_ram == REGION_INTERSECTS)
- addr = try_ram_remap(offset, size, flags);
- if (!addr)
- addr = arch_memremap_wb(offset, size);
- }
-
- /*
- * If we don't have a mapping yet and other request flags are
- * present then we will be attempting to establish a new virtual
- * address mapping. Enforce that this mapping is not aliasing
- * System RAM.
- */
- if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
- WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
- &offset, (unsigned long) size);
- return NULL;
- }
-
- if (!addr && (flags & MEMREMAP_WT))
- addr = ioremap_wt(offset, size);
-
- if (!addr && (flags & MEMREMAP_WC))
- addr = ioremap_wc(offset, size);
-
- return addr;
-}
-EXPORT_SYMBOL(memremap);
-
-void memunmap(void *addr)
-{
- if (is_vmalloc_addr(addr))
- iounmap((void __iomem *) addr);
-}
-EXPORT_SYMBOL(memunmap);
-
-static void devm_memremap_release(struct device *dev, void *res)
-{
- memunmap(*(void **)res);
-}
-
-static int devm_memremap_match(struct device *dev, void *res, void *match_data)
-{
- return *(void **)res == match_data;
-}
-
-void *devm_memremap(struct device *dev, resource_size_t offset,
- size_t size, unsigned long flags)
-{
- void **ptr, *addr;
-
- ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
- dev_to_node(dev));
- if (!ptr)
- return ERR_PTR(-ENOMEM);
-
- addr = memremap(offset, size, flags);
- if (addr) {
- *ptr = addr;
- devres_add(dev, ptr);
- } else {
- devres_free(ptr);
- return ERR_PTR(-ENXIO);
- }
-
- return addr;
-}
-EXPORT_SYMBOL(devm_memremap);
-
-void devm_memunmap(struct device *dev, void *addr)
-{
- WARN_ON(devres_release(dev, devm_memremap_release,
- devm_memremap_match, addr));
-}
-EXPORT_SYMBOL(devm_memunmap);
-
-#ifdef CONFIG_ZONE_DEVICE
static DEFINE_MUTEX(pgmap_lock);
static RADIX_TREE(pgmap_radix, GFP_KERNEL);
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
@@ -474,7 +301,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
return pgmap;
}
-#endif /* CONFIG_ZONE_DEVICE */
#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
void put_zone_device_private_or_public_page(struct page *page)
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
Cc: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/ext4/inode.c | 42 +++++++++++++++++++++++++++++++-----------
1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..249a97b19181 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
- if (dax_mapping(mapping)) {
- ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
- wbc);
- goto out_writepages;
- }
-
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
@@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
return ret;
}
+static int ext4_dax_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ int ret;
+ long nr_to_write = wbc->nr_to_write;
+ struct inode *inode = mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ return -EIO;
+
+ percpu_down_read(&sbi->s_journal_flag_rwsem);
+ trace_ext4_writepages(inode, wbc);
+
+ ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+ trace_ext4_writepages_result(inode, wbc, ret,
+ nr_to_write - wbc->nr_to_write);
+ percpu_up_read(&sbi->s_journal_flag_rwsem);
+ return ret;
+}
+
static int ext4_nonda_switch(struct super_block *sb)
{
s64 free_clusters, dirty_clusters;
@@ -3857,10 +3872,6 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (ext4_has_inline_data(inode))
return 0;
- /* DAX uses iomap path now */
- if (WARN_ON_ONCE(IS_DAX(inode)))
- return 0;
-
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
if (iov_iter_rw(iter) == READ)
ret = ext4_direct_IO_read(iocb, iter);
@@ -3946,6 +3957,13 @@ static const struct address_space_operations ext4_da_aops = {
.error_remove_page = generic_error_remove_page,
};
+static const struct address_space_operations ext4_dax_aops = {
+ .writepages = ext4_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
+
void ext4_set_aops(struct inode *inode)
{
switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3976,9 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
- if (test_opt(inode->i_sb, DELALLOC))
+ if (IS_DAX(inode))
+ inode->i_mapping->a_ops = &ext4_dax_aops;
+ else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
else
inode->i_mapping->a_ops = &ext4_aops;
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:
WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
[..]
CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984
[..]
Call Trace:
set_page_dirty_lock+0x40/0x60
bio_set_pages_dirty+0x37/0x50
iomap_dio_actor+0x2b7/0x3b0
? iomap_dio_zero+0x110/0x110
iomap_apply+0xa4/0x110
iomap_dio_rw+0x29e/0x3b0
? iomap_dio_zero+0x110/0x110
? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_read_iter+0xa0/0xc0 [xfs]
__vfs_read+0xf9/0x170
vfs_read+0xa6/0x150
SyS_pread64+0x93/0xb0
entry_SYSCALL_64_fastpath+0x1f/0x96
...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.
Cc: Jeff Moyer <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/xfs/xfs_aops.c | 34 ++++++++++++++++++----------------
fs/xfs/xfs_aops.h | 1 +
fs/xfs/xfs_iops.c | 5 ++++-
3 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..e7a56c4786ff 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1194,16 +1194,22 @@ xfs_vm_writepages(
int ret;
xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
- if (dax_mapping(mapping))
- return dax_writeback_mapping_range(mapping,
- xfs_find_bdev_for_inode(mapping->host), wbc);
-
ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
if (wpc.ioend)
ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
return ret;
}
+STATIC int
+xfs_dax_writepages(
+ struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+ return dax_writeback_mapping_range(mapping,
+ xfs_find_bdev_for_inode(mapping->host), wbc);
+}
+
/*
* Called to move a page into cleanable state - and from there
* to be released. The page should already be clean. We always
@@ -1367,17 +1373,6 @@ xfs_get_blocks(
return error;
}
-STATIC ssize_t
-xfs_vm_direct_IO(
- struct kiocb *iocb,
- struct iov_iter *iter)
-{
- /*
- * We just need the method present so that open/fcntl allow direct I/O.
- */
- return -EINVAL;
-}
-
STATIC sector_t
xfs_vm_bmap(
struct address_space *mapping,
@@ -1500,8 +1495,15 @@ const struct address_space_operations xfs_address_space_operations = {
.releasepage = xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.bmap = xfs_vm_bmap,
- .direct_IO = xfs_vm_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
+
+const struct address_space_operations xfs_dax_aops = {
+ .writepages = xfs_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
};
extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
case S_IFREG:
inode->i_op = &xfs_inode_operations;
inode->i_fop = &xfs_file_operations;
- inode->i_mapping->a_ops = &xfs_address_space_operations;
+ if (IS_DAX(inode))
+ inode->i_mapping->a_ops = &xfs_dax_aops;
+ else
+ inode->i_mapping->a_ops = &xfs_address_space_operations;
break;
case S_IFDIR:
if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.
Cc: Jan Kara <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/ext2/ext2.h | 1 +
fs/ext2/inode.c | 46 +++++++++++++++++++++++++++-------------------
fs/ext2/namei.c | 18 ++----------------
3 files changed, 30 insertions(+), 35 deletions(-)
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
/* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
extern const struct address_space_operations ext2_aops;
extern const struct address_space_operations ext2_nobh_aops;
extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..1e01fabef130 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -940,9 +940,6 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
loff_t offset = iocb->ki_pos;
ssize_t ret;
- if (WARN_ON_ONCE(IS_DAX(inode)))
- return -EIO;
-
ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
if (ret < 0 && iov_iter_rw(iter) == WRITE)
ext2_write_failed(mapping, offset + count);
@@ -952,17 +949,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
static int
ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
-#ifdef CONFIG_FS_DAX
- if (dax_mapping(mapping)) {
- return dax_writeback_mapping_range(mapping,
- mapping->host->i_sb->s_bdev,
- wbc);
- }
-#endif
-
return mpage_writepages(mapping, wbc, ext2_get_block);
}
+static int
+ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
+{
+ return dax_writeback_mapping_range(mapping,
+ mapping->host->i_sb->s_bdev, wbc);
+}
+
const struct address_space_operations ext2_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -990,6 +986,13 @@ const struct address_space_operations ext2_nobh_aops = {
.error_remove_page = generic_error_remove_page,
};
+static const struct address_space_operations ext2_dax_aops = {
+ .writepages = ext2_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
+
/*
* Probably it should be a library function... search for first non-zero word
* or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1391,18 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_DAX;
}
+void ext2_set_file_ops(struct inode *inode)
+{
+ inode->i_op = &ext2_file_inode_operations;
+ inode->i_fop = &ext2_file_operations;
+ if (IS_DAX(inode))
+ inode->i_mapping->a_ops = &ext2_dax_aops;
+ else if (test_opt(inode->i_sb, NOBH))
+ inode->i_mapping->a_ops = &ext2_nobh_aops;
+ else
+ inode->i_mapping->a_ops = &ext2_aops;
+}
+
struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
{
struct ext2_inode_info *ei;
@@ -1480,14 +1495,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
ei->i_data[n] = raw_inode->i_block[n];
if (S_ISREG(inode->i_mode)) {
- inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, NOBH)) {
- inode->i_mapping->a_ops = &ext2_nobh_aops;
- inode->i_fop = &ext2_file_operations;
- } else {
- inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_file_operations;
- }
+ ext2_set_file_ops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext2_dir_inode_operations;
inode->i_fop = &ext2_dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
if (IS_ERR(inode))
return PTR_ERR(inode);
- inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, NOBH)) {
- inode->i_mapping->a_ops = &ext2_nobh_aops;
- inode->i_fop = &ext2_file_operations;
- } else {
- inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_file_operations;
- }
+ ext2_set_file_ops(inode);
mark_inode_dirty(inode);
return ext2_add_nondir(dentry, inode);
}
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
if (IS_ERR(inode))
return PTR_ERR(inode);
- inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, NOBH)) {
- inode->i_mapping->a_ops = &ext2_nobh_aops;
- inode->i_fop = &ext2_file_operations;
- } else {
- inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_file_operations;
- }
+ ext2_set_file_ops(inode);
mark_inode_dirty(inode);
d_tmpfile(dentry, inode);
unlock_new_inode(inode);
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().
Cc: Jeff Moyer <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/block_dev.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe09ef9c21f3..846ee2d31781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
static int blkdev_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- if (dax_mapping(mapping)) {
- struct block_device *bdev = I_BDEV(mapping->host);
-
- return dax_writeback_mapping_range(mapping, bdev, wbc);
- }
return generic_writepages(mapping, wbc);
}
On Fri 30-03-18 21:02:36, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.
>
> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: [email protected]
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/inode.c | 42 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 31 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c94780075b04..249a97b19181 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
> percpu_down_read(&sbi->s_journal_flag_rwsem);
> trace_ext4_writepages(inode, wbc);
>
> - if (dax_mapping(mapping)) {
> - ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
> - wbc);
> - goto out_writepages;
> - }
> -
> /*
> * No pages to write? This is mainly a kludge to avoid starting
> * a transaction for special inodes like journal inode on last iput()
> @@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
> return ret;
> }
>
> +static int ext4_dax_writepages(struct address_space *mapping,
> + struct writeback_control *wbc)
> +{
> + int ret;
> + long nr_to_write = wbc->nr_to_write;
> + struct inode *inode = mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
> +
> + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> + return -EIO;
> +
> + percpu_down_read(&sbi->s_journal_flag_rwsem);
> + trace_ext4_writepages(inode, wbc);
> +
> + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
> + trace_ext4_writepages_result(inode, wbc, ret,
> + nr_to_write - wbc->nr_to_write);
> + percpu_up_read(&sbi->s_journal_flag_rwsem);
> + return ret;
> +}
> +
> static int ext4_nonda_switch(struct super_block *sb)
> {
> s64 free_clusters, dirty_clusters;
> @@ -3857,10 +3872,6 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> if (ext4_has_inline_data(inode))
> return 0;
>
> - /* DAX uses iomap path now */
> - if (WARN_ON_ONCE(IS_DAX(inode)))
> - return 0;
> -
> trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> if (iov_iter_rw(iter) == READ)
> ret = ext4_direct_IO_read(iocb, iter);
> @@ -3946,6 +3957,13 @@ static const struct address_space_operations ext4_da_aops = {
> .error_remove_page = generic_error_remove_page,
> };
>
> +static const struct address_space_operations ext4_dax_aops = {
> + .writepages = ext4_dax_writepages,
> + .direct_IO = noop_direct_IO,
> + .set_page_dirty = noop_set_page_dirty,
> + .invalidatepage = noop_invalidatepage,
> +};
> +
> void ext4_set_aops(struct inode *inode)
> {
> switch (ext4_inode_journal_mode(inode)) {
> @@ -3958,7 +3976,9 @@ void ext4_set_aops(struct inode *inode)
> default:
> BUG();
> }
> - if (test_opt(inode->i_sb, DELALLOC))
> + if (IS_DAX(inode))
> + inode->i_mapping->a_ops = &ext4_dax_aops;
> + else if (test_opt(inode->i_sb, DELALLOC))
> inode->i_mapping->a_ops = &ext4_da_aops;
> else
> inode->i_mapping->a_ops = &ext4_aops;
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 30-03-18 21:02:41, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.
>
> Cc: Jan Kara <[email protected]>
> Reported-by: kbuild test robot <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext2/ext2.h | 1 +
> fs/ext2/inode.c | 46 +++++++++++++++++++++++++++-------------------
> fs/ext2/namei.c | 18 ++----------------
> 3 files changed, 30 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 032295e1d386..cc40802ddfa8 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
> extern const struct file_operations ext2_file_operations;
>
> /* inode.c */
> +extern void ext2_set_file_ops(struct inode *inode);
> extern const struct address_space_operations ext2_aops;
> extern const struct address_space_operations ext2_nobh_aops;
> extern const struct iomap_ops ext2_iomap_ops;
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 9b2ac55ac34f..1e01fabef130 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -940,9 +940,6 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> loff_t offset = iocb->ki_pos;
> ssize_t ret;
>
> - if (WARN_ON_ONCE(IS_DAX(inode)))
> - return -EIO;
> -
> ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
> if (ret < 0 && iov_iter_rw(iter) == WRITE)
> ext2_write_failed(mapping, offset + count);
> @@ -952,17 +949,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> static int
> ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> -#ifdef CONFIG_FS_DAX
> - if (dax_mapping(mapping)) {
> - return dax_writeback_mapping_range(mapping,
> - mapping->host->i_sb->s_bdev,
> - wbc);
> - }
> -#endif
> -
> return mpage_writepages(mapping, wbc, ext2_get_block);
> }
>
> +static int
> +ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
> +{
> + return dax_writeback_mapping_range(mapping,
> + mapping->host->i_sb->s_bdev, wbc);
> +}
> +
> const struct address_space_operations ext2_aops = {
> .readpage = ext2_readpage,
> .readpages = ext2_readpages,
> @@ -990,6 +986,13 @@ const struct address_space_operations ext2_nobh_aops = {
> .error_remove_page = generic_error_remove_page,
> };
>
> +static const struct address_space_operations ext2_dax_aops = {
> + .writepages = ext2_dax_writepages,
> + .direct_IO = noop_direct_IO,
> + .set_page_dirty = noop_set_page_dirty,
> + .invalidatepage = noop_invalidatepage,
> +};
> +
> /*
> * Probably it should be a library function... search for first non-zero word
> * or memcmp with zero_page, whatever is better for particular architecture.
> @@ -1388,6 +1391,18 @@ void ext2_set_inode_flags(struct inode *inode)
> inode->i_flags |= S_DAX;
> }
>
> +void ext2_set_file_ops(struct inode *inode)
> +{
> + inode->i_op = &ext2_file_inode_operations;
> + inode->i_fop = &ext2_file_operations;
> + if (IS_DAX(inode))
> + inode->i_mapping->a_ops = &ext2_dax_aops;
> + else if (test_opt(inode->i_sb, NOBH))
> + inode->i_mapping->a_ops = &ext2_nobh_aops;
> + else
> + inode->i_mapping->a_ops = &ext2_aops;
> +}
> +
> struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
> {
> struct ext2_inode_info *ei;
> @@ -1480,14 +1495,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
> ei->i_data[n] = raw_inode->i_block[n];
>
> if (S_ISREG(inode->i_mode)) {
> - inode->i_op = &ext2_file_inode_operations;
> - if (test_opt(inode->i_sb, NOBH)) {
> - inode->i_mapping->a_ops = &ext2_nobh_aops;
> - inode->i_fop = &ext2_file_operations;
> - } else {
> - inode->i_mapping->a_ops = &ext2_aops;
> - inode->i_fop = &ext2_file_operations;
> - }
> + ext2_set_file_ops(inode);
> } else if (S_ISDIR(inode->i_mode)) {
> inode->i_op = &ext2_dir_inode_operations;
> inode->i_fop = &ext2_dir_operations;
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index e078075dc66f..55f7caadb093 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
> if (IS_ERR(inode))
> return PTR_ERR(inode);
>
> - inode->i_op = &ext2_file_inode_operations;
> - if (test_opt(inode->i_sb, NOBH)) {
> - inode->i_mapping->a_ops = &ext2_nobh_aops;
> - inode->i_fop = &ext2_file_operations;
> - } else {
> - inode->i_mapping->a_ops = &ext2_aops;
> - inode->i_fop = &ext2_file_operations;
> - }
> + ext2_set_file_ops(inode);
> mark_inode_dirty(inode);
> return ext2_add_nondir(dentry, inode);
> }
> @@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
> if (IS_ERR(inode))
> return PTR_ERR(inode);
>
> - inode->i_op = &ext2_file_inode_operations;
> - if (test_opt(inode->i_sb, NOBH)) {
> - inode->i_mapping->a_ops = &ext2_nobh_aops;
> - inode->i_fop = &ext2_file_operations;
> - } else {
> - inode->i_mapping->a_ops = &ext2_aops;
> - inode->i_fop = &ext2_file_operations;
> - }
> + ext2_set_file_ops(inode);
> mark_inode_dirty(inode);
> d_tmpfile(dentry, inode);
> unlock_new_inode(inode);
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Mar 30, 2018 at 9:03 PM, Dan Williams <[email protected]> wrote:
> In preparation for allowing filesystems to augment the dev_pagemap
> associated with a dax_device, add an ->fs_claim() callback. The
> ->fs_claim() callback is leveraged by the device-mapper dax
> implementation to iterate all member devices in the map and repeat the
> claim operation across the array.
>
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
> The ->fs_claim() operation arranges for this callback to be registered,
> although that implementation is saved for a later patch.
>
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
Mike, do these DM touches look ok to you? We need these ->fs_claim()
/ ->fs_release() interfaces for device-mapper to set up filesystem-dax
infrastructure on all sub-devices whenever a dax-capable DM device is
mounted. It builds on the device-mapper dax dependency removal
patches.
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: "Jérôme Glisse" <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/dax/super.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/md/dm.c | 56 ++++++++++++++++++++++++++++++++
> include/linux/dax.h | 16 +++++++++
> include/linux/memremap.h | 8 +++++
> 4 files changed, 160 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 2b2332b605e4..c4cf284dfe1c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
> static DEFINE_IDA(dax_minor_ida);
> static struct kmem_cache *dax_cache __read_mostly;
> static struct super_block *dax_superblock __read_mostly;
> +static DEFINE_MUTEX(devmap_lock);
>
> #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
> static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> @@ -169,9 +170,88 @@ struct dax_device {
> const char *host;
> void *private;
> unsigned long flags;
> + struct dev_pagemap *pgmap;
> const struct dax_operations *ops;
> };
>
> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> + /* TODO: wakeup page-idle waiters */
> +}
> +
> +struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
> +{
> + struct dev_pagemap *pgmap;
> +
> + if (!dax_dev->pgmap)
> + return dax_dev;
> + pgmap = dax_dev->pgmap;
> +
> + mutex_lock(&devmap_lock);
> + if (pgmap->data && pgmap->data == owner) {
> + /* dm might try to claim the same device more than once... */
> + mutex_unlock(&devmap_lock);
> + return dax_dev;
> + } else if (pgmap->page_free || pgmap->page_fault
> + || pgmap->type != MEMORY_DEVICE_HOST) {
> + put_dax(dax_dev);
> + mutex_unlock(&devmap_lock);
> + return NULL;
> + }
> +
> + pgmap->type = MEMORY_DEVICE_FS_DAX;
> + pgmap->page_free = generic_dax_pagefree;
> + pgmap->data = owner;
> + mutex_unlock(&devmap_lock);
> +
> + return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_claim);
> +
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
> +{
> + struct dax_device *dax_dev;
> +
> + if (!blk_queue_dax(bdev->bd_queue))
> + return NULL;
> + dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> + if (dax_dev->ops->fs_claim)
> + return dax_dev->ops->fs_claim(dax_dev, owner);
> + else
> + return fs_dax_claim(dax_dev, owner);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
> +
> +void __fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> + struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
> +
> + put_dax(dax_dev);
> + if (!pgmap)
> + return;
> + if (!pgmap->data)
> + return;
> +
> + mutex_lock(&devmap_lock);
> + WARN_ON(pgmap->data != owner);
> + pgmap->type = MEMORY_DEVICE_HOST;
> + pgmap->page_free = NULL;
> + pgmap->data = NULL;
> + mutex_unlock(&devmap_lock);
> +}
> +EXPORT_SYMBOL_GPL(__fs_dax_release);
> +
> +void fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> + if (dax_dev->ops->fs_release)
> + dax_dev->ops->fs_release(dax_dev, owner);
> + else
> + __fs_dax_release(dax_dev, owner);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_release);
> +#endif
> +
> static ssize_t write_cache_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index ffc93aecc02a..964cb7537f11 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1090,6 +1090,60 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> return ret;
> }
>
> +static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
> + sector_t start, sector_t len, void *owner)
> +{
> + if (fs_dax_claim(dev->dax_dev, owner))
> + return 0;
> + /*
> + * Outside of a kernel bug there is no reason a dax_dev should
> + * fail a claim attempt. Device-mapper should have exclusive
> + * ownership of the dm_dev and the filesystem should have
> + * exclusive ownership of the dm_target.
> + */
> + WARN_ON_ONCE(1);
> + return -ENXIO;
> +}
> +
> +static int dm_dax_dev_release(struct dm_target *ti, struct dm_dev *dev,
> + sector_t start, sector_t len, void *owner)
> +{
> + __fs_dax_release(dev->dax_dev, owner);
> + return 0;
> +}
> +
> +static struct dax_device *dm_dax_iterate(struct dax_device *dax_dev,
> + iterate_devices_callout_fn fn, void *arg)
> +{
> + struct mapped_device *md = dax_get_private(dax_dev);
> + struct dm_table *map;
> + struct dm_target *ti;
> + int i, srcu_idx;
> +
> + map = dm_get_live_table(md, &srcu_idx);
> +
> + for (i = 0; i < dm_table_get_num_targets(map); i++) {
> + ti = dm_table_get_target(map, i);
> +
> + if (ti->type->iterate_devices)
> + ti->type->iterate_devices(ti, fn, arg);
> + }
> +
> + dm_put_live_table(md, srcu_idx);
> + return dax_dev;
> +}
> +
> +static struct dax_device *dm_dax_fs_claim(struct dax_device *dax_dev,
> + void *owner)
> +{
> + return dm_dax_iterate(dax_dev, dm_dax_dev_claim, owner);
> +}
> +
> +static void dm_dax_fs_release(struct dax_device *dax_dev, void *owner)
> +{
> + dm_dax_iterate(dax_dev, dm_dax_dev_release, owner);
> +}
> +
> /*
> * A target may call dm_accept_partial_bio only from the map routine. It is
> * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
> @@ -3111,6 +3165,8 @@ static const struct block_device_operations dm_blk_dops = {
> static const struct dax_operations dm_dax_ops = {
> .direct_access = dm_dax_direct_access,
> .copy_from_iter = dm_dax_copy_from_iter,
> + .fs_claim = dm_dax_fs_claim,
> + .fs_release = dm_dax_fs_release,
> };
>
> /*
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index f9eb22ad341e..e9d59a6b06e1 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -20,6 +20,10 @@ struct dax_operations {
> /* copy_from_iter: required operation for fs-dax direct-i/o */
> size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> struct iov_iter *);
> + /* fs_claim: setup filesytem parameters for the device's dev_pagemap */
> + struct dax_device *(*fs_claim)(struct dax_device *, void *);
> + /* fs_release: restore device's dev_pagemap to its default state */
> + void (*fs_release)(struct dax_device *, void *);
> };
>
> extern struct attribute_group dax_attribute_group;
> @@ -83,6 +87,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
> struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> int dax_writeback_mapping_range(struct address_space *mapping,
> struct block_device *bdev, struct writeback_control *wbc);
> +struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
> +void __fs_dax_release(struct dax_device *dax_dev, void *owner);
> #else
> static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
> {
> @@ -108,6 +114,16 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
> {
> return -EOPNOTSUPP;
> }
> +
> +static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
> + void *owner)
> +{
> + return NULL;
> +}
> +
> +static inline void __fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +}
> #endif
>
> int dax_read_lock(void);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 7b4899c06f49..02d6d042ee7f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -53,11 +53,19 @@ struct vmem_altmap {
> * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> * type. Any page of a process can be migrated to such memory. However no one
> * should be allow to pin such memory so that it can always be evicted.
> + *
> + * MEMORY_DEVICE_FS_DAX:
> + * When MEMORY_DEVICE_HOST memory is represented by a device that can
> + * host a filesystem, for example /dev/pmem0, that filesystem can
> + * register for a callback when a page is idled. For the filesystem-dax
> + * case page idle callbacks are used to coordinate DMA vs
> + * hole-punch/truncate.
> */
> enum memory_type {
> MEMORY_DEVICE_HOST = 0,
> MEMORY_DEVICE_PRIVATE,
> MEMORY_DEVICE_PUBLIC,
> + MEMORY_DEVICE_FS_DAX,
> };
>
> /*
>
On Tue, Apr 03 2018 at 2:24pm -0400,
Dan Williams <[email protected]> wrote:
> On Fri, Mar 30, 2018 at 9:03 PM, Dan Williams <[email protected]> wrote:
> > In preparation for allowing filesystems to augment the dev_pagemap
> > associated with a dax_device, add an ->fs_claim() callback. The
> > ->fs_claim() callback is leveraged by the device-mapper dax
> > implementation to iterate all member devices in the map and repeat the
> > claim operation across the array.
> >
> > In order to resolve collisions between filesystem operations and DMA to
> > DAX mapped pages we need a callback when DMA completes. With a callback
> > we can hold off filesystem operations while DMA is in-flight and then
> > resume those operations when the last put_page() occurs on a DMA page.
> > The ->fs_claim() operation arranges for this callback to be registered,
> > although that implementation is saved for a later patch.
> >
> > Cc: Alasdair Kergon <[email protected]>
> > Cc: Mike Snitzer <[email protected]>
>
> Mike, do these DM touches look ok to you? We need these ->fs_claim()
> / ->fs_release() interfaces for device-mapper to set up filesystem-dax
> infrastructure on all sub-devices whenever a dax-capable DM device is
> mounted. It builds on the device-mapper dax dependency removal
> patches.
I'd prefer dm_dax_iterate() be renamed to dm_dax_iterate_devices()
But dm_dax_iterate() is weird... it is simply returning the struct
dax_device *dax_dev that is passed: seemingly without actually directly
changing anything about that dax_device (I can infer that you're
claiming the underlying devices, but...)
In general user's of ti->type->iterate_devices can get a result back
(via 'int' return).. you aren't using it that way (and maybe dax will
never have a need to return an answer). But all said, I think I'd
prefer to see dm_dax_iterate_devices() return void.
But please let me know if I'm missing something, thanks.
Mike
On Tue, Apr 3, 2018 at 12:39 PM, Mike Snitzer <[email protected]> wrote:
> On Tue, Apr 03 2018 at 2:24pm -0400,
> Dan Williams <[email protected]> wrote:
>
>> On Fri, Mar 30, 2018 at 9:03 PM, Dan Williams <[email protected]> wrote:
>> > In preparation for allowing filesystems to augment the dev_pagemap
>> > associated with a dax_device, add an ->fs_claim() callback. The
>> > ->fs_claim() callback is leveraged by the device-mapper dax
>> > implementation to iterate all member devices in the map and repeat the
>> > claim operation across the array.
>> >
>> > In order to resolve collisions between filesystem operations and DMA to
>> > DAX mapped pages we need a callback when DMA completes. With a callback
>> > we can hold off filesystem operations while DMA is in-flight and then
>> > resume those operations when the last put_page() occurs on a DMA page.
>> > The ->fs_claim() operation arranges for this callback to be registered,
>> > although that implementation is saved for a later patch.
>> >
>> > Cc: Alasdair Kergon <[email protected]>
>> > Cc: Mike Snitzer <[email protected]>
>>
>> Mike, do these DM touches look ok to you? We need these ->fs_claim()
>> / ->fs_release() interfaces for device-mapper to set up filesystem-dax
>> infrastructure on all sub-devices whenever a dax-capable DM device is
>> mounted. It builds on the device-mapper dax dependency removal
>> patches.
>
> I'd prefer dm_dax_iterate() be renamed to dm_dax_iterate_devices()
Ok, I'll fix that up.
> But dm_dax_iterate() is weird... it is simply returning the struct
> dax_device *dax_dev that is passed: seemingly without actually directly
> changing anything about that dax_device (I can infer that you're
> claiming the underlying devices, but...)
I could at least add a note to see the comment in dm_dax_dev_claim().
The filesystem caller expects to get a dax_dev back or NULL from
fs_dax_claim_bdev() if the claim failed. For fs_dax_claim() the return
value could simply be bool for pass / fail, but I used dax_dev NULL /
not-NULL instead.
In the case of device-mapper the claim attempt can't fail for
conflicting ownership reasons because the exclusive ownership of the
underlying block device is already established by device-mapper before
the fs claims the device-mapper dax device.
> In general user's of ti->type->iterate_devices can get a result back
> (via 'int' return).. you aren't using it that way (and maybe dax will
> never have a need to return an answer). But all said, I think I'd
> prefer to see dm_dax_iterate_devices() return void.
>
> But please let me know if I'm missing something, thanks.
Oh, yeah, I like that better. I'll just make it return void and have
dm_dax_fs_claim() return the dax_dev directly.
Thanks Mike!
On Fri 30-03-18 21:03:30, Dan Williams wrote:
> Background:
>
> get_user_pages() in the filesystem pins file backed memory pages for
> access by devices performing dma. However, it only pins the memory pages
> not the page-to-file offset association. If a file is truncated the
> pages are mapped out of the file and dma may continue indefinitely into
> a page that is owned by a device driver. This breaks coherency of the
> file vs dma, but the assumption is that if userspace wants the
> file-space truncated it does not matter what data is inbound from the
> device, it is not relevant anymore. The only expectation is that dma can
> safely continue while the filesystem reallocates the block(s).
>
> Problem:
>
> This expectation that dma can safely continue while the filesystem
> changes the block map is broken by dax. With dax the target dma page
> *is* the filesystem block. The model of leaving the page pinned for dma,
> but truncating the file block out of the file, means that the filesytem
> is free to reallocate a block under active dma to another file and now
> the expected data-incoherency situation has turned into active
> data-corruption.
>
> Solution:
>
> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> file while any page/block in the file is under active dma. This solution
> assumes that dma is transient. Cases where dma operations are known to
> not be transient, like RDMA, have been explicitly disabled via
> commits like 5f1d43de5416 "IB/core: disable memory registration of
> filesystem-dax vmas".
>
> The dax_layout_busy_page() routine is called by filesystems with a lock
> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> The process of looking up a busy page invalidates all mappings
> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> The filesystem continues to call dax_layout_busy_page() until it finally
> returns no more active pages. This approach assumes that the page
> pinning is transient, if that assumption is violated the system would
> have likely hung from the uncompleted I/O.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Reported-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/dax/super.c | 2 +
> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dax.h | 25 ++++++++++++++
> mm/gup.c | 5 +++
> 4 files changed, 123 insertions(+), 1 deletion(-)
...
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> + * walks are performed under rcu_read_lock().
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> + pgoff_t indices[PAGEVEC_SIZE];
> + struct page *page = NULL;
> + struct pagevec pvec;
> + pgoff_t index, end;
> + unsigned i;
> +
> + /*
> + * In the 'limited' case get_user_pages() for dax is disabled.
> + */
> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> + return NULL;
> +
> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> + return NULL;
> +
> + pagevec_init(&pvec);
> + index = 0;
> + end = -1;
> + /*
> + * Flush dax_layout_lock() sections to ensure all possible page
> + * references have been taken, or otherwise arrange for faults
> + * to block on the filesystem lock that is taken for
> + * establishing new mappings.
> + */
> + unmap_mapping_range(mapping, 0, 0, 1);
> + synchronize_rcu();
So I still don't like the use of RCU for this. It just seems as an abuse to
use RCU like that. Furthermore it has a hefty latency cost for the truncate
path. A trivial test to truncate 100 times the last page of a 16k file that
is mmaped (only the first page):
DAX+your patches 3.899s
non-DAX 0.015s
So you can see synchronize_rcu() increased time to run truncate(2) more
than 200 times (the process is indeed sitting in __wait_rcu_gp all the
time). IMHO that's just too costly.
> + while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
> + min(end - index, (pgoff_t)PAGEVEC_SIZE),
> + indices)) {
> + for (i = 0; i < pagevec_count(&pvec); i++) {
> + struct page *pvec_ent = pvec.pages[i];
> + void *entry;
> +
> + index = indices[i];
> + if (index >= end)
> + break;
> +
> + if (!radix_tree_exceptional_entry(pvec_ent))
> + continue;
This would be a bug - so WARN_ON_ONCE() here?
> +
> + spin_lock_irq(&mapping->tree_lock);
> + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> + if (entry)
> + page = dax_busy_page(entry);
> + put_unlocked_mapping_entry(mapping, index, entry);
> + spin_unlock_irq(&mapping->tree_lock);
> + if (page)
> + break;
> + }
> + pagevec_remove_exceptionals(&pvec);
> + pagevec_release(&pvec);
> + index++;
> +
> + if (page)
> + break;
> + }
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page);
> +
> static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> pgoff_t index, bool trunc)
> {
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 30-03-18 21:03:46, Dan Williams wrote:
> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
> for busy / pinned dax pages and waits for those pages to go idle before
> any potential extent unmap operation.
>
> dax_layout_busy_page() handles synchronizing against new page-busy
> events (get_user_pages). It invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the xfs inode
> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
> busy page it returns it for xfs to wait for the page-idle event that
> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
> pages are idle at count 1, see generic_dax_pagefree()).
>
> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
> deadlock the process that might be trying to elevate the page count of
> more pages before arranging for any of them to go idle. I.e. the typical
> case of submitting I/O is that iov_iter_get_pages() elevates the
> reference count of all pages in the I/O before starting I/O on the first
> page. The process of elevating the reference count of all pages involved
> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
>
> Cc: Jan Kara <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
...
> ---
> fs/xfs/xfs_file.c | 60 +++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 49 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 51e6506bdcb1..0342f6fb782f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -752,6 +752,38 @@ xfs_file_write_iter(
> return ret;
> }
>
> +static void
> +xfs_wait_var_event(
> + struct inode *inode,
> + uint iolock,
> + bool *did_unlock)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> +
> + *did_unlock = true;
> + xfs_iunlock(ip, iolock);
> + schedule();
> + xfs_ilock(ip, iolock);
> +}
With this scheme, there's a problem that it can be easily livelocked. E.g.
when I created a program that maps a file on DAX fs and does AIO DIO from
it indefinitely (with 64 iocbs in flight), truncate of that file never gets
past xfs_break_layouts(). The reason is that once we drop all locks, new
iocbs can be submitted, they grab new page references and these prevent
truncation next time... So I think we need to somehow fix this retry scheme
so that we guarantee forward progress of the truncate. E.g. if we kept
IOLOCK locked, that would prevent new iocbs from being submitted...
Honza
> +
> +static int
> +xfs_break_dax_layouts(
> + struct inode *inode,
> + uint iolock,
> + bool *did_unlock)
> +{
> + struct page *page;
> +
> + *did_unlock = false;
> + page = dax_layout_busy_page(inode->i_mapping);
> + if (!page)
> + return 0;
> +
> + return ___wait_var_event(&page->_refcount,
> + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> + 0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
> +}
> +
> int
> xfs_break_layouts(
> struct inode *inode,
> @@ -763,17 +795,23 @@ xfs_break_layouts(
>
> ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>
> - switch (reason) {
> - case BREAK_UNMAP:
> - ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
> - /* fall through */
> - case BREAK_WRITE:
> - error = xfs_break_leased_layouts(inode, iolock, &retry);
> - break;
> - default:
> - WARN_ON_ONCE(1);
> - return -EINVAL;
> - }
> + do {
> + switch (reason) {
> + case BREAK_UNMAP:
> + ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
> +
> + error = xfs_break_dax_layouts(inode, *iolock, &retry);
> + /* fall through */
> + case BREAK_WRITE:
> + if (error || retry)
> + break;
> + error = xfs_break_leased_layouts(inode, iolock, &retry);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + return -EINVAL;
> + }
> + } while (error == 0 && retry);
>
> return error;
> }
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 04-04-18 11:46:56, Jan Kara wrote:
> On Fri 30-03-18 21:03:30, Dan Williams wrote:
> > Background:
> >
> > get_user_pages() in the filesystem pins file backed memory pages for
> > access by devices performing dma. However, it only pins the memory pages
> > not the page-to-file offset association. If a file is truncated the
> > pages are mapped out of the file and dma may continue indefinitely into
> > a page that is owned by a device driver. This breaks coherency of the
> > file vs dma, but the assumption is that if userspace wants the
> > file-space truncated it does not matter what data is inbound from the
> > device, it is not relevant anymore. The only expectation is that dma can
> > safely continue while the filesystem reallocates the block(s).
> >
> > Problem:
> >
> > This expectation that dma can safely continue while the filesystem
> > changes the block map is broken by dax. With dax the target dma page
> > *is* the filesystem block. The model of leaving the page pinned for dma,
> > but truncating the file block out of the file, means that the filesytem
> > is free to reallocate a block under active dma to another file and now
> > the expected data-incoherency situation has turned into active
> > data-corruption.
> >
> > Solution:
> >
> > Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> > file while any page/block in the file is under active dma. This solution
> > assumes that dma is transient. Cases where dma operations are known to
> > not be transient, like RDMA, have been explicitly disabled via
> > commits like 5f1d43de5416 "IB/core: disable memory registration of
> > filesystem-dax vmas".
> >
> > The dax_layout_busy_page() routine is called by filesystems with a lock
> > held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> > The process of looking up a busy page invalidates all mappings
> > to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> > The filesystem continues to call dax_layout_busy_page() until it finally
> > returns no more active pages. This approach assumes that the page
> > pinning is transient, if that assumption is violated the system would
> > have likely hung from the uncompleted I/O.
> >
> > Cc: Jan Kara <[email protected]>
> > Cc: Jeff Moyer <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: Alexander Viro <[email protected]>
> > Cc: "Darrick J. Wong" <[email protected]>
> > Cc: Ross Zwisler <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Reported-by: Christoph Hellwig <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
> > ---
> > drivers/dax/super.c | 2 +
> > fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dax.h | 25 ++++++++++++++
> > mm/gup.c | 5 +++
> > 4 files changed, 123 insertions(+), 1 deletion(-)
>
> ...
>
> > +/**
> > + * dax_layout_busy_page - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> > + * any page in the mapping is busy, i.e. for DMA, or other
> > + * get_user_pages() usages.
> > + *
> > + * It is expected that the filesystem is holding locks to block the
> > + * establishment of new mappings in this address_space. I.e. it expects
> > + * to be able to run unmap_mapping_range() and subsequently not race
> > + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> > + * walks are performed under rcu_read_lock().
> > + */
> > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > +{
> > + pgoff_t indices[PAGEVEC_SIZE];
> > + struct page *page = NULL;
> > + struct pagevec pvec;
> > + pgoff_t index, end;
> > + unsigned i;
> > +
> > + /*
> > + * In the 'limited' case get_user_pages() for dax is disabled.
> > + */
> > + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > + return NULL;
> > +
> > + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> > + return NULL;
> > +
> > + pagevec_init(&pvec);
> > + index = 0;
> > + end = -1;
> > + /*
> > + * Flush dax_layout_lock() sections to ensure all possible page
> > + * references have been taken, or otherwise arrange for faults
> > + * to block on the filesystem lock that is taken for
> > + * establishing new mappings.
> > + */
> > + unmap_mapping_range(mapping, 0, 0, 1);
> > + synchronize_rcu();
>
> So I still don't like the use of RCU for this. It just seems as an abuse to
> use RCU like that. Furthermore it has a hefty latency cost for the truncate
> path. A trivial test to truncate 100 times the last page of a 16k file that
> is mmaped (only the first page):
>
> DAX+your patches 3.899s
> non-DAX 0.015s
>
> So you can see synchronize_rcu() increased time to run truncate(2) more
> than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> time). IMHO that's just too costly.
Forgot to add some more thoughts: Maybe we could use global percpu rwsem
for this instead of RCU? That would cut down the truncate latency and the
cost on GUP path should be very small. Or I'm still not convinced that my
PageTruncateInProgress() idea cannot be made to work - that would be free
on the GUP side for the non-DAX case, relatively cheap for the DAX case,
and also reasonably cheap for the truncate side. But I admit it requires
more work on the fs side to propagate offsets that are going to be
truncated into the DAX helper.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> On Fri 30-03-18 21:03:30, Dan Williams wrote:
>> Background:
>>
>> get_user_pages() in the filesystem pins file backed memory pages for
>> access by devices performing dma. However, it only pins the memory pages
>> not the page-to-file offset association. If a file is truncated the
>> pages are mapped out of the file and dma may continue indefinitely into
>> a page that is owned by a device driver. This breaks coherency of the
>> file vs dma, but the assumption is that if userspace wants the
>> file-space truncated it does not matter what data is inbound from the
>> device, it is not relevant anymore. The only expectation is that dma can
>> safely continue while the filesystem reallocates the block(s).
>>
>> Problem:
>>
>> This expectation that dma can safely continue while the filesystem
>> changes the block map is broken by dax. With dax the target dma page
>> *is* the filesystem block. The model of leaving the page pinned for dma,
>> but truncating the file block out of the file, means that the filesytem
>> is free to reallocate a block under active dma to another file and now
>> the expected data-incoherency situation has turned into active
>> data-corruption.
>>
>> Solution:
>>
>> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
>> file while any page/block in the file is under active dma. This solution
>> assumes that dma is transient. Cases where dma operations are known to
>> not be transient, like RDMA, have been explicitly disabled via
>> commits like 5f1d43de5416 "IB/core: disable memory registration of
>> filesystem-dax vmas".
>>
>> The dax_layout_busy_page() routine is called by filesystems with a lock
>> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
>> The process of looking up a busy page invalidates all mappings
>> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
>> The filesystem continues to call dax_layout_busy_page() until it finally
>> returns no more active pages. This approach assumes that the page
>> pinning is transient, if that assumption is violated the system would
>> have likely hung from the uncompleted I/O.
>>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Dave Chinner <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Cc: "Darrick J. Wong" <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Reported-by: Christoph Hellwig <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> drivers/dax/super.c | 2 +
>> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/dax.h | 25 ++++++++++++++
>> mm/gup.c | 5 +++
>> 4 files changed, 123 insertions(+), 1 deletion(-)
>
> ...
>
>> +/**
>> + * dax_layout_busy_page - find first pinned page in @mapping
>> + * @mapping: address space to scan for a page with ref count > 1
>> + *
>> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
>> + * 'onlined' to the page allocator so they are considered idle when
>> + * page->count == 1. A filesystem uses this interface to determine if
>> + * any page in the mapping is busy, i.e. for DMA, or other
>> + * get_user_pages() usages.
>> + *
>> + * It is expected that the filesystem is holding locks to block the
>> + * establishment of new mappings in this address_space. I.e. it expects
>> + * to be able to run unmap_mapping_range() and subsequently not race
>> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
>> + * walks are performed under rcu_read_lock().
>> + */
>> +struct page *dax_layout_busy_page(struct address_space *mapping)
>> +{
>> + pgoff_t indices[PAGEVEC_SIZE];
>> + struct page *page = NULL;
>> + struct pagevec pvec;
>> + pgoff_t index, end;
>> + unsigned i;
>> +
>> + /*
>> + * In the 'limited' case get_user_pages() for dax is disabled.
>> + */
>> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> + return NULL;
>> +
>> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
>> + return NULL;
>> +
>> + pagevec_init(&pvec);
>> + index = 0;
>> + end = -1;
>> + /*
>> + * Flush dax_layout_lock() sections to ensure all possible page
>> + * references have been taken, or otherwise arrange for faults
>> + * to block on the filesystem lock that is taken for
>> + * establishing new mappings.
>> + */
>> + unmap_mapping_range(mapping, 0, 0, 1);
>> + synchronize_rcu();
>
> So I still don't like the use of RCU for this. It just seems as an abuse to
> use RCU like that. Furthermore it has a hefty latency cost for the truncate
> path. A trivial test to truncate 100 times the last page of a 16k file that
> is mmaped (only the first page):
>
> DAX+your patches 3.899s
> non-DAX 0.015s
>
> So you can see synchronize_rcu() increased time to run truncate(2) more
> than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> time). IMHO that's just too costly.
Agree. I was quietly hoping that it wouldn't be that bad, but numbers
are numbers.
At this point I think we should just go with the
address_space_operations conversions and the sector-to-pfn conversion
for what's stored in the dax radix for 4.17-rc1, and circle back on a
better way to do this synchronization for 4.18.
Hi Dan,
I catch the following bug on the linux-next 20180404. git bisect brought me to this commit:
commit 8e4d1ccc5286d2c3da6515b92323a3529aa64496 (HEAD, refs/bisect/bad)
Author: Dan Williams <[email protected]>
Date: Sat Oct 21 14:41:13 2017 -0700
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
[ 11.278768] BUG: unable to handle kernel NULL pointer dereference at 0000000000000440
[ 11.279999] IP: fs_dax_release+0x5/0x90
[ 11.280587] PGD 0 P4D 0
[ 11.280973] Oops: 0000 [#1] SMP PTI
[ 11.281500] Modules linked in:
[ 11.281968] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc4-00193-g8e4d1ccc5286 #7
[ 11.283163] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 11.284418] RIP: 0010:fs_dax_release+0x5/0x90
[ 11.285068] RSP: 0000:ffffb1480062fbd8 EFLAGS: 00010287
[ 11.285845] RAX: 0000000000000001 RBX: ffff9e2cb823c088 RCX: 0000000000000003
[ 11.286896] RDX: 0000000000000000 RSI: ffff9e2cb823c088 RDI: 0000000000000000
[ 11.287980] RBP: ffffb1480062fcd8 R08: 0000000000000001 R09: 0000000000000000
[ 11.289147] R10: ffffb1480062fb20 R11: 0000000000000000 R12: 00000000ffffffea
[ 11.290576] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9e2cb823a048
[ 11.291630] FS: 0000000000000000(0000) GS:ffff9e2cbfd00000(0000) knlGS:0000000000000000
[ 11.292781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 11.293602] CR2: 0000000000000440 CR3: 000000007d21e001 CR4: 00000000003606e0
[ 11.294817] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 11.296827] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 11.298293] Call Trace:
[ 11.298728] ext4_fill_super+0x31b/0x39d0
[ 11.299441] ? sget_userns+0x155/0x500
[ 11.300144] ? vsnprintf+0x253/0x4b0
[ 11.301223] ? ext4_calculate_overhead+0x4a0/0x4a0
[ 11.301801] ? snprintf+0x45/0x70
[ 11.302214] ? ext4_calculate_overhead+0x4a0/0x4a0
[ 11.302822] mount_bdev+0x17b/0x1b0
[ 11.303332] mount_fs+0x35/0x150
[ 11.303803] vfs_kern_mount.part.25+0x54/0x150
[ 11.304443] do_mount+0x620/0xd60
[ 11.304935] ? memdup_user+0x3e/0x70
[ 11.305458] SyS_mount+0x80/0xd0
[ 11.305931] mount_block_root+0x105/0x2b7
[ 11.306512] ? SyS_mknod+0x16b/0x1f0
[ 11.307035] ? set_debug_rodata+0x11/0x11
[ 11.307616] prepare_namespace+0x135/0x16b
[ 11.308215] kernel_init_freeable+0x271/0x297
[ 11.308838] ? rest_init+0xd0/0xd0
[ 11.309322] kernel_init+0xa/0x110
[ 11.309821] ret_from_fork+0x3a/0x50
[ 11.310347] Code: a5 45 31 ed e8 5d 5e 36 00 eb d7 48 c7 c7 20 48 2f a5 e8 4f 5e 36 00 eb c9 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <48> 8b 87 40 04 00 00 48 8b 40 18 48 85 c0 74 05 e9 c6 7e 60 00
[ 11.313168] RIP: fs_dax_release+0x5/0x90 RSP: ffffb1480062fbd8
[ 11.313991] CR2: 0000000000000440
[ 11.314475] ---[ end trace 8acbb19b74409665 ]---
On Fri, Mar 30, 2018 at 09:03:08PM -0700, Dan Williams wrote:
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
>
> Recall that the 'struct page' entries for DAX memory are created with
> devm_memremap_pages(). That routine arranges for the pages to be
> allocated, but never onlined, so a DAX page is DMA-idle when its
> reference count reaches one.
>
> Also recall that the HMM sub-system added infrastructure to trap the
> page-idle (2-to-1 reference count) transition of the pages allocated by
> devm_memremap_pages() and trigger a callback via the 'struct
> dev_pagemap' associated with the page range. Whereas the HMM callbacks
> are going to a device driver to manage bounce pages in device-memory in
> the filesystem-dax case we will call back to filesystem specified
> callback.
>
> Since the callback is not known at devm_memremap_pages() time we arrange
> for the filesystem to install it at mount time. No functional changes
> are expected as this only registers a nop handler for the ->page_free()
> event for device-mapped pages.
>
> Cc: Michal Hocko <[email protected]>
> Reviewed-by: "Jérôme Glisse" <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/dax/super.c | 21 +++++++++++----------
> drivers/nvdimm/pmem.c | 3 ++-
> fs/ext2/super.c | 6 +++---
> fs/ext4/super.c | 6 +++---
> fs/xfs/xfs_super.c | 20 ++++++++++----------
> include/linux/dax.h | 23 ++++++++++++++---------
> 6 files changed, 43 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4cf284dfe1c..7d260f118a39 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> }
> EXPORT_SYMBOL(bdev_dax_pgoff);
>
> -#if IS_ENABLED(CONFIG_FS_DAX)
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> - if (!blk_queue_dax(bdev->bd_queue))
> - return NULL;
> - return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> -}
> -EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -#endif
> -
> /**
> * __bdev_dax_supported() - Check if the device supports dax for filesystem
> * @sb: The superblock of the device
> @@ -579,6 +569,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
> }
> EXPORT_SYMBOL_GPL(alloc_dax);
>
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> + const struct dax_operations *ops, struct dev_pagemap *pgmap)
> +{
> + struct dax_device *dax_dev = alloc_dax(private, host, ops);
> +
> + if (dax_dev)
> + dax_dev->pgmap = pgmap;
> + return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(alloc_dax_devmap);
> +
> void put_dax(struct dax_device *dax_dev)
> {
> if (!dax_dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 06f8dcc52ca6..e6d7351f3379 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
> nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
> disk->bb = &pmem->bb;
>
> - dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
> + dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
> + &pmem->pgmap);
> if (!dax_dev) {
> put_disk(disk);
> return -ENOMEM;
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 7666c065b96f..6ae20e319bc4 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
> brelse (sbi->s_sbh);
> sb->s_fs_info = NULL;
> kfree(sbi->s_blockgroup_lock);
> - fs_put_dax(sbi->s_daxdev);
> + fs_dax_release(sbi->s_daxdev, sb);
> kfree(sbi);
> }
>
> @@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
>
> static int ext2_fill_super(struct super_block *sb, void *data, int silent)
> {
> - struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> + struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
> struct buffer_head * bh;
> struct ext2_sb_info * sbi;
> struct ext2_super_block * es;
> @@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
> kfree(sbi->s_blockgroup_lock);
> kfree(sbi);
> failed:
> - fs_put_dax(dax_dev);
> + fs_dax_release(dax_dev, sb);
> return ret;
> }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 39bf464c35f1..315a323729e3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
> if (sbi->s_chksum_driver)
> crypto_free_shash(sbi->s_chksum_driver);
> kfree(sbi->s_blockgroup_lock);
> - fs_put_dax(sbi->s_daxdev);
> + fs_dax_release(sbi->s_daxdev, sb);
> kfree(sbi);
> }
>
> @@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
>
> static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> {
> - struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> + struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
> char *orig_data = kstrdup(data, GFP_KERNEL);
> struct buffer_head *bh;
> struct ext4_super_block *es = NULL;
> @@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> out_free_base:
> kfree(sbi);
> kfree(orig_data);
> - fs_put_dax(dax_dev);
> + fs_dax_release(dax_dev, sb);
> return err ? err : ret;
> }
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 93588ea3d3d2..ef7dd7148c0b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -724,7 +724,7 @@ xfs_close_devices(
>
> xfs_free_buftarg(mp, mp->m_logdev_targp);
> xfs_blkdev_put(logdev);
> - fs_put_dax(dax_logdev);
> + fs_dax_release(dax_logdev, mp);
> }
> if (mp->m_rtdev_targp) {
> struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
> @@ -732,10 +732,10 @@ xfs_close_devices(
>
> xfs_free_buftarg(mp, mp->m_rtdev_targp);
> xfs_blkdev_put(rtdev);
> - fs_put_dax(dax_rtdev);
> + fs_dax_release(dax_rtdev, mp);
> }
> xfs_free_buftarg(mp, mp->m_ddev_targp);
> - fs_put_dax(dax_ddev);
> + fs_dax_release(dax_ddev, mp);
> }
>
> /*
> @@ -753,9 +753,9 @@ xfs_open_devices(
> struct xfs_mount *mp)
> {
> struct block_device *ddev = mp->m_super->s_bdev;
> - struct dax_device *dax_ddev = fs_dax_get_by_bdev(ddev);
> - struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
> + struct dax_device *dax_ddev = fs_dax_claim_bdev(ddev, mp);
> struct block_device *logdev = NULL, *rtdev = NULL;
> + struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
> int error;
>
> /*
> @@ -765,7 +765,7 @@ xfs_open_devices(
> error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
> if (error)
> goto out;
> - dax_logdev = fs_dax_get_by_bdev(logdev);
> + dax_logdev = fs_dax_claim_bdev(logdev, mp);
> }
>
> if (mp->m_rtname) {
> @@ -779,7 +779,7 @@ xfs_open_devices(
> error = -EINVAL;
> goto out_close_rtdev;
> }
> - dax_rtdev = fs_dax_get_by_bdev(rtdev);
> + dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
> }
>
> /*
> @@ -813,14 +813,14 @@ xfs_open_devices(
> xfs_free_buftarg(mp, mp->m_ddev_targp);
> out_close_rtdev:
> xfs_blkdev_put(rtdev);
> - fs_put_dax(dax_rtdev);
> + fs_dax_release(dax_rtdev, mp);
> out_close_logdev:
> if (logdev && logdev != ddev) {
> xfs_blkdev_put(logdev);
> - fs_put_dax(dax_logdev);
> + fs_dax_release(dax_logdev, mp);
> }
> out:
> - fs_put_dax(dax_ddev);
> + fs_dax_release(dax_ddev, mp);
> return error;
> }
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index e9d59a6b06e1..a88ff009e2a1 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -32,6 +32,8 @@ extern struct attribute_group dax_attribute_group;
> struct dax_device *dax_get_by_host(const char *host);
> struct dax_device *alloc_dax(void *private, const char *host,
> const struct dax_operations *ops);
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> + const struct dax_operations *ops, struct dev_pagemap *pgmap);
> void put_dax(struct dax_device *dax_dev);
> void kill_dax(struct dax_device *dax_dev);
> void dax_write_cache(struct dax_device *dax_dev, bool wc);
> @@ -50,6 +52,12 @@ static inline struct dax_device *alloc_dax(void *private, const char *host,
> */
> return NULL;
> }
> +static inline struct dax_device *alloc_dax_devmap(void *private,
> + const char *host, const struct dax_operations *ops,
> + struct dev_pagemap *pgmap)
> +{
> + return NULL;
> +}
> static inline void put_dax(struct dax_device *dax_dev)
> {
> }
> @@ -79,12 +87,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
> return dax_get_by_host(host);
> }
>
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> -{
> - put_dax(dax_dev);
> -}
> -
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
> +void fs_dax_release(struct dax_device *dax_dev, void *owner);
> int dax_writeback_mapping_range(struct address_space *mapping,
> struct block_device *bdev, struct writeback_control *wbc);
> struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
> @@ -100,13 +104,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
> return NULL;
> }
>
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> + void *owner)
> {
> + return NULL;
> }
>
> -static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> +static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
> {
> - return NULL;
> }
>
> static inline int dax_writeback_mapping_range(struct address_space *mapping,
On Wed, Apr 4, 2018 at 2:23 PM, Andrei Vagin <[email protected]> wrote:
> Hi Dan,
>
> I catch the following bug on the linux-next 20180404. git bisect brought me to this commit:
Yes, I will be yanking this functionality out of -next shortly and try
again for v4.18.
[ adding Stephen ]
On Wed, Apr 4, 2018 at 2:27 PM, Dan Williams <[email protected]> wrote:
> On Wed, Apr 4, 2018 at 2:23 PM, Andrei Vagin <[email protected]> wrote:
>> Hi Dan,
>>
>> I catch the following bug on the linux-next 20180404. git bisect brought me to this commit:
>
> Yes, I will be yanking this functionality out of -next shortly and try
> again for v4.18.
New branch pushed out with this offending commit removed.
On Wed, Apr 04, 2018 at 02:23:40PM -0700, Andrei Vagin wrote:
> Hi Dan,
>
> I catch the following bug on the linux-next 20180404. git bisect brought me to this commit:
The next patch fixes the problem:
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5b13da127982..a67a7fe75fd5 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -228,6 +228,10 @@ static void __fs_dax_release(struct dax_device *dax_dev, void *owner)
void fs_dax_release(struct dax_device *dax_dev, void *owner)
{
+ if (!dax_dev) {
+ printk("%s:%d: dax_dev == NULL\n", __func__, __LINE__);
+ return;
+ }
if (dax_dev->ops->fs_release)
dax_dev->ops->fs_release(dax_dev, owner);
else
And here is dmesg from my test vm:
[root@fc24 ~]# dmesg | grep -A 2 -B 2 dax
[ 14.659318] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
[ 14.662436] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
[ 14.663983] fs_dax_release:232: dax_dev == NULL
[ 14.665646] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
[ 14.667047] fs_dax_release:232: dax_dev == NULL
[ 14.668933] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
[ 14.670039] EXT4-fs (vda2): write access will be enabled during recovery
>
> commit 8e4d1ccc5286d2c3da6515b92323a3529aa64496 (HEAD, refs/bisect/bad)
> Author: Dan Williams <[email protected]>
> Date: Sat Oct 21 14:41:13 2017 -0700
>
> mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
>
>
> [ 11.278768] BUG: unable to handle kernel NULL pointer dereference at 0000000000000440
> [ 11.279999] IP: fs_dax_release+0x5/0x90
> [ 11.280587] PGD 0 P4D 0
> [ 11.280973] Oops: 0000 [#1] SMP PTI
> [ 11.281500] Modules linked in:
> [ 11.281968] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc4-00193-g8e4d1ccc5286 #7
> [ 11.283163] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [ 11.284418] RIP: 0010:fs_dax_release+0x5/0x90
> [ 11.285068] RSP: 0000:ffffb1480062fbd8 EFLAGS: 00010287
> [ 11.285845] RAX: 0000000000000001 RBX: ffff9e2cb823c088 RCX: 0000000000000003
> [ 11.286896] RDX: 0000000000000000 RSI: ffff9e2cb823c088 RDI: 0000000000000000
> [ 11.287980] RBP: ffffb1480062fcd8 R08: 0000000000000001 R09: 0000000000000000
> [ 11.289147] R10: ffffb1480062fb20 R11: 0000000000000000 R12: 00000000ffffffea
> [ 11.290576] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9e2cb823a048
> [ 11.291630] FS: 0000000000000000(0000) GS:ffff9e2cbfd00000(0000) knlGS:0000000000000000
> [ 11.292781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 11.293602] CR2: 0000000000000440 CR3: 000000007d21e001 CR4: 00000000003606e0
> [ 11.294817] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 11.296827] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 11.298293] Call Trace:
> [ 11.298728] ext4_fill_super+0x31b/0x39d0
> [ 11.299441] ? sget_userns+0x155/0x500
> [ 11.300144] ? vsnprintf+0x253/0x4b0
> [ 11.301223] ? ext4_calculate_overhead+0x4a0/0x4a0
> [ 11.301801] ? snprintf+0x45/0x70
> [ 11.302214] ? ext4_calculate_overhead+0x4a0/0x4a0
> [ 11.302822] mount_bdev+0x17b/0x1b0
> [ 11.303332] mount_fs+0x35/0x150
> [ 11.303803] vfs_kern_mount.part.25+0x54/0x150
> [ 11.304443] do_mount+0x620/0xd60
> [ 11.304935] ? memdup_user+0x3e/0x70
> [ 11.305458] SyS_mount+0x80/0xd0
> [ 11.305931] mount_block_root+0x105/0x2b7
> [ 11.306512] ? SyS_mknod+0x16b/0x1f0
> [ 11.307035] ? set_debug_rodata+0x11/0x11
> [ 11.307616] prepare_namespace+0x135/0x16b
> [ 11.308215] kernel_init_freeable+0x271/0x297
> [ 11.308838] ? rest_init+0xd0/0xd0
> [ 11.309322] kernel_init+0xa/0x110
> [ 11.309821] ret_from_fork+0x3a/0x50
> [ 11.310347] Code: a5 45 31 ed e8 5d 5e 36 00 eb d7 48 c7 c7 20 48 2f a5 e8 4f 5e 36 00 eb c9 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <48> 8b 87 40 04 00 00 48 8b 40 18 48 85 c0 74 05 e9 c6 7e 60 00
> [ 11.313168] RIP: fs_dax_release+0x5/0x90 RSP: ffffb1480062fbd8
> [ 11.313991] CR2: 0000000000000440
> [ 11.314475] ---[ end trace 8acbb19b74409665 ]---
>
>
> On Fri, Mar 30, 2018 at 09:03:08PM -0700, Dan Williams wrote:
> > In order to resolve collisions between filesystem operations and DMA to
> > DAX mapped pages we need a callback when DMA completes. With a callback
> > we can hold off filesystem operations while DMA is in-flight and then
> > resume those operations when the last put_page() occurs on a DMA page.
> >
> > Recall that the 'struct page' entries for DAX memory are created with
> > devm_memremap_pages(). That routine arranges for the pages to be
> > allocated, but never onlined, so a DAX page is DMA-idle when its
> > reference count reaches one.
> >
> > Also recall that the HMM sub-system added infrastructure to trap the
> > page-idle (2-to-1 reference count) transition of the pages allocated by
> > devm_memremap_pages() and trigger a callback via the 'struct
> > dev_pagemap' associated with the page range. Whereas the HMM callbacks
> > are going to a device driver to manage bounce pages in device-memory in
> > the filesystem-dax case we will call back to filesystem specified
> > callback.
> >
> > Since the callback is not known at devm_memremap_pages() time we arrange
> > for the filesystem to install it at mount time. No functional changes
> > are expected as this only registers a nop handler for the ->page_free()
> > event for device-mapped pages.
> >
> > Cc: Michal Hocko <[email protected]>
> > Reviewed-by: "Jérôme Glisse" <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
> > ---
> > drivers/dax/super.c | 21 +++++++++++----------
> > drivers/nvdimm/pmem.c | 3 ++-
> > fs/ext2/super.c | 6 +++---
> > fs/ext4/super.c | 6 +++---
> > fs/xfs/xfs_super.c | 20 ++++++++++----------
> > include/linux/dax.h | 23 ++++++++++++++---------
> > 6 files changed, 43 insertions(+), 36 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4cf284dfe1c..7d260f118a39 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> > }
> > EXPORT_SYMBOL(bdev_dax_pgoff);
> >
> > -#if IS_ENABLED(CONFIG_FS_DAX)
> > -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> > -{
> > - if (!blk_queue_dax(bdev->bd_queue))
> > - return NULL;
> > - return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> > -}
> > -EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> > -#endif
> > -
> > /**
> > * __bdev_dax_supported() - Check if the device supports dax for filesystem
> > * @sb: The superblock of the device
> > @@ -579,6 +569,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
> > }
> > EXPORT_SYMBOL_GPL(alloc_dax);
> >
> > +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> > + const struct dax_operations *ops, struct dev_pagemap *pgmap)
> > +{
> > + struct dax_device *dax_dev = alloc_dax(private, host, ops);
> > +
> > + if (dax_dev)
> > + dax_dev->pgmap = pgmap;
> > + return dax_dev;
> > +}
> > +EXPORT_SYMBOL_GPL(alloc_dax_devmap);
> > +
> > void put_dax(struct dax_device *dax_dev)
> > {
> > if (!dax_dev)
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 06f8dcc52ca6..e6d7351f3379 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
> > nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
> > disk->bb = &pmem->bb;
> >
> > - dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
> > + dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
> > + &pmem->pgmap);
> > if (!dax_dev) {
> > put_disk(disk);
> > return -ENOMEM;
> > diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> > index 7666c065b96f..6ae20e319bc4 100644
> > --- a/fs/ext2/super.c
> > +++ b/fs/ext2/super.c
> > @@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
> > brelse (sbi->s_sbh);
> > sb->s_fs_info = NULL;
> > kfree(sbi->s_blockgroup_lock);
> > - fs_put_dax(sbi->s_daxdev);
> > + fs_dax_release(sbi->s_daxdev, sb);
> > kfree(sbi);
> > }
> >
> > @@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
> >
> > static int ext2_fill_super(struct super_block *sb, void *data, int silent)
> > {
> > - struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> > + struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
> > struct buffer_head * bh;
> > struct ext2_sb_info * sbi;
> > struct ext2_super_block * es;
> > @@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
> > kfree(sbi->s_blockgroup_lock);
> > kfree(sbi);
> > failed:
> > - fs_put_dax(dax_dev);
> > + fs_dax_release(dax_dev, sb);
> > return ret;
> > }
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 39bf464c35f1..315a323729e3 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
> > if (sbi->s_chksum_driver)
> > crypto_free_shash(sbi->s_chksum_driver);
> > kfree(sbi->s_blockgroup_lock);
> > - fs_put_dax(sbi->s_daxdev);
> > + fs_dax_release(sbi->s_daxdev, sb);
> > kfree(sbi);
> > }
> >
> > @@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
> >
> > static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > {
> > - struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> > + struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
> > char *orig_data = kstrdup(data, GFP_KERNEL);
> > struct buffer_head *bh;
> > struct ext4_super_block *es = NULL;
> > @@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > out_free_base:
> > kfree(sbi);
> > kfree(orig_data);
> > - fs_put_dax(dax_dev);
> > + fs_dax_release(dax_dev, sb);
> > return err ? err : ret;
> > }
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 93588ea3d3d2..ef7dd7148c0b 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -724,7 +724,7 @@ xfs_close_devices(
> >
> > xfs_free_buftarg(mp, mp->m_logdev_targp);
> > xfs_blkdev_put(logdev);
> > - fs_put_dax(dax_logdev);
> > + fs_dax_release(dax_logdev, mp);
> > }
> > if (mp->m_rtdev_targp) {
> > struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
> > @@ -732,10 +732,10 @@ xfs_close_devices(
> >
> > xfs_free_buftarg(mp, mp->m_rtdev_targp);
> > xfs_blkdev_put(rtdev);
> > - fs_put_dax(dax_rtdev);
> > + fs_dax_release(dax_rtdev, mp);
> > }
> > xfs_free_buftarg(mp, mp->m_ddev_targp);
> > - fs_put_dax(dax_ddev);
> > + fs_dax_release(dax_ddev, mp);
> > }
> >
> > /*
> > @@ -753,9 +753,9 @@ xfs_open_devices(
> > struct xfs_mount *mp)
> > {
> > struct block_device *ddev = mp->m_super->s_bdev;
> > - struct dax_device *dax_ddev = fs_dax_get_by_bdev(ddev);
> > - struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
> > + struct dax_device *dax_ddev = fs_dax_claim_bdev(ddev, mp);
> > struct block_device *logdev = NULL, *rtdev = NULL;
> > + struct dax_device *dax_logdev = NULL, *dax_rtdev = NULL;
> > int error;
> >
> > /*
> > @@ -765,7 +765,7 @@ xfs_open_devices(
> > error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
> > if (error)
> > goto out;
> > - dax_logdev = fs_dax_get_by_bdev(logdev);
> > + dax_logdev = fs_dax_claim_bdev(logdev, mp);
> > }
> >
> > if (mp->m_rtname) {
> > @@ -779,7 +779,7 @@ xfs_open_devices(
> > error = -EINVAL;
> > goto out_close_rtdev;
> > }
> > - dax_rtdev = fs_dax_get_by_bdev(rtdev);
> > + dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
> > }
> >
> > /*
> > @@ -813,14 +813,14 @@ xfs_open_devices(
> > xfs_free_buftarg(mp, mp->m_ddev_targp);
> > out_close_rtdev:
> > xfs_blkdev_put(rtdev);
> > - fs_put_dax(dax_rtdev);
> > + fs_dax_release(dax_rtdev, mp);
> > out_close_logdev:
> > if (logdev && logdev != ddev) {
> > xfs_blkdev_put(logdev);
> > - fs_put_dax(dax_logdev);
> > + fs_dax_release(dax_logdev, mp);
> > }
> > out:
> > - fs_put_dax(dax_ddev);
> > + fs_dax_release(dax_ddev, mp);
> > return error;
> > }
> >
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index e9d59a6b06e1..a88ff009e2a1 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -32,6 +32,8 @@ extern struct attribute_group dax_attribute_group;
> > struct dax_device *dax_get_by_host(const char *host);
> > struct dax_device *alloc_dax(void *private, const char *host,
> > const struct dax_operations *ops);
> > +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> > + const struct dax_operations *ops, struct dev_pagemap *pgmap);
> > void put_dax(struct dax_device *dax_dev);
> > void kill_dax(struct dax_device *dax_dev);
> > void dax_write_cache(struct dax_device *dax_dev, bool wc);
> > @@ -50,6 +52,12 @@ static inline struct dax_device *alloc_dax(void *private, const char *host,
> > */
> > return NULL;
> > }
> > +static inline struct dax_device *alloc_dax_devmap(void *private,
> > + const char *host, const struct dax_operations *ops,
> > + struct dev_pagemap *pgmap)
> > +{
> > + return NULL;
> > +}
> > static inline void put_dax(struct dax_device *dax_dev)
> > {
> > }
> > @@ -79,12 +87,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
> > return dax_get_by_host(host);
> > }
> >
> > -static inline void fs_put_dax(struct dax_device *dax_dev)
> > -{
> > - put_dax(dax_dev);
> > -}
> > -
> > -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> > +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
> > +void fs_dax_release(struct dax_device *dax_dev, void *owner);
> > int dax_writeback_mapping_range(struct address_space *mapping,
> > struct block_device *bdev, struct writeback_control *wbc);
> > struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
> > @@ -100,13 +104,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
> > return NULL;
> > }
> >
> > -static inline void fs_put_dax(struct dax_device *dax_dev)
> > +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> > + void *owner)
> > {
> > + return NULL;
> > }
> >
> > -static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> > +static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
> > {
> > - return NULL;
> > }
> >
> > static inline int dax_writeback_mapping_range(struct address_space *mapping,
Hi Dan,
On Wed, 4 Apr 2018 14:35:20 -0700 Dan Williams <[email protected]> wrote:
>
> New branch pushed out with this offending commit removed.
Thanks, I refetched.
--
Cheers,
Stephen Rothwell
[ adding Paul and Josh ]
On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> On Fri 30-03-18 21:03:30, Dan Williams wrote:
>> Background:
>>
>> get_user_pages() in the filesystem pins file backed memory pages for
>> access by devices performing dma. However, it only pins the memory pages
>> not the page-to-file offset association. If a file is truncated the
>> pages are mapped out of the file and dma may continue indefinitely into
>> a page that is owned by a device driver. This breaks coherency of the
>> file vs dma, but the assumption is that if userspace wants the
>> file-space truncated it does not matter what data is inbound from the
>> device, it is not relevant anymore. The only expectation is that dma can
>> safely continue while the filesystem reallocates the block(s).
>>
>> Problem:
>>
>> This expectation that dma can safely continue while the filesystem
>> changes the block map is broken by dax. With dax the target dma page
>> *is* the filesystem block. The model of leaving the page pinned for dma,
>> but truncating the file block out of the file, means that the filesytem
>> is free to reallocate a block under active dma to another file and now
>> the expected data-incoherency situation has turned into active
>> data-corruption.
>>
>> Solution:
>>
>> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
>> file while any page/block in the file is under active dma. This solution
>> assumes that dma is transient. Cases where dma operations are known to
>> not be transient, like RDMA, have been explicitly disabled via
>> commits like 5f1d43de5416 "IB/core: disable memory registration of
>> filesystem-dax vmas".
>>
>> The dax_layout_busy_page() routine is called by filesystems with a lock
>> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
>> The process of looking up a busy page invalidates all mappings
>> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
>> The filesystem continues to call dax_layout_busy_page() until it finally
>> returns no more active pages. This approach assumes that the page
>> pinning is transient, if that assumption is violated the system would
>> have likely hung from the uncompleted I/O.
>>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Dave Chinner <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Cc: "Darrick J. Wong" <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Reported-by: Christoph Hellwig <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> drivers/dax/super.c | 2 +
>> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/dax.h | 25 ++++++++++++++
>> mm/gup.c | 5 +++
>> 4 files changed, 123 insertions(+), 1 deletion(-)
>
> ...
>
>> +/**
>> + * dax_layout_busy_page - find first pinned page in @mapping
>> + * @mapping: address space to scan for a page with ref count > 1
>> + *
>> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
>> + * 'onlined' to the page allocator so they are considered idle when
>> + * page->count == 1. A filesystem uses this interface to determine if
>> + * any page in the mapping is busy, i.e. for DMA, or other
>> + * get_user_pages() usages.
>> + *
>> + * It is expected that the filesystem is holding locks to block the
>> + * establishment of new mappings in this address_space. I.e. it expects
>> + * to be able to run unmap_mapping_range() and subsequently not race
>> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
>> + * walks are performed under rcu_read_lock().
>> + */
>> +struct page *dax_layout_busy_page(struct address_space *mapping)
>> +{
>> + pgoff_t indices[PAGEVEC_SIZE];
>> + struct page *page = NULL;
>> + struct pagevec pvec;
>> + pgoff_t index, end;
>> + unsigned i;
>> +
>> + /*
>> + * In the 'limited' case get_user_pages() for dax is disabled.
>> + */
>> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> + return NULL;
>> +
>> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
>> + return NULL;
>> +
>> + pagevec_init(&pvec);
>> + index = 0;
>> + end = -1;
>> + /*
>> + * Flush dax_layout_lock() sections to ensure all possible page
>> + * references have been taken, or otherwise arrange for faults
>> + * to block on the filesystem lock that is taken for
>> + * establishing new mappings.
>> + */
>> + unmap_mapping_range(mapping, 0, 0, 1);
>> + synchronize_rcu();
>
> So I still don't like the use of RCU for this. It just seems as an abuse to
> use RCU like that. Furthermore it has a hefty latency cost for the truncate
> path. A trivial test to truncate 100 times the last page of a 16k file that
> is mmaped (only the first page):
>
> DAX+your patches 3.899s
> non-DAX 0.015s
>
> So you can see synchronize_rcu() increased time to run truncate(2) more
> than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> time). IMHO that's just too costly.
I wonder if this can be trivially solved by using srcu. I.e. we don't
need to wait for a global quiescent state, just a
get_user_pages_fast() quiescent state. ...or is that an abuse of the
srcu api?
On Sat, Apr 07, 2018 at 12:38:24PM -0700, Dan Williams wrote:
> [ adding Paul and Josh ]
>
> On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> > On Fri 30-03-18 21:03:30, Dan Williams wrote:
> >> Background:
> >>
> >> get_user_pages() in the filesystem pins file backed memory pages for
> >> access by devices performing dma. However, it only pins the memory pages
> >> not the page-to-file offset association. If a file is truncated the
> >> pages are mapped out of the file and dma may continue indefinitely into
> >> a page that is owned by a device driver. This breaks coherency of the
> >> file vs dma, but the assumption is that if userspace wants the
> >> file-space truncated it does not matter what data is inbound from the
> >> device, it is not relevant anymore. The only expectation is that dma can
> >> safely continue while the filesystem reallocates the block(s).
> >>
> >> Problem:
> >>
> >> This expectation that dma can safely continue while the filesystem
> >> changes the block map is broken by dax. With dax the target dma page
> >> *is* the filesystem block. The model of leaving the page pinned for dma,
> >> but truncating the file block out of the file, means that the filesytem
> >> is free to reallocate a block under active dma to another file and now
> >> the expected data-incoherency situation has turned into active
> >> data-corruption.
> >>
> >> Solution:
> >>
> >> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> >> file while any page/block in the file is under active dma. This solution
> >> assumes that dma is transient. Cases where dma operations are known to
> >> not be transient, like RDMA, have been explicitly disabled via
> >> commits like 5f1d43de5416 "IB/core: disable memory registration of
> >> filesystem-dax vmas".
> >>
> >> The dax_layout_busy_page() routine is called by filesystems with a lock
> >> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> >> The process of looking up a busy page invalidates all mappings
> >> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> >> The filesystem continues to call dax_layout_busy_page() until it finally
> >> returns no more active pages. This approach assumes that the page
> >> pinning is transient, if that assumption is violated the system would
> >> have likely hung from the uncompleted I/O.
> >>
> >> Cc: Jan Kara <[email protected]>
> >> Cc: Jeff Moyer <[email protected]>
> >> Cc: Dave Chinner <[email protected]>
> >> Cc: Matthew Wilcox <[email protected]>
> >> Cc: Alexander Viro <[email protected]>
> >> Cc: "Darrick J. Wong" <[email protected]>
> >> Cc: Ross Zwisler <[email protected]>
> >> Cc: Dave Hansen <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Reported-by: Christoph Hellwig <[email protected]>
> >> Reviewed-by: Christoph Hellwig <[email protected]>
> >> Signed-off-by: Dan Williams <[email protected]>
> >> ---
> >> drivers/dax/super.c | 2 +
> >> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >> include/linux/dax.h | 25 ++++++++++++++
> >> mm/gup.c | 5 +++
> >> 4 files changed, 123 insertions(+), 1 deletion(-)
> >
> > ...
> >
> >> +/**
> >> + * dax_layout_busy_page - find first pinned page in @mapping
> >> + * @mapping: address space to scan for a page with ref count > 1
> >> + *
> >> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> >> + * 'onlined' to the page allocator so they are considered idle when
> >> + * page->count == 1. A filesystem uses this interface to determine if
> >> + * any page in the mapping is busy, i.e. for DMA, or other
> >> + * get_user_pages() usages.
> >> + *
> >> + * It is expected that the filesystem is holding locks to block the
> >> + * establishment of new mappings in this address_space. I.e. it expects
> >> + * to be able to run unmap_mapping_range() and subsequently not race
> >> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> >> + * walks are performed under rcu_read_lock().
> >> + */
> >> +struct page *dax_layout_busy_page(struct address_space *mapping)
> >> +{
> >> + pgoff_t indices[PAGEVEC_SIZE];
> >> + struct page *page = NULL;
> >> + struct pagevec pvec;
> >> + pgoff_t index, end;
> >> + unsigned i;
> >> +
> >> + /*
> >> + * In the 'limited' case get_user_pages() for dax is disabled.
> >> + */
> >> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> + return NULL;
> >> +
> >> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> >> + return NULL;
> >> +
> >> + pagevec_init(&pvec);
> >> + index = 0;
> >> + end = -1;
> >> + /*
> >> + * Flush dax_layout_lock() sections to ensure all possible page
> >> + * references have been taken, or otherwise arrange for faults
> >> + * to block on the filesystem lock that is taken for
> >> + * establishing new mappings.
> >> + */
> >> + unmap_mapping_range(mapping, 0, 0, 1);
> >> + synchronize_rcu();
> >
> > So I still don't like the use of RCU for this. It just seems as an abuse to
> > use RCU like that. Furthermore it has a hefty latency cost for the truncate
> > path. A trivial test to truncate 100 times the last page of a 16k file that
> > is mmaped (only the first page):
> >
> > DAX+your patches 3.899s
> > non-DAX 0.015s
> >
> > So you can see synchronize_rcu() increased time to run truncate(2) more
> > than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> > time). IMHO that's just too costly.
>
> I wonder if this can be trivially solved by using srcu. I.e. we don't
> need to wait for a global quiescent state, just a
> get_user_pages_fast() quiescent state. ...or is that an abuse of the
> srcu api?
From what I can see (not that I claim to understand DAX), SRCU
is worth trying. Another thing to try (as a test) is to replace the
synchronize_rcu() above with synchronize_rcu_expedited(), which might
get you an order of magnitude or thereabouts.
Thanx, Paul
On Sat 07-04-18 12:38:24, Dan Williams wrote:
> [ adding Paul and Josh ]
>
> On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> > On Fri 30-03-18 21:03:30, Dan Williams wrote:
> >> Background:
> >>
> >> get_user_pages() in the filesystem pins file backed memory pages for
> >> access by devices performing dma. However, it only pins the memory pages
> >> not the page-to-file offset association. If a file is truncated the
> >> pages are mapped out of the file and dma may continue indefinitely into
> >> a page that is owned by a device driver. This breaks coherency of the
> >> file vs dma, but the assumption is that if userspace wants the
> >> file-space truncated it does not matter what data is inbound from the
> >> device, it is not relevant anymore. The only expectation is that dma can
> >> safely continue while the filesystem reallocates the block(s).
> >>
> >> Problem:
> >>
> >> This expectation that dma can safely continue while the filesystem
> >> changes the block map is broken by dax. With dax the target dma page
> >> *is* the filesystem block. The model of leaving the page pinned for dma,
> >> but truncating the file block out of the file, means that the filesytem
> >> is free to reallocate a block under active dma to another file and now
> >> the expected data-incoherency situation has turned into active
> >> data-corruption.
> >>
> >> Solution:
> >>
> >> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> >> file while any page/block in the file is under active dma. This solution
> >> assumes that dma is transient. Cases where dma operations are known to
> >> not be transient, like RDMA, have been explicitly disabled via
> >> commits like 5f1d43de5416 "IB/core: disable memory registration of
> >> filesystem-dax vmas".
> >>
> >> The dax_layout_busy_page() routine is called by filesystems with a lock
> >> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> >> The process of looking up a busy page invalidates all mappings
> >> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> >> The filesystem continues to call dax_layout_busy_page() until it finally
> >> returns no more active pages. This approach assumes that the page
> >> pinning is transient, if that assumption is violated the system would
> >> have likely hung from the uncompleted I/O.
> >>
> >> Cc: Jan Kara <[email protected]>
> >> Cc: Jeff Moyer <[email protected]>
> >> Cc: Dave Chinner <[email protected]>
> >> Cc: Matthew Wilcox <[email protected]>
> >> Cc: Alexander Viro <[email protected]>
> >> Cc: "Darrick J. Wong" <[email protected]>
> >> Cc: Ross Zwisler <[email protected]>
> >> Cc: Dave Hansen <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Reported-by: Christoph Hellwig <[email protected]>
> >> Reviewed-by: Christoph Hellwig <[email protected]>
> >> Signed-off-by: Dan Williams <[email protected]>
> >> ---
> >> drivers/dax/super.c | 2 +
> >> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >> include/linux/dax.h | 25 ++++++++++++++
> >> mm/gup.c | 5 +++
> >> 4 files changed, 123 insertions(+), 1 deletion(-)
> >
> > ...
> >
> >> +/**
> >> + * dax_layout_busy_page - find first pinned page in @mapping
> >> + * @mapping: address space to scan for a page with ref count > 1
> >> + *
> >> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> >> + * 'onlined' to the page allocator so they are considered idle when
> >> + * page->count == 1. A filesystem uses this interface to determine if
> >> + * any page in the mapping is busy, i.e. for DMA, or other
> >> + * get_user_pages() usages.
> >> + *
> >> + * It is expected that the filesystem is holding locks to block the
> >> + * establishment of new mappings in this address_space. I.e. it expects
> >> + * to be able to run unmap_mapping_range() and subsequently not race
> >> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> >> + * walks are performed under rcu_read_lock().
> >> + */
> >> +struct page *dax_layout_busy_page(struct address_space *mapping)
> >> +{
> >> + pgoff_t indices[PAGEVEC_SIZE];
> >> + struct page *page = NULL;
> >> + struct pagevec pvec;
> >> + pgoff_t index, end;
> >> + unsigned i;
> >> +
> >> + /*
> >> + * In the 'limited' case get_user_pages() for dax is disabled.
> >> + */
> >> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> + return NULL;
> >> +
> >> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> >> + return NULL;
> >> +
> >> + pagevec_init(&pvec);
> >> + index = 0;
> >> + end = -1;
> >> + /*
> >> + * Flush dax_layout_lock() sections to ensure all possible page
> >> + * references have been taken, or otherwise arrange for faults
> >> + * to block on the filesystem lock that is taken for
> >> + * establishing new mappings.
> >> + */
> >> + unmap_mapping_range(mapping, 0, 0, 1);
> >> + synchronize_rcu();
> >
> > So I still don't like the use of RCU for this. It just seems as an abuse to
> > use RCU like that. Furthermore it has a hefty latency cost for the truncate
> > path. A trivial test to truncate 100 times the last page of a 16k file that
> > is mmaped (only the first page):
> >
> > DAX+your patches 3.899s
> > non-DAX 0.015s
> >
> > So you can see synchronize_rcu() increased time to run truncate(2) more
> > than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> > time). IMHO that's just too costly.
>
> I wonder if this can be trivially solved by using srcu. I.e. we don't
> need to wait for a global quiescent state, just a
> get_user_pages_fast() quiescent state. ...or is that an abuse of the
> srcu api?
Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
SRCU. It is a more-or-less standard locking mechanism rather than relying
on implementation properties of SRCU which is a data structure protection
method. And the overhead of percpu rwsemaphore for your use case should be
about the same as that of SRCU.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <[email protected]> wrote:
> On Sat 07-04-18 12:38:24, Dan Williams wrote:
[..]
>> I wonder if this can be trivially solved by using srcu. I.e. we don't
>> need to wait for a global quiescent state, just a
>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
>> srcu api?
>
> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
> SRCU. It is a more-or-less standard locking mechanism rather than relying
> on implementation properties of SRCU which is a data structure protection
> method. And the overhead of percpu rwsemaphore for your use case should be
> about the same as that of SRCU.
I was just about to ask that. Yes, it seems they would share similar
properties and it would be better to use the explicit implementation
rather than a side effect of srcu.
On Mon, Apr 09, 2018 at 06:39:10PM +0200, Jan Kara wrote:
> On Sat 07-04-18 20:11:13, Paul E. McKenney wrote:
> > On Sat, Apr 07, 2018 at 12:38:24PM -0700, Dan Williams wrote:
> > > [ adding Paul and Josh ]
> > >
> > > On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> > > > On Fri 30-03-18 21:03:30, Dan Williams wrote:
> > > >> Background:
> > > >>
> > > >> get_user_pages() in the filesystem pins file backed memory pages for
> > > >> access by devices performing dma. However, it only pins the memory pages
> > > >> not the page-to-file offset association. If a file is truncated the
> > > >> pages are mapped out of the file and dma may continue indefinitely into
> > > >> a page that is owned by a device driver. This breaks coherency of the
> > > >> file vs dma, but the assumption is that if userspace wants the
> > > >> file-space truncated it does not matter what data is inbound from the
> > > >> device, it is not relevant anymore. The only expectation is that dma can
> > > >> safely continue while the filesystem reallocates the block(s).
> > > >>
> > > >> Problem:
> > > >>
> > > >> This expectation that dma can safely continue while the filesystem
> > > >> changes the block map is broken by dax. With dax the target dma page
> > > >> *is* the filesystem block. The model of leaving the page pinned for dma,
> > > >> but truncating the file block out of the file, means that the filesytem
> > > >> is free to reallocate a block under active dma to another file and now
> > > >> the expected data-incoherency situation has turned into active
> > > >> data-corruption.
> > > >>
> > > >> Solution:
> > > >>
> > > >> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> > > >> file while any page/block in the file is under active dma. This solution
> > > >> assumes that dma is transient. Cases where dma operations are known to
> > > >> not be transient, like RDMA, have been explicitly disabled via
> > > >> commits like 5f1d43de5416 "IB/core: disable memory registration of
> > > >> filesystem-dax vmas".
> > > >>
> > > >> The dax_layout_busy_page() routine is called by filesystems with a lock
> > > >> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> > > >> The process of looking up a busy page invalidates all mappings
> > > >> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> > > >> The filesystem continues to call dax_layout_busy_page() until it finally
> > > >> returns no more active pages. This approach assumes that the page
> > > >> pinning is transient, if that assumption is violated the system would
> > > >> have likely hung from the uncompleted I/O.
> > > >>
> > > >> Cc: Jan Kara <[email protected]>
> > > >> Cc: Jeff Moyer <[email protected]>
> > > >> Cc: Dave Chinner <[email protected]>
> > > >> Cc: Matthew Wilcox <[email protected]>
> > > >> Cc: Alexander Viro <[email protected]>
> > > >> Cc: "Darrick J. Wong" <[email protected]>
> > > >> Cc: Ross Zwisler <[email protected]>
> > > >> Cc: Dave Hansen <[email protected]>
> > > >> Cc: Andrew Morton <[email protected]>
> > > >> Reported-by: Christoph Hellwig <[email protected]>
> > > >> Reviewed-by: Christoph Hellwig <[email protected]>
> > > >> Signed-off-by: Dan Williams <[email protected]>
> > > >> ---
> > > >> drivers/dax/super.c | 2 +
> > > >> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> include/linux/dax.h | 25 ++++++++++++++
> > > >> mm/gup.c | 5 +++
> > > >> 4 files changed, 123 insertions(+), 1 deletion(-)
> > > >
> > > > ...
> > > >
> > > >> +/**
> > > >> + * dax_layout_busy_page - find first pinned page in @mapping
> > > >> + * @mapping: address space to scan for a page with ref count > 1
> > > >> + *
> > > >> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > > >> + * 'onlined' to the page allocator so they are considered idle when
> > > >> + * page->count == 1. A filesystem uses this interface to determine if
> > > >> + * any page in the mapping is busy, i.e. for DMA, or other
> > > >> + * get_user_pages() usages.
> > > >> + *
> > > >> + * It is expected that the filesystem is holding locks to block the
> > > >> + * establishment of new mappings in this address_space. I.e. it expects
> > > >> + * to be able to run unmap_mapping_range() and subsequently not race
> > > >> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> > > >> + * walks are performed under rcu_read_lock().
> > > >> + */
> > > >> +struct page *dax_layout_busy_page(struct address_space *mapping)
> > > >> +{
> > > >> + pgoff_t indices[PAGEVEC_SIZE];
> > > >> + struct page *page = NULL;
> > > >> + struct pagevec pvec;
> > > >> + pgoff_t index, end;
> > > >> + unsigned i;
> > > >> +
> > > >> + /*
> > > >> + * In the 'limited' case get_user_pages() for dax is disabled.
> > > >> + */
> > > >> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > > >> + return NULL;
> > > >> +
> > > >> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> > > >> + return NULL;
> > > >> +
> > > >> + pagevec_init(&pvec);
> > > >> + index = 0;
> > > >> + end = -1;
> > > >> + /*
> > > >> + * Flush dax_layout_lock() sections to ensure all possible page
> > > >> + * references have been taken, or otherwise arrange for faults
> > > >> + * to block on the filesystem lock that is taken for
> > > >> + * establishing new mappings.
> > > >> + */
> > > >> + unmap_mapping_range(mapping, 0, 0, 1);
> > > >> + synchronize_rcu();
> > > >
> > > > So I still don't like the use of RCU for this. It just seems as an abuse to
> > > > use RCU like that. Furthermore it has a hefty latency cost for the truncate
> > > > path. A trivial test to truncate 100 times the last page of a 16k file that
> > > > is mmaped (only the first page):
> > > >
> > > > DAX+your patches 3.899s
> > > > non-DAX 0.015s
> > > >
> > > > So you can see synchronize_rcu() increased time to run truncate(2) more
> > > > than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> > > > time). IMHO that's just too costly.
> > >
> > > I wonder if this can be trivially solved by using srcu. I.e. we don't
> > > need to wait for a global quiescent state, just a
> > > get_user_pages_fast() quiescent state. ...or is that an abuse of the
> > > srcu api?
> >
> > From what I can see (not that I claim to understand DAX), SRCU
> > is worth trying. Another thing to try (as a test) is to replace the
> > synchronize_rcu() above with synchronize_rcu_expedited(), which might
> > get you an order of magnitude or thereabouts.
>
> But having synchronize_rcu_expedited() easily triggerable by userspace
> (potentially every 100 usec or even less) is not a great thing, right?
> It would be hogging the system with IPIs...
Yes, and that is why I have "(as a test)" above. If doing that restores
performance in the trivial-truncation case, that at least lets us know what
needs to happen, even though it does have some drawbacks.
And there is a synchronize_srcu_expedited() that does not do IPIs, if
that helps.
Another approach is to use call_rcu(), but I am guessing that you cannot
safely return to user until the grace period has completed.
Thanx, Paul
On Sat 07-04-18 20:11:13, Paul E. McKenney wrote:
> On Sat, Apr 07, 2018 at 12:38:24PM -0700, Dan Williams wrote:
> > [ adding Paul and Josh ]
> >
> > On Wed, Apr 4, 2018 at 2:46 AM, Jan Kara <[email protected]> wrote:
> > > On Fri 30-03-18 21:03:30, Dan Williams wrote:
> > >> Background:
> > >>
> > >> get_user_pages() in the filesystem pins file backed memory pages for
> > >> access by devices performing dma. However, it only pins the memory pages
> > >> not the page-to-file offset association. If a file is truncated the
> > >> pages are mapped out of the file and dma may continue indefinitely into
> > >> a page that is owned by a device driver. This breaks coherency of the
> > >> file vs dma, but the assumption is that if userspace wants the
> > >> file-space truncated it does not matter what data is inbound from the
> > >> device, it is not relevant anymore. The only expectation is that dma can
> > >> safely continue while the filesystem reallocates the block(s).
> > >>
> > >> Problem:
> > >>
> > >> This expectation that dma can safely continue while the filesystem
> > >> changes the block map is broken by dax. With dax the target dma page
> > >> *is* the filesystem block. The model of leaving the page pinned for dma,
> > >> but truncating the file block out of the file, means that the filesytem
> > >> is free to reallocate a block under active dma to another file and now
> > >> the expected data-incoherency situation has turned into active
> > >> data-corruption.
> > >>
> > >> Solution:
> > >>
> > >> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> > >> file while any page/block in the file is under active dma. This solution
> > >> assumes that dma is transient. Cases where dma operations are known to
> > >> not be transient, like RDMA, have been explicitly disabled via
> > >> commits like 5f1d43de5416 "IB/core: disable memory registration of
> > >> filesystem-dax vmas".
> > >>
> > >> The dax_layout_busy_page() routine is called by filesystems with a lock
> > >> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> > >> The process of looking up a busy page invalidates all mappings
> > >> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> > >> The filesystem continues to call dax_layout_busy_page() until it finally
> > >> returns no more active pages. This approach assumes that the page
> > >> pinning is transient, if that assumption is violated the system would
> > >> have likely hung from the uncompleted I/O.
> > >>
> > >> Cc: Jan Kara <[email protected]>
> > >> Cc: Jeff Moyer <[email protected]>
> > >> Cc: Dave Chinner <[email protected]>
> > >> Cc: Matthew Wilcox <[email protected]>
> > >> Cc: Alexander Viro <[email protected]>
> > >> Cc: "Darrick J. Wong" <[email protected]>
> > >> Cc: Ross Zwisler <[email protected]>
> > >> Cc: Dave Hansen <[email protected]>
> > >> Cc: Andrew Morton <[email protected]>
> > >> Reported-by: Christoph Hellwig <[email protected]>
> > >> Reviewed-by: Christoph Hellwig <[email protected]>
> > >> Signed-off-by: Dan Williams <[email protected]>
> > >> ---
> > >> drivers/dax/super.c | 2 +
> > >> fs/dax.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> include/linux/dax.h | 25 ++++++++++++++
> > >> mm/gup.c | 5 +++
> > >> 4 files changed, 123 insertions(+), 1 deletion(-)
> > >
> > > ...
> > >
> > >> +/**
> > >> + * dax_layout_busy_page - find first pinned page in @mapping
> > >> + * @mapping: address space to scan for a page with ref count > 1
> > >> + *
> > >> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > >> + * 'onlined' to the page allocator so they are considered idle when
> > >> + * page->count == 1. A filesystem uses this interface to determine if
> > >> + * any page in the mapping is busy, i.e. for DMA, or other
> > >> + * get_user_pages() usages.
> > >> + *
> > >> + * It is expected that the filesystem is holding locks to block the
> > >> + * establishment of new mappings in this address_space. I.e. it expects
> > >> + * to be able to run unmap_mapping_range() and subsequently not race
> > >> + * mapping_mapped() becoming true. It expects that get_user_pages() pte
> > >> + * walks are performed under rcu_read_lock().
> > >> + */
> > >> +struct page *dax_layout_busy_page(struct address_space *mapping)
> > >> +{
> > >> + pgoff_t indices[PAGEVEC_SIZE];
> > >> + struct page *page = NULL;
> > >> + struct pagevec pvec;
> > >> + pgoff_t index, end;
> > >> + unsigned i;
> > >> +
> > >> + /*
> > >> + * In the 'limited' case get_user_pages() for dax is disabled.
> > >> + */
> > >> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > >> + return NULL;
> > >> +
> > >> + if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> > >> + return NULL;
> > >> +
> > >> + pagevec_init(&pvec);
> > >> + index = 0;
> > >> + end = -1;
> > >> + /*
> > >> + * Flush dax_layout_lock() sections to ensure all possible page
> > >> + * references have been taken, or otherwise arrange for faults
> > >> + * to block on the filesystem lock that is taken for
> > >> + * establishing new mappings.
> > >> + */
> > >> + unmap_mapping_range(mapping, 0, 0, 1);
> > >> + synchronize_rcu();
> > >
> > > So I still don't like the use of RCU for this. It just seems as an abuse to
> > > use RCU like that. Furthermore it has a hefty latency cost for the truncate
> > > path. A trivial test to truncate 100 times the last page of a 16k file that
> > > is mmaped (only the first page):
> > >
> > > DAX+your patches 3.899s
> > > non-DAX 0.015s
> > >
> > > So you can see synchronize_rcu() increased time to run truncate(2) more
> > > than 200 times (the process is indeed sitting in __wait_rcu_gp all the
> > > time). IMHO that's just too costly.
> >
> > I wonder if this can be trivially solved by using srcu. I.e. we don't
> > need to wait for a global quiescent state, just a
> > get_user_pages_fast() quiescent state. ...or is that an abuse of the
> > srcu api?
>
> From what I can see (not that I claim to understand DAX), SRCU
> is worth trying. Another thing to try (as a test) is to replace the
> synchronize_rcu() above with synchronize_rcu_expedited(), which might
> get you an order of magnitude or thereabouts.
But having synchronize_rcu_expedited() easily triggerable by userspace
(potentially every 100 usec or even less) is not a great thing, right?
It would be hogging the system with IPIs...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <[email protected]> wrote:
> On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <[email protected]> wrote:
>> On Sat 07-04-18 12:38:24, Dan Williams wrote:
> [..]
>>> I wonder if this can be trivially solved by using srcu. I.e. we don't
>>> need to wait for a global quiescent state, just a
>>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
>>> srcu api?
>>
>> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
>> SRCU. It is a more-or-less standard locking mechanism rather than relying
>> on implementation properties of SRCU which is a data structure protection
>> method. And the overhead of percpu rwsemaphore for your use case should be
>> about the same as that of SRCU.
>
> I was just about to ask that. Yes, it seems they would share similar
> properties and it would be better to use the explicit implementation
> rather than a side effect of srcu.
...unfortunately:
BUG: sleeping function called from invalid context at
./include/linux/percpu-rwsem.h:34
[..]
Call Trace:
dump_stack+0x85/0xcb
___might_sleep+0x15b/0x240
dax_layout_lock+0x18/0x80
get_user_pages_fast+0xf8/0x140
...and thinking about it more srcu is a better fit. We don't need the
100% exclusion provided by an rwsem we only need the guarantee that
all cpus that might have been running get_user_pages_fast() have
finished it at least once.
In my tests synchronize_srcu is a bit slower than unpatched for the
trivial 100 truncate test, but certainly not the 200x latency you were
seeing with syncrhonize_rcu.
Elapsed time:
0.006149178 unpatched
0.009426360 srcu
On Fri, Apr 13, 2018 at 03:03:51PM -0700, Dan Williams wrote:
> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <[email protected]> wrote:
> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <[email protected]> wrote:
> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
> > [..]
> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
> >>> need to wait for a global quiescent state, just a
> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
> >>> srcu api?
> >>
> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
> >> on implementation properties of SRCU which is a data structure protection
> >> method. And the overhead of percpu rwsemaphore for your use case should be
> >> about the same as that of SRCU.
> >
> > I was just about to ask that. Yes, it seems they would share similar
> > properties and it would be better to use the explicit implementation
> > rather than a side effect of srcu.
>
> ...unfortunately:
>
> BUG: sleeping function called from invalid context at
> ./include/linux/percpu-rwsem.h:34
> [..]
> Call Trace:
> dump_stack+0x85/0xcb
> ___might_sleep+0x15b/0x240
> dax_layout_lock+0x18/0x80
> get_user_pages_fast+0xf8/0x140
>
> ...and thinking about it more srcu is a better fit. We don't need the
> 100% exclusion provided by an rwsem we only need the guarantee that
> all cpus that might have been running get_user_pages_fast() have
> finished it at least once.
>
> In my tests synchronize_srcu is a bit slower than unpatched for the
> trivial 100 truncate test, but certainly not the 200x latency you were
> seeing with syncrhonize_rcu.
>
> Elapsed time:
> 0.006149178 unpatched
> 0.009426360 srcu
You might want to try synchronize_srcu_expedited(). Unlike plain RCU,
it does not send IPIs, so should be less controversial. And it might
well more than make up the performance difference you are seeing above.
Thanx, Paul
On Fri 13-04-18 15:03:51, Dan Williams wrote:
> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <[email protected]> wrote:
> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <[email protected]> wrote:
> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
> > [..]
> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
> >>> need to wait for a global quiescent state, just a
> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
> >>> srcu api?
> >>
> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
> >> on implementation properties of SRCU which is a data structure protection
> >> method. And the overhead of percpu rwsemaphore for your use case should be
> >> about the same as that of SRCU.
> >
> > I was just about to ask that. Yes, it seems they would share similar
> > properties and it would be better to use the explicit implementation
> > rather than a side effect of srcu.
>
> ...unfortunately:
>
> BUG: sleeping function called from invalid context at
> ./include/linux/percpu-rwsem.h:34
> [..]
> Call Trace:
> dump_stack+0x85/0xcb
> ___might_sleep+0x15b/0x240
> dax_layout_lock+0x18/0x80
> get_user_pages_fast+0xf8/0x140
>
> ...and thinking about it more srcu is a better fit. We don't need the
> 100% exclusion provided by an rwsem we only need the guarantee that
> all cpus that might have been running get_user_pages_fast() have
> finished it at least once.
>
> In my tests synchronize_srcu is a bit slower than unpatched for the
> trivial 100 truncate test, but certainly not the 200x latency you were
> seeing with syncrhonize_rcu.
>
> Elapsed time:
> 0.006149178 unpatched
> 0.009426360 srcu
Hum, right. Yesterday I was looking into KSM for a different reason and
I've noticed it also does writeprotect pages and deals with races with GUP.
And what KSM relies on is:
write_protect_page()
...
entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
/*
* Check that no O_DIRECT or similar I/O is in progress on the
* page
*/
if (page_mapcount(page) + 1 + swapped != page_count(page)) {
page used -> bail
}
And this really works because gup_pte_range() does:
page = pte_page(pte);
head = compound_head(page);
if (!page_cache_get_speculative(head))
goto pte_unmap;
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
bail
}
So either write_protect_page() page sees the elevated reference or
gup_pte_range() bails because it will see the pte changed.
In the truncate path things are a bit different but in principle the same
should work - once truncate blocks page faults and unmaps pages from page
tables, we can be sure GUP will not grab the page anymore or we'll see
elevated page count. So IMO there's no need for any additional locking
against the GUP path (but a comment explaining this is highly desirable I
guess).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Apr 19, 2018 at 3:44 AM, Jan Kara <[email protected]> wrote:
> On Fri 13-04-18 15:03:51, Dan Williams wrote:
>> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <[email protected]> wrote:
>> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <[email protected]> wrote:
>> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
>> > [..]
>> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
>> >>> need to wait for a global quiescent state, just a
>> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
>> >>> srcu api?
>> >>
>> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
>> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
>> >> on implementation properties of SRCU which is a data structure protection
>> >> method. And the overhead of percpu rwsemaphore for your use case should be
>> >> about the same as that of SRCU.
>> >
>> > I was just about to ask that. Yes, it seems they would share similar
>> > properties and it would be better to use the explicit implementation
>> > rather than a side effect of srcu.
>>
>> ...unfortunately:
>>
>> BUG: sleeping function called from invalid context at
>> ./include/linux/percpu-rwsem.h:34
>> [..]
>> Call Trace:
>> dump_stack+0x85/0xcb
>> ___might_sleep+0x15b/0x240
>> dax_layout_lock+0x18/0x80
>> get_user_pages_fast+0xf8/0x140
>>
>> ...and thinking about it more srcu is a better fit. We don't need the
>> 100% exclusion provided by an rwsem we only need the guarantee that
>> all cpus that might have been running get_user_pages_fast() have
>> finished it at least once.
>>
>> In my tests synchronize_srcu is a bit slower than unpatched for the
>> trivial 100 truncate test, but certainly not the 200x latency you were
>> seeing with syncrhonize_rcu.
>>
>> Elapsed time:
>> 0.006149178 unpatched
>> 0.009426360 srcu
>
> Hum, right. Yesterday I was looking into KSM for a different reason and
> I've noticed it also does writeprotect pages and deals with races with GUP.
> And what KSM relies on is:
>
> write_protect_page()
> ...
> entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> /*
> * Check that no O_DIRECT or similar I/O is in progress on the
> * page
> */
> if (page_mapcount(page) + 1 + swapped != page_count(page)) {
> page used -> bail
Slick.
> }
>
> And this really works because gup_pte_range() does:
>
> page = pte_page(pte);
> head = compound_head(page);
>
> if (!page_cache_get_speculative(head))
> goto pte_unmap;
>
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> bail
Need to add a similar check to __gup_device_huge_pmd.
> }
>
> So either write_protect_page() page sees the elevated reference or
> gup_pte_range() bails because it will see the pte changed.
>
> In the truncate path things are a bit different but in principle the same
> should work - once truncate blocks page faults and unmaps pages from page
> tables, we can be sure GUP will not grab the page anymore or we'll see
> elevated page count. So IMO there's no need for any additional locking
> against the GUP path (but a comment explaining this is highly desirable I
> guess).
Yes, those "pte_val(pte) != pte_val(*ptep)" checks should be
documented for the same reason we require comments on rmb/wmb pairs.
I'll take a look, thanks Jan.